Fix few issues in Qwen_3_Omni_Moe#44848
Conversation
|
run-slow: qwen3_omni_moe |
vasqu
left a comment
There was a problem hiding this comment.
Thanks, checking with CI 🫡
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
Ok it fixes one issue and reveals some other ones 😓 can you recheck or rather rename the PR since it does unblock partially at least |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
|
run-slow: qwen3_omni_moe |
|
This comment contains models: ["models/qwen3_omni_moe"] |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
|
Hey, Pushed a potential fix for these. I think The |
|
run-slow: qwen3_omni_moe |
|
Sorry, I was off for a few days. Now back 🤗 @Sai-Suraj-27 checking run-slow |
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
run-slow: qwen3_omni_moe |
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
run-slow: qwen3_omni_moe |
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
run-slow: qwen3_omni_moe |
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
run-slow: qwen3_omni_moe |
|
This comment contains models: ["models/qwen3_omni_moe"] |
|
no longer crash, but just |
|
Yea, seems reasonable - the test didn't run at all before and crashed, this PR at least let's the integration tests produce output again @ydshieh Can we change the title tho @Sai-Suraj-27? Also looks like the meta device one is not specific to the multi-gpu case (which we talked about before) |
Yes, but for the text expectation mismatch failures, should I try & update the expected text maybe? |
|
Nope, not for now - imo I would like to have a failure for now / xmark. Something somewhere changed and arbitrarily changing the values is not good |
Qwen3OmniModelIntegrationTests|
Do you want to investigate the meta device issue? Otherwise, I would merge as is for now |
Sure, Let me check that over the weekend. |
CI ResultsCommit Info
Model CI Report❌ 3 new failed tests from this PR 😭
|
|
run-slow: qwen3_omni_moe |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_omni_moe |
|
@Sai-Suraj-27 To move fast, I pushed some commits that should work well (on our CI runner), including the fixes for I will merge once @vasqu have a final look 🙏 . Thanks again for the work ! |
|
This comment contains models: ["models/qwen3_omni_moe"] |
vasqu
left a comment
There was a problem hiding this comment.
Thanks, will also try to investigate a bit more later because something clearly goes wrong within the model
| ) | ||
| self.assertFalse(torch.isnan(output[1]).any().item()) | ||
|
|
||
| @run_first |
There was a problem hiding this comment.
Any reason we want this to run first?
|
|
||
| if "inputs_embeds" in model_kwargs: | ||
| return torch.ones((batch_size, 0), dtype=torch.long, device=self.device) | ||
| return torch.ones( |
There was a problem hiding this comment.
Can you add comment with reference to here
There was a problem hiding this comment.
ohoh yes, forgot before push
There was a problem hiding this comment.
Hey, @ydshieh. Thanks for pushing the fix. I was able to run the test on RTX PRO 6000, & it's running fine without the meta device issue. But incase of A-10 GPU the device_map="auto" is offloading the talker module to CPU & iiuc from accelerate code, it keeps the parameters of cpu/disk offloaded modules as meta tensors (which is why model.talker.device is giving "meta" in case of A10) & only loads the real-weights on to the GPU later just before forward.
Since the test ran fine on the big gpu but failing on A10, I think, I can confrim with this fix & that the issue is with how we are using self.device in this method. So, Maybe we can add a comment regarding this accelerate behaviour here pointing to this accelerate code.
There was a problem hiding this comment.
a comment is added (a few line below)
# Use the device of the existing tensor to avoid any potential `meta` device isssue.
# See PR #44848. (Previously, it used `self.device`.)
I think it's enough with the reference to this PR.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_omni_moe |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44848&sha=0865d5 |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_omni_moe |
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_omni_moe |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=44848&sha=f57a22 |
What does this PR do?
Update Qwen3_Omni_Moe, to fix these attribute errors Qwen3OmniModelIntegrationTests
Almost same issue was fixed initally in #43084 but the config refactor in #41250 dropped/missed the
initializer_rangefromQwen3OmniMoeConfig.Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@Rocketknight1 @vasqu