[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba)#32710
[ROCm][Bugfix][CI] Fix hybrid models and their tests (Mamba/Jamba/Bamba)#32710tjtanaa merged 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
There was a problem hiding this comment.
Code Review
This pull request addresses a bug where Mamba-1 models produce incorrect output on ROCm platforms. The root cause is correctly identified as ROCm's GEMM implementation producing incorrect results for non-contiguous input tensors. The fix involves ensuring the time_step tensor is contiguous before it's used in the dt_proj layer, which performs a GEMM operation. The change is targeted and effectively resolves the issue. The code is clean and the accompanying comment clearly explains the necessity of the fix for ROCm. The change looks good and correctly fixes the described problem.
|
Hi @AndreasKaratzas, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
…cing batch variance Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com>
d0d444a to
6f9ba30
Compare
|
Language Models (Hybrid) is now passing in AMD CI. |
| if current_platform.is_rocm(): | ||
| time_step = time_step.contiguous() |
There was a problem hiding this comment.
Would it make to just do this unconditionally? e.g., for all platforms?
There was a problem hiding this comment.
It's a process that introduces a latency overhead, so if hybrid models' correctness is good on other platforms, I don't see the reason for having this for every platform.
There was a problem hiding this comment.
We can't add .contiguous calls in the model code or _custom_ops, for example, because it introduces severe performance regressions. @tdoublep
There was a problem hiding this comment.
@rasmith Sorry for asking again. So, adding the .contiguous() does not introduce overheads?
There was a problem hiding this comment.
@tjtanaa It introduces overheads, but for ROCm these tensors last time I tested need to be contiguous. I can run a test again with these and check if without them the test passes, since there were other PRs that might have addressed this issue already in a more core kernel level.
There was a problem hiding this comment.
@tjtanaa Tests are passing without the contiguous as well now: pytest -v -s tests/models/language/generation -m hybrid_model
The PR that made this change obsolete was: #33366
Lmk if there is any other blocker in this one :)
EDIT: Just run the same test group with contiguous. I think I'll update this PR.
There was a problem hiding this comment.
@AndreasKaratzas You mentioned that the test is passing without the .contiguous(). So does this mean we can remove the time_step = time_step.contiguous() ?
There was a problem hiding this comment.
@tjtanaa True, but the results are just a bit more accurate with the contiguous. At the same time it looks like it does not incur a performance issue. At least in test level, tests finish sooner than without the contiguous.
There was a problem hiding this comment.
ok sure. Then let's go with current fix first.
| # Reduce the effects of batch variance on ROCm since batch invariance is not | ||
| # yet supported. See: https://github.com/vllm-project/vllm/issues/27433 | ||
| if current_platform.is_rocm(): | ||
| vllm_runner_kwargs["max_num_seqs"] = 4 |
There was a problem hiding this comment.
Is this change strictly related to the bug fix from this PR?
There was a problem hiding this comment.
Yes in the sense that this PR is intended to fix test failures in the Language Models (Hybrid) test group on AMD ROCm. Without this change, this test consistently fails.
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
The Test Results:
So, tests pass but output quality degrades Recommendation: KEEP the .contiguous() call - it improves output accuracy Test Statistics Comparison
Affected Models and ImpactNotable Impact
Moderate Impact
Minor Impact
Also Affected
Numerical AnalysisLogprob DivergenceToken Prediction Changes
PerformanceContiguous seems to boost a bit performance: With Without Full logs with `.contiguous()`Full logs without `.contiguous()`cc @tjtanaa @mawong-amd @tdoublep I am thinking of having contiguous there. Lmk what you think based on the above. |
…(issuecomment-3845064063) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
While the total test time is almost certainly correlated with e2e latency, it's not a robust measure of performance. In the future it is probably better to collect proper |
…ba) (vllm-project#32710) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
…ba) (vllm-project#32710) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Matthew Wong <Matthew.Wong2@amd.com> Co-authored-by: Matthew Wong <Matthew.Wong2@amd.com>
Fixes Mamba-1 models producing garbage output on ROCm.
Problem
On ROCm, Mamba models (e.g.,
state-spaces/mamba-130m-hf) produce completely incorrect output:Root Cause
In
_ssm_transform, aftertorch.split():time_stepis non-contiguous aftersplit()dt_projis aColumnParallelLinearthat uses GEMMTesting
All Mamba-130m tests pass:
pytest tests/models/language/generation/test_hybrid.py -k "mamba-130m" -vOut of Scope
The following failures are also resolved in this PR in later commit:
mamba_mixer2.py), may need similar fix in separate PR