Skip to content

[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe#19113

Merged
HaiShaw merged 14 commits intomainfrom
fix/amd-ci-flaky-tests
Feb 27, 2026
Merged

[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe#19113
HaiShaw merged 14 commits intomainfrom
fix/amd-ci-flaky-tests

Conversation

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator

@michaelzhang-ai michaelzhang-ai commented Feb 21, 2026

Motivation

Fix pre-existing stage-b AMD CI test failures:
https://github.com/sgl-project/sglang/actions/runs/22327699165/job/64603707599#step:6:14389

ValueError: layer_id=0 not in full attention layers: dict_keys([2, 6, 10, 14, 18, 21])
in aiter_backend.py line 126. The aiter attention backend hardcodes layer_id=0 to get the value head dim, but LFM2-MoE is a hybrid Mamba+attention model where only layers [2, 6, 10, 14, 18, 21] are attention layers. Layer 0 is a Mamba layer.
On CUDA, the model uses flashinfer which handles this correctly. On AMD, aiter is auto-selected and crashes.

Please help review: @yctseng0211, @bingxche, @HaiShaw, @sogalin . Thanks!

Modifications

  • aiter_backend.py: Use hasattr(get_v_head_dim) to cover all hybrid KV pool types including LFM2, matching triton_backend.py. Previous check only covered hybrid_gdn_config/kimi_linear_config.

  • causal_conv1d.py: Fall back to Triton kernels when sgl_kernel causal_conv1d ops are unavailable on ROCm. (by @bingxche)

  • test_tool_choice.py: Remove TestToolChoiceLfm2 AMD skip since causal_conv1d now falls back to Triton. (by @bingxche)

  • pr-test-amd.yml, pr-test-amd-rocm720.yml: Add new AMD stage-b job. (by @yctseng0211)

  • run_suite.py, slash_command_handler.py: CI plumbing for new stage-b job. (by @yctseng0211)

Test plan

  • Verify TestToolChoiceLfm2Moe no longer crashes with ValueError: layer_id=0
  • Verify AMD CI stage-b shards 5, 11, 13 pass

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions bot added lora Multi-modal multi-modal language model labels Feb 21, 2026
@michaelzhang-ai michaelzhang-ai marked this pull request as ready for review February 21, 2026 07:47
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@michaelzhang-ai michaelzhang-ai changed the title [AMD] Fix 3 pre-existing AMD CI test failures [AMD] [DO NOT MERGE] Fix 3 pre-existing AMD CI test failures Feb 21, 2026
@michaelzhang-ai michaelzhang-ai changed the title [AMD] [DO NOT MERGE] Fix 3 pre-existing AMD CI test failures [AMD] [DO NOT MERGE] Fix pre-existing AMD CI test failures Feb 24, 2026
@michaelzhang-ai michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch 2 times, most recently from 759ac90 to cfdea4c Compare February 24, 2026 22:32
- Relax LoRA multi-batch ROUGE-L tolerance from 1.0 to 0.95 to account
  for minor numerical non-determinism on ROCm
- Fix aiter attention backend crashing on hybrid Mamba+attention models
  (e.g. LFM2-MoE): use get_v_head_dim() for hybrid KV pools instead of
  hardcoded get_value_buffer(0) which fails when layer 0 is not an
  attention layer
- Skip TestToolChoiceLfm2Moe on AMD: sgl_kernel ROCm build lacks
  causal_conv1d_update op needed by Mamba layers
@yctseng0211 yctseng0211 marked this pull request as ready for review February 25, 2026 11:34
@michaelzhang-ai michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch 2 times, most recently from 6111bf0 to 39d1391 Compare February 25, 2026 20:23
michaelzhang-ai and others added 7 commits February 25, 2026 14:25
Upstream already has a proper v_head_dim fix (handling MLA, hybrid_gdn,
kimi_linear models) so our hasattr-based version is no longer needed.
The aiter RoPE backend has lower precision (as warned by apex), causing
consistent single-token differences between SRT and HF reference outputs
(ROUGE-L 0.9774 vs required 1.0). Disable it for the LoRA multi-batch
test to produce exact matches.
The existing check only covers hybrid_gdn_config and kimi_linear_config,
but LFM2 models use HybridLinearKVPool without either config. Use
hasattr(get_v_head_dim) to cover all hybrid KV pool types, matching
triton_backend.py.
@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

stage-b Lora test will be fix in next pr. cc: @yctseng0211, @bingxche, @sogalin, @HaiShaw

@michaelzhang-ai michaelzhang-ai changed the title [AMD] [DO NOT MERGE] Fix pre-existing AMD CI test failures [AMD] Fix AMD CI test of TestToolChoiceLfm2Moe Feb 27, 2026
@bingxche
Copy link
Copy Markdown
Collaborator

bingxche commented Feb 27, 2026

test_moriep_small error will be fixed in another PR. https://github.com/sgl-project/sglang/actions/runs/22471492647/job/65100816457?pr=19113#step:7:12894

cc @yctseng0211 @michaelzhang-ai @HaiShaw

@bingxche
Copy link
Copy Markdown
Collaborator

test_tool_choice.py in NV CI also passed https://github.com/sgl-project/sglang/actions/runs/22471492640/job/65102200115?pr=19113#step:5:31
image

Install dependency timeout error is unrelated to this PR
image

Could you please take a look? Thanks in advance. @alisonshao @Kangyan-Zhou

@michaelzhang-ai
Copy link
Copy Markdown
Collaborator Author

@hubertlu-tw Could you have another look? The PR only change TestToolChoiceLfm2Moe now with triton implementation. Thanks! Cc: @HaiShaw

@HaiShaw HaiShaw merged commit 1b79934 into main Feb 27, 2026
199 of 222 checks passed
@HaiShaw HaiShaw deleted the fix/amd-ci-flaky-tests branch February 27, 2026 18:18
magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: yctseng0211 <yctseng@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants