[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe by michaelzhang-ai · Pull Request #19113 · sgl-project/sglang

michaelzhang-ai · 2026-02-21T07:47:32Z

Motivation

Fix pre-existing stage-b AMD CI test failures:
https://github.com/sgl-project/sglang/actions/runs/22327699165/job/64603707599#step:6:14389

ValueError: layer_id=0 not in full attention layers: dict_keys([2, 6, 10, 14, 18, 21])
in aiter_backend.py line 126. The aiter attention backend hardcodes layer_id=0 to get the value head dim, but LFM2-MoE is a hybrid Mamba+attention model where only layers [2, 6, 10, 14, 18, 21] are attention layers. Layer 0 is a Mamba layer.
On CUDA, the model uses flashinfer which handles this correctly. On AMD, aiter is auto-selected and crashes.

Please help review: @yctseng0211, @bingxche, @HaiShaw, @sogalin . Thanks!

Modifications

aiter_backend.py: Use hasattr(get_v_head_dim) to cover all hybrid KV pool types including LFM2, matching triton_backend.py. Previous check only covered hybrid_gdn_config/kimi_linear_config.
causal_conv1d.py: Fall back to Triton kernels when sgl_kernel causal_conv1d ops are unavailable on ROCm. (by @bingxche)
test_tool_choice.py: Remove TestToolChoiceLfm2 AMD skip since causal_conv1d now falls back to Triton. (by @bingxche)
pr-test-amd.yml, pr-test-amd-rocm720.yml: Add new AMD stage-b job. (by @yctseng0211)
run_suite.py, slash_command_handler.py: CI plumbing for new stage-b job. (by @yctseng0211)

Test plan

Verify TestToolChoiceLfm2Moe no longer crashes with ValueError: layer_id=0
Verify AMD CI stage-b shards 5, 11, 13 pass

gemini-code-assist · 2026-02-21T07:47:35Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-02-21T07:48:00Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

test/registered/models/test_vlm_models.py

test/registered/openai_server/function_call/test_tool_choice.py

- Relax LoRA multi-batch ROUGE-L tolerance from 1.0 to 0.95 to account for minor numerical non-determinism on ROCm - Fix aiter attention backend crashing on hybrid Mamba+attention models (e.g. LFM2-MoE): use get_v_head_dim() for hybrid KV pools instead of hardcoded get_value_buffer(0) which fails when layer 0 is not an attention layer - Skip TestToolChoiceLfm2Moe on AMD: sgl_kernel ROCm build lacks causal_conv1d_update op needed by Mamba layers

Upstream already has a proper v_head_dim fix (handling MLA, hybrid_gdn, kimi_linear models) so our hasattr-based version is no longer needed.

The aiter RoPE backend has lower precision (as warned by apex), causing consistent single-token differences between SRT and HF reference outputs (ROUGE-L 0.9774 vs required 1.0). Disable it for the LoRA multi-batch test to produce exact matches.

The existing check only covers hybrid_gdn_config and kimi_linear_config, but LFM2 models use HybridLinearKVPool without either config. Use hasattr(get_v_head_dim) to cover all hybrid KV pool types, matching triton_backend.py.

michaelzhang-ai · 2026-02-27T03:14:23Z

stage-b Lora test will be fix in next pr. cc: @yctseng0211, @bingxche, @sogalin, @HaiShaw

bingxche · 2026-02-27T07:29:05Z

test_moriep_small error will be fixed in another PR. https://github.com/sgl-project/sglang/actions/runs/22471492647/job/65100816457?pr=19113#step:7:12894

cc @yctseng0211 @michaelzhang-ai @HaiShaw

bingxche · 2026-02-27T09:05:27Z

test_tool_choice.py in NV CI also passed https://github.com/sgl-project/sglang/actions/runs/22471492640/job/65102200115?pr=19113#step:5:31

Install dependency timeout error is unrelated to this PR

Could you please take a look? Thanks in advance. @alisonshao @Kangyan-Zhou

michaelzhang-ai · 2026-02-27T17:01:54Z

@hubertlu-tw Could you have another look? The PR only change TestToolChoiceLfm2Moe now with triton implementation. Thanks! Cc: @HaiShaw

Co-authored-by: michaelzhang-ai <michaelzhang-ai@users.noreply.github.com> Co-authored-by: bingxche <Bingxu.Chen@amd.com> Co-authored-by: yctseng0211 <yctseng@amd.com>

github-actions bot added lora Multi-modal multi-modal language model labels Feb 21, 2026

michaelzhang-ai marked this pull request as ready for review February 21, 2026 07:47

michaelzhang-ai changed the title ~~[AMD] Fix 3 pre-existing AMD CI test failures~~ [AMD] [DO NOT MERGE] Fix 3 pre-existing AMD CI test failures Feb 21, 2026

michaelzhang-ai added run-ci amd labels Feb 21, 2026

michaelzhang-ai requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, ispobock and merrymercy as code owners February 21, 2026 23:51

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch from e1d2d6c to c02f522 Compare February 23, 2026 07:04

Fridge003 approved these changes Feb 24, 2026

View reviewed changes

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch from c02f522 to 841d6ba Compare February 24, 2026 04:54

michaelzhang-ai requested review from Qiaolin-Yu and hebiao064 as code owners February 24, 2026 04:54

michaelzhang-ai added the high priority label Feb 24, 2026

michaelzhang-ai changed the title ~~[AMD] [DO NOT MERGE] Fix 3 pre-existing AMD CI test failures~~ [AMD] [DO NOT MERGE] Fix pre-existing AMD CI test failures Feb 24, 2026

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch from 841d6ba to b57aed1 Compare February 24, 2026 18:36

hubertlu-tw requested changes Feb 24, 2026

View reviewed changes

test/registered/models/test_vlm_models.py Outdated Show resolved Hide resolved

test/registered/openai_server/function_call/test_tool_choice.py Outdated Show resolved Hide resolved

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch 2 times, most recently from 759ac90 to cfdea4c Compare February 24, 2026 22:32

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch from cfdea4c to 5bd3199 Compare February 24, 2026 22:36

michaelzhang-ai mentioned this pull request Feb 24, 2026

[AMD] Aiter attention backends crash on hybrid Mamba+attention models (e.g. LFM2-MoE) #19272

Closed

yctseng0211 marked this pull request as ready for review February 25, 2026 11:34

yctseng0211 requested a review from yizhang2077 as a code owner February 25, 2026 11:34

bingxche and others added 2 commits February 25, 2026 22:59

Merge branch 'main' into fix/amd-ci-flaky-tests

e14e3de

increase the additional retry limit

39d1391

michaelzhang-ai force-pushed the fix/amd-ci-flaky-tests branch 2 times, most recently from 6111bf0 to 39d1391 Compare February 25, 2026 20:23

michaelzhang-ai and others added 7 commits February 25, 2026 14:25

Merge upstream/main: resolve aiter_backend.py conflict

ad75f9b

Upstream already has a proper v_head_dim fix (handling MLA, hybrid_gdn, kimi_linear models) so our hasattr-based version is no longer needed.

Merge branch 'main' into fix/amd-ci-flaky-tests

2fa3f7c

set SGLANG_USE_AITER = 0

eba8075

Merge branch 'main' into fix/amd-ci-flaky-tests

b683f61

create new amd stage-b job

46de4f4

yctseng0211 requested review from Kangyan-Zhou and bingxche as code owners February 27, 2026 02:58

Update test_multi_lora_backend.py for AMD CI

c5a0e63

michaelzhang-ai changed the title ~~[AMD] [DO NOT MERGE] Fix pre-existing AMD CI test failures~~ [AMD] Fix AMD CI test of TestToolChoiceLfm2Moe Feb 27, 2026

michaelzhang-ai requested a review from yctseng0211 February 27, 2026 06:21

HaiShaw approved these changes Feb 27, 2026

View reviewed changes

hubertlu-tw approved these changes Feb 27, 2026

View reviewed changes

HaiShaw merged commit 1b79934 into main Feb 27, 2026
199 of 222 checks passed

HaiShaw deleted the fix/amd-ci-flaky-tests branch February 27, 2026 18:18

This was referenced Mar 8, 2026

[AMD] Add Claude skills for AMD CI workflows michaelzhang-ai/sglang#8

Closed

[AMD] Add Claude skills for AMD CI workflows #20116

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe#19113

[AMD] Fix AMD CI test of TestToolChoiceLfm2Moe#19113
HaiShaw merged 14 commits intomainfrom
fix/amd-ci-flaky-tests

michaelzhang-ai commented Feb 21, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

Uh oh!

Uh oh!

michaelzhang-ai commented Feb 27, 2026

Uh oh!

bingxche commented Feb 27, 2026 •

edited by michaelzhang-ai

Loading

Uh oh!

bingxche commented Feb 27, 2026

Uh oh!

michaelzhang-ai commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

michaelzhang-ai commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Test plan

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

gemini-code-assist bot commented Feb 21, 2026

Uh oh!

Uh oh!

Uh oh!

michaelzhang-ai commented Feb 27, 2026

Uh oh!

bingxche commented Feb 27, 2026 • edited by michaelzhang-ai Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bingxche commented Feb 27, 2026

Uh oh!

michaelzhang-ai commented Feb 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

michaelzhang-ai commented Feb 21, 2026 •

edited

Loading

bingxche commented Feb 27, 2026 •

edited by michaelzhang-ai

Loading