[AMD] fix tbo specv2 seq_lens_cpu NoneType error by billishyahao · Pull Request #24319 · sgl-project/sglang

billishyahao · 2026-05-03T16:47:45Z

Motivation

The issue is identified in CI task https://github.com/sgl-project/sglang/actions/runs/25151525365/job/73798865426#step:7:16755
and workaround #24205
and issue tracker #24212

cc @HaiShaw @hubertlu-tw @bingxche

This patch is to fix the crash when SpecV2 is enabled (now the default) together with TBO, TestMTPwithTBOLowLatency. The trace is as following:

File "/sgl-workspace/sglang/python/sglang/srt/batch_overlap/two_batch_overlap.py", line 208, in split_spec_info
    if end_seq_index == spec_info.seq_lens_cpu.shape[0]:
AttributeError: 'NoneType' object has no attribute 'shape'

at two_batch_overlap.py:208 in split_spec_info, because spec_info.seq_lens_cpu is None.

In the v2 draft path (eagle_worker_v2.py:390), EagleVerifyInput is created with seq_lens_cpu=None and seq_lens_sum=None. The v1 path (eagle_worker.py:800) correctly passes forward_batch.seq_lens_cpu. Later, prepare_for_v2_verify updates batch.seq_lens_cpu but never writes it back to the EagleVerifyInput object itself. When TBO calls split_spec_info(forward_batch.spec_info, ...), it reads spec_info.seq_lens_cpu which is still None.

The same issue applies to multi_layer_eagle_worker_v2.py:283, but both paths share the EagleVerifyInputV2Mixin, so the fix covers both.

Modifications

python/sglang/srt/speculative/eagle_info_v2.py: In prepare_for_v2_verify, populate self.seq_lens_cpu and self.seq_lens_sum from batch before ForwardBatch.init_new() is called.

Accuracy Tests

TestMTPwithTBOLowLatency passed

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request updates python/sglang/srt/speculative/eagle_info_v2.py to populate seq_lens_cpu and seq_lens_sum on the verify input, which is necessary for correct custom_mask slicing in split_spec_info. Feedback was provided to move these assignments into an existing conditional block to eliminate a redundant check and improve code clarity.

hubertlu-tw

LGTM

hubertlu-tw · 2026-05-04T18:03:32Z

/tag-and-rerun-ci

HaiShaw · 2026-05-05T08:47:08Z

@amd-bot ci-status

amd-bot · 2026-05-05T08:50:01Z

@HaiShaw

Commit Info

PR head SHA: 5553c474
Diff: 1 file, +5/-0 in python/sglang/srt/speculative/eagle_info_v2.py — populates self.seq_lens_cpu / self.seq_lens_sum on the verify input inside prepare_for_v2_verify, so TBO's split_spec_info can slice custom_mask. Scope: EAGLE V2 + TBO (Two-Batch Overlap) AMD path only.

CI Status for PR #24319

PR: [AMD] fix tbo specv2 seq_lens_cpu NoneType error
Changed files: python/sglang/srt/speculative/eagle_info_v2.py (+5/-0)

AMD: 3 failures (0 likely related) | Others: 4 failures (0 related) + 5 fast-fail skips + 2 aggregator failures

AMD CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
stage-b-test-1-gpu-small-amd-mi35x	`test/registered/core/test_gpt_oss_1gpu.py`	`test_mxfp4_20b`	`AssertionError: False is not true` (empty streaming response)	🟢 Unlikely	Known never-passed test on `mi35x` (and `mi35x-rocm720`); intended fix #23829 was closed unmerged 2026-04-28. GPT-OSS mxfp4 path has no overlap with EAGLE V2 / TBO code.	Log
stage-b-test-1-gpu-large-amd (1)	`test/registered/perf/test_bench_serving_1gpu_part2.py`	`test_score_api_batch_scaling`	`AssertionError: 73.229 not less than 70` (avg_latency_ms perf threshold)	🟢 Unlikely	Perf-budget regression on `gte-Qwen2-1.5B` embedding/score API (latency 73 ms vs budget 70 ms). Embedding model + score API don't use speculative decoding at all; no path to EAGLE V2.	Log
stage-c-test-large-8-gpu-amd (2)	`test/registered/amd/test_moriep_small.py`	`setUpClass` (server failed to come up; `py-spy dump` could not get traces)	`timeout after 3600s`	🟢 Unlikely	MoRI-EP (MoE expert-parallelism) test — server stuck during model load (`Multi-thread loading shards` then `Pyspy failed`). Unrelated to EAGLE V2 / TBO change. Looks like infra/MoE-loading hang.	Log

Other CI Failures

Job	Test File	Test Function	Error	Related?	Explanation	Log
stage-c-test-4-gpu-b200 (5)	`test/registered/4-gpu-models/test_qwen35_fp4_triton.py`	`test_gsm8k`	`FileNotFoundError: '/root/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_true_92ff72c75a8dd73f'` (during `kernel_warmup` on all TP ranks)	🟢 Unlikely	sgl-kernel JIT cache directory missing — this is a runner/build-cache infrastructure issue in the sgl-kernel JIT path, not in spec decoding.	Log
stage-c-test-8-gpu-h20	`test/registered/distributed/test_disaggregation_different_tp.py` (in-progress when killed)	N/A — process killed	`Process completed with exit code 137` (SIGKILL / OOM)	🟢 Unlikely	PD-disaggregation test killed by OOM-killer / runner. PR doesn't touch disaggregation code; no spec-decoding path involved.	Log
stage-c-test-4-gpu-b200 (0), (3), (4), stage-c-test-8-gpu-h200 (1), stage-c-test-4-gpu-h100 (2)	N/A	N/A	`Fast-fail: skipping — root cause job(s): stage-c-test-4-gpu-b200 (5)` / `stage-c-test-8-gpu-h20`	🟢 Unlikely	These jobs were skipped by the fast-fail mechanism because the b200 (5) / h20 jobs above failed first. They themselves did not run any tests.	—
pr-test-finish, pr-test-amd-finish, pr-test-npu-finish, finish	N/A	N/A	Aggregator jobs reporting overall pipeline failure	🟢 Unlikely	These are roll-up gates that fail when any upstream job fails; not standalone failures.	—

Details

No failure on this PR appears related to the diff. The PR modifies eagle_info_v2.py:284 (one site, additive) inside prepare_for_v2_verify to copy seq_lens_cpu / seq_lens_sum from batch onto self. To trigger this code, a CI test would need to exercise EAGLE Spec V2 with TBO enabled. None of the failing test files do:

test_gpt_oss_1gpu.py::test_mxfp4_20b — GPT-OSS MoE inference, no spec decoding. Pre-existing never-passed mi35x failure (per memory).
test_bench_serving_1gpu_part2.py::test_score_api_batch_scaling — embedding/score API on gte-Qwen2-1.5B, no spec decoding. Perf-budget bound (70 ms) is tight; 73 ms is a marginal flake/perf-regression in something else.
test_moriep_small.py — MoRI-EP MoE distributed, no spec decoding. Hung during model shard loading (Pyspy failed (py-spy dump)).
test_qwen35_fp4_triton.py::test_gsm8k — JIT cache directory missing on b200 runner; infrastructure / sgl-kernel JIT path, not spec decoding.
test_disaggregation_*.py on h20 — PD-disaggregation, killed by OOM (exit 137). Disaggregation does not enter the modified function.

Recommendation: The PR's CI failures are all pre-existing or infrastructure issues — there is nothing to fix in this PR for any of them. Author can ignore these CI failures and may want to re-run the b200/h20/large-amd jobs once to confirm the JIT-cache and OOM signals are transient. The mi35x `test_mxfp4_20b` failure is a tracked never-passed AMD issue separate from this PR.

Generated by amd-bot using Claude Code CLI

[AMD] fix tbo specv2 seq_lens_cpu NoneType error

0a83c9b

billishyahao requested review from Qiaolin-Yu, Ying1123, hnyls2002 and merrymercy as code owners May 3, 2026 16:47

gemini-code-assist Bot reviewed May 3, 2026

View reviewed changes

Comment thread python/sglang/srt/speculative/eagle_info_v2.py Outdated

billishyahao mentioned this pull request May 3, 2026

[Feature][AMD] Address the specv2 + moriep tbo compatibility #24212

Open

2 tasks

fix comments

5553c47

HaiShaw approved these changes May 4, 2026

View reviewed changes

hubertlu-tw approved these changes May 4, 2026

View reviewed changes

github-actions Bot added the run-ci label May 4, 2026

HaiShaw merged commit 80ccb6b into sgl-project:main May 5, 2026
188 of 219 checks passed

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

[AMD] fix tbo specv2 seq_lens_cpu NoneType error (sgl-project#24319)

3da3b93

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] fix tbo specv2 seq_lens_cpu NoneType error#24319

[AMD] fix tbo specv2 seq_lens_cpu NoneType error#24319
HaiShaw merged 2 commits into
sgl-project:mainfrom
HaiShaw:fix_tbo_specv2

billishyahao commented May 3, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

hubertlu-tw left a comment

Uh oh!

hubertlu-tw commented May 4, 2026

Uh oh!

HaiShaw commented May 5, 2026

Uh oh!

amd-bot commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

billishyahao commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hubertlu-tw left a comment

Choose a reason for hiding this comment

Uh oh!

hubertlu-tw commented May 4, 2026

Uh oh!

HaiShaw commented May 5, 2026

Uh oh!

amd-bot commented May 5, 2026

Commit Info

CI Status for PR #24319

AMD CI Failures

Other CI Failures

Details

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

billishyahao commented May 3, 2026 •

edited

Loading