Skip to content

[AMD] fix tbo specv2 seq_lens_cpu NoneType error#24319

Merged
HaiShaw merged 2 commits into
sgl-project:mainfrom
HaiShaw:fix_tbo_specv2
May 5, 2026
Merged

[AMD] fix tbo specv2 seq_lens_cpu NoneType error#24319
HaiShaw merged 2 commits into
sgl-project:mainfrom
HaiShaw:fix_tbo_specv2

Conversation

@billishyahao
Copy link
Copy Markdown
Contributor

@billishyahao billishyahao commented May 3, 2026

Motivation

The issue is identified in CI task https://github.com/sgl-project/sglang/actions/runs/25151525365/job/73798865426#step:7:16755
and workaround #24205
and issue tracker #24212

cc @HaiShaw @hubertlu-tw @bingxche

This patch is to fix the crash when SpecV2 is enabled (now the default) together with TBO, TestMTPwithTBOLowLatency. The trace is as following:

File "/sgl-workspace/sglang/python/sglang/srt/batch_overlap/two_batch_overlap.py", line 208, in split_spec_info
    if end_seq_index == spec_info.seq_lens_cpu.shape[0]:
AttributeError: 'NoneType' object has no attribute 'shape'

at two_batch_overlap.py:208 in split_spec_info, because spec_info.seq_lens_cpu is None.

In the v2 draft path (eagle_worker_v2.py:390), EagleVerifyInput is created with seq_lens_cpu=None and seq_lens_sum=None. The v1 path (eagle_worker.py:800) correctly passes forward_batch.seq_lens_cpu. Later, prepare_for_v2_verify updates batch.seq_lens_cpu but never writes it back to the EagleVerifyInput object itself. When TBO calls split_spec_info(forward_batch.spec_info, ...), it reads spec_info.seq_lens_cpu which is still None.

The same issue applies to multi_layer_eagle_worker_v2.py:283, but both paths share the EagleVerifyInputV2Mixin, so the fix covers both.

Modifications

  • python/sglang/srt/speculative/eagle_info_v2.py: In prepare_for_v2_verify, populate self.seq_lens_cpu and self.seq_lens_sum from batch before ForwardBatch.init_new() is called.

Accuracy Tests

TestMTPwithTBOLowLatency passed

image

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates python/sglang/srt/speculative/eagle_info_v2.py to populate seq_lens_cpu and seq_lens_sum on the verify input, which is necessary for correct custom_mask slicing in split_spec_info. Feedback was provided to move these assignments into an existing conditional block to eliminate a redundant check and improve code clarity.

Comment thread python/sglang/srt/speculative/eagle_info_v2.py Outdated
Copy link
Copy Markdown
Collaborator

@hubertlu-tw hubertlu-tw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hubertlu-tw
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label May 4, 2026
@HaiShaw
Copy link
Copy Markdown
Collaborator

HaiShaw commented May 5, 2026

@amd-bot ci-status

@amd-bot
Copy link
Copy Markdown

amd-bot commented May 5, 2026

@HaiShaw

Commit Info

  • PR head SHA: 5553c474
  • Diff: 1 file, +5/-0 in python/sglang/srt/speculative/eagle_info_v2.py — populates self.seq_lens_cpu / self.seq_lens_sum on the verify input inside prepare_for_v2_verify, so TBO's split_spec_info can slice custom_mask. Scope: EAGLE V2 + TBO (Two-Batch Overlap) AMD path only.

CI Status for PR #24319

PR: [AMD] fix tbo specv2 seq_lens_cpu NoneType error
Changed files: python/sglang/srt/speculative/eagle_info_v2.py (+5/-0)

AMD: 3 failures (0 likely related) | Others: 4 failures (0 related) + 5 fast-fail skips + 2 aggregator failures

AMD CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-b-test-1-gpu-small-amd-mi35x test/registered/core/test_gpt_oss_1gpu.py test_mxfp4_20b AssertionError: False is not true (empty streaming response) 🟢 Unlikely Known never-passed test on mi35x (and mi35x-rocm720); intended fix #23829 was closed unmerged 2026-04-28. GPT-OSS mxfp4 path has no overlap with EAGLE V2 / TBO code. Log
stage-b-test-1-gpu-large-amd (1) test/registered/perf/test_bench_serving_1gpu_part2.py test_score_api_batch_scaling AssertionError: 73.229 not less than 70 (avg_latency_ms perf threshold) 🟢 Unlikely Perf-budget regression on gte-Qwen2-1.5B embedding/score API (latency 73 ms vs budget 70 ms). Embedding model + score API don't use speculative decoding at all; no path to EAGLE V2. Log
stage-c-test-large-8-gpu-amd (2) test/registered/amd/test_moriep_small.py setUpClass (server failed to come up; py-spy dump could not get traces) timeout after 3600s 🟢 Unlikely MoRI-EP (MoE expert-parallelism) test — server stuck during model load (Multi-thread loading shards then Pyspy failed). Unrelated to EAGLE V2 / TBO change. Looks like infra/MoE-loading hang. Log

Other CI Failures

Job Test File Test Function Error Related? Explanation Log
stage-c-test-4-gpu-b200 (5) test/registered/4-gpu-models/test_qwen35_fp4_triton.py test_gsm8k FileNotFoundError: '/root/.cache/tvm-ffi/sgl_kernel_jit_activation_bf16_t_true_92ff72c75a8dd73f' (during kernel_warmup on all TP ranks) 🟢 Unlikely sgl-kernel JIT cache directory missing — this is a runner/build-cache infrastructure issue in the sgl-kernel JIT path, not in spec decoding. Log
stage-c-test-8-gpu-h20 test/registered/distributed/test_disaggregation_different_tp.py (in-progress when killed) N/A — process killed Process completed with exit code 137 (SIGKILL / OOM) 🟢 Unlikely PD-disaggregation test killed by OOM-killer / runner. PR doesn't touch disaggregation code; no spec-decoding path involved. Log
stage-c-test-4-gpu-b200 (0), (3), (4), stage-c-test-8-gpu-h200 (1), stage-c-test-4-gpu-h100 (2) N/A N/A Fast-fail: skipping — root cause job(s): stage-c-test-4-gpu-b200 (5) / stage-c-test-8-gpu-h20 🟢 Unlikely These jobs were skipped by the fast-fail mechanism because the b200 (5) / h20 jobs above failed first. They themselves did not run any tests.
pr-test-finish, pr-test-amd-finish, pr-test-npu-finish, finish N/A N/A Aggregator jobs reporting overall pipeline failure 🟢 Unlikely These are roll-up gates that fail when any upstream job fails; not standalone failures.

Details

No failure on this PR appears related to the diff. The PR modifies eagle_info_v2.py:284 (one site, additive) inside prepare_for_v2_verify to copy seq_lens_cpu / seq_lens_sum from batch onto self. To trigger this code, a CI test would need to exercise EAGLE Spec V2 with TBO enabled. None of the failing test files do:

  • test_gpt_oss_1gpu.py::test_mxfp4_20b — GPT-OSS MoE inference, no spec decoding. Pre-existing never-passed mi35x failure (per memory).
  • test_bench_serving_1gpu_part2.py::test_score_api_batch_scaling — embedding/score API on gte-Qwen2-1.5B, no spec decoding. Perf-budget bound (70 ms) is tight; 73 ms is a marginal flake/perf-regression in something else.
  • test_moriep_small.py — MoRI-EP MoE distributed, no spec decoding. Hung during model shard loading (Pyspy failed (py-spy dump)).
  • test_qwen35_fp4_triton.py::test_gsm8k — JIT cache directory missing on b200 runner; infrastructure / sgl-kernel JIT path, not spec decoding.
  • test_disaggregation_*.py on h20 — PD-disaggregation, killed by OOM (exit 137). Disaggregation does not enter the modified function.

Recommendation: The PR's CI failures are all pre-existing or infrastructure issues — there is nothing to fix in this PR for any of them. Author can ignore these CI failures and may want to re-run the b200/h20/large-amd jobs once to confirm the JIT-cache and OOM signals are transient. The mi35x test_mxfp4_20b failure is a tracked never-passed AMD issue separate from this PR.

Generated by amd-bot using Claude Code CLI

@HaiShaw HaiShaw merged commit 80ccb6b into sgl-project:main May 5, 2026
188 of 219 checks passed
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants