[AMD] fix tbo specv2 seq_lens_cpu NoneType error#24319
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates python/sglang/srt/speculative/eagle_info_v2.py to populate seq_lens_cpu and seq_lens_sum on the verify input, which is necessary for correct custom_mask slicing in split_spec_info. Feedback was provided to move these assignments into an existing conditional block to eliminate a redundant check and improve code clarity.
|
/tag-and-rerun-ci |
|
@amd-bot ci-status |
Commit Info
CI Status for PR #24319PR: [AMD] fix tbo specv2 seq_lens_cpu NoneType error AMD: 3 failures (0 likely related) | Others: 4 failures (0 related) + 5 fast-fail skips + 2 aggregator failures AMD CI Failures
Other CI Failures
DetailsNo failure on this PR appears related to the diff. The PR modifies
Recommendation: The PR's CI failures are all pre-existing or infrastructure issues — there is nothing to fix in this PR for any of them. Author can ignore these CI failures and may want to re-run the b200/h20/large-amd jobs once to confirm the JIT-cache and OOM signals are transient. The mi35x
|
Motivation
The issue is identified in CI task https://github.com/sgl-project/sglang/actions/runs/25151525365/job/73798865426#step:7:16755
and workaround #24205
and issue tracker #24212
cc @HaiShaw @hubertlu-tw @bingxche
This patch is to fix the crash when SpecV2 is enabled (now the default) together with TBO,
TestMTPwithTBOLowLatency. The trace is as following:at
two_batch_overlap.py:208insplit_spec_info, becausespec_info.seq_lens_cpuisNone.In the v2 draft path (
eagle_worker_v2.py:390),EagleVerifyInputis created withseq_lens_cpu=Noneandseq_lens_sum=None. The v1 path (eagle_worker.py:800) correctly passesforward_batch.seq_lens_cpu. Later,prepare_for_v2_verifyupdatesbatch.seq_lens_cpubut never writes it back to theEagleVerifyInputobject itself. When TBO callssplit_spec_info(forward_batch.spec_info, ...), it readsspec_info.seq_lens_cpuwhich is stillNone.The same issue applies to
multi_layer_eagle_worker_v2.py:283, but both paths share theEagleVerifyInputV2Mixin, so the fix covers both.Modifications
python/sglang/srt/speculative/eagle_info_v2.py: Inprepare_for_v2_verify, populateself.seq_lens_cpuandself.seq_lens_sumfrombatchbeforeForwardBatch.init_new()is called.Accuracy Tests
TestMTPwithTBOLowLatency passed
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci