[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) by yctseng0211 · Pull Request #21737 · sgl-project/sglang

yctseng0211 · 2026-03-31T05:53:55Z

Motivation

#21465 moved unwrap_shm_features to after broadcast_pyobj so that only shm_name metadata (instead of full tensors) is serialized during broadcast. However, shared memory segments are created by the tokenizer manager process and are only accessible from that process's namespace. When non-rank-0 schedulers deserialize ShmPointerMMData during broadcast, they call shm_open(shm_name) which fails with FileNotFoundError because the segment has already been unlinked by the tokenizer side.
This causes test_encoder_dp.py (Qwen3-VL-32B, TP=4, --mm-enable-dp-encoder) to crash on TP1/TP3 with:
FileNotFoundError: [Errno 2] No such file or directory: '/psm_43f004c4' (see: https://github.com/sgl-project/sglang/actions/runs/23759205665/job/69221928169#step:5:12320)
resulting in MMMU accuracy dropping to 0.25 (random guessing).

Modifications

Move unwrap_shm_features back to before broadcast. Rank 0 materializes shm into regular tensors first, then broadcast sends normal tensor data that all ranks can deserialize without shm access.
The ShmPointerMMData refactoring from pr21465 (materialize(), zero-copy __setstate__, __del__ guard) is preserved, only the call order in scheduler.py is changed.

Accuracy Tests

https://github.com/sgl-project/sglang/actions/runs/23782802453/job/69299041224

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-03-31T05:53:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-03-31T06:18:16Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yctseng0211 · 2026-03-31T06:28:39Z

cc @yhyang201
could you help review this forward fix ?

yhyang201 · 2026-03-31T06:32:57Z

Can we use this PR? #21655

I think this PR can resolve the issue—could you give it a try?

yctseng0211 · 2026-03-31T06:37:41Z

Can we use this PR? #21655

I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

yhyang201 · 2026-03-31T06:40:11Z

Can we use this PR? #21655
I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

Thank you so much — I really appreciate it, and sorry for the trouble.

yctseng0211 · 2026-03-31T06:59:19Z

Can we use this PR? #21655
I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

Thank you so much — I really appreciate it, and sorry for the trouble.

@yhyang201
verified, pr21655 will resolve this issue, thanks for your help and I will close this PR once pr21655 is merged, thanks!
https://github.com/sgl-project/sglang/actions/runs/23784141110/job/69303257296

yhyang201 · 2026-04-01T08:17:04Z

#21655 has been merged.

[Fix] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash

cf1c413

yctseng0211 changed the title ~~[AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash~~ [AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) Mar 31, 2026

yctseng0211 added the run-ci label Mar 31, 2026

yctseng0211 marked this pull request as ready for review March 31, 2026 06:18

yctseng0211 requested review from Ying1123, hnyls2002, merrymercy and xiezhq-hermann as code owners March 31, 2026 06:18

Merge branch 'main' into fix_amd_nightly_qwen3

f436f57

yctseng0211 changed the title ~~[AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)~~ [AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) Mar 31, 2026

yctseng0211 closed this Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737

[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737
yctseng0211 wants to merge 2 commits intomainfrom
fix_amd_nightly_qwen3

yctseng0211 commented Mar 31, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Mar 31, 2026 •

edited

Loading

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Mar 31, 2026

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yctseng0211 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

gemini-code-assist bot commented Mar 31, 2026

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Mar 31, 2026

Uh oh!

yctseng0211 commented Mar 31, 2026

Uh oh!

yhyang201 commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yctseng0211 commented Mar 31, 2026 •

edited

Loading

yhyang201 commented Mar 31, 2026 •

edited

Loading