Skip to content

[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737

Closed
yctseng0211 wants to merge 2 commits intomainfrom
fix_amd_nightly_qwen3
Closed

[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737
yctseng0211 wants to merge 2 commits intomainfrom
fix_amd_nightly_qwen3

Conversation

@yctseng0211
Copy link
Copy Markdown
Collaborator

@yctseng0211 yctseng0211 commented Mar 31, 2026

Motivation

#21465 moved unwrap_shm_features to after broadcast_pyobj so that only shm_name metadata (instead of full tensors) is serialized during broadcast. However, shared memory segments are created by the tokenizer manager process and are only accessible from that process's namespace. When non-rank-0 schedulers deserialize ShmPointerMMData during broadcast, they call shm_open(shm_name) which fails with FileNotFoundError because the segment has already been unlinked by the tokenizer side.
This causes test_encoder_dp.py (Qwen3-VL-32B, TP=4, --mm-enable-dp-encoder) to crash on TP1/TP3 with:
FileNotFoundError: [Errno 2] No such file or directory: '/psm_43f004c4' (see: https://github.com/sgl-project/sglang/actions/runs/23759205665/job/69221928169#step:5:12320)
resulting in MMMU accuracy dropping to 0.25 (random guessing).

Modifications

Move unwrap_shm_features back to before broadcast. Rank 0 materializes shm into regular tensors first, then broadcast sends normal tensor data that all ranks can deserialize without shm access.
The ShmPointerMMData refactoring from pr21465 (materialize(), zero-copy __setstate__, __del__ guard) is preserved, only the call order in scheduler.py is changed.

Accuracy Tests

https://github.com/sgl-project/sglang/actions/runs/23782802453/job/69299041224

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yctseng0211 yctseng0211 changed the title [AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash [AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) Mar 31, 2026
@yctseng0211 yctseng0211 marked this pull request as ready for review March 31, 2026 06:18
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yctseng0211
Copy link
Copy Markdown
Collaborator Author

cc @yhyang201
could you help review this forward fix ?

@yhyang201
Copy link
Copy Markdown
Collaborator

yhyang201 commented Mar 31, 2026

Can we use this PR? #21655

I think this PR can resolve the issue—could you give it a try?

@yctseng0211
Copy link
Copy Markdown
Collaborator Author

Can we use this PR? #21655

I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

@yhyang201
Copy link
Copy Markdown
Collaborator

Can we use this PR? #21655
I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

Thank you so much — I really appreciate it, and sorry for the trouble.

@yctseng0211
Copy link
Copy Markdown
Collaborator Author

Can we use this PR? #21655
I think this PR can resolve the issue—could you give it a try?

sure, I will verify amd nightly test with this PR branch.

Thank you so much — I really appreciate it, and sorry for the trouble.

@yhyang201
verified, pr21655 will resolve this issue, thanks for your help and I will close this PR once pr21655 is merged, thanks!
https://github.com/sgl-project/sglang/actions/runs/23784141110/job/69303257296

@yctseng0211 yctseng0211 changed the title [AMD] Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) [AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu) Mar 31, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator

#21655 has been merged.

@yctseng0211 yctseng0211 closed this Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants