[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737
[AMD] DO NOT MERGE - Move unwrap_shm_features before broadcast to fix TP>1 VLM crash (AMD nightly-4-gpu)#21737yctseng0211 wants to merge 2 commits intomainfrom
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
cc @yhyang201 |
|
Can we use this PR? #21655 I think this PR can resolve the issue—could you give it a try? |
sure, I will verify amd nightly test with this PR branch. |
Thank you so much — I really appreciate it, and sorry for the trouble. |
@yhyang201 |
|
#21655 has been merged. |
Motivation
#21465 moved
unwrap_shm_featuresto afterbroadcast_pyobjso that onlyshm_namemetadata (instead of full tensors) is serialized during broadcast. However, shared memory segments are created by the tokenizer manager process and are only accessible from that process's namespace. When non-rank-0 schedulers deserializeShmPointerMMDataduring broadcast, they callshm_open(shm_name)which fails withFileNotFoundErrorbecause the segment has already been unlinked by the tokenizer side.This causes
test_encoder_dp.py(Qwen3-VL-32B, TP=4,--mm-enable-dp-encoder) to crash on TP1/TP3 with:FileNotFoundError: [Errno 2] No such file or directory: '/psm_43f004c4'(see: https://github.com/sgl-project/sglang/actions/runs/23759205665/job/69221928169#step:5:12320)resulting in MMMU accuracy dropping to 0.25 (random guessing).
Modifications
Move
unwrap_shm_featuresback to before broadcast. Rank 0 materializes shm into regular tensors first, then broadcast sends normal tensor data that all ranks can deserialize without shm access.The
ShmPointerMMDatarefactoring from pr21465 (materialize(), zero-copy__setstate__,__del__guard) is preserved, only the call order inscheduler.pyis changed.Accuracy Tests
https://github.com/sgl-project/sglang/actions/runs/23782802453/job/69299041224
Speed Tests and Profiling
Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci