support qwen3_vl vision model dp#13724
Conversation
Summary of ChangesHello @Lzhang-hub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Qwen3_VL vision model by implementing data parallelism capabilities. This integration of distributed computing mechanisms allows the vision encoder to process multimodal data more efficiently, leading to improved inference speeds as validated by the provided benchmark results. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds data parallelism support for the Qwen3-VL vision model. The changes correctly propagate a use_data_parallel flag to disable tensor parallelism in the vision encoder components when DP is enabled. The core logic for DP is added to get_image_feature and get_video_feature methods, which now use run_dp_sharded_mrope_vision_model for sharded execution. The changes look correct and follow a standard pattern for adding DP support. I have a couple of suggestions to improve maintainability by reducing code duplication.
python/sglang/srt/models/qwen3_vl.py
Outdated
| self.tp_size = 1 if use_data_parallel else get_tensor_model_parallel_world_size() | ||
| self.tp_rank = 0 if use_data_parallel else get_tensor_model_parallel_rank() |
| if self.use_data_parallel: | ||
| return run_dp_sharded_mrope_vision_model( | ||
| self.visual, pixel_values, video_grid_thw.tolist(), rope_type="rope_3d" | ||
| ) | ||
| else: | ||
| video_embeds = self.visual(pixel_values, grid_thw=video_grid_thw) | ||
| return video_embeds |
There was a problem hiding this comment.
The logic in this if/else block is almost identical to the one in get_image_feature. The surrounding functions get_image_feature and get_video_feature are also very similar. To improve maintainability and reduce redundancy, you could extract this logic into a private helper method that accepts the grid_thw_attr as an argument.
For example:
def _get_media_feature(self, items: List[MultimodalDataItem], grid_thw_attr: str) -> torch.Tensor:
# in qwen-vl, last dim is the same
pixel_values = torch.cat([item.feature for item in items], dim=0).type(
self.visual.dtype
)
grid_thw = torch.concat([getattr(item, grid_thw_attr) for item in items], dim=0)
assert pixel_values.dim() == 2, pixel_values.dim()
assert grid_thw.dim() == 2, grid_thw.dim()
if self.use_data_parallel:
return run_dp_sharded_mrope_vision_model(
self.visual, pixel_values, grid_thw.tolist(), rope_type="rope_3d"
)
return self.visual(pixel_values, grid_thw=grid_thw)
def get_image_feature(self, items: List[MultimodalDataItem]) -> torch.Tensor:
return self._get_media_feature(items, "image_grid_thw")
def get_video_feature(self, items: List[MultimodalDataItem]) -> torch.Tensor:
return self._get_media_feature(items, "video_grid_thw")|
Is it tested for MoE? |
Add acc and perf bench for |
The label added by github-actions bot. It seems I don’t have permission to remove it. If you could help remove it, I’d really appreciate it. |
|
Thanks for @ShangmingCai removed label, I rebase main. |
| else: | ||
| # Handle empty case | ||
| image_embeds_local = torch.empty( | ||
| (0, vision_model.out_hidden_size), |
There was a problem hiding this comment.
This is a bug. Not only qwen2_5_vl and qwen3_vl will use this run_dp_sharded_mrope_vision_model, more and more vlm models with mrope might adopt this function to support DP, but they are not necessarily owning vision_model.out_dim. All the models are to be impacted.
Change back to vision_model.out_hidden_size here. Meanwhile, change back to out_hidden_size in qwen3_vl.py.
|
/tag-and-rerun-ci |
|
/tag-and-rerun-ci |
Motivation
Based on PR 13126, add support for the Qwen3_VL vision model DP.
Modifications
Accuracy Tests
Qwen/Qwen3-VL-32B-Instruct
Benchmarking and Profiling
server cmd:
bench cmd:
Checklist