[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439
[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439Isotr0py merged 5 commits intovllm-project:mainfrom
Conversation
…idx and _calculate_timestamps Signed-off-by: chengyufang <cnyvfang@outlook.com>
There was a problem hiding this comment.
Code Review
The pull request correctly identifies and addresses a critical bug where merge_size was incorrectly used for temporal video timestamp calculations instead of temporal_patch_size. The changes consistently replace the erroneous variable across the _calculate_timestamps function and its call site in _get_video_second_idx. This fix aligns with the model's actual temporal design and resolves the AssertionError reported in the description.
There was a problem hiding this comment.
Pull request overview
Fixes Qwen3-VL/Qwen3.5 video timestamp grouping by using temporal_patch_size (temporal token grouping) instead of merge_size (spatial merging), preventing mismatches between computed timestamps and tokens_per_frame under non-default configs.
Changes:
- Update
_calculate_timestampsto pad/group usingtemporal_patch_sizerather thanmerge_size. - Update
_get_video_second_idxto passvideo_processor.temporal_patch_sizeinto_calculate_timestamps.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| def _calculate_timestamps( | ||
| self, indices: list[int] | torch.Tensor, video_fps: float, merge_size: int | ||
| self, indices: list[int] | torch.Tensor, video_fps: float, temporal_patch_size: int | ||
| ): |
There was a problem hiding this comment.
But I thought this is basically refered to implementation at Transformers side? cc @JJJYmmm
https://github.com/huggingface/transformers/blob/670f3c85fde62cede8206d7ade43a8eaa1edc205/src/transformers/models/qwen3_vl/processing_qwen3_vl.py#L262-L273
There was a problem hiding this comment.
Hi @Isotr0py, thanks for your comments.
I have checked the Transformers implementation. In fact, the argument passed to this function is temporal_patch_size, not merge_size.
curr_timestamp = self._calculate_timestamps(
metadata.frames_indices,
metadata.fps,
self.video_processor.temporal_patch_size,
)You can check here:
https://github.com/huggingface/transformers/blob/670f3c85fde62cede8206d7ade43a8eaa1edc205/src/transformers/models/qwen3_vl/processing_qwen3_vl.py#L152-L156
There was a problem hiding this comment.
As discussed here huggingface/transformers#43519. (use temporal_patch_size rather than merge_size for video processor)
@cnyvfang I think modifying line 770/809 is enough. (to align with hf side) https://github.com/vllm-project/vllm/pull/37439/changes#diff-89b6b3d99f8961ecdc0cf8cf16b0f9e091a6227ee2a9bfb4f04b88043278e0e0R809.
There was a problem hiding this comment.
I agree with your point of view, I have already made the minimum changes in the new commits, thank you for your suggestion! @JJJYmmm
…tion specification. Signed-off-by: chengyufang <cnyvfang@outlook.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
This PR fixes an incorrect use of
merge_sizein Qwen3-VL video timestamp processing. (This bug also affects Qwen3.5)Previously,
_get_video_second_idxpassedvideo_processor.merge_sizeinto_calculate_timestamps, and_calculate_timestampsused it to pad frame indices and group timestamps. However,merge_sizeonly affects the spatial dimension in Qwen3-VL and Qwen3.5. It does not affect the temporal dimension, because temporal tokens are not merged by the merger.The correct parameter here is
temporal_patch_size, since both the final number of temporal groups and the timestamp grouping logic depend on temporal patching rather than spatial merging.The old code appeared to work under default model settings only because
merge_sizeandtemporal_patch_sizehappen to be numerically equal in the default Qwen3-VL and Qwen3.5 configurations. Once a user initializes the model with non-default values or uses a checkpoint with a modifiedtemporal_patch_size, timestamp grouping becomes inconsistent withtokens_per_frame, which can trigger:This PR replaces the incorrect use of
merge_sizewithtemporal_patch_sizein both_get_video_second_idxand_calculate_timestamps, so the timestamp calculation matches the model's actual temporal design.Test Plan
temporal_patch_sizein the following config files from the default value2to any other value (In our case, we set it as 16):config.jsonpreprocessor_config.jsonvideo_preprocessor_config.jsonprocessor_config.json_get_video_second_idx_calculate_timestampsget_video_repltimestampsare grouped usingmerge_size, whiletokens_per_frameare derived from the true temporal grouping based ontemporal_patch_size.timestampsare grouped usingtemporal_patch_sizelen(timestamps) == len(tokens_per_frame)get_video_replno longer failsmerge_size == temporal_patch_sizeTest Result
Before this fix, after modifying
temporal_patch_sizein the Qwen3-VL or Qwen3.5 config files to a non-default value such as16, running vLLM inference on an input containing video content reproduced a mismatch betweentimestampsandtokens_per_frame, leading to the following runtime error:After replacing
merge_sizewithtemporal_patch_size, the same workflow runs correctly. The computedtimestampsnow align with the actual temporal token grouping, the assertion passes, and the behavior is consistent with the model design.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.