[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation by cnyvfang · Pull Request #37439 · vllm-project/vllm

cnyvfang · 2026-03-18T14:06:16Z

Purpose

This PR fixes an incorrect use of merge_size in Qwen3-VL video timestamp processing. (This bug also affects Qwen3.5)

Previously, _get_video_second_idx passed video_processor.merge_size into _calculate_timestamps, and _calculate_timestamps used it to pad frame indices and group timestamps. However, merge_size only affects the spatial dimension in Qwen3-VL and Qwen3.5. It does not affect the temporal dimension, because temporal tokens are not merged by the merger.

The correct parameter here is temporal_patch_size, since both the final number of temporal groups and the timestamp grouping logic depend on temporal patching rather than spatial merging.

The old code appeared to work under default model settings only because merge_size and temporal_patch_size happen to be numerically equal in the default Qwen3-VL and Qwen3.5 configurations. Once a user initializes the model with non-default values or uses a checkpoint with a modified temporal_patch_size, timestamp grouping becomes inconsistent with tokens_per_frame, which can trigger:

AssertionError: timestamps and tokens_per_frame must have the same length

This PR replaces the incorrect use of merge_size with temporal_patch_size in both _get_video_second_idx and _calculate_timestamps, so the timestamp calculation matches the model's actual temporal design.

Test Plan

Start from a Qwen3-VL or Qwen3.5 checkpoint. (In our case, we use Qwen3.5 4B dense model)
Modify temporal_patch_size in the following config files from the default value 2 to any other value (In our case, we set it as 16):
- config.json
- preprocessor_config.json
- video_preprocessor_config.json
- processor_config.json
Load the checkpoint with the modified configs in vLLM for inference.
Run an input whose QA content includes video input, so that the video processing path reaches:
- _get_video_second_idx
- _calculate_timestamps
- get_video_repl
Verify that the original implementation fails because timestamps are grouped using merge_size, while tokens_per_frame are derived from the true temporal grouping based on temporal_patch_size.
Apply this patch and rerun the same inference workflow.
Verify that:
- timestamps are grouped using temporal_patch_size
- len(timestamps) == len(tokens_per_frame)
- the assertion in get_video_repl no longer fails
- the behavior remains unchanged for default configurations where merge_size == temporal_patch_size

Test Result

Before this fix, after modifying temporal_patch_size in the Qwen3-VL or Qwen3.5 config files to a non-default value such as 16, running vLLM inference on an input containing video content reproduced a mismatch between timestamps and tokens_per_frame, leading to the following runtime error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/cli/infer.py", line 5, in <module>
[rank0]:     infer_main()
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 307, in infer_main
[rank0]:     return SwiftInfer(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/base.py", line 52, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 96, in run
[rank0]:     result = self.infer_dataset()
[rank0]:              ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 256, in infer_dataset
[rank0]:     result = self._batch_infer(shard_dataset, request_config)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 295, in _batch_infer
[rank0]:     resp_list = self.infer(val_dataset, request_config, use_tqdm=True, **self.infer_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/infer_engine/vllm_engine.py", line 718, in infer
[rank0]:     self._add_request(inputs, generation_config, request_id, adapter_request=adapter_request)
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/infer_engine/vllm_engine.py", line 390, in _add_request
[rank0]:     return self.engine.add_request(request_id, llm_inputs, generation_config, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 248, in add_request
[rank0]:     request = self.input_processor.process_inputs(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 263, in process_inputs
[rank0]:     processed_inputs = self.input_preprocessor.preprocess(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 317, in preprocess
[rank0]:     return self._process_decoder_only_prompt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 298, in _process_decoder_only_prompt
[rank0]:     return self._prompt_to_llm_inputs(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 231, in _prompt_to_llm_inputs
[rank0]:     return self._process_tokens(prompt)  # type: ignore[arg-type]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 144, in _process_tokens
[rank0]:     inputs = self._process_multimodal(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 103, in _process_multimodal
[rank0]:     return self.renderer._process_multimodal(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/renderers/base.py", line 563, in _process_multimodal
[rank0]:     mm_inputs = mm_processor.apply(mm_processor_inputs, mm_timing_ctx)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1682, in apply
[rank0]:     ) = self._cached_apply_hf_processor(inputs, timing_ctx)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1471, in _cached_apply_hf_processor
[rank0]:     ) = self._apply_hf_processor_main(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1288, in _apply_hf_processor_main
[rank0]:     mm_processed_data = self._apply_hf_processor_mm_only(
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1246, in _apply_hf_processor_mm_only
[rank0]:     _, mm_processed_data, _ = self._apply_hf_processor_text_mm(
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1173, in _apply_hf_processor_text_mm
[rank0]:     processed_data = self._call_hf_processor(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1032, in _call_hf_processor
[rank0]:     video_repl = Qwen3VLMultiModalProcessor.get_video_repl(
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1195, in get_video_repl
[rank0]:     assert len(timestamps) == len(tokens_per_frame), (
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: timestamps and tokens_per_frame must have the same length
[rank0]:[W318 13:01:36.285935784 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

After replacing merge_size with temporal_patch_size, the same workflow runs correctly. The computed timestamps now align with the actual temporal token grouping, the assertion passes, and the behavior is consistent with the model design.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…idx and _calculate_timestamps Signed-off-by: chengyufang <cnyvfang@outlook.com>

gemini-code-assist

Code Review

The pull request correctly identifies and addresses a critical bug where merge_size was incorrectly used for temporal video timestamp calculations instead of temporal_patch_size. The changes consistently replace the erroneous variable across the _calculate_timestamps function and its call site in _get_video_second_idx. This fix aligns with the model's actual temporal design and resolves the AssertionError reported in the description.

vllm/model_executor/models/qwen3_vl.py

Copilot

Pull request overview

Fixes Qwen3-VL/Qwen3.5 video timestamp grouping by using temporal_patch_size (temporal token grouping) instead of merge_size (spatial merging), preventing mismatches between computed timestamps and tokens_per_frame under non-default configs.

Changes:

Update _calculate_timestamps to pad/group using temporal_patch_size rather than merge_size.
Update _get_video_second_idx to pass video_processor.temporal_patch_size into _calculate_timestamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

vllm/model_executor/models/qwen3_vl.py

Isotr0py · 2026-03-18T14:26:19Z

vllm/model_executor/models/qwen3_vl.py

    def _calculate_timestamps(
-        self, indices: list[int] | torch.Tensor, video_fps: float, merge_size: int
+        self, indices: list[int] | torch.Tensor, video_fps: float, temporal_patch_size: int
    ):


But I thought this is basically refered to implementation at Transformers side? cc @JJJYmmm
https://github.com/huggingface/transformers/blob/670f3c85fde62cede8206d7ade43a8eaa1edc205/src/transformers/models/qwen3_vl/processing_qwen3_vl.py#L262-L273

Hi @Isotr0py, thanks for your comments.

I have checked the Transformers implementation. In fact, the argument passed to this function is temporal_patch_size, not merge_size.

curr_timestamp = self._calculate_timestamps( metadata.frames_indices, metadata.fps, self.video_processor.temporal_patch_size, )

You can check here:
https://github.com/huggingface/transformers/blob/670f3c85fde62cede8206d7ade43a8eaa1edc205/src/transformers/models/qwen3_vl/processing_qwen3_vl.py#L152-L156

As discussed here huggingface/transformers#43519. (use temporal_patch_size rather than merge_size for video processor)

@cnyvfang I think modifying line 770/809 is enough. (to align with hf side) https://github.com/vllm-project/vllm/pull/37439/changes#diff-89b6b3d99f8961ecdc0cf8cf16b0f9e091a6227ee2a9bfb4f04b88043278e0e0R809.

I agree with your point of view, I have already made the minimum changes in the new commits, thank you for your suggestion! @JJJYmmm

…tion specification. Signed-off-by: chengyufang <cnyvfang@outlook.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

[Bugfix] Fix misuse of the merge_size parameter in _get_video_second_…

56d8dee

…idx and _calculate_timestamps Signed-off-by: chengyufang <cnyvfang@outlook.com>

Copilot AI review requested due to automatic review settings March 18, 2026 14:06

cnyvfang requested a review from sighingnow as a code owner March 18, 2026 14:06

mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 18, 2026

gemini-code-assist bot reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/models/qwen3_vl.py Outdated Show resolved Hide resolved

Copilot AI reviewed Mar 18, 2026

View reviewed changes

vllm/model_executor/models/qwen3_vl.py Show resolved Hide resolved

vllm/model_executor/models/qwen3_vl.py Show resolved Hide resolved

Isotr0py reviewed Mar 18, 2026

View reviewed changes

cnyvfang requested a review from Isotr0py March 18, 2026 14:35

cnyvfang and others added 2 commits March 18, 2026 22:39

Modify the parameter names to comply with the Transformers implementa…

187c310

…tion specification. Signed-off-by: chengyufang <cnyvfang@outlook.com>

Merge branch 'main' into main

53a2163

Isotr0py approved these changes Mar 18, 2026

View reviewed changes

Isotr0py enabled auto-merge (squash) March 18, 2026 14:49

Merge branch 'main' into main

891a491

Copilot started reviewing on behalf of cnyvfang March 18, 2026 15:09 View session

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

Merge branch 'main' into main

d99dbea

Isotr0py merged commit 738d0a2 into vllm-project:main Mar 18, 2026
54 checks passed

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

907b9b5

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

21bc09a

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

d047b0e

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp …

c650ac5

…calculation (vllm-project#37439) Signed-off-by: chengyufang <cnyvfang@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439
Isotr0py merged 5 commits intovllm-project:mainfrom
cnyvfang:main

cnyvfang commented Mar 18, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Isotr0py Mar 18, 2026 •

edited

Loading

Uh oh!

cnyvfang Mar 18, 2026 •

edited

Loading

Uh oh!

JJJYmmm Mar 18, 2026 •

edited

Loading

Uh oh!

cnyvfang Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

cnyvfang commented Mar 18, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Isotr0py Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnyvfang Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JJJYmmm Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cnyvfang Mar 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cnyvfang commented Mar 18, 2026 •

edited by github-actions bot

Loading

Isotr0py Mar 18, 2026 •

edited

Loading

cnyvfang Mar 18, 2026 •

edited

Loading

JJJYmmm Mar 18, 2026 •

edited

Loading