Skip to content

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439

Merged
Isotr0py merged 5 commits intovllm-project:mainfrom
cnyvfang:main
Mar 18, 2026
Merged

[Bugfix] Fix incorrect use of merge_size in Qwen3-VL video timestamp calculation#37439
Isotr0py merged 5 commits intovllm-project:mainfrom
cnyvfang:main

Conversation

@cnyvfang
Copy link
Copy Markdown

@cnyvfang cnyvfang commented Mar 18, 2026

Purpose

This PR fixes an incorrect use of merge_size in Qwen3-VL video timestamp processing. (This bug also affects Qwen3.5)

Previously, _get_video_second_idx passed video_processor.merge_size into _calculate_timestamps, and _calculate_timestamps used it to pad frame indices and group timestamps. However, merge_size only affects the spatial dimension in Qwen3-VL and Qwen3.5. It does not affect the temporal dimension, because temporal tokens are not merged by the merger.

The correct parameter here is temporal_patch_size, since both the final number of temporal groups and the timestamp grouping logic depend on temporal patching rather than spatial merging.

The old code appeared to work under default model settings only because merge_size and temporal_patch_size happen to be numerically equal in the default Qwen3-VL and Qwen3.5 configurations. Once a user initializes the model with non-default values or uses a checkpoint with a modified temporal_patch_size, timestamp grouping becomes inconsistent with tokens_per_frame, which can trigger:

AssertionError: timestamps and tokens_per_frame must have the same length

This PR replaces the incorrect use of merge_size with temporal_patch_size in both _get_video_second_idx and _calculate_timestamps, so the timestamp calculation matches the model's actual temporal design.

Test Plan

  1. Start from a Qwen3-VL or Qwen3.5 checkpoint. (In our case, we use Qwen3.5 4B dense model)
  2. Modify temporal_patch_size in the following config files from the default value 2 to any other value (In our case, we set it as 16):
    • config.json
    • preprocessor_config.json
    • video_preprocessor_config.json
    • processor_config.json
  3. Load the checkpoint with the modified configs in vLLM for inference.
  4. Run an input whose QA content includes video input, so that the video processing path reaches:
    • _get_video_second_idx
    • _calculate_timestamps
    • get_video_repl
  5. Verify that the original implementation fails because timestamps are grouped using merge_size, while tokens_per_frame are derived from the true temporal grouping based on temporal_patch_size.
  6. Apply this patch and rerun the same inference workflow.
  7. Verify that:
    • timestamps are grouped using temporal_patch_size
    • len(timestamps) == len(tokens_per_frame)
    • the assertion in get_video_repl no longer fails
    • the behavior remains unchanged for default configurations where merge_size == temporal_patch_size

Test Result

Before this fix, after modifying temporal_patch_size in the Qwen3-VL or Qwen3.5 config files to a non-default value such as 16, running vLLM inference on an input containing video content reproduced a mismatch between timestamps and tokens_per_frame, leading to the following runtime error:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/cli/infer.py", line 5, in <module>
[rank0]:     infer_main()
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 307, in infer_main
[rank0]:     return SwiftInfer(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/base.py", line 52, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 96, in run
[rank0]:     result = self.infer_dataset()
[rank0]:              ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 256, in infer_dataset
[rank0]:     result = self._batch_infer(shard_dataset, request_config)
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/pipelines/infer/infer.py", line 295, in _batch_infer
[rank0]:     resp_list = self.infer(val_dataset, request_config, use_tqdm=True, **self.infer_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/infer_engine/vllm_engine.py", line 718, in infer
[rank0]:     self._add_request(inputs, generation_config, request_id, adapter_request=adapter_request)
[rank0]:   File "/mnt/beijing_fast/fangchengyu.fcy/BasicMed/swift/infer_engine/vllm_engine.py", line 390, in _add_request
[rank0]:     return self.engine.add_request(request_id, llm_inputs, generation_config, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/v1/engine/llm_engine.py", line 248, in add_request
[rank0]:     request = self.input_processor.process_inputs(
[rank0]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 263, in process_inputs
[rank0]:     processed_inputs = self.input_preprocessor.preprocess(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 317, in preprocess
[rank0]:     return self._process_decoder_only_prompt(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 298, in _process_decoder_only_prompt
[rank0]:     return self._prompt_to_llm_inputs(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 231, in _prompt_to_llm_inputs
[rank0]:     return self._process_tokens(prompt)  # type: ignore[arg-type]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 144, in _process_tokens
[rank0]:     inputs = self._process_multimodal(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/inputs/preprocess.py", line 103, in _process_multimodal
[rank0]:     return self.renderer._process_multimodal(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/renderers/base.py", line 563, in _process_multimodal
[rank0]:     mm_inputs = mm_processor.apply(mm_processor_inputs, mm_timing_ctx)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1682, in apply
[rank0]:     ) = self._cached_apply_hf_processor(inputs, timing_ctx)
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1471, in _cached_apply_hf_processor
[rank0]:     ) = self._apply_hf_processor_main(
[rank0]:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1288, in _apply_hf_processor_main
[rank0]:     mm_processed_data = self._apply_hf_processor_mm_only(
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1246, in _apply_hf_processor_mm_only
[rank0]:     _, mm_processed_data, _ = self._apply_hf_processor_text_mm(
[rank0]:                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/multimodal/processing/processor.py", line 1173, in _apply_hf_processor_text_mm
[rank0]:     processed_data = self._call_hf_processor(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1032, in _call_hf_processor
[rank0]:     video_repl = Qwen3VLMultiModalProcessor.get_video_repl(
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/opt/conda/envs/fangchengyu.fcy/BasicMed/lib/python3.12/site-packages/vllm/model_executor/models/qwen3_vl.py", line 1195, in get_video_repl
[rank0]:     assert len(timestamps) == len(tokens_per_frame), (
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError: timestamps and tokens_per_frame must have the same length
[rank0]:[W318 13:01:36.285935784 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())

After replacing merge_size with temporal_patch_size, the same workflow runs correctly. The computed timestamps now align with the actual temporal token grouping, the assertion passes, and the behavior is consistent with the model design.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

…idx and _calculate_timestamps

Signed-off-by: chengyufang <cnyvfang@outlook.com>
Copilot AI review requested due to automatic review settings March 18, 2026 14:06
@cnyvfang cnyvfang requested a review from sighingnow as a code owner March 18, 2026 14:06
@mergify mergify bot added qwen Related to Qwen models bug Something isn't working labels Mar 18, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request correctly identifies and addresses a critical bug where merge_size was incorrectly used for temporal video timestamp calculations instead of temporal_patch_size. The changes consistently replace the erroneous variable across the _calculate_timestamps function and its call site in _get_video_second_idx. This fix aligns with the model's actual temporal design and resolves the AssertionError reported in the description.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Qwen3-VL/Qwen3.5 video timestamp grouping by using temporal_patch_size (temporal token grouping) instead of merge_size (spatial merging), preventing mismatches between computed timestamps and tokens_per_frame under non-default configs.

Changes:

  • Update _calculate_timestamps to pad/group using temporal_patch_size rather than merge_size.
  • Update _get_video_second_idx to pass video_processor.temporal_patch_size into _calculate_timestamps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 747 to 749
def _calculate_timestamps(
self, indices: list[int] | torch.Tensor, video_fps: float, merge_size: int
self, indices: list[int] | torch.Tensor, video_fps: float, temporal_patch_size: int
):
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Author

@cnyvfang cnyvfang Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Isotr0py, thanks for your comments.

I have checked the Transformers implementation. In fact, the argument passed to this function is temporal_patch_size, not merge_size.

curr_timestamp = self._calculate_timestamps(
                        metadata.frames_indices,
                        metadata.fps,
                        self.video_processor.temporal_patch_size,
                    )

You can check here:
https://github.com/huggingface/transformers/blob/670f3c85fde62cede8206d7ade43a8eaa1edc205/src/transformers/models/qwen3_vl/processing_qwen3_vl.py#L152-L156

Copy link
Copy Markdown
Contributor

@JJJYmmm JJJYmmm Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed here huggingface/transformers#43519. (use temporal_patch_size rather than merge_size for video processor)

@cnyvfang I think modifying line 770/809 is enough. (to align with hf side) https://github.com/vllm-project/vllm/pull/37439/changes#diff-89b6b3d99f8961ecdc0cf8cf16b0f9e091a6227ee2a9bfb4f04b88043278e0e0R809.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your point of view, I have already made the minimum changes in the new commits, thank you for your suggestion! @JJJYmmm

@cnyvfang cnyvfang requested a review from Isotr0py March 18, 2026 14:35
cnyvfang and others added 2 commits March 18, 2026 22:39
…tion specification.

Signed-off-by: chengyufang <cnyvfang@outlook.com>
@Isotr0py Isotr0py enabled auto-merge (squash) March 18, 2026 14:49
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@Isotr0py Isotr0py merged commit 738d0a2 into vllm-project:main Mar 18, 2026
54 checks passed
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…calculation (vllm-project#37439)

Signed-off-by: chengyufang <cnyvfang@outlook.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants