Migrate Qwen2 inputs to TensorSchema#23475
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully migrates Qwen2 inputs from TypedDict to TensorSchema, which enhances runtime shape validation and improves input contract enforcement. The changes are well-aligned with the project's goals for multi-modal models. However, I've identified several instances where the type hints for tensor fields are incorrect. Specifically, they are declared as Union[torch.Tensor, list[torch.Tensor]] when the values passed at runtime are always torch.Tensor. Correcting these type hints will improve code clarity, maintainability, and alignment with static analysis tools.
|
May I ask the purpose of refactoring existing models? Based on your description, I cannot understand. |
Sorry for any confusion. We're migrating from TypedDict to TensorSchema to enable shape validations. More details can be found in this RFC. Feel free to lmk if any questions. |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Observing failing MM test similar to #23471, #22021: |
Head branch was pushed to by a user without write access
224f5a1 to
edc9642
Compare
There was a problem hiding this comment.
I think we should just allow unpadded features for this model.
There was a problem hiding this comment.
Ack, thanks for the reminder. Will investigate further and work on a fix.
There was a problem hiding this comment.
Observed some differences in dims between the input_features for Qwen2-Audio (ndim=3) and Qwen2.5-Omni (ndim=2). Qwen2-Audio inputs match the shape definition, but Qwen2.5-Omni currently does not. I’ve yet to find a way of resolving the conflict by squeezing/flattening or updating schema annotation. Need more time to root cause the differences. Feel free to let me know if any thoughts or pointers for investigating further.
Original Schema:
class Qwen2AudioFeatureInputs(TypedDict):
type: Literal["audio_features"]
input_features: torch.Tensor
"""Shape: `(num_audios, num_mel_bins, 3000)`"""
feature_attention_mask: torch.Tensor
"""Shape: `(num_audios, 3000)`"""
# Qwen2-Audio
python -m pytest tests/models/multimodal/processing/test_tensor_schema.py::test_model_tensor_schema[Qwen2AudioForConditionalGeneration-Qwen/Qwen2-Audio-7B-Instruct]
> /home/bbeckca/vllm/vllm/model_executor/models/qwen2_audio.py(369)_parse_and_validate_audio_input()
-> input_features = self._validate_and_reshape_mm_tensor(
(Pdb) [x.shape for x in input_features]
[torch.Size([1, 128, 3000]), torch.Size([1, 128, 3000]), torch.Size([1, 128, 3000])]
(Pdb) [x.shape for x in feature_attention_mask]
[torch.Size([1, 3000]), torch.Size([1, 3000]), torch.Size([1, 3000])]
(Pdb) self._validate_and_reshape_mm_tensor(input_features, 'input_features').shape
torch.Size([3, 128, 3000])
(Pdb) self._validate_and_reshape_mm_tensor(feature_attention_mask, 'feature_attention_mask').shape
torch.Size([3, 3000])
# Qwen2.5-Omni
python -m pytest tests/models/multimodal/processing/test_common.py::test_processing_correctness[1.0-32-0.5-Qwen/Qwen2.5-Omni-3B]
> /home/bbeckca/vllm/vllm/model_executor/models/qwen2_5_omni_thinker.py(547)_parse_and_validate_audio_input()
(Pdb) [x.shape for x in input_audio_features]
[torch.Size([128, 3000]), torch.Size([128, 1500]), torch.Size([128, 750])]
(Pdb) [x.shape for x in feature_attention_mask]
[torch.Size([1, 30000]), torch.Size([1, 30000]), torch.Size([1, 30000])]
(Pdb) self._validate_and_reshape_mm_tensor(input_audio_features, 'input_audio_features', dim=1).shape
torch.Size([128, 5250])
(Pdb) self._validate_and_reshape_mm_tensor(feature_attention_mask, 'feature_attention_mask').shape
torch.Size([3, 30000])
There was a problem hiding this comment.
Perhaps we should define a separate shape definition for Qwen2.5-Omni audio inputs then.
|
Sorry, pushed incorrectly. Feel free to ignore. Will undo those changes. |
Signed-off-by: Benji Beck <benjibeck@meta.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
There was a problem hiding this comment.
@DarkLight1337 Added new input for Omni with best effort naming. Feel free to review and share any thoughts.
Should be resolved. Apologies to any reviewers that were tagged unnecessarily. Feel free to let me know if there's anything I should do to fix the mergify labels. |
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Benji Beck <benjibeck@meta.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Benji Beck <benjibeck@meta.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Purpose
This PR migrates Qwen2 inputs from a TypedDict-based definition to a structured TensorSchema model with runtime shape validation. This brings it in line with recent changes to Phi3VImagePixelInputs, and is part of a broader effort to improve input contract enforcement and debug-ability across multi-modal models.
More details: #14764 (comment)
Classes Migrated:
qwen2_vl.py:Qwen2VLImagePixelInputsQwen2VLImageEmbeddingInputsQwen2VLVideoPixelInputsQwen2VLVideoEmbeddingInputsqwen2_5_vl.py:Qwen2_5_VLImagePixelInputsQwen2_5_VLImageEmbeddingInputsQwen2_5_VLVideoPixelInputsQwen2_5_VLVideoEmbeddingInputsqwen2_audio.py:Qwen2AudioInputsTest Plan
Confirm validation works via standalone tests in tests/standalone_test/test_tensor_schema.py and rely on CI to check integration.
Test Result