Migrate Phi4 inputs to TensorSchema#23471
Conversation
There was a problem hiding this comment.
Code Review
This pull request successfully migrates Phi-4 multi-modal input definitions from TypedDict to the more structured TensorSchema, enhancing input validation and code clarity. The changes also include some useful bug fixes in error messages. My review identifies a few critical correctness issues related to the handling and parsing of image inputs, particularly image embeddings, in both phi4_multimodal.py and phi4mm.py. While some of these issues may predate this PR, addressing them is crucial for the correct functioning of the multi-modal capabilities.
There was a problem hiding this comment.
The usage of this Phi4MMImageEmbeddingInputs class in _process_image_input seems to have a few issues:
- The type hint for
image_inputin_process_image_inputisPhi4MMImagePixelInputs, but it should bePhi4MMImageInputto handle both pixel values and embeddings. - It accesses
image_input["image_embeds"], but the field name in this class isdata. This will cause aKeyError. - It uses
self.visual.dtype, butself.visualis not defined inPhi4MultimodalForCausalLM. This will cause anAttributeError.
While these issues might not be introduced in this PR, they are related to the data structures being modified and affect correctness.
There was a problem hiding this comment.
This might be worth reviewing, but it beyond the scope of this diff and seems better served as a follow up task.
There was a problem hiding this comment.
This function _parse_and_validate_image_input seems to only handle image_pixel_values. However, _parse_and_validate_multimodal_inputs calls it for both image_pixel_values and image_embeds. If only image_embeds are provided, this function will return None, and the embeddings will be ignored.
This function should be updated to handle image_embeds as well, similar to how _parse_and_validate_audio_input handles both features and embeddings. The return type hint should also be updated from Optional[Phi4MMImagePixelInputs] to Optional[Phi4MMImageInput].
vllm/model_executor/models/phi4mm.py
Outdated
There was a problem hiding this comment.
There seems to be a naming inconsistency here. The function _parse_and_validate_image_input gets input_image_embeds from kwargs, but it seems to be treating them as pixel values and returns a Phi4MMImagePixelInputs object.
This inconsistency propagates. For example, _get_mm_fields_config in Phi4MMMultiModalProcessor expects input_image_embeds, but the HF processor is likely to produce image_pixel_values. This could lead to a KeyError at runtime.
It seems input_image_embeds should be renamed to image_pixel_values throughout the image processing path in this file for clarity and correctness.
There was a problem hiding this comment.
I'm just fixing a typo. Similarly, this might be worth reviewing as a follow up task but trying to keep changes in this PR focused on the migration.
|
Observing failing MM test for: Based on existing schema, this seems to be an issue with the inputs: Will find time to investigate further. |
Head branch was pushed to by a user without write access
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
…hi4MMAudioFeatureInputs.data Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
Signed-off-by: Benji Beck <benjibeck@meta.com>
Purpose
This PR migrates Phi4MMImagePixelInputs, Phi4MMAudioFeatureInputs, Phi4MMAudioEmbeddingInputs, Phi4MMImageEmbeddingInputs from a TypedDict-based definition to a structured TensorSchema model with runtime shape validation. This brings it in line with recent changes to Phi3VImagePixelInputs, and is part of a broader effort to improve input contract enforcement and debug-ability across multi-modal models.
Test Plan
Confirm validation works via standalone tests in tests/standalone_test/test_tensor_schema.py and rely on CI to check integration.
Test Result