[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests#37613
[ROCm][CI] Fix accuracy for llama-nemotron-vl pooling tests#37613DarkLight1337 merged 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Testing MI250 to see if issue is resolved (added |
There was a problem hiding this comment.
Code Review
This pull request addresses an accuracy issue for llama-nemotron-vl pooling tests on ROCm by generalizing the patch to force SDPA for vision encoders. The changes refactor patch_hf_vision_attn_for_rocm to support more model architectures and apply this patch in the relevant tests. Additionally, the relative tolerance for test assertions is increased for ROCm to account for numerical differences. My feedback includes a suggestion to improve the robustness of the patching logic to prevent potential errors with different model structures in the future.
tests/models/multimodal/conftest.py
Outdated
| if hasattr(inner, "vision_embedding"): | ||
| vit = inner.vision_embedding[0] | ||
| for layer in vit.encoder.layers: | ||
| if hasattr(layer, "self_attn"): | ||
| layer.self_attn.vision_config._attn_implementation = "sdpa" | ||
| _patch_encoder_layers(vit.encoder) |
There was a problem hiding this comment.
The current implementation assumes that inner.vision_embedding is a non-empty list and that its first element has an encoder attribute. This could lead to IndexError or AttributeError if a model has a vision_embedding attribute with a different structure. To make this patch more robust and prevent future test failures, it's better to add checks for the list's existence and content, as well as for the presence of the encoder attribute.
| if hasattr(inner, "vision_embedding"): | |
| vit = inner.vision_embedding[0] | |
| for layer in vit.encoder.layers: | |
| if hasattr(layer, "self_attn"): | |
| layer.self_attn.vision_config._attn_implementation = "sdpa" | |
| _patch_encoder_layers(vit.encoder) | |
| if hasattr(inner, "vision_embedding") and inner.vision_embedding: | |
| vit = inner.vision_embedding[0] | |
| if hasattr(vit, "encoder"): | |
| _patch_encoder_layers(vit.encoder) |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
|
Test group confirmed passing: https://buildkite.com/vllm/amd-ci/builds/6723/steps/canvas?sid=019d09de-226d-4565-8db4-bd4f91370f0d&tab=output |
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…ject#37613) Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Follow-up for:
Fixes small accuracy diff due to differences in HF and vLLM attention backends on ROCm in
mi250_1: Multi-Modal Models (Extended Pooling)Motivation: https://buildkite.com/vllm/amd-ci/builds/6701/steps/canvas?sid=019d07a7-1a1e-445a-8480-1feaf029a19d&tab=output
cc @kenroche