Skip to content

[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250#37610

Merged
DarkLight1337 merged 5 commits intovllm-project:mainfrom
ROCm:akaratza_gfx90a_multi_mod_gemma
Mar 21, 2026
Merged

[ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250#37610
DarkLight1337 merged 5 commits intovllm-project:mainfrom
ROCm:akaratza_gfx90a_multi_mod_gemma

Conversation

@AndreasKaratzas
Copy link
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 19, 2026

@AndreasKaratzas AndreasKaratzas added the rocm Related to AMD ROCm label Mar 19, 2026
@github-project-automation github-project-automation bot moved this to Todo in AMD Mar 19, 2026
@AndreasKaratzas AndreasKaratzas added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 19, 2026
@AndreasKaratzas
Copy link
Collaborator Author

Testing MI250 to see if issue is resolved (added rocm and ready labels).

@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Mar 19, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an Out-Of-Memory error on MI250 for gemma3 tests under ROCm by skipping the Scaled Dot-Product Attention (SDP) override. The change is simple and effective. I've added one suggestion to improve code maintainability by documenting the reason for this skip directly in the code.

@AndreasKaratzas AndreasKaratzas force-pushed the akaratza_gfx90a_multi_mod_gemma branch from 3e44e63 to 879b58b Compare March 20, 2026 00:46
@AndreasKaratzas AndreasKaratzas changed the title [ROCm][CI] Skip SDP override for gemma3 to avoid OOM on MI250 GCDs [ROCm][CI] Reduce image resolution for gemma3 to avoid OOM on MI250 Mar 20, 2026
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas force-pushed the akaratza_gfx90a_multi_mod_gemma branch from 700d164 to cdcea97 Compare March 20, 2026 06:06
@AndreasKaratzas AndreasKaratzas changed the title [ROCm][CI] Reduce image resolution for gemma3 to avoid OOM on MI250 [ROCm][CI] Split test step for gemma3 to avoid OOM on MI250 Mar 20, 2026
@mergify mergify bot added the ci/build label Mar 20, 2026
@AndreasKaratzas
Copy link
Collaborator Author

I have just added a large GPU mark for ROCm only here. This will help skip the test if the platform is mi250 and resolve OOMing there.

@AndreasKaratzas AndreasKaratzas marked this pull request as ready for review March 20, 2026 21:45
@AndreasKaratzas AndreasKaratzas changed the title [ROCm][CI] Split test step for gemma3 to avoid OOM on MI250 [ROCm][CI] Mark gemma3 as large GPU test to avoid OOM on MI250 Mar 20, 2026
@AndreasKaratzas
Copy link
Collaborator Author

@DarkLight1337
Copy link
Member

Actually, how can a 4B model cause OOM?

@AndreasKaratzas
Copy link
Collaborator Author

I think it's the profiling stage that generates a tensor that is big enough to create that. It happens during the SDPA stage.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Mar 21, 2026

Hmm ok, maybe you should investigate this further as it's quite unexpected. Let's get the CI to pass first though

@DarkLight1337 DarkLight1337 merged commit 0d50fa1 into vllm-project:main Mar 21, 2026
22 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 21, 2026
@AndreasKaratzas AndreasKaratzas deleted the akaratza_gfx90a_multi_mod_gemma branch March 21, 2026 05:24
JartX pushed a commit to JartX/vllm that referenced this pull request Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build multi-modality Related to multi-modality (#4194) ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants