Skip to content

[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6#38252

Merged
gshtras merged 5 commits intovllm-project:mainfrom
ROCm:rocm_7.2.1_torch_triton
Mar 27, 2026
Merged

[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6#38252
gshtras merged 5 commits intovllm-project:mainfrom
ROCm:rocm_7.2.1_torch_triton

Conversation

@gshtras
Copy link
Copy Markdown
Collaborator

@gshtras gshtras commented Mar 26, 2026

Updating the base library versions.

Including the profiler workaround through using a custom kineto submodule

Including the triton BUFFER OPS fix cherry picked from upstream triton

Testing plan

A candidate base image is being built, will run a round of CI tests with it once ready and update

Tesing results

pytest issue revealed where it would exit with 0 even when there are failing tests, after importing torch.

Worked around by creating a pytest_sessionfinish hook until we get to the bottom of it

gshtras added 2 commits March 24, 2026 15:26
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras requested a review from tjtanaa as a code owner March 26, 2026 15:41
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ROCm base image to version 7.2.1 and bumps the branch versions for Triton and PyTorch. It also introduces a cherry-pick for Triton and a specific commit checkout for the Kineto third-party dependency in PyTorch. Feedback was provided to avoid using global git configurations within the Dockerfile and to improve the robustness of git operations by avoiding non-idempotent remote additions.

Comment thread docker/Dockerfile.rocm_base
Comment thread docker/Dockerfile.rocm_base
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as it passed internal testing.

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@mergify mergify Bot added ci/build rocm Related to AMD ROCm labels Mar 26, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Mar 26, 2026
@gshtras
Copy link
Copy Markdown
Collaborator Author

gshtras commented Mar 26, 2026

CI run with the new image https://buildkite.com/vllm/amd-ci/builds/6964/summary
Base image built from the new dockerfile: rocm/vllm-dev:base_custom_rocm_7.2.1_torch_triton_20260326

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 27, 2026
@gshtras gshtras merged commit 731285c into vllm-project:main Mar 27, 2026
12 of 13 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Mar 27, 2026
@gshtras gshtras deleted the rocm_7.2.1_torch_triton branch March 27, 2026 23:03
Comment thread docker/Dockerfile.rocm
# will not be imported by other tests
RUN mkdir src && mv vllm src/vllm

# This is a workaround to ensure pytest exits with the correct status code in CI tests.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AndreasKaratzas can you note this down and formally fix the issue?

Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are triaging this. There are some exit hooks on torch version. But we are working on that. This is not AMD specific btw. The gap identified here could also exploit upstream too if an external package has exit hooks that override pytest. But how to document it? I guess it would be better to just make a PR and put that formally in the master conftest file.

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…llm-project#38252)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
neweyes pushed a commit to neweyes/vllm that referenced this pull request Mar 31, 2026
…llm-project#38252)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: neweyes <328719365@qq.com>
puririshi98 pushed a commit to puririshi98/vllm that referenced this pull request Apr 7, 2026
…llm-project#38252)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rishi Puri <riship@nvidia.com>
mtparet pushed a commit to blackfuel-ai/vllm that referenced this pull request Apr 9, 2026
…llm-project#38252)

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants