Skip to content

[ROCm] ROCm7.2.2 + profiler fix + AITER 0.1.12.post2#41386

Merged
gshtras merged 15 commits into
vllm-project:mainfrom
ROCm:rocclr_profiler_hotfix_old_torch
May 4, 2026
Merged

[ROCm] ROCm7.2.2 + profiler fix + AITER 0.1.12.post2#41386
gshtras merged 15 commits into
vllm-project:mainfrom
ROCm:rocclr_profiler_hotfix_old_torch

Conversation

@gshtras

@gshtras gshtras commented Apr 30, 2026

Copy link
Copy Markdown
Collaborator

A combination of base libraries that works

Including the profiler fix through rocm runtime

@gshtras gshtras requested a review from tjtanaa as a code owner April 30, 2026 16:00

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added ci/build rocm Related to AMD ROCm labels Apr 30, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Apr 30, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the ROCm base image to version 7.2.2 and the AITER branch to v0.1.12.post2, while removing a manual kineto repository checkout. It also introduces a torch profiler hotfix by rebuilding the CLR from source. However, the hotfix as implemented is likely ineffective because it lacks the correct installation prefix for CMake, causing the system to load original libraries instead of the patched ones. Additionally, the build steps should be consolidated into a single Docker layer to optimize image size.

Comment on lines +109 to +128
RUN apt-get update && apt-get install -y rocm-llvm-dev
RUN pip install CppHeaderParser
RUN git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-systems /tmp/rocm-systems \
&& cd /tmp/rocm-systems \
&& git sparse-checkout init --cone \
&& git sparse-checkout set projects/hip projects/clr \
&& git checkout 35e8c7bf8911862e5389509800e65fdf125412b3 \
&& export CLR_DIR=/tmp/rocm-systems/projects/clr \
&& export HIP_DIR=/tmp/rocm-systems/projects/hip \
&& mkdir -p $CLR_DIR/build && cd $CLR_DIR/build \
&& cmake \
-DHIP_COMMON_DIR=$HIP_DIR \
-DCMAKE_PREFIX_PATH="/opt/rocm/" \
-DCLR_BUILD_HIP=ON \
-DCLR_BUILD_OCL=OFF \
-DHIP_PLATFORM=amd \
.. \
&& make -j$(nproc) \
&& make install \
&& rm -rf /tmp/rocm-systems

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The hotfix for the torch profiler has a critical functional issue and an efficiency concern:

  1. Functional Issue: The cmake command is missing -DCMAKE_INSTALL_PREFIX="/opt/rocm/". By default, CMake installs to /usr/local. However, the LD_LIBRARY_PATH (defined on line 29) prioritizes /opt/rocm/lib over /usr/local/lib. This means the system will continue to load the original libraries from the base image instead of the hotfixed ones, rendering the fix ineffective.
  2. Image Efficiency: The hotfix is implemented across three separate RUN layers and does not clean up the apt cache. Consolidating these into a single layer and cleaning up /var/lib/apt/lists/* is standard practice to minimize image size and improve build performance.
RUN apt-get update && apt-get install -y --no-install-recommends rocm-llvm-dev \
    && pip install CppHeaderParser \
    && git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-systems /tmp/rocm-systems \
    && cd /tmp/rocm-systems \
    && git sparse-checkout init --cone \
    && git sparse-checkout set projects/hip projects/clr \
    && git checkout 35e8c7bf8911862e5389509800e65fdf125412b3 \
    && export CLR_DIR=/tmp/rocm-systems/projects/clr \
    && export HIP_DIR=/tmp/rocm-systems/projects/hip \
    && mkdir -p $CLR_DIR/build && cd $CLR_DIR/build \
    && cmake \
        -DHIP_COMMON_DIR=$HIP_DIR \
        -DCMAKE_PREFIX_PATH="/opt/rocm/" \
        -DCMAKE_INSTALL_PREFIX="/opt/rocm/" \
        -DCLR_BUILD_HIP=ON \
        -DCLR_BUILD_OCL=OFF \
        -DHIP_PLATFORM=amd \
        .. \
    && make -j$(nproc) \
    && make install \
    && cd /app \
    && rm -rf /tmp/rocm-systems \
    && rm -rf /var/lib/apt/lists/*

@mergify

mergify Bot commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Hi @gshtras, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Rohan138 and others added 14 commits April 30, 2026 16:02
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras force-pushed the rocclr_profiler_hotfix_old_torch branch from 92df0bf to b4fdd2e Compare April 30, 2026 16:02

@Bortlesboat Bortlesboat left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hotfix pattern looks clean — sparse-checkout + --filter=blob:none + pinned commit (35e8c7bf8...) + the temp clone removed at the end. The kineto remote/checkout removal makes sense once the profiler fix lives in CLR itself rather than in pytorch's third_party/kineto.

One small Docker layer hygiene nit: the apt-get update && apt-get install -y rocm-llvm-dev doesn't clean /var/lib/apt/lists/* afterwards, so this RUN layer carries the apt cache permanently. Since the comment marks this as a temp block to remove at ROCm 7.2.3 it's probably not worth the churn, but a trailing && rm -rf /var/lib/apt/lists/* would keep the base image leaner in the meantime. Minor.

@tjtanaa

tjtanaa commented Apr 30, 2026

Copy link
Copy Markdown
Member

@gshtras can you help to update this line to rocm722?

VARIANT: "rocm721"

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if it passed all required tests (internal and external).

Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
@gshtras gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label May 4, 2026
@gshtras gshtras merged commit e724b0e into vllm-project:main May 4, 2026
14 of 16 checks passed
@gshtras gshtras deleted the rocclr_profiler_hotfix_old_torch branch May 4, 2026 18:07
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD May 4, 2026
chaojun-zhang pushed a commit to chaojun-zhang/vllm that referenced this pull request May 6, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
Copilot AI pushed a commit to hongbolv/vllm that referenced this pull request May 7, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request May 7, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
mvanhorn pushed a commit to mvanhorn/vllm that referenced this pull request Jun 4, 2026
)

Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Co-authored-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants