[ROCm] ROCm7.2.2 + profiler fix + AITER 0.1.12.post2#41386
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the ROCm base image to version 7.2.2 and the AITER branch to v0.1.12.post2, while removing a manual kineto repository checkout. It also introduces a torch profiler hotfix by rebuilding the CLR from source. However, the hotfix as implemented is likely ineffective because it lacks the correct installation prefix for CMake, causing the system to load original libraries instead of the patched ones. Additionally, the build steps should be consolidated into a single Docker layer to optimize image size.
| RUN apt-get update && apt-get install -y rocm-llvm-dev | ||
| RUN pip install CppHeaderParser | ||
| RUN git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-systems /tmp/rocm-systems \ | ||
| && cd /tmp/rocm-systems \ | ||
| && git sparse-checkout init --cone \ | ||
| && git sparse-checkout set projects/hip projects/clr \ | ||
| && git checkout 35e8c7bf8911862e5389509800e65fdf125412b3 \ | ||
| && export CLR_DIR=/tmp/rocm-systems/projects/clr \ | ||
| && export HIP_DIR=/tmp/rocm-systems/projects/hip \ | ||
| && mkdir -p $CLR_DIR/build && cd $CLR_DIR/build \ | ||
| && cmake \ | ||
| -DHIP_COMMON_DIR=$HIP_DIR \ | ||
| -DCMAKE_PREFIX_PATH="/opt/rocm/" \ | ||
| -DCLR_BUILD_HIP=ON \ | ||
| -DCLR_BUILD_OCL=OFF \ | ||
| -DHIP_PLATFORM=amd \ | ||
| .. \ | ||
| && make -j$(nproc) \ | ||
| && make install \ | ||
| && rm -rf /tmp/rocm-systems |
There was a problem hiding this comment.
The hotfix for the torch profiler has a critical functional issue and an efficiency concern:
- Functional Issue: The
cmakecommand is missing-DCMAKE_INSTALL_PREFIX="/opt/rocm/". By default, CMake installs to/usr/local. However, theLD_LIBRARY_PATH(defined on line 29) prioritizes/opt/rocm/libover/usr/local/lib. This means the system will continue to load the original libraries from the base image instead of the hotfixed ones, rendering the fix ineffective. - Image Efficiency: The hotfix is implemented across three separate
RUNlayers and does not clean up the apt cache. Consolidating these into a single layer and cleaning up/var/lib/apt/lists/*is standard practice to minimize image size and improve build performance.
RUN apt-get update && apt-get install -y --no-install-recommends rocm-llvm-dev \
&& pip install CppHeaderParser \
&& git clone --no-checkout --filter=blob:none https://github.com/ROCm/rocm-systems /tmp/rocm-systems \
&& cd /tmp/rocm-systems \
&& git sparse-checkout init --cone \
&& git sparse-checkout set projects/hip projects/clr \
&& git checkout 35e8c7bf8911862e5389509800e65fdf125412b3 \
&& export CLR_DIR=/tmp/rocm-systems/projects/clr \
&& export HIP_DIR=/tmp/rocm-systems/projects/hip \
&& mkdir -p $CLR_DIR/build && cd $CLR_DIR/build \
&& cmake \
-DHIP_COMMON_DIR=$HIP_DIR \
-DCMAKE_PREFIX_PATH="/opt/rocm/" \
-DCMAKE_INSTALL_PREFIX="/opt/rocm/" \
-DCLR_BUILD_HIP=ON \
-DCLR_BUILD_OCL=OFF \
-DHIP_PLATFORM=amd \
.. \
&& make -j$(nproc) \
&& make install \
&& cd /app \
&& rm -rf /tmp/rocm-systems \
&& rm -rf /var/lib/apt/lists/*
|
Hi @gshtras, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
92df0bf to
b4fdd2e
Compare
Bortlesboat
left a comment
There was a problem hiding this comment.
Hotfix pattern looks clean — sparse-checkout + --filter=blob:none + pinned commit (35e8c7bf8...) + the temp clone removed at the end. The kineto remote/checkout removal makes sense once the profiler fix lives in CLR itself rather than in pytorch's third_party/kineto.
One small Docker layer hygiene nit: the apt-get update && apt-get install -y rocm-llvm-dev doesn't clean /var/lib/apt/lists/* afterwards, so this RUN layer carries the apt cache permanently. Since the comment marks this as a temp block to remove at ROCm 7.2.3 it's probably not worth the churn, but a trailing && rm -rf /var/lib/apt/lists/* would keep the base image leaner in the meantime. Minor.
|
@gshtras can you help to update this line to vllm/.buildkite/release-pipeline.yaml Line 726 in 9c61864 |
tjtanaa
left a comment
There was a problem hiding this comment.
LGTM if it passed all required tests (internal and external).
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
A combination of base libraries that works
Including the profiler fix through rocm runtime