[AMD] Merge Dockerfiles for ROCm#19203
Conversation
Summary of ChangesHello @akao-amd, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request streamlines the Docker build process for ROCm environments by consolidating two previously separate Dockerfiles, Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
The pull request successfully unifies rocm720.Dockerfile into rocm.Dockerfile, streamlining the Dockerfile management for ROCm builds. The changes involve updating base images, adding ROCm 7.2 specific patches for AMD SMI and AITER, and modifying the CI script to use the unified Dockerfile. The accuracy tests passed for the specified models, indicating functional correctness. The removal of rocm720.Dockerfile and the consolidation of logic are positive steps towards better maintainability.
docker/rocm.Dockerfile
Outdated
| fi \ | ||
| && echo "export PYTHONPATH=/sgl-workspace/aiter:\${PYTHONPATH}" >> /etc/bash.bashrc |
There was a problem hiding this comment.
The PYTHONPATH export should ideally be placed in a more persistent configuration file like /etc/profile or a custom script that sources it, rather than /etc/bash.bashrc. While /etc/bash.bashrc works for interactive shells, it might not be sourced in non-interactive contexts (e.g., when running scripts or services), leading to unexpected behavior. Alternatively, if this is only for the build environment, it could be set as an ENV variable in the Dockerfile itself, or passed as a build argument to the Python setup.py command if applicable.
&& echo "export PYTHONPATH=/sgl-workspace/aiter:${PYTHONPATH}" >> /etc/profile
8a57843 to
7a49043
Compare
|
/tag-and-rerun-ci |
HaiShaw
left a comment
There was a problem hiding this comment.
Let's keep using release-docker-amd-nightly.yml vs. release-docker-amd-rocm720-nightly.yml
| @@ -117,17 +174,6 @@ RUN cd aiter \ | |||
| sh -c "GPU_ARCHS=$GPU_ARCH_LIST python setup.py develop"; \ | |||
| fi | |||
|
|
|||
There was a problem hiding this comment.
Lost && echo "export PYTHONPATH=/sgl-workspace/aiter:\${PYTHONPATH}" >> /etc/bash.bashrc in case ROCm 7.2
Please be careful at merging two dockerfiles
There was a problem hiding this comment.
We (YC, BingXu) discussed yesterday that the line wasn't needed.
There was a problem hiding this comment.
This is needed so sys.path on ROCm 7.2 is set properly to include /sgl-workspace/aiter
There was a problem hiding this comment.
Sorry but can you elaborate? If it is the case, why doesn't ROCm 7.0 image need it in the first place?
CC. @yctseng0211 @bingxche
| && cd python \ | ||
| && python setup.py install; \ | ||
| fi | ||
|
|
There was a problem hiding this comment.
Should keep this for ROCm 7.0
There was a problem hiding this comment.
It was just move to bottom, not removed.
| && pip install -r python/requirements.txt \ | ||
| && pip install -e .; \ | ||
| fi | ||
|
|
There was a problem hiding this comment.
L426-L509, pls keep it identical as in ROCm 7.2 dockerfile, and apply it only to ROCm7.2 build
There was a problem hiding this comment.
No. I believe BUILD_TRITON is already a better flag for this block and serves its purpose correctly.
Currently ROCm 7.2 images gives BUILD_TRITON=1 while ROCm 7.0 images give BUILD_TRITON=0. The flag has a finer granularity that can serve us well. For example, if we don't want to update gfx942-rocm720 for some reason, then it can keep as is while we may have fixed the triton dependency issue for gfx950-rocm720.
|
@HaiShaw Thanks for the review. Fixed two positions (vllm removal and cancel prompt for pip install) accordingly. For the rest I commented without modification. |
a27cd12 to
525cbc6
Compare
525cbc6 to
a5cd2f1
Compare
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Co-authored-by: bingxche <Bingxu.Chen@amd.com>
Motivation
rocm720.Dockerfileandrocm.Dockerfileshares a lot in common. It is time to unify them.Modifications
Making
rocm720.Dockerfilea superset ofrocm.Dockerfilewas the design choice back in #17799 . This PR mostly replacesrocm.Dockerfilewithrocm720.Dockerfile, and modifies all the users of it accordingly. Refactoring workflow files is a reasonable next step but not include in this PR.Accuracy Tests
Both
rocm700-mi35xandrocm720-mi35xpass the accuracy test of DEEPSEEK-R1-MXFP4-Preview model.Benchmarking and Profiling
N/A.
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci