[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6#38252
[ROCm][CI/Build] ROCm 7.2.1 release version; torch 2.10; triton 3.6#38252gshtras merged 5 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
There was a problem hiding this comment.
Code Review
This pull request updates the ROCm base image to version 7.2.1 and bumps the branch versions for Triton and PyTorch. It also introduces a cherry-pick for Triton and a specific commit checkout for the Kineto third-party dependency in PyTorch. Feedback was provided to avoid using global git configurations within the Dockerfile and to improve the robustness of git operations by avoiding non-idempotent remote additions.
tjtanaa
left a comment
There was a problem hiding this comment.
LGTM as long as it passed internal testing.
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
|
CI run with the new image https://buildkite.com/vllm/amd-ci/builds/6964/summary |
Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
| # will not be imported by other tests | ||
| RUN mkdir src && mv vllm src/vllm | ||
|
|
||
| # This is a workaround to ensure pytest exits with the correct status code in CI tests. |
There was a problem hiding this comment.
@AndreasKaratzas can you note this down and formally fix the issue?
There was a problem hiding this comment.
We are triaging this. There are some exit hooks on torch version. But we are working on that. This is not AMD specific btw. The gap identified here could also exploit upstream too if an external package has exit hooks that override pytest. But how to document it? I guess it would be better to just make a PR and put that formally in the master conftest file.
…llm-project#38252) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
…llm-project#38252) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: neweyes <328719365@qq.com>
…llm-project#38252) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…llm-project#38252) Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Updating the base library versions.
Including the profiler workaround through using a custom kineto submodule
Including the triton BUFFER OPS fix cherry picked from upstream triton
Testing plan
A candidate base image is being built, will run a round of CI tests with it once ready and update
Tesing results
pytest issue revealed where it would exit with 0 even when there are failing tests, after importing torch.
Worked around by creating a pytest_sessionfinish hook until we get to the bottom of it