Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes #5422
[Hardware][AMD][CI/Build][Doc] Upgrade to ROCm 6.1, Dockerfile improvements, test fixes #5422
Changes from all commits
32b9f67
c3d3671
1b3b2a7
a4679ed
e5a3428
454cb21
84fe59d
50b909b
ef1e4cb
3bb3875
90ab818
ba8130c
2583953
4786050
85cb7d0
ae9e82c
e49260d
554cff1
32fb98b
e239bfc
016a553
40c33ec
c293b3a
353c0b2
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know whether base image rocm/pytorch:rocm6.1.2_ubuntu20.04_py3.9_pytorch_staging works for pytorch rocm5.7 and rocm 6.0 pytorch wheels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I understand this question, or the spirit behind the question: is the question whether "pytorch_staging" works on ROCm 5.7 or 6.0 without wheels, or whether these wheels work on a non-"pytorch_staging" image, or something else? Because you'd never install rocm5.7 or rocm6.0 wheels on a rocm6.1.2 image.
For the first question, the answer is "no": you need at least PyTorch 2.4.0 for
torch.cuda._device_count_amdsmi
, whilepytorch_staging
goes up to PyTorch 2.3.0. For the second, the answer is "yes": the relevant wheels for each ROCm version are designed to (and work on) any official base image from the relevant versions.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the confusing question. I was thinking about whether we should use the "staging" as the base image as the name means its testing status. And also whether rocm-6.0 + torch 2.4 or rocm-5.7 + torch 2.4, works out of box, for certain users on certain devices (especially, navi3x users, for example), because in the past the supported based docker images has torch 2.0.1 or torch 2.1.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Staging" in this case refers to the PyTorch version (2.3.0, which we are targeting for stable release sometime), so it's a moot point when we're upgrading to a PyTorch 2.4.0 nightly. And this particular nightly has not raised problems on a wide range of internal testing on multiple GPUs architectures, including MI300, MI250, and Navi32.
Let's keep in mind the broader picture here: whether this PR is an improvement to the status quo, as opposed to whether this PR is "perfect". The status quo has multiple failing tests, even on the tests AMD supports, and most importantly, does not have ROCm 6.1 support, which brings in a host of performance improvements. And upstream vLLM has moved such that PyTorch 2.3+/2.4+ on ROCm is needed for a full complement of tests.
We can offer this support: even though on paper it is "experimental", it has not raised issues in testing. Unless someone provides a specific failing example that arises as a result of this PR, quibbling about potential issues that may arise because of the "experimental" or "nightly" status, without doing the legwork to verify that such issues actually exist, seems to me to be an example of making the perfect the enemy of the good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reusing PYTORCH_ROCM_ARCH environment variable in VLLM environment exposes some potential problems. This environment variable was used by pytorch on rocm when building its rocm backend (https://github.com/pytorch/pytorch/blob/0707811286d1846209676435f4f86f2b4b3d1a17/cmake/Dependencies.cmake#L1075). Reuse this will hide this information, and may potentially make the debugging more difficult.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure if this is a concern:
torch
wheels, soPYTORCH_ROCM_ARCH
in the base image is decoupled from whatever "Pytorch used when building its ROCm backend"There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am concerned about the name clash. Can you change it to VLLM_ROCM_SUPPORTED_ARCHS?
Though the dockerfile in this PR installed wheels, the wheels was built using the same environment variable and same logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was working fine prior to #2830 and the workflow people are used to is to specify
PYTORCH_ROCM_ARCH
to decide the architectures to build vLLM for: this has been the case at ROCm/vllm and here, just that it hasn't been working here in a while.. I haven't yet seen a good reason to change this workflow. If this change is made, it should be made in a separate PR with proper discussion of the change.Also, if the wheels are all built using the same env var, then it can be easily remembered as a constant.