Skip to content

Merge EmbeddedLLM/vllm-rocm into vLLM main#1749

Closed
tjtanaa wants to merge 20 commits into
vllm-project:mainfrom
EmbeddedLLM:vllm-rocm-merge-to-vllm
Closed

Merge EmbeddedLLM/vllm-rocm into vLLM main#1749
tjtanaa wants to merge 20 commits into
vllm-project:mainfrom
EmbeddedLLM:vllm-rocm-merge-to-vllm

Conversation

@tjtanaa
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa commented Nov 22, 2023

Checklist:

  • Merge changes from upstream vllm commit 094f716
  • Dynamic code path selection for CUDA or ROCm in PyTorch
  • Pass all unit tests
  • ROCm Dockerfile

tjtanaa and others added 20 commits October 27, 2023 00:27
* port dtype_float16.cuh and cache_kernels.cu

* port dtype_bfloat16.cuh

* port attention_utils.cuh

* port more kernels

* fix typo

* add cuda_compat.h

* sync branches

* update

* update

* fixes

* cleanup

* update

* update

* update

* fmt

* cleanup

* refactor

* update

* detecting rocm and adding flag for compiling

* using asm volatile instead of hip api

* using asm volatile for type casting of f16

---------

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Co-authored-by: Amir Balwel <amoooori04@gmail.com>
Copy link
Copy Markdown
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for upstreaming this! We will review soon.

Comment thread Dockerfile

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make a new docker file.

@tjtanaa tjtanaa mentioned this pull request Nov 22, 2023
15 tasks
Comment thread Dockerfile
&& git clone https://github.com/ROCmSoftwarePlatform/flash-attention.git \
&& cd flash-attention \
&& git submodule update --init \
&& sed -i -e "s/--offload-arch=native/--offload-arch=$(/opt/rocm/llvm/bin/amdgpu-offload-arch)/g" setup.py \
Copy link
Copy Markdown
Collaborator

@hongxiayang hongxiayang Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the pull request.
This line is no-op since I don't see any reference of offload-arch in setup.py file.
Therefore, when I test this pull request and build the docker using this Dockerfile, it failed because of that.
Can you check the setup.py file?

/opt/rocm/bin/hipcc  -I/app/libs/flash-attention/csrc/flash_attn_rocm -I/app/libs/flash-attention/csrc/flash_attn_rocm/src -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.10/include/python3.10 -c -c /app/libs/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /app/libs/flash-attention/build/temp.linux-x86_64-cpython-310/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc^M
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'^M

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We implemented a temporary solution during the build process with ROCm/flash-attention@edc7698.
The issue with the hardcoded --offload-arch=native has been resolved by the commit ROCm/flash-attention@5f1ae07. It appears that the temporary fix is no longer necessary. Following the testing of the most recent version of flash-attention, we plan to revise the Dockerfile accordingly.

Copy link
Copy Markdown
Collaborator

@hongxiayang hongxiayang Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, use a specific commit of a named branch to achieve stable and reproducible result than using the default branch since it might keep changing. Related to the name of the Dockerfile, you might want to rename the Dockerfile to Dockerfile.rocm_xxx with xxx related to the version of the rocm you are using.

@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Nov 29, 2023

@hongxiayang @WoosukKwon @simon-mo
I am closing this PR and continue the work on PR https://github.com/EmbeddedLLM/vllm-rocm/pull/17
🙏

@tjtanaa tjtanaa closed this Nov 29, 2023
@kliuae kliuae deleted the vllm-rocm-merge-to-vllm branch December 1, 2023 17:21
@simon-mo
Copy link
Copy Markdown
Collaborator

The full version is merged in #1836!

amy-why-3459 pushed a commit to amy-why-3459/vllm that referenced this pull request Sep 15, 2025
### What this PR does / why we need it?
since the interface in vllm-ascend has changed so quickly, the
quantization function in mindie_turbo is no longer needed, so it needs
to be discarded.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
through ci

Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>
WeNeedMoreCode pushed a commit to WeNeedMoreCode/vllm that referenced this pull request Dec 15, 2025
### What this PR does / why we need it?
cherry pick vllm-project#1749 from v0.9.1-dev
since the interface in vllm-ascend has changed so quickly, the
quantization function in mindie_turbo is no longer needed, so it needs
to be discarded.

Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.10.0
- vLLM main:
vllm-project@207b750

Signed-off-by: wangli <wangli858794774@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants