Merge EmbeddedLLM/vllm-rocm into vLLM main#1749
Conversation
* port dtype_float16.cuh and cache_kernels.cu * port dtype_bfloat16.cuh * port attention_utils.cuh * port more kernels * fix typo * add cuda_compat.h * sync branches * update * update * fixes * cleanup * update * update * update * fmt * cleanup * refactor * update * detecting rocm and adding flag for compiling * using asm volatile instead of hip api * using asm volatile for type casting of f16 --------- Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com>
…oblem and xformers license
simon-mo
left a comment
There was a problem hiding this comment.
Thank you for upstreaming this! We will review soon.
|
|
||
| ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] | ||
|
|
||
| FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 |
There was a problem hiding this comment.
Please make a new docker file.
| && git clone https://github.com/ROCmSoftwarePlatform/flash-attention.git \ | ||
| && cd flash-attention \ | ||
| && git submodule update --init \ | ||
| && sed -i -e "s/--offload-arch=native/--offload-arch=$(/opt/rocm/llvm/bin/amdgpu-offload-arch)/g" setup.py \ |
There was a problem hiding this comment.
Thank you for the pull request.
This line is no-op since I don't see any reference of offload-arch in setup.py file.
Therefore, when I test this pull request and build the docker using this Dockerfile, it failed because of that.
Can you check the setup.py file?
/opt/rocm/bin/hipcc -I/app/libs/flash-attention/csrc/flash_attn_rocm -I/app/libs/flash-attention/csrc/flash_attn_rocm/src -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/include -I/app/libs/flash-attention/csrc/flash_attn_rocm/composable_kernel/library/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/TH -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THC -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/THH -I/opt/rocm/include -I/opt/conda/envs/py_3.10/include/python3.10 -c -c /app/libs/flash-attention/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.hip -o /app/libs/flash-attention/build/temp.linux-x86_64-cpython-310/csrc/flash_attn_rocm/src/flash_bwd_runner_batched_hdim32_bf16_causal_gfx9x_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -std=c++20 -DNDEBUG -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ --offload-arch=native -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=flash_attn_2_cuda -D_GLIBCXX_USE_CXX11_ABI=1 -fno-gpu-rdc^M
clang++: error: cannot determine amdgcn architecture: /opt/rocm/llvm/bin/amdgpu-arch: ; consider passing it via '--offload-arch'^M
There was a problem hiding this comment.
We implemented a temporary solution during the build process with ROCm/flash-attention@edc7698.
The issue with the hardcoded --offload-arch=native has been resolved by the commit ROCm/flash-attention@5f1ae07. It appears that the temporary fix is no longer necessary. Following the testing of the most recent version of flash-attention, we plan to revise the Dockerfile accordingly.
There was a problem hiding this comment.
Yes, use a specific commit of a named branch to achieve stable and reproducible result than using the default branch since it might keep changing. Related to the name of the Dockerfile, you might want to rename the Dockerfile to Dockerfile.rocm_xxx with xxx related to the version of the rocm you are using.
|
@hongxiayang @WoosukKwon @simon-mo |
|
The full version is merged in #1836! |
### What this PR does / why we need it? since the interface in vllm-ascend has changed so quickly, the quantization function in mindie_turbo is no longer needed, so it needs to be discarded. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? through ci Signed-off-by: zouyida <zouyida@huawei.com> Co-authored-by: zouyida <zouyida@huawei.com>
### What this PR does / why we need it? cherry pick vllm-project#1749 from v0.9.1-dev since the interface in vllm-ascend has changed so quickly, the quantization function in mindie_turbo is no longer needed, so it needs to be discarded. Co-authored-by: zouyida [zouyida@huawei.com](mailto:zouyida@huawei.com) Co-authored-by: wangli [wangli858794774@gmail.com](mailto:wangli858794774@gmail.com) ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: vllm-project@207b750 Signed-off-by: wangli <wangli858794774@gmail.com>
Checklist: