[Continuation] Merge EmbeddedLLM/vllm-rocm into vLLM main by tjtanaa · Pull Request #1836 · vllm-project/vllm

tjtanaa · 2023-11-29T16:00:29Z

Add ROCm- Support

Dynamic code path selection for CUDA or ROCm in PyTorch
Llama2 support
SqueezeLLM ROCm
add documentation amd-installation.rst. Describing how to setup vLLM ROCm version.
format.sh all the code
Prepare amd.Dockerfile

As there are too many changes has been made after #1749 ,
the previous PR #1749 is closed as and continued here.

PR Authors:
@kliuae
@iAmir97
@tjtanaa
@tanpinsiang

Contributer:
@pcmortiz

This pull request also incorporates the work from Port most vLLM kernels to ROCm #1313 by @pcmoritz, which was not merged. We appreciate @pcmoritz's contribution.

Features * Auto-code path selection * support llama2 * support squeezellm rocm * add documentation amd-installation.rst. Describing how to setup vllm ROCm version * format.sh all the code * add base amd.Dockerfile --------- Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com>

…m support

WoosukKwon · 2023-12-07T06:46:01Z

Thanks for the amazing work. I just verified the Dockerfile.rocm and ran the benchmarking on llama2-7b model on MI210. The thruput is:
Throughput: 0.89 requests/s, 424.90 tokens/s

Hi @hongxiayang Could you provide what your benchmark setting is? I'm wondering because it seems quite lower than what we got from A100-80GB GPUs on the ShareGPT benchmark.

hongxiayang · 2023-12-07T12:51:28Z

Hi @hongxiayang Could you provide what your benchmark setting is? I'm wondering because it seems quite lower than what we got from A100-80GB GPUs on the ShareGPT benchmark.

The setting is as below: 1 gpu only on MI210 (dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0).
MI250 will be better, and I could run the same to get the numbers.
@WoosukKwon what is your setting on A100?

Namespace(backend='vllm', dataset='/app/dataset/ShareGPT_V3_unfiltered_cleaned_split.json', input_len=None, output_len=None, model='/app/model', tokenizer='/app/model', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=1000, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto')
INFO 12-05 22:40:38 llm_engine.py:73] Initializing an LLM engine with config: model='/app/model', tokenizer='/app/model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.0+cu121 with CUDA 1201 (you have 2.0.1+gita61a294)
    Python  3.10.13 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
INFO 12-05 22:41:11 llm_engine.py:222] # GPU blocks: 5705, # CPU blocks: 512
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [18:45<00:00,  1.13s/it]
Throughput: 0.89 requests/s, 424.90 tokens/s

Update the vLLM installation procedures on AMD platform. Update vLLM documentations.

tjtanaa · 2023-12-07T14:40:40Z

@WoosukKwon It is ready for another review. Thank you very much.

WoosukKwon · 2023-12-07T17:23:57Z

@hongxiayang This is my benchmark result on llama-7b and ShareGPT (benchmark_throughput.py), which is quite different from your results but seems more reasonable.

GPU	A100	MI210x
TFLOPs (FP16)	312	181
Memory capacity	80 GB	64 GB
Memory bandwidth	1.9 TB/s	1.6 TB/s
Throughput	8.30 reqs/s	5.25 reqs/s

hongxiayang · 2023-12-07T19:16:16Z

@hongxiayang This is my benchmark result on llama-7b and ShareGPT (benchmark_throughput.py), which is quite different from your results but seems more reasonable.

GPU A100 MI210x
TFLOPs (FP16) 312 181
Memory capacity 80 GB 64 GB
Memory bandwidth 1.9 TB/s 1.6 TB/s
Throughput 8.30 reqs/s 5.25 reqs/s

Wow, Is this one gpu, or 8 GPUs? your number is quite different from mine, and I am wondering whether we had the same parameters when running the test, like INFO 12-05 22:40:38 llm_engine.py:73] Initializing an LLM engine with config: model='/app/model', tokenizer='/app/model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=0)

edit: My MI210 was wonky when I ran the test. Your number is valid.

WoosukKwon

@tjtanaa Thanks again for the great work! I found the code super clean and well organized. I also like the detailed documentation and the provided docker image. I could run vLLM on MI210 very smoothly and the performance was great! Thanks a lot for the contribution.

Left some minor comments on the code style. Please take a look!

WoosukKwon · 2023-12-07T23:40:45Z

+       -v <path/to/model>:/app/model \
+       vllm-rocm \
+       bash
+


I think we can keep this as is. I tried it out and it worked pretty smoothly!

WoosukKwon · 2023-12-07T23:50:51Z

 # Supported NVIDIA GPU architectures.
-SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"}
+NVIDIA_SUPPORTED_ARCHS = {"7.0", "7.5", "8.0", "8.6", "8.9", "9.0"}
+ROCM_SUPPORTED_ARCHS = {"gfx90a", "gfx908", "gfx906", "gfx1030", "gfx1100"}


Just curious: Which part of the code makes this requirement? That is, why is gfx8 not supported? While I don't we have to support it, I'd like to know why we don't.

The way we compiled this list of rocm supported archs is based on what AMD is supporting for ROCm and HIP, furthermore each arch has its own set of assembly instructions we have to make sure the currently used assembly instructions is supported by those archs as well.

To the best of our knowledge, the following are the ARCH requirements needed by different libraries:

Pytorch gfx900 gfx906 gfx908 gfx90a gfx1030 gfx1101

vLLM Custom Ops: gfx90a gfx908 gfx906 gfx1030 gfx1100

Flash-Attention-ROCm: gfx90a gfx940 gfx941 gfx942

Should we use the intersection of all three ARCH requirements instead?

@tjtanaa Thanks for the detailed explanation. Sorry, I have little background on this stuff. Maybe I should learn more about ROCm and AMD GPUs 😂

As far as I understand, the vLLM custom ops support every "recent" AMD GPUs, and currently the supported GPU list is limited by the ROCm Flash Attention. Is this correct?

@WoosukKwon We believe in near future, the supported GPU ARCH is going to be restricted by Flash Attention ROCm.

fyi: The supported gfx arch for ROCm is documented here (as "LLVM target" column): https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html#linux-supported-gpus.

WoosukKwon · 2023-12-08T00:20:48Z

@hongxiayang It's LLaMA2-7B on a single MI210x. Basically, it should be the same setup as yours.

tjtanaa · 2023-12-08T05:56:47Z

@tjtanaa Thanks again for the great work! I found the code super clean and well organized. I also like the detailed documentation and the provided docker image. I could run vLLM on MI210 very smoothly and the performance was great! Thanks a lot for the contribution.

Left some minor comments on the code style. Please take a look!

@WoosukKwon We have done updating the code style. and replied to your questions regarding to the supported ARCHs.

WoosukKwon · 2023-12-08T07:16:36Z

@tjtanaa @kliuae LGTM! Many thanks again for the wonderful work! This will be HUGE!!

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Co-authored-by: Amir Balwel <amoooori04@gmail.com> Co-authored-by: root <kuanfu.liu@akirakan.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: kuanfu <kuanfu.liu@embeddedllm.com> Co-authored-by: miloice <17350011+kliuae@users.noreply.github.com>

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

### What this PR does / why we need it? Add `fusion_result.json` to `.gitignore`. - vLLM version: v0.9.2 - vLLM main: vllm-project@72ad273 --------- Signed-off-by: shen-shanshan <467638484@qq.com>

pcmoritz and others added 28 commits October 10, 2023 08:19

port dtype_float16.cuh and cache_kernels.cu

43af310

port dtype_bfloat16.cuh

cc81866

port attention_utils.cuh

475b5e2

port more kernels

ddc496c

fix typo

5eaa7a1

add cuda_compat.h

f7273c6

Merge branch 'main' into port-to-rocm

99c3be7

sync branches

f8093dc

update

41df689

update

93be9c5

fixes

d96fa3c

cleanup

421365b

update

06b800e

update

2312beb

update

2958b39

fmt

3f89734

cleanup

5397a57

refactor

90e02d2

update

a420202

Merge branch 'main' into port-to-rocm

b072182

detecting rocm and adding flag for compiling

2d1e435

using asm volatile instead of hip api

e231b79

using asm volatile for type casting of f16

31bb335

Hipifying csrc file to accomodate rocm builds

b027d06

merged with latest upstream

0f67117

format code

7dbf2d4

downgrade torch requirement in toml to torch 2.0.1 to accommodate ROC…

52ffcf0

…m support

WoosukKwon added the rocm Related to AMD ROCm label Dec 1, 2023

Merged changes from vllm main

27f0513

kliuae added 3 commits December 6, 2023 15:26

Format code

4a52977

Restored awq file

23a987a

Format code

8787a4e

kliuae added 3 commits December 7, 2023 07:58

Merge latest vllm main

5911131

Updated rocm dockerfile

9fa8075

Update amd installation guide

81e052d

Update vLLM Documentations (#18)

fb8ac26

Update the vLLM installation procedures on AMD platform. Update vLLM documentations.

WoosukKwon approved these changes Dec 8, 2023

View reviewed changes

kliuae added 4 commits December 8, 2023 04:02

Updated setup.py, vllm/utils.py and amd-installation doc

98f5487

Updated setup.py

d90187a

Format code

c840531

Merge branch 'main' into vllm-cuda-rocm-mod

9dba1d8

WoosukKwon merged commit 6ccc0bf into vllm-project:main Dec 8, 2023

tjtanaa mentioned this pull request Dec 10, 2023

Roadmap EmbeddedLLM/vllm#4

Closed

15 tasks

This was referenced Dec 10, 2023

Port most vLLM kernels to ROCm #1313

Closed

[Do not merge] Hacks for the ROCm port #1314

Closed

simon-mo mentioned this pull request Dec 10, 2023

Merge EmbeddedLLM/vllm-rocm into vLLM main #1749

Closed

4 tasks

tanpinsiang mentioned this pull request Dec 11, 2023

Merging with vLLM main branch EmbeddedLLM/vllm#12

Closed

nonetrix mentioned this pull request Jan 25, 2024

Support Vllm backend oobabooga/textgen#4860

Closed

1 task

jinyouzhi pushed a commit to jinyouzhi/vllm that referenced this pull request Sep 12, 2025

v0 aware padding scheduler fix for bs=1 (vllm-project#1836)

8fad535

Signed-off-by: Iryna Boiko <iboiko@habana.ai>

Uh oh!

Conversation

tjtanaa commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Dec 7, 2023

Uh oh!

hongxiayang commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tjtanaa commented Dec 7, 2023

Uh oh!

WoosukKwon commented Dec 7, 2023

Uh oh!

hongxiayang commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WoosukKwon Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon Dec 7, 2023

Choose a reason for hiding this comment

Uh oh!

tjtanaa Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

WoosukKwon Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjtanaa Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hongxiayang Dec 8, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Dec 8, 2023

Uh oh!

tjtanaa commented Dec 8, 2023

Uh oh!

WoosukKwon commented Dec 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

tjtanaa commented Nov 29, 2023 •

edited

Loading

hongxiayang commented Dec 7, 2023 •

edited

Loading

hongxiayang commented Dec 7, 2023 •

edited

Loading

WoosukKwon Dec 8, 2023 •

edited

Loading

tjtanaa Dec 8, 2023 •

edited

Loading

hongxiayang Dec 8, 2023 •

edited

Loading