Update Dockerfile to build for Blackwell by mgoin · Pull Request #18095 · vllm-project/vllm

mgoin · 2025-05-13T19:12:47Z

Updates the docker to build wheels for blackwell (SM 10.0) and include the latest flashinfer for performance blackwell attention support (FIX #17325). We didn't include SM 12.0 for now because of wheel size concerns.

Updates to latest flashinfer main as of 5/15 since there isn't a release yet: flashinfer-ai/flashinfer@e00e8ce

Signed-off-by: mgoin <mgoin64@gmail.com>

github-actions · 2025-05-13T19:12:56Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

docker/Dockerfile

Signed-off-by: mgoin <mgoin64@gmail.com>

chenyang78 · 2025-05-13T20:11:58Z

Thanks for the prompt fix, @mgoin ! Attaching the eval results (perf results in a separate comment below) with this fix on GB200. All the experiments were conducted with the latest flashinfer commit (25fb40) plus cherry-picking vllm PR (#15777).

Evals:

Llama-3.1B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=meta-llama/Llama-3.1-8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7741|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7498|±  |0.0119|

Llama-3.2-1B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=meta-llama/Llama-3.2-1B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto 
...
vllm (pretrained=meta-llama/Llama-3.2-1B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.3321|±  | 0.013|
|     |       |strict-match    |     5|exact_match|↑  |0.3321|±  | 0.013|

Qwen2.5-7B

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=Qwen/Qwen2.5-7B-Instruct --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=Qwen/Qwen2.5-7B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8226|±  |0.0105|
|     |       |strict-match    |     5|exact_match|↑  |0.7839|±  |0.0113|

QwQ-32B-FP8

$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=RedHatAI/QwQ-32B-FP8-dynamic,tensor_parallel_size=2,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4496|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.7346|±  |0.0122|

chenyang78 · 2025-05-13T20:22:40Z

Some perf numbers for the FlashInfer backend and FlashAttention V2 backend on GB200, using the same settings as the evals above.

Llama 8B at 1024/128 input/output tokens:

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1024 --output-len 128

Throughput: 50.68 requests/s, 58330.33 total tokens/s, 6486.70 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1024 --output-len 128

Throughput: 46.47 requests/s, 53479.79 total tokens/s, 5948.69 output tokens/s

Llama 8B at 1000/1000 input/output tokens:

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 10.01 requests/s, 20002.62 total tokens/s, 10006.33 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model meta-llama/Llama-3.1-8B-Instruct --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 11.00 requests/s, 21982.08 total tokens/s, 10999.94 output tokens/s

QwQ 32B FP8-dynamic TP=2 at 1000/1000 input/output tokens

# flashinfer
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASHINFER python benchmarks/benchmark_throughput.py --model RedHatAI/QwQ-32B-FP8-dynamic --tensor-parallel-size=2 --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 7.46 requests/s, 14892.90 total tokens/s, 7455.20 output tokens/s

# flash attn V2
$ VLLM_USE_V1=1 VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_FLASH_ATTN_VERSION=2 python benchmarks/benchmark_throughput.py --model RedHatAI/QwQ-32B-FP8-dynamic --tensor-parallel-size=2 --num-prompts 1000 --input-len 1000 --output-len 1000

Throughput: 6.60 requests/s, 13185.27 total tokens/s, 6600.92 output tokens/s

mgoin · 2025-05-14T00:59:01Z

It seems if we build with SM 10.0+12.0, we increase the wheel size to 450MB

[2025-05-13T23:00:45Z] #32 0.715 Not allowed: Wheel dist/vllm-0.8.5.dev650+g114a0f311-cp38-abi3-linux_x86_64.whl is larger (450.40 MB) than the limit (400 MB).

Signed-off-by: mgoin <mgoin64@gmail.com>

lanking520 · 2025-05-14T22:49:34Z

docker/Dockerfile

-    FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.6 8.9 9.0+PTX' \
-    uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@v0.2.2.post1" ; \
+    FLASHINFER_ENABLE_AOT=1 TORCH_CUDA_ARCH_LIST='7.5 8.0 8.9 9.0 10.0+PTX' \
+    uv pip install --system --no-build-isolation "git+https://github.com/flashinfer-ai/flashinfer@948a14622bd624773918d738b0f66137a9ac4784" ; \


is that possible to based on a release tag than a commit? This will be very hard for different users to consume as dependencies.

There isn't one available atm.

That contains Blackwell kernel, which was a reason for such upgrade

simon-mo · 2025-05-15T16:47:13Z

Extended time out build here https://buildkite.com/vllm/ci/builds/20161/steps

simon-mo · 2025-05-15T20:12:47Z

FAILED samplers/test_rejection_sampler.py::test_compare_nonflashinfer_backend[cuda:0-1-30000-6] - RuntimeError: target_probs must be a 3D tensor

Let's try to get #15777 in

Signed-off-by: mgoin <mgoin64@gmail.com>

simon-mo · 2025-05-16T05:09:51Z

Longer time out build: https://buildkite.com/vllm/ci/builds/20203/steps

Signed-off-by: mgoin <mgoin64@gmail.com>

simon-mo · 2025-05-16T22:14:30Z

@mgoin, sampler fixes merged. Can you resolve the conflict?

mergify · 2025-05-16T22:15:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: mgoin <mgoin64@gmail.com>

This reverts commit dcfe952. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Update FlashInfer

4ea258a

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot added the ci/build label May 13, 2025

simon-mo added this to the v0.9.0 milestone May 13, 2025

simon-mo reviewed May 13, 2025

View reviewed changes

docker/Dockerfile Outdated Show resolved Hide resolved

Add blackwell cuda arch

114a0f3

Signed-off-by: mgoin <mgoin64@gmail.com>

Try dropping 8.6 and 12.0

c044a06

Signed-off-by: mgoin <mgoin64@gmail.com>

lanking520 reviewed May 14, 2025

View reviewed changes

mgoin added 2 commits May 15, 2025 23:46

Merge branch 'main' into enable-flashinfer-blackwell

cb1dc18

Update flashinfer to latest

eabdfbb

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin changed the title ~~Update FlashInfer~~ Update Dockerfile to build for Blackwell May 15, 2025

Add SM 12.0

6e7eca2

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label May 16, 2025

mgoin added 2 commits May 16, 2025 11:02

Revert SM 12.0

23d6c02

Signed-off-by: mgoin <mgoin64@gmail.com>

Merge branch 'main' into enable-flashinfer-blackwell

9f8ecfb

0xjunhao mentioned this pull request May 16, 2025

[CI/Build] Update the Dockerfile to include Blackwell archs #18092

Closed

mergify bot added the needs-rebase label May 16, 2025

Merge branch 'main' into enable-flashinfer-blackwell

813b9a2

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label May 16, 2025

simon-mo merged commit dcfe952 into vllm-project:main May 17, 2025
86 of 93 checks passed

simon-mo mentioned this pull request May 18, 2025

[Build] Supports CUDA 12.6 and 11.8 after Blackwell Update #18316

Merged

markmc added a commit to markmc/vllm that referenced this pull request May 21, 2025

Revert "Update Dockerfile to build for Blackwell (vllm-project#18095)"

70812a9

This reverts commit dcfe952. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc mentioned this pull request May 21, 2025

[DO NOT MERGE] Revert to pre #15777 #18462

Closed

zzzyq pushed a commit to zzzyq/vllm that referenced this pull request May 24, 2025

Update Dockerfile to build for Blackwell (vllm-project#18095)

6cfc7cc

Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

mgoin mentioned this pull request Jun 4, 2025

[V1] Use FlashInfer by default on Blackwell GPUs #19118

Merged

huydhn mentioned this pull request Jun 4, 2025

[Bug]: vllm-openai:0.9.0 docker image raise 'CUDA error: no kernel image is available for execution on the device' for Llama4 Maverick FP8 #18841

Closed

1 task

Uh oh!

Conversation

mgoin commented May 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 13, 2025

Uh oh!

Uh oh!

chenyang78 commented May 13, 2025 • edited by mgoin Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chenyang78 commented May 13, 2025

Uh oh!

mgoin commented May 14, 2025

Uh oh!

lanking520 May 14, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo May 15, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo May 15, 2025

Choose a reason for hiding this comment

Uh oh!

simon-mo commented May 15, 2025

Uh oh!

simon-mo commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simon-mo commented May 16, 2025

Uh oh!

simon-mo commented May 16, 2025

Uh oh!

mergify bot commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mgoin commented May 13, 2025 •

edited by github-actions bot

Loading

chenyang78 commented May 13, 2025 •

edited by mgoin

Loading

simon-mo commented May 15, 2025 •

edited

Loading