Llamas 3.1 405B fp4 changes upstreaming from 355_wip by maleksan85 · Pull Request #25135 · vllm-project/vllm

maleksan85 · 2025-09-18T04:18:56Z

Perf is the same:
upstream tp1

============ Serving Benchmark Result ============
Successful requests:                     320
Maximum request concurrency:             64
Benchmark duration (s):                  413.07
Total input tokens:                      326905
Total generated tokens:                  327680
Request throughput (req/s):              0.77
Output token throughput (tok/s):         793.27
Total Token throughput (tok/s):          1584.66
---------------Time to First Token----------------
Mean TTFT (ms):                          5240.27
Median TTFT (ms):                        5150.59
P99 TTFT (ms):                           13542.02
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          75.58
Median TPOT (ms):                        76.05
P99 TPOT (ms):                           79.35
---------------Inter-token Latency----------------
Mean ITL (ms):                           75.58
Median ITL (ms):                         66.27
P99 ITL (ms):                            73.04
----------------End-to-end Latency----------------
Mean E2EL (ms):                          82558.95
Median E2EL (ms):                        82994.12
P99 E2EL (ms):                           89413.31
==================================================

355_wip tp1

============ Serving Benchmark Result ============
Successful requests:                     320
Maximum request concurrency:             64
Benchmark duration (s):                  415.51
Total input tokens:                      326905
Total generated tokens:                  327680
Request throughput (req/s):              0.77
Output token throughput (tok/s):         788.62
Total Token throughput (tok/s):          1575.37
---------------Time to First Token----------------
Mean TTFT (ms):                          5262.51
Median TTFT (ms):                        5156.42
P99 TTFT (ms):                           14143.78
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          76.03
Median TPOT (ms):                        76.44
P99 TPOT (ms):                           79.86
---------------Inter-token Latency----------------
Mean ITL (ms):                           76.03
Median ITL (ms):                         66.64
P99 ITL (ms):                            69.13
----------------End-to-end Latency----------------
Mean E2EL (ms):                          83044.74
Median E2EL (ms):                        83387.81
P99 E2EL (ms):                           90413.53
==================================================

Command:

HIP_VISIBLE_DEVICES=7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_USE_V1=1 \
AMDGCN_USE_BUFFER_OPS=1 \
TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1 \
TRITON_HIP_USE_ASYNC_COPY=1 \
TRITON_HIP_USE_BLOCK_PINGPONG=1 \
TRITON_HIP_ASYNC_FAST_SWIZZLE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
VLLM_USE_AITER_UNIFIED_ATTENTION=0 \
VLLM_ROCM_USE_TRITON_ROPE=1 \
VLLM_ROCM_USE_AITER_RMSNORM=1 \
VLLM_ROCM_USE_AITER_FP4_ASM_GEMM=1 \
vllm serve /data/models/Llama-3.1-405B-Instruct-MXFP4-Preview \
  --host localhost \
  --port 30000 \
  --swap-space 64 \
  --disable-log-requests \
  --dtype auto \
  --max-model-len 8192 \
  --tensor-parallel-size 1 \
  --max-num-seqs 64 \
  --distributed-executor-backend mp \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.85 \
  --max-seq-len-to-capture 8192 \
  --no-enable-prefix-caching \
  --async-scheduling \
  --max-num-batched-tokens 8192 \
  --compilation-config='{"pass_config":{"enable_attn_fusion":true,"enable_noop":true,"enable_fusion":true},"cudagraph_mode":"FULL","custom_ops":["+rms_norm","+silu_and_mul","+quant_fp8"],"splitting_ops":[]}'

Run the client benchmark

vllm bench serve \
  --host localhost \
  --port 30000 \
  --model /data/models/Llama-3.1-405B-Instruct-MXFP4-Preview \
  --dataset-name random \
  --random-input-len 1024 \
  --random-output-len 1024 \
  --max-concurrency 64 \
  --num-prompts 320 \
  --percentile-metrics ttft,tpot,itl,e2el \
  --ignore-eos

Correctness - shows reasonable answers for command:

HIP_VISIBLE_DEVICES=7 \
VLLM_DISABLE_COMPILE_CACHE=1 \
USE_FASTSAFETENSOR=1 \
SAFETENSORS_FAST_GPU=1 \
VLLM_USE_V1=1 \
VLLM_V1_USE_PREFILL_DECODE_ATTENTION=1 \
AMDGCN_USE_BUFFER_OPS=1 \
VLLM_USE_AITER_TRITON_ROPE=1 \
TRITON_HIP_ASYNC_COPY_BYPASS_PERMUTE=1 \
TRITON_HIP_USE_ASYNC_COPY=1 \
TRITON_HIP_USE_BLOCK_PINGPONG=1 \
TRITON_HIP_ASYNC_FAST_SWIZZLE=1 \
VLLM_ROCM_USE_AITER=1 \
VLLM_ROCM_USE_AITER_MHA=0 \
VLLM_ROCM_USE_AITER_RMSNORM=1 \
VLLM_TRITON_FP4_GEMM_USE_ASM=1 \
python /data/vllm-scripts/llm_test.py \
  --model /data/models/Llama-3.1-405B-Instruct-MXFP4-Preview \
  --dataset-path /data/models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json \
  --batch-size 32 \
  --swap-space 64 \
  --dtype auto \
  --max-model-len 8192 \
  --tensor-parallel-size 1 \
  --max-num-seqs 1024 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --max-seq-len-to-capture 8192 \
  --no-enable-prefix-caching \
  --max-num-batched-tokens 8192

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

vllm/model_executor/layers/rotary_embedding/base.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify · 2025-09-18T23:26:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

…es_upstr_from_355_wip

mergify · 2025-09-19T20:15:33Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

SageMoore

Can we get some unit tests for the batched_rotary_embedding kernel?

vllm/model_executor/layers/quantization/quark/quark.py

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 · 2025-09-23T22:50:11Z

Can we get some unit tests for the batched_rotary_embedding kernel?

removed batched rope for now to speed up this PR landing

fxmarty-amd

great that cdna4 mxfp4 gemm gets upstreamed!

vllm/envs.py

vllm/model_executor/layers/linear.py

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

SageMoore · 2025-09-25T13:59:04Z

CC @mgoin

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 · 2025-09-25T18:56:10Z

CI test model-executor-test fails on main:f552d5e578077574276aa9d83139b91e1d5ae163 as well which this branch is based on. Please force merge this PR. Thanks.

mgoin

Let's remove the x_quant_scales change to the linear layer

vllm/model_executor/layers/linear.py

vllm/model_executor/layers/rotary_embedding/base.py

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 · 2025-09-25T22:01:30Z

Let's remove the x_quant_scales change to the linear layer

removed, please look again.

maleksan85 · 2025-09-25T23:37:09Z

CI test that failed, passes locally:
python -m pytest -svx tests/v1/e2e/test_spec_decode.py::test_eagle_correctness[FLASH_ATTN-llama3_eagle]

mgoin

LGTM now, thank you!

zejunchen-zejun

LGTM

fxmarty-amd · 2025-09-26T07:58:53Z

vllm/model_executor/layers/quantization/quark/schemes/quark_w4a4_mxfp4.py

+        if self.emulate:
+            layer.weight_scale = torch.nn.Parameter(layer.weight_scale.data,
+                                                    requires_grad=False)
+            try:
+                from quark.torch.export.nn.modules import realquantizer
+                from quark.torch.quantization.config.config import (
+                    QuantizationSpec)
+            except ImportError as err:
+                raise ImportError(
+                    "The package `amd-quark` is required to use AMD Quark "
+                    "MX-FP4 models. Please install it with `pip install "
+                    "amd-quark`.") from err
+
+            weight_quant_spec = QuantizationSpec.from_dict(
+                self.weight_quant_spec)
+
+            weight_quantizer = realquantizer.get_real_quantizer(
+                qspec=weight_quant_spec,
+                quantizer=None,
+                real_quantized=True,
+                reorder=False,
+                float_dtype=self.out_dtype,
+                scale_shape=layer.weight_scale.shape,
+                zero_point_shape=None,
+            )
+            weight_quantizer.scale.data = layer.weight_scale.data
+
+            layer.weight = torch.nn.Parameter(
+                weight_quantizer(layer.weight.data).to(self.out_dtype),
+                requires_grad=False,
+            )
+            layer.weight_scale = None
+
+            # This call is necessary to release the scales memory.
+            torch.cuda.empty_cache()


I insist that this is unnecessary https://github.com/vllm-project/vllm/pull/25135/files#r2378191214 - was not able to reopen the thread that was closed unfortunately.

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Doug Lehr <douglehr@amd.com>

Aleksandr Malyshev added 3 commits September 18, 2025 03:01

405B fp4 llama 3.1 support

2ceeff7

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

405B fp4 llama 3.1 support

06d8515

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

405B fp4 llama 3.1 support

82aa4b2

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot added the llama Related to Llama models label Sep 18, 2025

maleksan85 marked this pull request as ready for review September 18, 2025 04:21

maleksan85 requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners September 18, 2025 04:21

dllehr-amd reviewed Sep 18, 2025

View reviewed changes

vllm/model_executor/layers/rotary_embedding/base.py Outdated Show resolved Hide resolved

Aleksandr Malyshev added 3 commits September 18, 2025 18:33

refactor ROPE base

3433861

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

refactor linear to exclude Fp8LinearMethod called with x_scale

dd7697f

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

refactor ROPE base

f17dc60

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 requested a review from gshtras as a code owner September 18, 2025 23:26

mergify bot added the v1 label Sep 18, 2025

mergify bot added the needs-rebase label Sep 18, 2025

Ensure right fp8 is called for given arch

f9626ee

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 force-pushed the llamas_changes_upstr_from_355_wip branch from 798e475 to f9626ee Compare September 18, 2025 23:29

merge with main

43bd10e

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

gshtras added the rocm Related to AMD ROCm label Sep 19, 2025

Merge branch 'main' of github.com:vllm-project/vllm into llamas_chang…

ddcc59c

…es_upstr_from_355_wip

mergify bot removed the needs-rebase label Sep 19, 2025

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025

mergify bot added the needs-rebase label Sep 19, 2025

fix for MoE kernel tests

af993eb

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 requested a review from WoosukKwon as a code owner September 19, 2025 22:56

merge with main

ba529e5

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify bot removed the needs-rebase label Sep 22, 2025

SageMoore reviewed Sep 23, 2025

View reviewed changes

removed batched rope, inmproved env handling

fa9950c

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

fxmarty-amd reviewed Sep 24, 2025

View reviewed changes

ROPE refactoring

8d3aca2

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

ROPE hip kernel remove and clean up

1020f95

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

Merge branch 'upstream/main' into llamas_changes_upstr_from_355_wip

5a54910

mgoin requested changes Sep 25, 2025

View reviewed changes

vllm/model_executor/layers/linear.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/rotary_embedding/base.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/rotary_embedding/base.py Outdated Show resolved Hide resolved

removed offsets and x scale

4dff937

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 requested a review from mgoin September 25, 2025 22:00

mgoin approved these changes Sep 25, 2025

View reviewed changes

zejunchen-zejun approved these changes Sep 26, 2025

View reviewed changes

SageMoore approved these changes Sep 26, 2025

View reviewed changes

mgoin merged commit 53a3084 into vllm-project:main Sep 26, 2025
47 checks passed

fxmarty-amd reviewed Sep 26, 2025

View reviewed changes

zhewenl mentioned this pull request Oct 2, 2025

[Bugfix] Fix import gemm_afp4wfp4 failure on AMD #26068

Merged

fxmarty-amd mentioned this pull request Oct 2, 2025

[mxfp4] Remove unnecessary process_weights_after_loading handling in case simulation is used #26111

Closed

tjtanaa mentioned this pull request Oct 20, 2025

[ROCm] Add AMD GPU support on Deepseek v3.2 and SparseMLA #26670

Merged

5 tasks

gshtras mentioned this pull request Oct 22, 2025

[Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek #27373

Merged

Rohan138 mentioned this pull request Feb 24, 2026

[ROCm]: Enable customop and rope+kvcache fusion for AITER RoPE #35180

Merged

5 tasks

Uh oh!

Conversation

maleksan85 commented Sep 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Run the client benchmark

Uh oh!

Uh oh!

mergify bot commented Sep 18, 2025

Uh oh!

mergify bot commented Sep 19, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maleksan85 commented Sep 23, 2025

Uh oh!

fxmarty-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SageMoore commented Sep 25, 2025

Uh oh!

maleksan85 commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

maleksan85 commented Sep 25, 2025

Uh oh!

maleksan85 commented Sep 25, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

zejunchen-zejun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fxmarty-amd Sep 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

maleksan85 commented Sep 18, 2025 •

edited by github-actions bot

Loading

maleksan85 commented Sep 25, 2025 •

edited

Loading