[ROCM] DSfp4 mla projection gemms weight dynamic quantization by maleksan85 · Pull Request #32238 · vllm-project/vllm

maleksan85 · 2026-01-13T06:41:09Z

Commands:

Server:

VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_FP4BMM=1 VLLM_ROCM_USE_AITER_MHA=0 VLLM_ROCM_USE_AITER_MLA=1 AMDGCN_USE_BUFFER_OPS=1 SAFETENSORS_FAST_GPU=1 vllm serve /data/models/DeepSeek-R1-0528-MXFP4-Preview --host localhost --port 8000 --tensor-parallel-size 8 --max-num-batched-tokens 32768 --trust-remote-code --no-enable-prefix-caching --disable-log-requests --gpu_memory_utilization 0.8 --async-scheduling --block-size 16 --load-format fastsafetensors --seed 123 --enforce-eager

Client:

curl http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
"model": "/data/models/DeepSeek-R1-0528-MXFP4-Preview",
"prompt": "San Francisco is a",
"max_tokens": 16,
"temperature": 0
}'

Output:

{"id":"cmpl-9afa69e020697740","object":"text_completion","created":1768286231,"model":"/data/models/DeepSeek-R1-0528-MXFP4-Preview","choices":[{"index":0,"text":" city known for its vibrant culture, stunning architecture, and diverse neighborhoods. Among its","logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"prompt_logprobs":null,"prompt_token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":5,"total_tokens":21,"completion_tokens":16,"prompt_tokens_details":null},"kv_transfer_params":null}

legacy: VLLM_ROCM_USE_AITER_FP4BMM=0 VLLM_ROCM_USE_AITER_FP8BMM=0 also works

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

gemini-code-assist

Code Review

This pull request introduces support for DSfp4 MLA projection GEMMs with dynamic weight quantization on ROCm. The changes look good overall, adding new environment variables and logic to handle FP4 batched matrix multiplication. I've found a critical issue and a high-severity typo that should be addressed.

mergify · 2026-01-13T17:16:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @maleksan85.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

cursor · 2026-01-13T20:34:33Z

@@ -2045,7 +2074,7 @@ def forward(
                scale=layer._k_scale,
            )

-        if fp8_attention:
+        if fp8_attention and not self.is_aiter_triton_fp4_bmm_enabled:


FP8 KV cache view skipped when FP4BMM enabled

High Severity

The condition fp8_attention and not self.is_aiter_triton_fp4_bmm_enabled can cause data corruption when both FP8 KV cache and FP4BMM are enabled simultaneously. The concat_and_cache_mla call at line 2068 writes FP8-encoded data when kv_cache_dtype is "fp8", but line 2077 skips the .view(fp8_dtype) conversion when is_aiter_triton_fp4_bmm_enabled is True. Since these conditions are independent (one depends on KV cache dtype, the other on weight dtype and feature flag), both can be True, causing subsequent operations to read FP8 bytes as the original dtype, producing garbage results.

cursor · 2026-01-13T20:34:33Z

+                transpose_bm=True,
+                prequant=True,
+                y_scale=None,
+            )


Base class missing FP4BMM attribute setup in process_weights_after_loading

Medium Severity

MLACommonBaseImpl._v_up_proj was modified to use self.W_V and self.W_V_scale when is_aiter_triton_fp4_bmm_enabled is True, but MLACommonBaseImpl.process_weights_after_loading only sets these attributes for FP8BMM (in the if self.is_aiter_triton_fp8_bmm_enabled branch). When FP4BMM is enabled but FP8BMM is disabled, process_weights_after_loading falls into the else branch which sets W_UV and W_UK_T instead, leaving W_V and W_V_scale undefined. This causes the base class to be in an inconsistent state. While MLACommonImpl handles FP4BMM setup correctly, any direct subclass of MLACommonBaseImpl would hit an AttributeError at runtime.

Additional Locations (1)

vllm/v1/attention/backends/mla/common.py#L1241-L1287

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

…roject#32238) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

…roject#32238) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

…roject#32238) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>

Aleksandr Malyshev added 3 commits January 13, 2026 05:20

initial commit, something wrong with transpositions

121f4cd

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

initial commit, something wrong with transpositions

e50c27d

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

refactored and fixed code, output is reasonable

a96833e

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

mergify Bot added rocm Related to AMD ROCm v1 labels Jan 13, 2026

gemini-code-assist Bot reviewed Jan 13, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/attention/mla_attention.py Outdated

Comment thread vllm/model_executor/layers/quantization/quark/utils.py Outdated

Comment thread vllm/v1/attention/backends/mla/common.py Outdated

mergify Bot added the needs-rebase label Jan 13, 2026

minor review corrections

02e7e2a

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

maleksan85 marked this pull request as ready for review January 13, 2026 20:25

maleksan85 requested review from pavanimajety and tjtanaa as code owners January 13, 2026 20:25

Merge branch 'upstream_main' into dsfp4_gemms_part1

56aa2f3

cursor Bot reviewed Jan 13, 2026

View reviewed changes

mergify Bot removed the needs-rebase label Jan 13, 2026

cursor comments fixes

ae10b68

Signed-off-by: Aleksandr Malyshev <maleksan@amd.com>

gshtras approved these changes Jan 15, 2026

View reviewed changes

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 15, 2026

gshtras merged commit 8c11001 into vllm-project:main Jan 15, 2026
58 of 59 checks passed

gshtras deleted the dsfp4_gemms_part1 branch January 15, 2026 20:13

iboiko-habana mentioned this pull request Jan 16, 2026

[FIX_FOR_VLLM_LATEST] Fix for is_aiter_triton_fp4_bmm_enabled in mla_attention #32238 vllm-project/vllm-gaudi#833

Closed

akh64bit pushed a commit to akh64bit/vllm that referenced this pull request Jan 16, 2026

[ROCM] DSfp4 mla projection gemms weight dynamic quantization (vllm-p…

28f44e1

…roject#32238) Signed-off-by: Aleksandr Malyshev <maleksan@amd.com> Co-authored-by: Aleksandr Malyshev <maleksan@amd.com>

khairulkabir1661 mentioned this pull request Feb 16, 2026

[ROCm] Default VLLM_ROCM_USE_AITER_FP4BMM=True crashes on MI300X (gfx942) #34641

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ROCM] DSfp4 mla projection gemms weight dynamic quantization#32238

[ROCM] DSfp4 mla projection gemms weight dynamic quantization#32238
gshtras merged 6 commits intovllm-project:mainfrom
ROCm:dsfp4_gemms_part1

maleksan85 commented Jan 13, 2026 •

edited by github-actions Bot

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 13, 2026

Uh oh!

cursor Bot Jan 13, 2026

Uh oh!

cursor Bot Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

maleksan85 commented Jan 13, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Commands:

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 13, 2026

Uh oh!

cursor Bot Jan 13, 2026

Choose a reason for hiding this comment

FP8 KV cache view skipped when FP4BMM enabled

Uh oh!

cursor Bot Jan 13, 2026

Choose a reason for hiding this comment

Base class missing FP4BMM attribute setup in process_weights_after_loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maleksan85 commented Jan 13, 2026 •

edited by github-actions Bot

Loading