[refactor] refactor weight trans nz and transpose by zzzzwwjj · Pull Request #4878 · vllm-project/vllm-ascend

zzzzwwjj · 2025-12-10T07:43:45Z

What this PR does / why we need it?

Now VLLM_ASCEND_ENABLE_NZ will have three options:
0: disable nz;
1: only quant case enable nz;
2: enable nz as long as possible;

And VLLM_ASCEND_ENABLE_NZ=1 by default.

All cases are shown in the table below:

	W4A4	W4A8	W8A8	fp16/bf16	fp32
trans nz	can't support nz	trans nz by default	trans nz by default	trans nz when VLLM_ASCEND_ENABLE_NZ is 2	can't support nz
transpose	only support not transpose case	only support transpose case	only support transpose case	linear: only support not transpose case gmm: only support transpose case	same to fp16/bf16

Some exceptional cases:

MLAPO op need to do some additional processing on the weights, including trans nz. If use MLAPO op, some weight will be transformed to nz forcely;
MLA/SFA's weight W_UV will be used by op torch.ops._C_ascend.batch_matmul_transpose, and this op can't support nz currently;

Does this PR introduce any user-facing change?

Now fp16/bf16 weight will not trans nz by default.

How was this patch tested?

vLLM version: v0.12.0
vLLM main: vllm-project/vllm@ad32e3e

gemini-code-assist

Code Review

This pull request refactors the handling of the VLLM_ASCEND_ENABLE_NZ environment variable by centralizing the logic into a new maybe_trans_nz helper function. This is a significant improvement in code clarity and maintainability. The changes are consistently applied across various modules, and the tests have been updated to reflect the new behavior. However, I've identified a critical issue in one of the quantization files where a torch.nn.Parameter is incorrectly replaced by a torch.Tensor, which could lead to incorrect model behavior.

gemini-code-assist · 2025-12-10T07:45:30Z

+        layer.w13_weight = maybe_trans_nz(layer.w13_weight)
+        layer.w2_weight = maybe_trans_nz(layer.w2_weight)


The maybe_trans_nz function returns a torch.Tensor. By assigning the result directly to layer.w13_weight and layer.w2_weight, you are replacing the torch.nn.Parameter objects with regular tensors. This will cause them to no longer be treated as model parameters, which can lead to issues with device placement, state dicts, and optimizer behavior.

The original code used an in-place operation torch_npu.npu_format_cast_, which preserved the Parameter status. To fix this, you should assign the result of maybe_trans_nz to the .data attribute of the parameter.

Suggested change

layer.w13_weight = maybe_trans_nz(layer.w13_weight)

layer.w2_weight = maybe_trans_nz(layer.w2_weight)

layer.w13_weight.data = maybe_trans_nz(layer.w13_weight.data)

layer.w2_weight.data = maybe_trans_nz(layer.w2_weight.data)

github-actions · 2025-12-10T08:15:31Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-12-10T08:17:33Z