Revert "[MXFP4] Support for linear layers + compressed-tensors integration" (#41664) by vllm-agent · Pull Request #42473 · vllm-project/vllm

vllm-agent · 2026-05-13T02:04:53Z

Revert of #41664

This reverts the merge commit a7b801e from PR #41664.

Reason: This PR is linked to 1 new CI failure in build #65912:

GPQA Eval (GPT-OSS) (B200) — The gpt-oss-20b-sm100-fi-mxfp4-mxfp8-trtllm eval config timed out after 1800s on B200, processing only ~12/1584 items. The MXFP4 kernel and compressed-tensors changes in this PR directly affect the failing codepath. Two other MXFP4 configs (flashinfer-mxfp4-bf16, flashinfer-mxfp4-mxfp8-cutlass) passed, but the SM100/TRT-LLM variant experienced a severe performance regression.

Auto-generated by CI failure analyzer.

…ation (vllm-project#41664)" This reverts commit a7b801e.

github-actions · 2026-05-13T02:05:05Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

gemini-code-assist

Code Review

This pull request removes the generic MXFP4 linear kernel infrastructure and the W4A4 MXFP4 scheme, replacing it with a W4A16 weight-only scheme that utilizes Marlin directly. It also refactors FlashInfer utilities by removing MXFP4-specific quantization functions and simplifying the flashinfer_scaled_fp4_mm interface. Feedback highlights that making the alpha parameter mandatory in flashinfer_scaled_fp4_mm may break existing callers and suggests clarifying the weight registration process when transitioning from packed weights.

gemini-code-assist · 2026-05-13T02:10:14Z

    block_scale_a: torch.Tensor,
    block_scale_b: torch.Tensor,
-    alpha: torch.Tensor | None,
+    alpha: torch.Tensor,


The alpha parameter is now a required argument, whereas it was previously optional (defaulting to None and then to a tensor of ones). This change could break existing callers of flashinfer_scaled_fp4_mm or flashinfer_scaled_fp4_mm_out that do not provide this argument, especially for non-MXFP4 paths (like NVFP4) that might have relied on the default behavior. If this is a revert, please ensure that all callers in the codebase have been updated to provide this tensor, or restore the optional behavior with a default value.

gemini-code-assist · 2026-05-13T02:10:14Z

        layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)
        del layer.weight_packed


While reassigning layer.weight and deleting layer.weight_packed works for the Marlin kernel, it's important to note that layer.weight is a standard attribute name for linear layers. If this module is part of a larger fused structure or if other utilities expect weight_packed to exist during the loading phase, this deletion might cause issues. However, since this is typically the final step after loading, it should be safe as long as no subsequent hooks rely on the original parameter name.

Suggested change

layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)

del layer.weight_packed

# Ensure weight is registered as a parameter for the module

layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)

del layer.weight_packed

Revert "[MXFP4] Support for linear layers + compressed-tensors integr…

1a6da54

…ation (vllm-project#41664)" This reverts commit a7b801e.

mergify Bot added the nvidia label May 13, 2026

github-project-automation Bot added this to NVIDIA May 13, 2026

gemini-code-assist Bot reviewed May 13, 2026

View reviewed changes

mgoin closed this May 13, 2026

github-project-automation Bot moved this to Done in NVIDIA May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "[MXFP4] Support for linear layers + compressed-tensors integration" (#41664)#42473

Revert "[MXFP4] Support for linear layers + compressed-tensors integration" (#41664)#42473
vllm-agent wants to merge 1 commit into
vllm-project:mainfrom
vllm-agent:auto-revert/pr-41664

vllm-agent commented May 13, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

gemini-code-assist Bot May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		layer.weight = Parameter(layer.weight_packed.data, requires_grad=False)
		del layer.weight_packed

Uh oh!

Conversation

vllm-agent commented May 13, 2026

Revert of #41664

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants