Skip to content

Add ModelOpt W4A16 lm_head regression tests#44671

Open
MerkyorLynn wants to merge 1 commit into
vllm-project:mainfrom
MerkyorLynn:codex/modelopt-qwen36-nvfp4-lm-head
Open

Add ModelOpt W4A16 lm_head regression tests#44671
MerkyorLynn wants to merge 1 commit into
vllm-project:mainfrom
MerkyorLynn:codex/modelopt-qwen36-nvfp4-lm-head

Conversation

@MerkyorLynn
Copy link
Copy Markdown

@MerkyorLynn MerkyorLynn commented Jun 5, 2026

Summary

Add regression coverage for ModelOpt mixed-precision checkpoints that quantize
lm_head as W4A16_NVFP4.

Why

Some official ModelOpt NVFP4 MoE checkpoints include lm_head in
hf_quant_config.json with quant_algo = "W4A16_NVFP4". The LM head is a
ParallelLMHead, not a regular LinearBase, so it is useful to keep coverage
for this path separate from the generic W4A16 linear-layer tests.

This also adds Qwen3-MoE constructor coverage to ensure the model passes
quant_config into ParallelLMHead.

This PR is regression coverage only. It does not claim an end-to-end official
checkpoint load fix.

Duplicate-work check

Searched open PRs for ModelOpt W4A16 lm_head. Related but not duplicate:

This PR is scoped to ModelOpt mixed-precision W4A16_NVFP4 dispatch for
ParallelLMHead and Qwen3-MoE constructor coverage.

Validation

  • uv run --no-sync --python 3.12 python -m py_compile tests/quantization/test_modelopt.py tests/model_executor/test_qwen3_5_quantization.py
  • git diff --check

Targeted pytest was not run locally because this macOS checkout does not have
the vLLM test environment (pytest/torch) installed.

AI assistance was used to prepare this patch.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the qwen Related to Qwen models label Jun 5, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 5, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dec6707dbc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

{"lm_head": {"quant_algo": "W4A16_NVFP4", "group_size": 16}}
)

method = config.get_quant_method(_mock_lm_head(), prefix=prefix)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Mock Marlin support before instantiating W4A16 method

This new parametrized test runs unconditionally, but get_quant_method() constructs ModelOptNvFp4W4A16LinearMethod, whose constructor directly instantiates MarlinNvFp4LinearKernel; that kernel asserts is_fp4_marlin_supported() (current_platform.is_cuda() and capability >= 75). In CPU-only or unsupported-GPU test jobs this fails with an AssertionError, unlike the adjacent NVFP4 tests that patch kernel selection. Patch the Marlin kernel/support check here or skip the test when FP4 Marlin is unavailable.

Useful? React with 👍 / 👎.

Signed-off-by: MerkyorLynn <268568828+MerkyorLynn@users.noreply.github.com>
@MerkyorLynn MerkyorLynn force-pushed the codex/modelopt-qwen36-nvfp4-lm-head branch from dec6707 to 2aa9c1c Compare June 5, 2026 17:28
@MerkyorLynn
Copy link
Copy Markdown
Author

Hi maintainers, this is ready for review. Could you please add verified or ready if the scope looks appropriate? Thanks!

@MerkyorLynn
Copy link
Copy Markdown
Author

Hi maintainers, quick clarification: the failing pre-run-check/RTD status appears to be the new-contributor gate rather than a docs/build failure. DCO passes, and this PR is intentionally scoped small.

Could a maintainer please review the scope and add verified or ready if it looks appropriate? Happy to revise or close if this does not fit vLLM’s contribution direction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant