Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example. by changwangss · Pull Request #2687 · vllm-project/llm-compressor

changwangss · 2026-05-06T10:37:44Z

Summary
This PR mainly includes two targeted fixes:

Ensure AutoRound ignored FP layers are not re-marked as quantized by clearing/restoring quantization metadata correctly in post-processing.
Update the Qwen3 MXFP8 example to use iters=0 for Qwen/Qwen3-30B-A3B-Instruct-2507.

Test
test mxfp8 with Qwen/Qwen3-30B-A3B-Instruct-2507.

root@12bc54b086b6:/data3/changwa1/qwen_example/llmc/examples/autoround/quantization_w8a8_mxfp8# CUDA_VISIBLE_DEVICES=1 python qwen3_example.py
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu1
28).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 16/16 [00:04<00:00,  3.31it/s]
2026-05-11 07:42:38 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks.
..
2026-05-11T07:42:52.0631 | __init__ | WARNING - Disabling tokenizer parallelism due to threading conflict between FastTo
kenizer and Datasets. Set TOKENIZERS_PARALLELISM=false to suppress this warning.
2026-05-11T07:42:52.5234 | reset | INFO - Compression lifecycle reset
2026-05-11T07:42:52.6649 | moe_calibration_context | INFO - Found 48 MoE modules to replace
Replacing MoE modules for calibration: 100%|██████████████████████████████████████████| 48/48 [00:00<00:00, 1286.97it/s]
2026-05-11T07:42:52.7037 | moe_calibration_context | INFO - Replaced 48 MoE modules for calibration
2026-05-11T07:42:52.7038 | moe_calibration_context | INFO - 48/48 modules will be restored after calibration
2026-05-11T07:42:52.7046 | from_modifiers | INFO - Creating recipe from modifiers
2026-05-11T07:42:57.0035 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2026-05-11T07:42:57.0040 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AutoRoundModifier`
W0511 07:43:00.305000 5546 torch/fx/_symbolic_trace.py:53] is_fx_tracing will return true for both fx.symbolic_trace and
 torch.export. Please use is_fx_tracing_symbolic_tracing() for specifically fx.symbolic_trace or torch.compiler.is_compiling() for specifically torch.export/compile.
Preparing cache: 100%|██████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 2834.11it/s]
(1/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:01<00:00, 64.19it/s]
(1/49): Propagating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:01<00:00, 82.96it/s]
(2/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:10<00:00, 11.93it/s]
2026-05-11T07:43:15.8864 | apply_autoround | INFO - Applying AutoRound on layer model.layers.0
2026-05-11 07:43:16 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the cost of accuracy.
2026-05-11 07:43:16 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 07:43:16 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters', 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed', 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 07:43:16 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 07:43:16 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 07:43:16 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 07:43:19 WARNING quantizer.py L145: MoE layer detected: optimized RTN is disabled for efficiency. Use `--enab
le_opt_rtn` to force-enable it for MoE layers.
2026-05-11T07:43:21.0857 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T07:43:21.0861 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(2/49): Propagating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.08it/s]
(3/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.69it/s]
2026-05-11T07:43:46.9994 | apply_autoround | INFO - Applying AutoRound on layer model.layers.1
2026-05-11 07:43:47 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 07:43:47 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 07:43:47 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 07:43:48 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
...
2026-05-11T08:04:59.2942 | apply_autoround | INFO - Applying AutoRound on layer model.layers.46
2026-05-11 08:05:00 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 08:05:00 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 08:05:00 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 08:05:01 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 08:05:01 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 08:05:01 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11T08:05:03.4751 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T08:05:03.4753 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(48/49): Propagating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.70it/s]
(49/49): Calibrating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.66it/s]
2026-05-11T08:05:28.6792 | apply_autoround | INFO - Applying AutoRound on layer model.layers.47
2026-05-11 08:05:29 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 08:05:29 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 08:05:29 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 08:05:30 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 08:05:30 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 08:05:30 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11T08:05:33.4549 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T08:05:33.4551 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(49/49): Propagating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.55it/s]
2026-05-11T08:05:46.6343 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2026-05-11T08:05:46.6344 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)


========== SAMPLE GENERATION ==============

Hello my name is Mandy and I am 25 years old. I live in the United States and I am a female. I am interested in a career
 in data science and I am currently learning Python and SQL. I am also learning about statistics and machine learning. I
 am interested in working in the tech industry, particularly in data analysis or data science roles. I would like to kno
w what skills I need to develop to become a successful data scientist, what resources are available to help me learn, an
d what steps
==========================================


Compressing model: 100%|████████████████████████████████████████████████████████| 18624/18624 [00:08<00:00, 2282.39it/s]
/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py:3970: UserWarning: Attempting to save a model wit
h offloaded modules. Ensure that unallocated cpu memory exceeds the `shard_size` (5GB default)
  warnings.warn(
Saving checkpoint shards: 100%|███████████████████████████████████████████████████████████| 7/7 [00:59<00:00,  8.53s/it]
Dispatching model: 100%|███████████████████████████████████████████████████████| 31351/31351 [00:01<00:00, 27766.34it/s]

load the mxfp8 model quantized by LLMC AutoRound with vLLM

mergify · 2026-05-06T10:38:25Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviews

Waiting for

#approved-reviews-by >= 2

This rule is failing.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

#approved-reviews-by >= 2
#changes-requested-reviews-by = 0

github-actions · 2026-05-06T10:39:19Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request updates the examples/autoround/README.md file to include MXFP8 and MXFP4 quantization schemes in the examples table and the list of supported schemes. Review feedback suggests improving the consistency of file paths for the new examples and correcting the casing of 'WNA16' along with punctuation in the documentation.

coderabbitai · 2026-05-06T10:40:30Z

Walkthrough

The PR updates the examples/autoround README by adding three new quantization precision rows (MXFP8, MXFP4, NVFP4) to the Support Matrix table with their associated example script paths, and updates the Known Issues section to reflect expanded AutoRound scheme support.

Changes

README Documentation Update

Layer / File(s)	Summary
Support Matrix Expansion `examples/autoround/README.md` (lines 72–75)	Three new quantization precision rows added: MXFP8, MXFP4, and NVFP4, each linked to their corresponding example script paths.
Known Issues Clarification `examples/autoround/README.md` (line 78)	Known Issues section updated to list supported AutoRound schemes as WNA16, MXFP8, MXFP4, and NVFP4; W8A8-FP8 reference removed.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

vllm-project/llm-compressor#2673: Both PRs modify documentation and example listings related to MXFP8/MXFP4/NVFP4 support and example paths.
vllm-project/llm-compressor#2685: Both PRs modify the examples/autoround README to update and add MXFP4-related example links and documentation.
vllm-project/llm-compressor#2678: Both PRs update README documentation to add and re-scope new quantization precisions (MXFP8/MXFP4/NVFP4 and related variants).

Suggested labels

autoround, documentation, fp8, nvfp4, enhancement

Suggested reviewers

brian-dellabetta

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions 'Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example', but the actual changes only document support for MXFP8, MXFP4, and NVFP4 schemes in the README with no evidence of ignore-layer metadata handling fixes.	Update the title to accurately reflect the actual changes: 'Add MXFP8, MXFP4, and NVFP4 support to AutoRound README with Qwen3-30B example' or similar, matching the documented additions in the file.
Description check	❓ Inconclusive	The PR description mentions two targeted fixes (AutoRound FP layer handling and Qwen3 MXFP8 example updates), but the actual changeset only shows README documentation updates for quantization support matrix.	Clarify whether the PR scope is limited to README documentation updates or if other code changes addressing the two mentioned fixes are also included.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

examples/autoround/README.md (1)
78-78: 💤 Low value

Consider standardizing scheme capitalization.

Line 78 uses "WNA16" while the Support Matrix table (lines 66–68) uses "wNa16". For consistency, consider matching the casing used in the table.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/autoround/README.md` at line 78, The README uses inconsistent
capitalization for the quantization scheme names: change the occurrence of
"WNA16" to match the Support Matrix's casing "wNa16" (or alternatively normalize
both entries to a single chosen casing) so that the term "wNa16" is used
consistently across the README (reference symbols: "WNA16" and "wNa16", and the
Support Matrix table).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/autoround/README.md`:
- Line 78: The README uses inconsistent capitalization for the quantization
scheme names: change the occurrence of "WNA16" to match the Support Matrix's
casing "wNa16" (or alternatively normalize both entries to a single chosen
casing) so that the term "wNa16" is used consistently across the README
(reference symbols: "WNA16" and "wNa16", and the Support Matrix table).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0618543b-9bcc-45ea-8cf4-0e8534319205

📥 Commits

Reviewing files that changed from the base of the PR and between 9eb75fc and 18150f0.

📒 Files selected for processing (1)

examples/autoround/README.md

mergify Bot added documentation Improvements or additions to documentation two-reviews When a PR requires two reviews labels May 6, 2026

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread examples/autoround/README.md Outdated

Comment thread examples/autoround/README.md Outdated

coderabbitai Bot added enhancement New feature or request fp8 For any issue / PR related to FP8 support nvfp4 For any PR / issue related to NVFP4 support autoround For any PR / issue related to autoround support w4a16 and removed two-reviews When a PR requires two reviews labels May 6, 2026

mergify Bot added the two-reviews When a PR requires two reviews label May 6, 2026

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

changwangss marked this pull request as draft May 6, 2026 11:06

changwangss changed the title ~~Update example autoround README.md~~ Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example. May 11, 2026

changwangss marked this pull request as ready for review May 11, 2026 06:27

changwangss closed this May 11, 2026

changwangss force-pushed the patch-1 branch from 458ee6d to 726599e Compare May 11, 2026 08:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example.#2687

Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example.#2687
changwangss wants to merge 0 commit into
vllm-project:mainfrom
changwangss:patch-1

changwangss commented May 6, 2026 •

edited

Loading

Uh oh!

mergify Bot commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

changwangss commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented May 6, 2026

Merge Protections

🔴 Require two reviews

Uh oh!

github-actions Bot commented May 6, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changwangss commented May 6, 2026 •

edited

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading