Skip to content

Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example.#2687

Closed
changwangss wants to merge 0 commit into
vllm-project:mainfrom
changwangss:patch-1
Closed

Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example.#2687
changwangss wants to merge 0 commit into
vllm-project:mainfrom
changwangss:patch-1

Conversation

@changwangss
Copy link
Copy Markdown
Contributor

@changwangss changwangss commented May 6, 2026

Summary
This PR mainly includes two targeted fixes:

  1. Ensure AutoRound ignored FP layers are not re-marked as quantized by clearing/restoring quantization metadata correctly in post-processing.
  2. Update the Qwen3 MXFP8 example to use iters=0 for Qwen/Qwen3-30B-A3B-Instruct-2507.

Test
test mxfp8 with Qwen/Qwen3-30B-A3B-Instruct-2507.

root@12bc54b086b6:/data3/changwa1/qwen_example/llmc/examples/autoround/quantization_w8a8_mxfp8# CUDA_VISIBLE_DEVICES=1 python qwen3_example.py
Skipping import of cpp extensions due to incompatible torch version. Please upgrade to torch >= 2.11.0 (found 2.10.0+cu1
28).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████| 16/16 [00:04<00:00,  3.31it/s]
2026-05-11 07:42:38 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks.
..
2026-05-11T07:42:52.0631 | __init__ | WARNING - Disabling tokenizer parallelism due to threading conflict between FastTo
kenizer and Datasets. Set TOKENIZERS_PARALLELISM=false to suppress this warning.
2026-05-11T07:42:52.5234 | reset | INFO - Compression lifecycle reset
2026-05-11T07:42:52.6649 | moe_calibration_context | INFO - Found 48 MoE modules to replace
Replacing MoE modules for calibration: 100%|██████████████████████████████████████████| 48/48 [00:00<00:00, 1286.97it/s]
2026-05-11T07:42:52.7037 | moe_calibration_context | INFO - Replaced 48 MoE modules for calibration
2026-05-11T07:42:52.7038 | moe_calibration_context | INFO - 48/48 modules will be restored after calibration
2026-05-11T07:42:52.7046 | from_modifiers | INFO - Creating recipe from modifiers
2026-05-11T07:42:57.0035 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
2026-05-11T07:42:57.0040 | IndependentPipeline | INFO - Inferred `SequentialPipeline` for `AutoRoundModifier`
W0511 07:43:00.305000 5546 torch/fx/_symbolic_trace.py:53] is_fx_tracing will return true for both fx.symbolic_trace and
 torch.export. Please use is_fx_tracing_symbolic_tracing() for specifically fx.symbolic_trace or torch.compiler.is_compiling() for specifically torch.export/compile.
Preparing cache: 100%|██████████████████████████████████████████████████████████████| 128/128 [00:00<00:00, 2834.11it/s]
(1/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:01<00:00, 64.19it/s]
(1/49): Propagating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:01<00:00, 82.96it/s]
(2/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:10<00:00, 11.93it/s]
2026-05-11T07:43:15.8864 | apply_autoround | INFO - Applying AutoRound on layer model.layers.0
2026-05-11 07:43:16 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the cost of accuracy.
2026-05-11 07:43:16 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 07:43:16 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters', 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed', 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 07:43:16 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 07:43:16 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 07:43:16 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 07:43:19 WARNING quantizer.py L145: MoE layer detected: optimized RTN is disabled for efficiency. Use `--enab
le_opt_rtn` to force-enable it for MoE layers.
2026-05-11T07:43:21.0857 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T07:43:21.0861 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(2/49): Propagating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.08it/s]
(3/49): Calibrating: 100%|████████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.69it/s]
2026-05-11T07:43:46.9994 | apply_autoround | INFO - Applying AutoRound on layer model.layers.1
2026-05-11 07:43:47 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 07:43:47 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 07:43:47 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 07:43:48 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
...
2026-05-11T08:04:59.2942 | apply_autoround | INFO - Applying AutoRound on layer model.layers.46
2026-05-11 08:05:00 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 08:05:00 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 08:05:00 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 08:05:01 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 08:05:01 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 08:05:01 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11T08:05:03.4751 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T08:05:03.4753 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(48/49): Propagating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:13<00:00,  9.70it/s]
(49/49): Calibrating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.66it/s]
2026-05-11T08:05:28.6792 | apply_autoround | INFO - Applying AutoRound on layer model.layers.47
2026-05-11 08:05:29 INFO config.py L50: `enable_opt_rtn` is turned on, set `--disable_opt_rtn` for higher speed at the c
ost of accuracy.
2026-05-11 08:05:29 INFO entry.py L491: Using LLM mode (new architecture).
2026-05-11 08:05:29 WARNING base.py L195: unrecognized keys ['enable_alg_ext', 'QuantizationConfig', 'dataset', 'iters',
 'processor', 'image_processor', 'template', 'extra_data_dir', 'guidance_scale', 'num_inference_steps', 'generator_seed'
, 'lr'] were passed. Please check them. If you use old api, just ignore this warning.
2026-05-11 08:05:30 WARNING base.py L613: reset enable_torch_compile to `False` as fp8 is enabled
2026-05-11 08:05:30 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11 08:05:30 INFO base.py L518: Using predefined ignore_layers: model.layers.0.mlp.gate
2026-05-11T08:05:33.4549 | _postprocess_qparams | INFO - Skipped restoring LLMC qparams for 1 FP layers (sample: mlp.gat
e).
2026-05-11T08:05:33.4551 | _postprocess_qparams | INFO - Cleared quantization_scheme on 1 FP layers (sample: mlp.gate).
(49/49): Propagating: 100%|███████████████████████████████████████████████████████████| 128/128 [00:12<00:00, 10.55it/s]
2026-05-11T08:05:46.6343 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
2026-05-11T08:05:46.6344 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)


========== SAMPLE GENERATION ==============

Hello my name is Mandy and I am 25 years old. I live in the United States and I am a female. I am interested in a career
 in data science and I am currently learning Python and SQL. I am also learning about statistics and machine learning. I
 am interested in working in the tech industry, particularly in data analysis or data science roles. I would like to kno
w what skills I need to develop to become a successful data scientist, what resources are available to help me learn, an
d what steps
==========================================


Compressing model: 100%|████████████████████████████████████████████████████████| 18624/18624 [00:08<00:00, 2282.39it/s]
/usr/local/lib/python3.12/dist-packages/transformers/modeling_utils.py:3970: UserWarning: Attempting to save a model wit
h offloaded modules. Ensure that unallocated cpu memory exceeds the `shard_size` (5GB default)
  warnings.warn(
Saving checkpoint shards: 100%|███████████████████████████████████████████████████████████| 7/7 [00:59<00:00,  8.53s/it]
Dispatching model: 100%|███████████████████████████████████████████████████████| 31351/31351 [00:01<00:00, 27766.34it/s]

load the mxfp8 model quantized by LLMC AutoRound with vLLM

@mergify mergify Bot added documentation Improvements or additions to documentation two-reviews When a PR requires two reviews labels May 6, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 6, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviews

Waiting for

  • #approved-reviews-by >= 2
This rule is failing.

PRs labelled "two-reviews" must have at least two approving reviews before merging.

  • #approved-reviews-by >= 2
  • #changes-requested-reviews-by = 0

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 6, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the examples/autoround/README.md file to include MXFP8 and MXFP4 quantization schemes in the examples table and the list of supported schemes. Review feedback suggests improving the consistency of file paths for the new examples and correcting the casing of 'WNA16' along with punctuation in the documentation.

Comment thread examples/autoround/README.md Outdated
Comment thread examples/autoround/README.md Outdated
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Walkthrough

The PR updates the examples/autoround README by adding three new quantization precision rows (MXFP8, MXFP4, NVFP4) to the Support Matrix table with their associated example script paths, and updates the Known Issues section to reflect expanded AutoRound scheme support.

Changes

README Documentation Update

Layer / File(s) Summary
Support Matrix Expansion
examples/autoround/README.md (lines 72–75)
Three new quantization precision rows added: MXFP8, MXFP4, and NVFP4, each linked to their corresponding example script paths.
Known Issues Clarification
examples/autoround/README.md (line 78)
Known Issues section updated to list supported AutoRound schemes as WNA16, MXFP8, MXFP4, and NVFP4; W8A8-FP8 reference removed.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Possibly related PRs

Suggested labels

autoround, documentation, fp8, nvfp4, enhancement

Suggested reviewers

  • brian-dellabetta
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title mentions 'Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example', but the actual changes only document support for MXFP8, MXFP4, and NVFP4 schemes in the README with no evidence of ignore-layer metadata handling fixes. Update the title to accurately reflect the actual changes: 'Add MXFP8, MXFP4, and NVFP4 support to AutoRound README with Qwen3-30B example' or similar, matching the documented additions in the file.
Description check ❓ Inconclusive The PR description mentions two targeted fixes (AutoRound FP layer handling and Qwen3 MXFP8 example updates), but the actual changeset only shows README documentation updates for quantization support matrix. Clarify whether the PR scope is limited to README documentation updates or if other code changes addressing the two mentioned fixes are also included.
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added enhancement New feature or request fp8 For any issue / PR related to FP8 support nvfp4 For any PR / issue related to NVFP4 support autoround For any PR / issue related to autoround support w4a16 and removed two-reviews When a PR requires two reviews labels May 6, 2026
@mergify mergify Bot added the two-reviews When a PR requires two reviews label May 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/autoround/README.md (1)

78-78: 💤 Low value

Consider standardizing scheme capitalization.

Line 78 uses "WNA16" while the Support Matrix table (lines 66–68) uses "wNa16". For consistency, consider matching the casing used in the table.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/autoround/README.md` at line 78, The README uses inconsistent
capitalization for the quantization scheme names: change the occurrence of
"WNA16" to match the Support Matrix's casing "wNa16" (or alternatively normalize
both entries to a single chosen casing) so that the term "wNa16" is used
consistently across the README (reference symbols: "WNA16" and "wNa16", and the
Support Matrix table).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/autoround/README.md`:
- Line 78: The README uses inconsistent capitalization for the quantization
scheme names: change the occurrence of "WNA16" to match the Support Matrix's
casing "wNa16" (or alternatively normalize both entries to a single chosen
casing) so that the term "wNa16" is used consistently across the README
(reference symbols: "WNA16" and "wNa16", and the Support Matrix table).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0618543b-9bcc-45ea-8cf4-0e8534319205

📥 Commits

Reviewing files that changed from the base of the PR and between 9eb75fc and 18150f0.

📒 Files selected for processing (1)
  • examples/autoround/README.md

@changwangss changwangss marked this pull request as draft May 6, 2026 11:06
@changwangss changwangss marked this pull request as draft May 6, 2026 11:06
@changwangss changwangss changed the title Update example autoround README.md Fix AutoRound ignore-layer metadata handling and add Qwen3-30B to mxfp8 example. May 11, 2026
@changwangss changwangss marked this pull request as ready for review May 11, 2026 06:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autoround For any PR / issue related to autoround support documentation Improvements or additions to documentation enhancement New feature or request fp8 For any issue / PR related to FP8 support nvfp4 For any PR / issue related to NVFP4 support two-reviews When a PR requires two reviews w4a16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant