[Bugfix] Expand quantization method support in perf metrics by thillai-c · Pull Request #37231 · vllm-project/vllm

thillai-c · 2026-03-16T21:46:18Z

Purpose

The MFU (Model Flops Utilization) metrics module uses AttentionQuantizationConfigParser and FfnQuantizationConfigParser to determine weight_byte_size for flops/memory estimation. Currently, only 3 quantization methods are supported (fp8, fbgemm_fp8, mxfp4). All other methods — including widely-used ones like GPTQ, AWQ, and BitsAndBytes — raise InvalidComponent, silently breaking MFU reporting for quantized models.

This PR resolves the multiple FIXME comments in vllm/v1/metrics/perf.py requesting broader quantization method support:

FIXME: Add more parsing logic for different quant methods.
FIXME: This is a hacky coarse-grained fp8 quantization detection.

Changes

Added _QUANT_WEIGHT_BYTE_SIZE dict — A shared mapping of 22 quantization method names to their effective weight_byte_size (1 byte for FP8 variants, 0.5 bytes for INT4/FP4 variants), used by both parsers.
Refactored both quantization parsers to use the shared dict lookup instead of separate if/elif/else chains. Unknown methods still raise InvalidComponent, now with a descriptive error message including the method name.
Added 4 new tests: parametrized tests covering all INT4/FP4 and FP8/INT8 methods, unknown method error handling, and end-to-end ModelMetrics aggregation with a quantized config.

Newly supported methods

Byte size	Methods
1 (FP8/INT8)	`fp8`, `fbgemm_fp8`, `ptpc_fp8`, `fp_quant`, `modelopt`, `modelopt_mxfp8`, `experts_int8`
0.5 (FP4/INT4)	`mxfp4`, `awq`, `awq_marlin`, `gptq`, `gptq_marlin`, `bitsandbytes`, `modelopt_fp4`, `petit_nvfp4`, `gguf`, `compressed-tensors`, `torchao`, `quark`, `moe_wna16`, `inc`, `cpu_awq`

Note: Methods like GPTQ and BitsAndBytes support variable bit-widths (e.g., 4-bit and 8-bit). We default to the most common configuration (4-bit). The existing FIXME comments about per-layer "ignored layers" handling remain as a separate concern for a follow-up.

Test Plan

Run the perf metrics test suite:

pytest tests/v1/metrics/test_perf_metrics.py -v

New tests added

Test	Description
`test_quantization_config_parser_int4_methods[<method>]`	Parametrized across all 15 INT4/FP4 methods, asserts `weight_byte_size == 0.5` for both Attention and FFN parsers
`test_quantization_config_parser_fp8_methods[<method>]`	Parametrized across all 7 FP8/INT8 methods, asserts `weight_byte_size == 1` for both parsers
`test_quantization_config_parser_unknown_method`	Verifies that an unrecognized quant method correctly raises `InvalidComponent`
`test_quantized_model_metrics_aggregation`	End-to-end test that `ModelMetrics` produces valid, consistent flops breakdowns with a GPTQ-quantized model config

Test Results

All existing tests continue to pass. The 4 new tests (expanding to 22+ via parametrization) verify correctness for every supported quantization method.

tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[mxfp4]        PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[awq]          PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[awq_marlin]   PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gptq]         PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gptq_marlin]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[bitsandbytes] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[modelopt_fp4] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[petit_nvfp4]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gguf]         PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[compressed-tensors] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[torchao]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[quark]        PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[moe_wna16]    PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[inc]          PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[cpu_awq]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fp8]           PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fbgemm_fp8]    PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[ptpc_fp8]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fp_quant]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[modelopt]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[modelopt_mxfp8] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[experts_int8]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_unknown_method             PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantized_model_metrics_aggregation                   PASSED

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request is a solid improvement that expands support for various quantization methods in the performance metrics module. The refactoring to use a shared dictionary for quantization method properties is a clean and maintainable approach. The accompanying tests are thorough and cover all the new additions. I have one suggestion to refactor the new tests to reduce code duplication and further improve maintainability.

gemini-code-assist · 2026-03-16T21:49:15Z

tests/v1/metrics/test_perf_metrics.py

+# INT4 / FP4 quantization methods (weight_byte_size == 0.5)
+_INT4_FP4_METHODS = [
+    m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 0.5
+]
+
+
+@pytest.mark.parametrize("quant_method", _INT4_FP4_METHODS)
+def test_quantization_config_parser_int4_methods(quant_method):
+    """Test quantization parsers with INT4/FP4 methods (0.5 bytes)."""
+
+    class MockQuantConfig:
+        def get_name(self):
+            return quant_method
+
+    hf_config = Qwen3Config(
+        hidden_size=2048,
+        num_attention_heads=16,
+        intermediate_size=8192,
+        num_hidden_layers=1,
+    )
+    vllm_config = create_mock_vllm_config(
+        hf_config, quant_config=MockQuantConfig()
+    )
+
+    attn_result = AttentionMetrics.get_parser().parse(vllm_config)
+    assert attn_result.weight_byte_size == 0.5, (
+        f"Expected 0.5 for {quant_method}, got {attn_result.weight_byte_size}"
+    )
+
+    ffn_result = FfnMetrics.get_parser().parse(vllm_config)
+    assert ffn_result.weight_byte_size == 0.5, (
+        f"Expected 0.5 for {quant_method}, got {ffn_result.weight_byte_size}"
+    )
+
+
+# FP8 / INT8 quantization methods (weight_byte_size == 1)
+_FP8_INT8_METHODS = [
+    m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 1
+]
+
+
+@pytest.mark.parametrize("quant_method", _FP8_INT8_METHODS)
+def test_quantization_config_parser_fp8_methods(quant_method):
+    """Test quantization parsers with FP8/INT8 methods (1 byte)."""
+
+    class MockQuantConfig:
+        def get_name(self):
+            return quant_method
+
+    hf_config = Qwen3Config(
+        hidden_size=2048,
+        num_attention_heads=16,
+        intermediate_size=8192,
+        num_hidden_layers=1,
+    )
+    vllm_config = create_mock_vllm_config(
+        hf_config, quant_config=MockQuantConfig()
+    )
+
+    attn_result = AttentionMetrics.get_parser().parse(vllm_config)
+    assert attn_result.weight_byte_size == 1, (
+        f"Expected 1 for {quant_method}, got {attn_result.weight_byte_size}"
+    )
+
+    ffn_result = FfnMetrics.get_parser().parse(vllm_config)
+    assert ffn_result.weight_byte_size == 1, (
+        f"Expected 1 for {quant_method}, got {ffn_result.weight_byte_size}"
+    )


The two new test functions, test_quantization_config_parser_int4_methods and test_quantization_config_parser_fp8_methods, are nearly identical. To improve maintainability and reduce code duplication, they can be combined into a single, more general parametrized test that iterates over all items in _QUANT_WEIGHT_BYTE_SIZE. This will also make it easier to add tests for new quantization sizes in the future.

@pytest.mark.parametrize("quant_method, expected_byte_size", list(_QUANT_WEIGHT_BYTE_SIZE.items())) def test_quantization_config_parser(quant_method, expected_byte_size): """Test quantization parsers with all supported methods.""" class MockQuantConfig: def get_name(self): return quant_method hf_config = Qwen3Config( hidden_size=2048, num_attention_heads=16, intermediate_size=8192, num_hidden_layers=1, ) vllm_config = create_mock_vllm_config( hf_config, quant_config=MockQuantConfig() ) attn_result = AttentionMetrics.get_parser().parse(vllm_config) assert attn_result.weight_byte_size == expected_byte_size, ( f"Expected {expected_byte_size} for {quant_method}, " f"got {attn_result.weight_byte_size}" ) ffn_result = FfnMetrics.get_parser().parse(vllm_config) assert ffn_result.weight_byte_size == expected_byte_size, ( f"Expected {expected_byte_size} for {quant_method}, " f"got {ffn_result.weight_byte_size}" )

mergify · 2026-03-16T21:49:49Z

Hi @thillai-c, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

markmc

lgtm, thanks

thillai-c · 2026-03-18T21:35:05Z

Hi @markmc, the CI pipeline takes ~2 hours to complete, and the PR repeatedly becomes out-of-date before I get a chance to merge.

Would you be able to enable auto-merge or help merge this once checks pass? That would help avoid rerunning the full pipeline again.

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>

thillai-c requested a review from markmc as a code owner March 16, 2026 21:46

mergify bot added v1 bug Something isn't working labels Mar 16, 2026

gemini-code-assist bot reviewed Mar 16, 2026

View reviewed changes

[Bugfix] Expand quantization method support in perf metrics

ff3a94b

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

thillai-c force-pushed the expand-quant-perf-metrics branch from 0722c5e to ff3a94b Compare March 16, 2026 22:05

markmc approved these changes Mar 18, 2026

View reviewed changes

github-project-automation bot moved this to Backlog in Metrics & Tracing Mar 18, 2026

markmc added this to Metrics & Tracing Mar 18, 2026

markmc moved this from Backlog to Ready in Metrics & Tracing Mar 18, 2026

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026

thillai-c added 3 commits March 18, 2026 13:50

Merge branch 'main' into expand-quant-perf-metrics

126c5d7

Merge branch 'main' into expand-quant-perf-metrics

8930cee

Merge branch 'main' into expand-quant-perf-metrics

0d243b3

markmc enabled auto-merge (squash) March 18, 2026 22:33

markmc merged commit 828f862 into vllm-project:main Mar 18, 2026
47 checks passed

github-project-automation bot moved this from Ready to Done in Metrics & Tracing Mar 18, 2026

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[Bugfix] Expand quantization method support in perf metrics (vllm-pro…

7269d6c

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Expand quantization method support in perf metrics (vllm-pro…

493f9d1

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[Bugfix] Expand quantization method support in perf metrics (vllm-pro…

7857e30

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[Bugfix] Expand quantization method support in perf metrics (vllm-pro…

e08c9f9

…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bugfix] Expand quantization method support in perf metrics#37231

[Bugfix] Expand quantization method support in perf metrics#37231
markmc merged 4 commits intovllm-project:mainfrom
thillai-c:expand-quant-perf-metrics

thillai-c commented Mar 16, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 16, 2026

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

markmc left a comment

Uh oh!

thillai-c commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

thillai-c commented Mar 16, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Newly supported methods

Test Plan

New tests added

Test Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 16, 2026

Uh oh!

markmc left a comment

Choose a reason for hiding this comment

Uh oh!

thillai-c commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

thillai-c commented Mar 16, 2026 •

edited by github-actions bot

Loading