Skip to content

[Bugfix] Expand quantization method support in perf metrics#37231

Merged
markmc merged 4 commits intovllm-project:mainfrom
thillai-c:expand-quant-perf-metrics
Mar 18, 2026
Merged

[Bugfix] Expand quantization method support in perf metrics#37231
markmc merged 4 commits intovllm-project:mainfrom
thillai-c:expand-quant-perf-metrics

Conversation

@thillai-c
Copy link
Copy Markdown
Contributor

@thillai-c thillai-c commented Mar 16, 2026

Purpose

The MFU (Model Flops Utilization) metrics module uses AttentionQuantizationConfigParser and FfnQuantizationConfigParser to determine weight_byte_size for flops/memory estimation. Currently, only 3 quantization methods are supported (fp8, fbgemm_fp8, mxfp4). All other methods — including widely-used ones like GPTQ, AWQ, and BitsAndBytes — raise InvalidComponent, silently breaking MFU reporting for quantized models.

This PR resolves the multiple FIXME comments in vllm/v1/metrics/perf.py requesting broader quantization method support:

  • FIXME: Add more parsing logic for different quant methods.
  • FIXME: This is a hacky coarse-grained fp8 quantization detection.

Changes

  • Added _QUANT_WEIGHT_BYTE_SIZE dict — A shared mapping of 22 quantization method names to their effective weight_byte_size (1 byte for FP8 variants, 0.5 bytes for INT4/FP4 variants), used by both parsers.
  • Refactored both quantization parsers to use the shared dict lookup instead of separate if/elif/else chains. Unknown methods still raise InvalidComponent, now with a descriptive error message including the method name.
  • Added 4 new tests: parametrized tests covering all INT4/FP4 and FP8/INT8 methods, unknown method error handling, and end-to-end ModelMetrics aggregation with a quantized config.

Newly supported methods

Byte size Methods
1 (FP8/INT8) fp8, fbgemm_fp8, ptpc_fp8, fp_quant, modelopt, modelopt_mxfp8, experts_int8
0.5 (FP4/INT4) mxfp4, awq, awq_marlin, gptq, gptq_marlin, bitsandbytes, modelopt_fp4, petit_nvfp4, gguf, compressed-tensors, torchao, quark, moe_wna16, inc, cpu_awq

Note: Methods like GPTQ and BitsAndBytes support variable bit-widths (e.g., 4-bit and 8-bit). We default to the most common configuration (4-bit). The existing FIXME comments about per-layer "ignored layers" handling remain as a separate concern for a follow-up.

Test Plan

Run the perf metrics test suite:

pytest tests/v1/metrics/test_perf_metrics.py -v

New tests added

Test Description
test_quantization_config_parser_int4_methods[<method>] Parametrized across all 15 INT4/FP4 methods, asserts weight_byte_size == 0.5 for both Attention and FFN parsers
test_quantization_config_parser_fp8_methods[<method>] Parametrized across all 7 FP8/INT8 methods, asserts weight_byte_size == 1 for both parsers
test_quantization_config_parser_unknown_method Verifies that an unrecognized quant method correctly raises InvalidComponent
test_quantized_model_metrics_aggregation End-to-end test that ModelMetrics produces valid, consistent flops breakdowns with a GPTQ-quantized model config

Test Results

All existing tests continue to pass. The 4 new tests (expanding to 22+ via parametrization) verify correctness for every supported quantization method.

tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[mxfp4]        PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[awq]          PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[awq_marlin]   PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gptq]         PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gptq_marlin]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[bitsandbytes] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[modelopt_fp4] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[petit_nvfp4]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[gguf]         PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[compressed-tensors] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[torchao]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[quark]        PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[moe_wna16]    PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[inc]          PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_int4_methods[cpu_awq]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fp8]           PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fbgemm_fp8]    PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[ptpc_fp8]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[fp_quant]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[modelopt]      PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[modelopt_mxfp8] PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_fp8_methods[experts_int8]  PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantization_config_parser_unknown_method             PASSED
tests/v1/metrics/test_perf_metrics.py::test_quantized_model_metrics_aggregation                   PASSED

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@thillai-c thillai-c requested a review from markmc as a code owner March 16, 2026 21:46
@mergify mergify bot added v1 bug Something isn't working labels Mar 16, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a solid improvement that expands support for various quantization methods in the performance metrics module. The refactoring to use a shared dictionary for quantization method properties is a clean and maintainable approach. The accompanying tests are thorough and cover all the new additions. I have one suggestion to refactor the new tests to reduce code duplication and further improve maintainability.

Comment on lines +913 to +980
# INT4 / FP4 quantization methods (weight_byte_size == 0.5)
_INT4_FP4_METHODS = [
m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 0.5
]


@pytest.mark.parametrize("quant_method", _INT4_FP4_METHODS)
def test_quantization_config_parser_int4_methods(quant_method):
"""Test quantization parsers with INT4/FP4 methods (0.5 bytes)."""

class MockQuantConfig:
def get_name(self):
return quant_method

hf_config = Qwen3Config(
hidden_size=2048,
num_attention_heads=16,
intermediate_size=8192,
num_hidden_layers=1,
)
vllm_config = create_mock_vllm_config(
hf_config, quant_config=MockQuantConfig()
)

attn_result = AttentionMetrics.get_parser().parse(vllm_config)
assert attn_result.weight_byte_size == 0.5, (
f"Expected 0.5 for {quant_method}, got {attn_result.weight_byte_size}"
)

ffn_result = FfnMetrics.get_parser().parse(vllm_config)
assert ffn_result.weight_byte_size == 0.5, (
f"Expected 0.5 for {quant_method}, got {ffn_result.weight_byte_size}"
)


# FP8 / INT8 quantization methods (weight_byte_size == 1)
_FP8_INT8_METHODS = [
m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 1
]


@pytest.mark.parametrize("quant_method", _FP8_INT8_METHODS)
def test_quantization_config_parser_fp8_methods(quant_method):
"""Test quantization parsers with FP8/INT8 methods (1 byte)."""

class MockQuantConfig:
def get_name(self):
return quant_method

hf_config = Qwen3Config(
hidden_size=2048,
num_attention_heads=16,
intermediate_size=8192,
num_hidden_layers=1,
)
vllm_config = create_mock_vllm_config(
hf_config, quant_config=MockQuantConfig()
)

attn_result = AttentionMetrics.get_parser().parse(vllm_config)
assert attn_result.weight_byte_size == 1, (
f"Expected 1 for {quant_method}, got {attn_result.weight_byte_size}"
)

ffn_result = FfnMetrics.get_parser().parse(vllm_config)
assert ffn_result.weight_byte_size == 1, (
f"Expected 1 for {quant_method}, got {ffn_result.weight_byte_size}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The two new test functions, test_quantization_config_parser_int4_methods and test_quantization_config_parser_fp8_methods, are nearly identical. To improve maintainability and reduce code duplication, they can be combined into a single, more general parametrized test that iterates over all items in _QUANT_WEIGHT_BYTE_SIZE. This will also make it easier to add tests for new quantization sizes in the future.

@pytest.mark.parametrize("quant_method, expected_byte_size",
                         list(_QUANT_WEIGHT_BYTE_SIZE.items()))
def test_quantization_config_parser(quant_method, expected_byte_size):
    """Test quantization parsers with all supported methods."""

    class MockQuantConfig:
        def get_name(self):
            return quant_method

    hf_config = Qwen3Config(
        hidden_size=2048,
        num_attention_heads=16,
        intermediate_size=8192,
        num_hidden_layers=1,
    )
    vllm_config = create_mock_vllm_config(
        hf_config, quant_config=MockQuantConfig()
    )

    attn_result = AttentionMetrics.get_parser().parse(vllm_config)
    assert attn_result.weight_byte_size == expected_byte_size, (
        f"Expected {expected_byte_size} for {quant_method}, "
        f"got {attn_result.weight_byte_size}"
    )

    ffn_result = FfnMetrics.get_parser().parse(vllm_config)
    assert ffn_result.weight_byte_size == expected_byte_size, (
        f"Expected {expected_byte_size} for {quant_method}, "
        f"got {ffn_result.weight_byte_size}"
    )

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 16, 2026

Hi @thillai-c, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
@thillai-c thillai-c force-pushed the expand-quant-perf-metrics branch from 0722c5e to ff3a94b Compare March 16, 2026 22:05
Copy link
Copy Markdown
Member

@markmc markmc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks

@markmc markmc moved this from Backlog to Ready in Metrics & Tracing Mar 18, 2026
@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 18, 2026
@thillai-c
Copy link
Copy Markdown
Contributor Author

Hi @markmc, the CI pipeline takes ~2 hours to complete, and the PR repeatedly becomes out-of-date before I get a chance to merge.

Would you be able to enable auto-merge or help merge this once checks pass? That would help avoid rerunning the full pipeline again.

@markmc markmc enabled auto-merge (squash) March 18, 2026 22:33
@markmc markmc merged commit 828f862 into vllm-project:main Mar 18, 2026
47 checks passed
@github-project-automation github-project-automation bot moved this from Ready to Done in Metrics & Tracing Mar 18, 2026
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request Mar 19, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
ikaadil pushed a commit to ikaadil/vllm that referenced this pull request Mar 19, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
SouthWest7 pushed a commit to SouthWest7/vllm that referenced this pull request Mar 27, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: Vinay Damodaran <vrdn@hey.com>
EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026
…ject#37231)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: EricccYang <yangyang4991@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants