[Bugfix] Expand quantization method support in perf metrics#37231
[Bugfix] Expand quantization method support in perf metrics#37231markmc merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request is a solid improvement that expands support for various quantization methods in the performance metrics module. The refactoring to use a shared dictionary for quantization method properties is a clean and maintainable approach. The accompanying tests are thorough and cover all the new additions. I have one suggestion to refactor the new tests to reduce code duplication and further improve maintainability.
| # INT4 / FP4 quantization methods (weight_byte_size == 0.5) | ||
| _INT4_FP4_METHODS = [ | ||
| m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 0.5 | ||
| ] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("quant_method", _INT4_FP4_METHODS) | ||
| def test_quantization_config_parser_int4_methods(quant_method): | ||
| """Test quantization parsers with INT4/FP4 methods (0.5 bytes).""" | ||
|
|
||
| class MockQuantConfig: | ||
| def get_name(self): | ||
| return quant_method | ||
|
|
||
| hf_config = Qwen3Config( | ||
| hidden_size=2048, | ||
| num_attention_heads=16, | ||
| intermediate_size=8192, | ||
| num_hidden_layers=1, | ||
| ) | ||
| vllm_config = create_mock_vllm_config( | ||
| hf_config, quant_config=MockQuantConfig() | ||
| ) | ||
|
|
||
| attn_result = AttentionMetrics.get_parser().parse(vllm_config) | ||
| assert attn_result.weight_byte_size == 0.5, ( | ||
| f"Expected 0.5 for {quant_method}, got {attn_result.weight_byte_size}" | ||
| ) | ||
|
|
||
| ffn_result = FfnMetrics.get_parser().parse(vllm_config) | ||
| assert ffn_result.weight_byte_size == 0.5, ( | ||
| f"Expected 0.5 for {quant_method}, got {ffn_result.weight_byte_size}" | ||
| ) | ||
|
|
||
|
|
||
| # FP8 / INT8 quantization methods (weight_byte_size == 1) | ||
| _FP8_INT8_METHODS = [ | ||
| m for m, s in _QUANT_WEIGHT_BYTE_SIZE.items() if s == 1 | ||
| ] | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("quant_method", _FP8_INT8_METHODS) | ||
| def test_quantization_config_parser_fp8_methods(quant_method): | ||
| """Test quantization parsers with FP8/INT8 methods (1 byte).""" | ||
|
|
||
| class MockQuantConfig: | ||
| def get_name(self): | ||
| return quant_method | ||
|
|
||
| hf_config = Qwen3Config( | ||
| hidden_size=2048, | ||
| num_attention_heads=16, | ||
| intermediate_size=8192, | ||
| num_hidden_layers=1, | ||
| ) | ||
| vllm_config = create_mock_vllm_config( | ||
| hf_config, quant_config=MockQuantConfig() | ||
| ) | ||
|
|
||
| attn_result = AttentionMetrics.get_parser().parse(vllm_config) | ||
| assert attn_result.weight_byte_size == 1, ( | ||
| f"Expected 1 for {quant_method}, got {attn_result.weight_byte_size}" | ||
| ) | ||
|
|
||
| ffn_result = FfnMetrics.get_parser().parse(vllm_config) | ||
| assert ffn_result.weight_byte_size == 1, ( | ||
| f"Expected 1 for {quant_method}, got {ffn_result.weight_byte_size}" | ||
| ) |
There was a problem hiding this comment.
The two new test functions, test_quantization_config_parser_int4_methods and test_quantization_config_parser_fp8_methods, are nearly identical. To improve maintainability and reduce code duplication, they can be combined into a single, more general parametrized test that iterates over all items in _QUANT_WEIGHT_BYTE_SIZE. This will also make it easier to add tests for new quantization sizes in the future.
@pytest.mark.parametrize("quant_method, expected_byte_size",
list(_QUANT_WEIGHT_BYTE_SIZE.items()))
def test_quantization_config_parser(quant_method, expected_byte_size):
"""Test quantization parsers with all supported methods."""
class MockQuantConfig:
def get_name(self):
return quant_method
hf_config = Qwen3Config(
hidden_size=2048,
num_attention_heads=16,
intermediate_size=8192,
num_hidden_layers=1,
)
vllm_config = create_mock_vllm_config(
hf_config, quant_config=MockQuantConfig()
)
attn_result = AttentionMetrics.get_parser().parse(vllm_config)
assert attn_result.weight_byte_size == expected_byte_size, (
f"Expected {expected_byte_size} for {quant_method}, "
f"got {attn_result.weight_byte_size}"
)
ffn_result = FfnMetrics.get_parser().parse(vllm_config)
assert ffn_result.weight_byte_size == expected_byte_size, (
f"Expected {expected_byte_size} for {quant_method}, "
f"got {ffn_result.weight_byte_size}"
)|
Hi @thillai-c, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
0722c5e to
ff3a94b
Compare
|
Hi @markmc, the CI pipeline takes ~2 hours to complete, and the PR repeatedly becomes out-of-date before I get a chance to merge. Would you be able to enable auto-merge or help merge this once checks pass? That would help avoid rerunning the full pipeline again. |
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
…ject#37231) Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com> Signed-off-by: EricccYang <yangyang4991@gmail.com>
Purpose
The MFU (Model Flops Utilization) metrics module uses
AttentionQuantizationConfigParserandFfnQuantizationConfigParserto determineweight_byte_sizefor flops/memory estimation. Currently, only 3 quantization methods are supported (fp8,fbgemm_fp8,mxfp4). All other methods — including widely-used ones like GPTQ, AWQ, and BitsAndBytes — raiseInvalidComponent, silently breaking MFU reporting for quantized models.This PR resolves the multiple
FIXMEcomments invllm/v1/metrics/perf.pyrequesting broader quantization method support:FIXME: Add more parsing logic for different quant methods.FIXME: This is a hacky coarse-grained fp8 quantization detection.Changes
_QUANT_WEIGHT_BYTE_SIZEdict — A shared mapping of 22 quantization method names to their effectiveweight_byte_size(1 byte for FP8 variants, 0.5 bytes for INT4/FP4 variants), used by both parsers.if/elif/elsechains. Unknown methods still raiseInvalidComponent, now with a descriptive error message including the method name.ModelMetricsaggregation with a quantized config.Newly supported methods
fp8,fbgemm_fp8,ptpc_fp8,fp_quant,modelopt,modelopt_mxfp8,experts_int8mxfp4,awq,awq_marlin,gptq,gptq_marlin,bitsandbytes,modelopt_fp4,petit_nvfp4,gguf,compressed-tensors,torchao,quark,moe_wna16,inc,cpu_awqTest Plan
Run the perf metrics test suite:
New tests added
test_quantization_config_parser_int4_methods[<method>]weight_byte_size == 0.5for both Attention and FFN parserstest_quantization_config_parser_fp8_methods[<method>]weight_byte_size == 1for both parserstest_quantization_config_parser_unknown_methodInvalidComponenttest_quantized_model_metrics_aggregationModelMetricsproduces valid, consistent flops breakdowns with a GPTQ-quantized model configTest Results
All existing tests continue to pass. The 4 new tests (expanding to 22+ via parametrization) verify correctness for every supported quantization method.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.