[Quantization] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema by luccafong · Pull Request #27280 · vllm-project/vllm

luccafong · 2025-10-21T18:25:19Z

Purpose

Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema on bf16 ckpt.

Memory Optimization: This differentiates from the other online quantization approach that is post loading through process_weights_after_loading, we add quantization of weights through process_weights_before_loading when the hardware GPU can not hold the original dtype weights and cpu offloading is too slow. This approach quantize each weight while loading. This enables online dynamic quantization for Llama4 Maverick Raw BF16 on H100, which is not doable before.
Extendible: The PR implements compressed-tensor fp8 channelwise (same as FP_dynamic in offline quantization), while the approach is extendible to other quantization method if the method implements process_weights_before_loading.
MOE and Linear Support: This support both MOE and linear layers.
llama4 specific optimization: Llama4 has a transposed/chunked fused weights when calling weight loader, this PR also has an improvement on llama4 model loading that copy to device before transpose/chunk happenning to avoid expensive contiguous (the model loading reduced from 40 minutes to 2 mintues)

 --quantization compressed-tensors \
--quantization-schema fp8_channelwise \
--hf-overrides '{"quantization_config":{"ignore":["re:.*self_attn","re:.*lm_head","re:.*router","re:.*vision_model","re:.*multi_modal_projector","re:.*feed_forward.gate_up_proj","re:.*feed_forward.down_proj", "re:.*shared_expert"]}}'

Test Plan

Online Serving

vllm serve /data/local/models/oss/Llama-4-Maverick-17B-128E-Instruct -tp 8 --quantization compressed-tensors --quantization-schema fp8_channelwise --hf-overrides '{"quantization_config":{"ignore":["re:.*self_attn","re:.*lm_head","re:.*router","re:.*vision_model","re:.*multi_modal_projector","re:.*feed_forward.gate_up_proj","re:.*feed_forward.down_proj", "re:.*shared_expert"]}}' --max_num_seqs 32 --max-model-len 32768

UT/CI Tests

pytest tests/quantization/test_fp8_channelwise.py
pytest tests/quantization/tes
t_compressed_tensors.py -k "test_compressed_tensors_fp8_online_quantization_channelwise"

Added both llama3 and Qwen to guard linear and MOE models.

Test Result

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.940|±  |0.0168|
|     |       |strict-match    |     5|exact_match|↑  |0.945|±  |0.0162|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

jerryzh168 · 2025-10-21T18:28:53Z

vllm/config/model.py

@@ -196,6 +197,9 @@ class ModelConfig:
    `quantization_config` attribute in the model config file. If that is
    `None`, we assume the model weights are not quantized and use `dtype` to
    determine the data type of the weights."""
+    quantization_schema: str | None = None


can we just use quantization and hf_overrides to specify the config? like #23014

I feel the user experience might be too complicated for fp8_channelwise, so add some schema for pre-defined config here.

I guess it's fine to use a string, maybe reuse hf_override to specify the string?

jerryzh168 · 2025-10-21T18:30:31Z

vllm/model_executor/layers/quantization/__init__.py

+        quantization == "compressed-tensors"
+        and quantization_schema == "fp8_channelwise"
+    ):
+        return {


yeah I guess this could be passed as a json string, or a file that stores the serialized json string

otherwise we'd need to invent quantization_schema name for each settings

JSON configs can also be passed with . notation. i.e. --quantization-scheme.format float-quantize --quantization-scheme.quant_method compressed-tensors ..., but this will be extremely verbose for a CLI.

Perhaps as a file is a good idea as @jerryzh168 suggested?

I feel not a good user experience if user have to define these settings in either file or pass through the args, so I think for frequently used schema, pre-defined should be fine @hmellor @jerryzh168

I think it's OK to have predefined strings, but can this live in hf_overrides, instead of adding a new quantization_schema in parallel to quantization?

later we may add fp8_blockwise, without a separate field, we could not differentiate, the current quantization are all compressed-tensors which is not enough to differentiate different schemas of the same quantization method

ah, I see, do you mean we specifiy a field in hf config to differentiate?

ah, I see, do you mean we specifiy a field in hf config to differentiate?

yeah just put this in hf_override is cleaner I think, since you already have some other configs there, and this is specific to llm compressor

mergify · 2025-10-21T22:45:55Z

Documentation preview: https://vllm--27280.org.readthedocs.build/en/27280/

Signed-off-by: Lu Fang <fanglu@fb.com>

HDCharles · 2025-11-12T14:47:55Z

vllm/config/model.py

            hf_overrides_kw = {}
            dict_overrides = {}
+            if quant_config_override:
+                dict_overrides["quantization_config"] = quant_config_override


Should this warn if overriding non null quantization configuration? Otherwise may be unclear which takes precedence

@HDCharles thanks for the comments, I moved where get_default_quantization_hf_config called closer to this line and also add to warn to the original method of get_default_quantization_hf_config

Signed-off-by: Lu Fang <fanglu@fb.com>

HDCharles · 2025-11-12T19:36:40Z

vllm/config/model.py

                if isinstance(value, dict):
-                    dict_overrides[key] = value
+                    if dict_overrides.get(key):
+                        dict_overrides[key] = {**dict_overrides[key], **value}


this seems rather convoluted.

We're assuming dict_overrides[key] is always a mapping as long as dict_overrides.get(key) and isinstance(value, dict) are true which doesn't seem obvious

as far as the logic:

we take the dict that we assume we get from dict_overrides[key] and add the value dict to it, letting value take precedence where there are any conflicts.

Feels like we should be taking one or the other i.e. overriding things rather than this merge operation.

also still no warning when one thing overrides another.

jerryzh168 · 2025-11-12T20:43:20Z

tests/quantization/test_compressed_tensors.py

+        enforce_eager=True,
+        dtype="bfloat16",
+        hf_overrides={
+            "quantization_config": {"ignore": ["re:.*self_attn", "re:.*lm_head"]}


i.e. here:

hf_overrides={ "llmc_quantization_schema"="fp8_channelwise", "llmc_quantization_config": {"ignore": ["re:.*self_attn", "re:.*lm_head"]} }

HDCharles · 2025-11-12T22:00:14Z

docs/features/quantization/fp8.md

+llm = LLM(
+    "meta-llama/Llama-3-8B-Instruct",
+    quantization="compressed-tensors",
+    quantization_schema="fp8_channelwise"


Shouldn't this match the per tensor API?

Super weird that

quantization="fp8" does per tensor while
quantization="compressed-tensoe", quantization_schema="fp8_channelwise" does per channel

Why not just quantization="fp8_channelwise"?

good point, let's make the adjustment for the interface

HDCharles

At a high level, the API for online per tensor fp8 should match the one for online per channel fp8.
It looks like this PR takes a completely different path than the existing per tensor fp8 online support, creating entirely new preprocessing steps. rather than having 2 unrelated paths for these techniques, it's be much cleaner to either support both techniques with the new abstractions or just do whatever per tensor did initially.

vkuzo · 2025-11-20T21:21:16Z

vllm/model_executor/layers/quantization/compressed_tensors/utils.py

    return unfused_matches[0] if all(unfused_matches) else None
+
+
+def fp8_channelwise_quantize(x: Tensor, channel_dim: int = -1) -> tuple[Tensor, Tensor]:


is there a higher level function in compressed-tensors to do this so we don't have to redefine it here? @kylesayrs since I'm very n00b in compressed-tensors

kylesayrs · 2025-12-08T21:36:54Z

@luccafong Can this be closed now that #29196 has landed?

Summary: vllm-project#29196 implemented streaming weight post-processing for online fp8 quant but did not actually reduce peak memory, because the linear|moe weights were created in bf16 and references to them were held for the entire `load_weights` loop in model loaders. this PR fixes it by changing fp8 online quant to create zero-sized weights in `create_weights`, and materialize them to the correct size just-in-time in `patched_weight_loader`. I would note that this PR is a bit hacky, and there are two more proper ways to fix this that I can think of, both with a much wider blast radius: - 1: change weight creation in vllm to be materialized just-in-time (same as this PR, just explicit instead of hacky callables) - 2: or, add an extension point for post-processing the weight before loading it (similar to vllm-project#27280) fixes vllm-project#31805 Test Plan: inspect memory usage inside of `load_weights` and verify that it increases ~monotonically as weights are loaded ```bash // dense python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8 // moe CUDA_VISIBLE_DEVICES=7 python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8 ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>

github-actions · 2026-03-09T02:14:36Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify bot added the llama Related to Llama models label Oct 21, 2025

luccafong changed the title ~~[Quantizon] Support compressed-tensors W8A8 channelwise online quantization~~ [Quantizion] Support compressed-tensors W8A8 channelwise online quantization Oct 21, 2025

jerryzh168 reviewed Oct 21, 2025

View reviewed changes

mergify bot added the documentation Improvements or additions to documentation label Oct 21, 2025

luccafong changed the title ~~[Quantizion] Support compressed-tensors W8A8 channelwise online quantization~~ [Quantizaion] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema Oct 21, 2025

luccafong changed the title ~~[Quantizaion] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema~~ [Quantization] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema Oct 21, 2025

luccafong marked this pull request as ready for review October 21, 2025 22:49

luccafong requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 21, 2025 22:49

luccafong added 3 commits November 11, 2025 11:53

enable compressed-tensors online quatization for fp8 channelwise

0a2009a

Signed-off-by: Lu Fang <fanglu@fb.com>

add tests and docs

d663e5c

Signed-off-by: Lu Fang <fanglu@fb.com>

update doc

27e66ac

Signed-off-by: Lu Fang <fanglu@fb.com>

luccafong force-pushed the fp8_channelwise_quantization_online branch from f2b24a4 to 27e66ac Compare November 11, 2025 19:53

HDCharles reviewed Nov 12, 2025

View reviewed changes

improve warn msg and code structure for hf config

1870795

Signed-off-by: Lu Fang <fanglu@fb.com>

HDCharles reviewed Nov 12, 2025

View reviewed changes

jerryzh168 reviewed Nov 12, 2025

View reviewed changes

HDCharles reviewed Nov 12, 2025

View reviewed changes

HDCharles suggested changes Nov 12, 2025

View reviewed changes

vkuzo reviewed Nov 20, 2025

View reviewed changes

vkuzo mentioned this pull request Jan 7, 2026

fix memory for online fp8 quantization with streaming weight load #31914

Merged

5 tasks

github-actions bot added the stale Over 90 days of inactivity label Mar 9, 2026

		return unfused_matches[0] if all(unfused_matches) else None


		def fp8_channelwise_quantize(x: Tensor, channel_dim: int = -1) -> tuple[Tensor, Tensor]:

Uh oh!

Conversation

luccafong commented Oct 21, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Online Serving

UT/CI Tests

Test Result

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmellor Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccafong Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Oct 21, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HDCharles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Dec 8, 2025

Uh oh!

github-actions bot commented Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

luccafong commented Oct 21, 2025 •

edited by github-actions bot

Loading

hmellor Oct 23, 2025 •

edited

Loading

luccafong Nov 12, 2025 •

edited

Loading

HDCharles Nov 12, 2025 •

edited

Loading