[Quantization] Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema#27280
Conversation
| @@ -196,6 +197,9 @@ class ModelConfig: | |||
| `quantization_config` attribute in the model config file. If that is | |||
| `None`, we assume the model weights are not quantized and use `dtype` to | |||
| determine the data type of the weights.""" | |||
| quantization_schema: str | None = None | |||
There was a problem hiding this comment.
can we just use quantization and hf_overrides to specify the config? like #23014
There was a problem hiding this comment.
I feel the user experience might be too complicated for fp8_channelwise, so add some schema for pre-defined config here.
There was a problem hiding this comment.
I guess it's fine to use a string, maybe reuse hf_override to specify the string?
| quantization == "compressed-tensors" | ||
| and quantization_schema == "fp8_channelwise" | ||
| ): | ||
| return { |
There was a problem hiding this comment.
yeah I guess this could be passed as a json string, or a file that stores the serialized json string
otherwise we'd need to invent quantization_schema name for each settings
There was a problem hiding this comment.
JSON configs can also be passed with . notation. i.e. --quantization-scheme.format float-quantize --quantization-scheme.quant_method compressed-tensors ..., but this will be extremely verbose for a CLI.
Perhaps as a file is a good idea as @jerryzh168 suggested?
There was a problem hiding this comment.
I feel not a good user experience if user have to define these settings in either file or pass through the args, so I think for frequently used schema, pre-defined should be fine @hmellor @jerryzh168
There was a problem hiding this comment.
I think it's OK to have predefined strings, but can this live in hf_overrides, instead of adding a new quantization_schema in parallel to quantization?
There was a problem hiding this comment.
later we may add fp8_blockwise, without a separate field, we could not differentiate, the current quantization are all compressed-tensors which is not enough to differentiate different schemas of the same quantization method
There was a problem hiding this comment.
ah, I see, do you mean we specifiy a field in hf config to differentiate?
There was a problem hiding this comment.
ah, I see, do you mean we specifiy a field in hf config to differentiate?
yeah just put this in hf_override is cleaner I think, since you already have some other configs there, and this is specific to llm compressor
|
Documentation preview: https://vllm--27280.org.readthedocs.build/en/27280/ |
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
Signed-off-by: Lu Fang <fanglu@fb.com>
f2b24a4 to
27e66ac
Compare
vllm/config/model.py
Outdated
| hf_overrides_kw = {} | ||
| dict_overrides = {} | ||
| if quant_config_override: | ||
| dict_overrides["quantization_config"] = quant_config_override |
There was a problem hiding this comment.
Should this warn if overriding non null quantization configuration? Otherwise may be unclear which takes precedence
There was a problem hiding this comment.
@HDCharles thanks for the comments, I moved where get_default_quantization_hf_config called closer to this line and also add to warn to the original method of get_default_quantization_hf_config
Signed-off-by: Lu Fang <fanglu@fb.com>
| if isinstance(value, dict): | ||
| dict_overrides[key] = value | ||
| if dict_overrides.get(key): | ||
| dict_overrides[key] = {**dict_overrides[key], **value} |
There was a problem hiding this comment.
this seems rather convoluted.
We're assuming dict_overrides[key] is always a mapping as long as dict_overrides.get(key) and isinstance(value, dict) are true which doesn't seem obvious
as far as the logic:
we take the dict that we assume we get from dict_overrides[key] and add the value dict to it, letting value take precedence where there are any conflicts.
Feels like we should be taking one or the other i.e. overriding things rather than this merge operation.
also still no warning when one thing overrides another.
| enforce_eager=True, | ||
| dtype="bfloat16", | ||
| hf_overrides={ | ||
| "quantization_config": {"ignore": ["re:.*self_attn", "re:.*lm_head"]} |
There was a problem hiding this comment.
i.e. here:
hf_overrides={
"llmc_quantization_schema"="fp8_channelwise",
"llmc_quantization_config": {"ignore": ["re:.*self_attn", "re:.*lm_head"]}
}
| llm = LLM( | ||
| "meta-llama/Llama-3-8B-Instruct", | ||
| quantization="compressed-tensors", | ||
| quantization_schema="fp8_channelwise" |
There was a problem hiding this comment.
Shouldn't this match the per tensor API?
Super weird that
quantization="fp8" does per tensor while
quantization="compressed-tensoe", quantization_schema="fp8_channelwise" does per channel
Why not just quantization="fp8_channelwise"?
There was a problem hiding this comment.
good point, let's make the adjustment for the interface
HDCharles
left a comment
There was a problem hiding this comment.
-
At a high level, the API for online per tensor fp8 should match the one for online per channel fp8.
-
It looks like this PR takes a completely different path than the existing per tensor fp8 online support, creating entirely new preprocessing steps. rather than having 2 unrelated paths for these techniques, it's be much cleaner to either support both techniques with the new abstractions or just do whatever per tensor did initially.
| return unfused_matches[0] if all(unfused_matches) else None | ||
|
|
||
|
|
||
| def fp8_channelwise_quantize(x: Tensor, channel_dim: int = -1) -> tuple[Tensor, Tensor]: |
There was a problem hiding this comment.
is there a higher level function in compressed-tensors to do this so we don't have to redefine it here? @kylesayrs since I'm very n00b in compressed-tensors
|
@luccafong Can this be closed now that #29196 has landed? |
Summary: vllm-project#29196 implemented streaming weight post-processing for online fp8 quant but did not actually reduce peak memory, because the linear|moe weights were created in bf16 and references to them were held for the entire `load_weights` loop in model loaders. this PR fixes it by changing fp8 online quant to create zero-sized weights in `create_weights`, and materialize them to the correct size just-in-time in `patched_weight_loader`. I would note that this PR is a bit hacky, and there are two more proper ways to fix this that I can think of, both with a much wider blast radius: - 1: change weight creation in vllm to be materialized just-in-time (same as this PR, just explicit instead of hacky callables) - 2: or, add an extension point for post-processing the weight before loading it (similar to vllm-project#27280) fixes vllm-project#31805 Test Plan: inspect memory usage inside of `load_weights` and verify that it increases ~monotonically as weights are loaded ```bash // dense python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8 // moe CUDA_VISIBLE_DEVICES=7 python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8 ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>
Summary: vllm-project#29196 implemented streaming weight post-processing for online fp8 quant but did not actually reduce peak memory, because the linear|moe weights were created in bf16 and references to them were held for the entire `load_weights` loop in model loaders. this PR fixes it by changing fp8 online quant to create zero-sized weights in `create_weights`, and materialize them to the correct size just-in-time in `patched_weight_loader`. I would note that this PR is a bit hacky, and there are two more proper ways to fix this that I can think of, both with a much wider blast radius: - 1: change weight creation in vllm to be materialized just-in-time (same as this PR, just explicit instead of hacky callables) - 2: or, add an extension point for post-processing the weight before loading it (similar to vllm-project#27280) fixes vllm-project#31805 Test Plan: inspect memory usage inside of `load_weights` and verify that it increases ~monotonically as weights are loaded ```bash // dense python3 examples/offline_inference/basic/generate.py --model facebook/opt-125m --enforce-eager --dtype=bfloat16 --max_model_len=2048 --quantization=fp8 // moe CUDA_VISIBLE_DEVICES=7 python3 examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B --enforce-eager --dtype=bfloat16 --block-size=64 --max_model_len=2048 --gpu-memory-utilization=0.8 --trust-remote-code --quantization=fp8 ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: vasiliy <vasiliy@fb.com>
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
Purpose
Support pre-load online quantization for compressed-tensors W8A8 channel-wise schema on bf16 ckpt.
process_weights_after_loading, we add quantization of weights throughprocess_weights_before_loadingwhen the hardware GPU can not hold the original dtype weights and cpu offloading is too slow. This approach quantize each weight while loading. This enables online dynamic quantization for Llama4 Maverick Raw BF16 on H100, which is not doable before.Test Plan
Online Serving
UT/CI Tests
Added both llama3 and Qwen to guard linear and MOE models.
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.