Merged
Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
kylesayrs
commented
Oct 29, 2025
kylesayrs
commented
Nov 14, 2025
kylesayrs
commented
Nov 15, 2025
Collaborator
Author
kylesayrs
left a comment
There was a problem hiding this comment.
Approve from my side
HDCharles
previously approved these changes
Nov 17, 2025
fynnsu
previously approved these changes
Nov 17, 2025
Collaborator
fynnsu
left a comment
There was a problem hiding this comment.
Looks good, added a couple comments below!
HDCharles
reviewed
Nov 25, 2025
bdcdca4 to
4e480ce
Compare
HDCharles
reviewed
Dec 8, 2025
kylesayrs
commented
Dec 9, 2025
9d2d033 to
57ade6b
Compare
kylesayrs
commented
Dec 10, 2025
Collaborator
Author
kylesayrs
left a comment
There was a problem hiding this comment.
LGTM, approved from my side
57ade6b to
157cf48
Compare
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Signed-off-by: Brian Dellabetta <bdellabe@redhat.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
157cf48 to
7d79953
Compare
Summary Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
HDCharles
approved these changes
Dec 10, 2025
fynnsu
approved these changes
Dec 11, 2025
3 tasks
dsikka
pushed a commit
that referenced
this pull request
Jan 23, 2026
We identified a number of inefficiencies and fixes after the AWQ generalization [PR](#1961), this PR largely implements them, see details below. Note I previously made this [speed improvements PR](#2188) which had some issues that have been fixed in this one, that PR is going to be closed. # BENCHMARKS to iterate more quickly i ran these tests on models with most of their [layers removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69), the actual improvement should be a bit better since the layers which were removed are where the improvement happens. To replicate these numbers see the [first commit](1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69) | Runtime (min) | Improvement | PR | Base | |---------------|-------------|-------|-------| | llama_no_off | 8.7% | 6.17 | 6.76 | | llama_off | 5.6% | 10.46 | 11.08 | | moe_no_off* | 2.8% | 6.67 | 6.86 | | moe_off* | 1.9% | 7.57 | 7.72 | | Memory (GB) | | | | |-----------------|--------|-------|-------| | llama_PR_no_off | 7.8% | 9.22 | 10 | | llama_PR_off | 17.6% | 3.66 | 4.44 | | moe_PR_no_off | -5.2% | 11.61 | 11.04 | | moe_PR_off** | -24.3% | 2.92 | 2.35 | \*The actual speedup for MoE is going to be higher than this. These numbers are for a single layer being quantized so the calibration overhead is going to attenuate the gains. \*\* This worsening of memory is due to the weights being cached on device, the layernorm -> up + gate proj mapping has to cache the up + gate linears for the entire MLP layer which is fairly large. # SUMMARY: changes: - targetted weight cache, no offloading - previously in compute_best_scale we would record the entire state dict of the parent and store it on cpu - now only record the balance layer weights and store those on device since they are generally small - reduce load/write/rewrite weights - during grid search we have to repeatedly updaste the weight to use scaled and fake quantized versions of the weight. Previously this was done by writing the original value, calculating the scaled value and then writing that (2 writes) - we instead calculate the scaled value directly from the on-device cached value and write it once - fake quantize only on device weight - previously the on/offloaded balance layer weight was updated - we now just update the on-device value - compute loss - we slightly optimize compute loss to reduce device movement - note a number of approaches to improve the loss computation were attempted including - progressively calculating the loss while running the samples to get int_w_outputs to avoid storing the whole int_w_output on device (was slower and seems to saves the same memory as deleting int_w_outputs after its used) - using torch.cat to combine the fp16 outputs and int_w_outputs into a single tensors so we can do only a single mse calculation (torch.cat briefly doubles memory usage so it created a significant memory increase) - avoiding torch.cat by preallocating a flat tensor of the needed size and progressively storing chunks into it (slow) - torch compile on compute loss and/or run samples (would likely speed up the run samples code but the offloading framework doesn't work well with it) - del int_w_outputs, simply deleting this intermediate value after its used saves a significant amount of memory - change default offload device behavior, normally None, now by default we check whether we use the default MoE Mapping and if so, offload to cpu by default TEST PLAN: (see first commit tests) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Etelis
pushed a commit
to Etelis/llm-compressor
that referenced
this pull request
Jan 24, 2026
We identified a number of inefficiencies and fixes after the AWQ generalization [PR](vllm-project#1961), this PR largely implements them, see details below. Note I previously made this [speed improvements PR](vllm-project#2188) which had some issues that have been fixed in this one, that PR is going to be closed. # BENCHMARKS to iterate more quickly i ran these tests on models with most of their [layers removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69), the actual improvement should be a bit better since the layers which were removed are where the improvement happens. To replicate these numbers see the [first commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69) | Runtime (min) | Improvement | PR | Base | |---------------|-------------|-------|-------| | llama_no_off | 8.7% | 6.17 | 6.76 | | llama_off | 5.6% | 10.46 | 11.08 | | moe_no_off* | 2.8% | 6.67 | 6.86 | | moe_off* | 1.9% | 7.57 | 7.72 | | Memory (GB) | | | | |-----------------|--------|-------|-------| | llama_PR_no_off | 7.8% | 9.22 | 10 | | llama_PR_off | 17.6% | 3.66 | 4.44 | | moe_PR_no_off | -5.2% | 11.61 | 11.04 | | moe_PR_off** | -24.3% | 2.92 | 2.35 | \*The actual speedup for MoE is going to be higher than this. These numbers are for a single layer being quantized so the calibration overhead is going to attenuate the gains. \*\* This worsening of memory is due to the weights being cached on device, the layernorm -> up + gate proj mapping has to cache the up + gate linears for the entire MLP layer which is fairly large. # SUMMARY: changes: - targetted weight cache, no offloading - previously in compute_best_scale we would record the entire state dict of the parent and store it on cpu - now only record the balance layer weights and store those on device since they are generally small - reduce load/write/rewrite weights - during grid search we have to repeatedly updaste the weight to use scaled and fake quantized versions of the weight. Previously this was done by writing the original value, calculating the scaled value and then writing that (2 writes) - we instead calculate the scaled value directly from the on-device cached value and write it once - fake quantize only on device weight - previously the on/offloaded balance layer weight was updated - we now just update the on-device value - compute loss - we slightly optimize compute loss to reduce device movement - note a number of approaches to improve the loss computation were attempted including - progressively calculating the loss while running the samples to get int_w_outputs to avoid storing the whole int_w_output on device (was slower and seems to saves the same memory as deleting int_w_outputs after its used) - using torch.cat to combine the fp16 outputs and int_w_outputs into a single tensors so we can do only a single mse calculation (torch.cat briefly doubles memory usage so it created a significant memory increase) - avoiding torch.cat by preallocating a flat tensor of the needed size and progressively storing chunks into it (slow) - torch compile on compute loss and/or run samples (would likely speed up the run samples code but the offloading framework doesn't work well with it) - del int_w_outputs, simply deleting this intermediate value after its used saves a significant amount of memory - change default offload device behavior, normally None, now by default we check whether we use the default MoE Mapping and if so, offload to cpu by default TEST PLAN: (see first commit tests) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Etelis
pushed a commit
to Etelis/llm-compressor
that referenced
this pull request
Jan 25, 2026
We identified a number of inefficiencies and fixes after the AWQ generalization [PR](vllm-project#1961), this PR largely implements them, see details below. Note I previously made this [speed improvements PR](vllm-project#2188) which had some issues that have been fixed in this one, that PR is going to be closed. # BENCHMARKS to iterate more quickly i ran these tests on models with most of their [layers removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69), the actual improvement should be a bit better since the layers which were removed are where the improvement happens. To replicate these numbers see the [first commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69) | Runtime (min) | Improvement | PR | Base | |---------------|-------------|-------|-------| | llama_no_off | 8.7% | 6.17 | 6.76 | | llama_off | 5.6% | 10.46 | 11.08 | | moe_no_off* | 2.8% | 6.67 | 6.86 | | moe_off* | 1.9% | 7.57 | 7.72 | | Memory (GB) | | | | |-----------------|--------|-------|-------| | llama_PR_no_off | 7.8% | 9.22 | 10 | | llama_PR_off | 17.6% | 3.66 | 4.44 | | moe_PR_no_off | -5.2% | 11.61 | 11.04 | | moe_PR_off** | -24.3% | 2.92 | 2.35 | \*The actual speedup for MoE is going to be higher than this. These numbers are for a single layer being quantized so the calibration overhead is going to attenuate the gains. \*\* This worsening of memory is due to the weights being cached on device, the layernorm -> up + gate proj mapping has to cache the up + gate linears for the entire MLP layer which is fairly large. # SUMMARY: changes: - targetted weight cache, no offloading - previously in compute_best_scale we would record the entire state dict of the parent and store it on cpu - now only record the balance layer weights and store those on device since they are generally small - reduce load/write/rewrite weights - during grid search we have to repeatedly updaste the weight to use scaled and fake quantized versions of the weight. Previously this was done by writing the original value, calculating the scaled value and then writing that (2 writes) - we instead calculate the scaled value directly from the on-device cached value and write it once - fake quantize only on device weight - previously the on/offloaded balance layer weight was updated - we now just update the on-device value - compute loss - we slightly optimize compute loss to reduce device movement - note a number of approaches to improve the loss computation were attempted including - progressively calculating the loss while running the samples to get int_w_outputs to avoid storing the whole int_w_output on device (was slower and seems to saves the same memory as deleting int_w_outputs after its used) - using torch.cat to combine the fp16 outputs and int_w_outputs into a single tensors so we can do only a single mse calculation (torch.cat briefly doubles memory usage so it created a significant memory increase) - avoiding torch.cat by preallocating a flat tensor of the needed size and progressively storing chunks into it (slow) - torch compile on compute loss and/or run samples (would likely speed up the run samples code but the offloading framework doesn't work well with it) - del int_w_outputs, simply deleting this intermediate value after its used saves a significant amount of memory - change default offload device behavior, normally None, now by default we check whether we use the default MoE Mapping and if so, offload to cpu by default TEST PLAN: (see first commit tests) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
cajeonrh
pushed a commit
to cajeonrh/llm-compressor
that referenced
this pull request
Feb 10, 2026
We identified a number of inefficiencies and fixes after the AWQ generalization [PR](vllm-project#1961), this PR largely implements them, see details below. Note I previously made this [speed improvements PR](vllm-project#2188) which had some issues that have been fixed in this one, that PR is going to be closed. # BENCHMARKS to iterate more quickly i ran these tests on models with most of their [layers removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69), the actual improvement should be a bit better since the layers which were removed are where the improvement happens. To replicate these numbers see the [first commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69) | Runtime (min) | Improvement | PR | Base | |---------------|-------------|-------|-------| | llama_no_off | 8.7% | 6.17 | 6.76 | | llama_off | 5.6% | 10.46 | 11.08 | | moe_no_off* | 2.8% | 6.67 | 6.86 | | moe_off* | 1.9% | 7.57 | 7.72 | | Memory (GB) | | | | |-----------------|--------|-------|-------| | llama_PR_no_off | 7.8% | 9.22 | 10 | | llama_PR_off | 17.6% | 3.66 | 4.44 | | moe_PR_no_off | -5.2% | 11.61 | 11.04 | | moe_PR_off** | -24.3% | 2.92 | 2.35 | \*The actual speedup for MoE is going to be higher than this. These numbers are for a single layer being quantized so the calibration overhead is going to attenuate the gains. \*\* This worsening of memory is due to the weights being cached on device, the layernorm -> up + gate proj mapping has to cache the up + gate linears for the entire MLP layer which is fairly large. # SUMMARY: changes: - targetted weight cache, no offloading - previously in compute_best_scale we would record the entire state dict of the parent and store it on cpu - now only record the balance layer weights and store those on device since they are generally small - reduce load/write/rewrite weights - during grid search we have to repeatedly updaste the weight to use scaled and fake quantized versions of the weight. Previously this was done by writing the original value, calculating the scaled value and then writing that (2 writes) - we instead calculate the scaled value directly from the on-device cached value and write it once - fake quantize only on device weight - previously the on/offloaded balance layer weight was updated - we now just update the on-device value - compute loss - we slightly optimize compute loss to reduce device movement - note a number of approaches to improve the loss computation were attempted including - progressively calculating the loss while running the samples to get int_w_outputs to avoid storing the whole int_w_output on device (was slower and seems to saves the same memory as deleting int_w_outputs after its used) - using torch.cat to combine the fp16 outputs and int_w_outputs into a single tensors so we can do only a single mse calculation (torch.cat briefly doubles memory usage so it created a significant memory increase) - avoiding torch.cat by preallocating a flat tensor of the needed size and progressively storing chunks into it (slow) - torch compile on compute loss and/or run samples (would likely speed up the run samples code but the offloading framework doesn't work well with it) - del int_w_outputs, simply deleting this intermediate value after its used saves a significant amount of memory - change default offload device behavior, normally None, now by default we check whether we use the default MoE Mapping and if so, offload to cpu by default TEST PLAN: (see first commit tests) --------- Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com> Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
To allow for arbitrary heterogeneous quantization schemes, this PR switches several helpers from AutoAWQ to the observer and QDQ logic. AWQ no longer constrains that the quantization config needs to have the same settings for group_size, symmetric, and num_bits for each config_group.
Resolves #1657
Prerequisites:
Test plan
llm-compressor/examples/awq/llama_example.pywith this (withduo_scaling="both") and logging the best configuration of(ratio, duo_scaling), I see a good mix of Falses and Trues. i.e. a good percentage of best_scales were found with duo_scaling=False and a good percentage were found withduo_scaling=True. Generated model output looks good.awq_one_shot.py(pasted below), Wikitext PPL is consistent for w4a16 and w4a16_asym on this branch when compared to main, and better than what was reported in a previous AWQ PR, but those might have been differently configured. For W4A16_ASYM, the results are both 13.41 for main and this branch. This is what we've been historically using to test regressions.CADENCE=weekly TEST_DATA_FILE=~/projects/llm-compressor/tests/lmeval/configs/w4a16_awq_sym.yaml pytest -s ~/projects/llm-compressor/tests/lmeval/test_lmeval.pyon this branch, which causes the test to fail. This persists even when usingpseudo_quantize_tensorinstead ofcall_observer/forward_quantize, as shown in this diff. I get the same result in this diff, so at least that means quantization logic in CT is consistent with AutoAWQOutput:
This is already a pretty high drop in recovery, should we revisit this test?
Further regression testing against main was done in this commit see run.sh as of that commit which was removed in the final PR. Results look reasonable comparing branch and main, some up some down, within margin of error.
Test Group Quantization (w4a16_awq_sym)
Test Tensor Quantization (int8_tensor)
Test Channel Quantization (fp8_dynamic)
Test Block Quantization (fp8_block)
awq_oneshot.py script
```python import osos.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from llmcompressor import oneshot, active_session
from llmcompressor.utils import dispatch_for_generation
from llmcompressor.modifiers.awq import AWQModifier, AWQMapping
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationScheme,
QuantizationStrategy,
QuantizationType,
)
MODEL_ID = "meta-llama/Llama-3.2-3B-Instruct"
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
Configure the quantization algorithm to run.
recipe = [
AWQModifier(
ignore=[
"lm_head",
"re:.*mlp.gate$",
"re:.mlp.shared_expert_gate$",
"re:visual.",
],
scheme="W4A16_ASYM",
duo_scaling="both",
targets=["Linear"],
# offload_device=torch.device("cpu"),
),
]
Select calibration dataset.
DATASET_ID = "mit-han-lab/pile-val-backup"
DATASET_SPLIT = "validation"
Select number of samples. 256 samples is a good place to start.
Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
def get_calib_dataset(tokenizer):
from datasets import load_dataset
if name == "main":
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)