Skip to content

[AWQ] Modifier Speedups#2265

Merged
dsikka merged 11 commits intomainfrom
97_awq_followups_offloading
Jan 23, 2026
Merged

[AWQ] Modifier Speedups#2265
dsikka merged 11 commits intomainfrom
97_awq_followups_offloading

Conversation

@HDCharles
Copy link
Collaborator

@HDCharles HDCharles commented Jan 20, 2026

We identified a number of inefficiencies and fixes after the AWQ generalization PR, this PR largely implements them, see details below. Note I previously made this speed improvements PR which had some issues that have been fixed in this one, that PR is going to be closed.

BENCHMARKS

to iterate more quickly i ran these tests on models with most of their layers removed, the actual improvement should be a bit better since the layers which were removed are where the improvement happens. To replicate these numbers see the first commit

Runtime (min) Improvement PR Base
llama_no_off 8.7% 6.17 6.76
llama_off 5.6% 10.46 11.08
moe_no_off* 2.8% 6.67 6.86
moe_off* 1.9% 7.57 7.72
Memory (GB)
llama_PR_no_off 7.8% 9.22 10
llama_PR_off 17.6% 3.66 4.44
moe_PR_no_off -5.2% 11.61 11.04
moe_PR_off** -24.3% 2.92 2.35

*The actual speedup for MoE is going to be higher than this. These numbers are for a single layer being quantized so the calibration overhead is going to attenuate the gains.

** This worsening of memory is due to the weights being cached on device, the layernorm -> up + gate proj mapping has to cache the up + gate linears for the entire MLP layer which is ~.6GB (same as when there's no offloading) which is large relative to the other device usage when offloading is enabled.

SUMMARY:

changes:

  • targetted weight cache, no offloading
    • previously in compute_best_scale we would record the entire state dict of the parent and store it on cpu
    • now only record the balance layer weights and store those on device since they are generally small
  • reduce load/write/rewrite weights
    • during grid search we have to repeatedly updaste the weight to use scaled and fake quantized versions of the weight. Previously this was done by writing the original value, calculating the scaled value and then writing that (2 writes)
    • we instead calculate the scaled value directly from the on-device cached value and write it once
  • fake quantize only on device weight
    • previously the on/offloaded balance layer weight was updated
    • we now just update the on-device value
  • compute loss
    • we slightly optimize compute loss to reduce device movement
    • note a number of approaches to improve the loss computation were attempted including
      • progressively calculating the loss while running the samples to get int_w_outputs to avoid storing the whole int_w_output on device (was slower and seems to saves the same memory as deleting int_w_outputs after its used)
      • using torch.cat to combine the fp16 outputs and int_w_outputs into a single tensors so we can do only a single mse calculation (torch.cat briefly doubles memory usage so it created a significant memory increase)
      • avoiding torch.cat by preallocating a flat tensor of the needed size and progressively storing chunks into it (slow)
      • torch compile on compute loss and/or run samples (would likely speed up the run samples code but the offloading framework doesn't work well with it)
  • del int_w_outputs, simply deleting this intermediate value after its used saves a significant amount of memory
  • change default offload device behavior, normally None, now by default we check whether we use the default MoE Mapping and if so, offload to cpu by default

TEST PLAN:
(see first commit tests)

@HDCharles HDCharles marked this pull request as draft January 20, 2026 15:31
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @HDCharles, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the efficiency of the AWQ quantization process. It introduces architectural improvements to how intermediate FP16 outputs are managed and how optimal scaling factors are determined, resulting in faster quantization times. The changes aim to streamline the workflow and reduce computational overhead for quantizing large language models.

Highlights

  • Significant Speedups for AWQ Quantization: Achieved notable reductions in the time required for AWQ quantization across tested models like LLAMA and QWEN3 MOE, as demonstrated by the provided benchmarks.
  • Optimized FP16 Baseline Caching: Implemented a caching mechanism for FP16 baseline outputs, eliminating redundant forward passes during the iterative smoothing and scaling process, which is a key driver of the performance improvements.
  • Improved Weight Scaling Precision: Enhanced the grid search for optimal scaling factors by ensuring each iteration operates on the original weights, preventing cumulative errors and improving the accuracy of the scaling application.
  • Enhanced Memory Management: Incorporated explicit memory cleanup for intermediate tensors and caches, contributing to more efficient GPU memory utilization during quantization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added the documentation Improvements or additions to documentation label Jan 20, 2026
@mergify
Copy link
Contributor

mergify bot commented Jan 20, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements to the AWQ algorithm by caching FP16 baseline outputs and optimizing the grid search process. The core logic changes in src/llmcompressor/modifiers/awq/base.py are well-implemented and include thoughtful memory optimizations. The modifications to the example files appear to be for benchmarking purposes, which is suitable for a performance-oriented change. My review focuses on a few areas for improvement, including a misplaced import, a type hint correction, and a potential issue related to parameter offloading that could affect correctness.

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@mergify mergify bot removed the quality-failed label Jan 20, 2026
@HDCharles HDCharles force-pushed the 97_awq_followups_offloading branch from b31a851 to afa7350 Compare January 20, 2026 15:50
@HDCharles HDCharles added enhancement New feature or request ready When a PR is ready for review awq For any issue / PR related to AWQ support and removed documentation Improvements or additions to documentation labels Jan 20, 2026
@HDCharles HDCharles marked this pull request as ready for review January 20, 2026 16:33
@dsikka dsikka mentioned this pull request Jan 20, 2026
22 tasks
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One super nit, and a question on the memory increase for MoE models. That seems like a big memory hit, and users frequently report OOM issues with AWQ. I wonder if it is worth the tradeoff. Did you try with batched calibration? It seems like the fp16 cache introduced in here saves us some forward passes, but with batched calibration a lot of that runtime is reduced.

Other than that i like the changes to clean things up and not having to re-load the entire statedict on the parent during the grid search

@mergify
Copy link
Contributor

mergify bot commented Jan 20, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

HDCharles and others added 4 commits January 20, 2026 16:46
Summary

LLAMA::

PR W OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 12.10 minutes (725.93 seconds)
Peak GPU Memory: 3.66 GB
============================================================

PR W/O OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 6.83 minutes (409.62 seconds)
Peak GPU Memory: 10.02 GB
============================================================

BASE W OFFLOAD
============================================================
AWQ Quantization Complete
============================================================
Time: 13.62 minutes (817.02 seconds)
Peak GPU Memory: 4.44 GB

BASE W/O OFFLOAD
============================================================
AWQ Quantization Complete
============================================================
Time: 7.36 minutes (441.76 seconds)
Peak GPU Memory: 10.00 GB
============================================================

QWEN3 MOE::
PR W OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 7.65 minutes (458.96 seconds)
Peak GPU Memory: 2.93 GB
============================================================

PR W/O OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 6.64 minutes (398.51 seconds)
Peak GPU Memory: 34.10 GB
============================================================

BASE W OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 7.77 minutes (466.00 seconds)
Peak GPU Memory: 2.36 GB
============================================================

BASE W/O OFFLOAD:
============================================================
AWQ Quantization Complete
============================================================
Time: 7.46 minutes (447.34 seconds)
Peak GPU Memory: 11.03 GB
============================================================

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
@HDCharles HDCharles force-pushed the 97_awq_followups_offloading branch from 8be33e2 to f0bd899 Compare January 20, 2026 21:46
@dsikka dsikka changed the title [AWQ] speedups [AWQ] Modifier Speedups Jan 21, 2026
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@mergify
Copy link
Contributor

mergify bot commented Jan 21, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@mergify
Copy link
Contributor

mergify bot commented Jan 21, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
@mergify mergify bot removed the quality-failed label Jan 21, 2026
Copy link
Collaborator

@brian-dellabetta brian-dellabetta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Thanks for updating

Summary

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for listening to my thoughts here. I think these changes have only benefits, I agree with your point that we should consider simplifying the memory/runtime tradeoff interface of AWQ in the future.

@dsikka dsikka enabled auto-merge (squash) January 23, 2026 15:27
@dsikka dsikka merged commit 5a1640a into main Jan 23, 2026
12 checks passed
@dsikka dsikka deleted the 97_awq_followups_offloading branch January 23, 2026 15:44
Etelis pushed a commit to Etelis/llm-compressor that referenced this pull request Jan 24, 2026
We identified a number of inefficiencies and fixes after the AWQ
generalization
[PR](vllm-project#1961), this PR
largely implements them, see details below. Note I previously made this
[speed improvements
PR](vllm-project#2188) which had
some issues that have been fixed in this one, that PR is going to be
closed.

# BENCHMARKS

to iterate more quickly i ran these tests on models with most of their
[layers
removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69),
the actual improvement should be a bit better since the layers which
were removed are where the improvement happens. To replicate these
numbers see the [first
commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69)

| Runtime (min) | Improvement | PR    | Base  |
|---------------|-------------|-------|-------|
| llama_no_off  |        8.7% |  6.17 |  6.76 |
| llama_off     |        5.6% | 10.46 | 11.08 |
| moe_no_off*    |        2.8% |  6.67 |  6.86 |
| moe_off*       |        1.9% |  7.57 |  7.72 |

| Memory (GB)     |        |       |       |
|-----------------|--------|-------|-------|
| llama_PR_no_off |   7.8% |  9.22 |    10 |
| llama_PR_off    |  17.6% |  3.66 |  4.44 |
| moe_PR_no_off   |  -5.2% | 11.61 | 11.04 |
| moe_PR_off**      | -24.3% |  2.92 |  2.35 |

\*The actual speedup for MoE is going to be higher than this. These
numbers are for a single layer being quantized so the calibration
overhead is going to attenuate the gains.

\*\* This worsening of memory is due to the weights being cached on
device, the layernorm -> up + gate proj mapping has to cache the up +
gate linears for the entire MLP layer which is fairly large.


# SUMMARY:
changes:

- targetted weight cache, no offloading
- previously in compute_best_scale we would record the entire state dict
of the parent and store it on cpu
- now only record the balance layer weights and store those on device
since they are generally small
- reduce load/write/rewrite weights
- during grid search we have to repeatedly updaste the weight to use
scaled and fake quantized versions of the weight. Previously this was
done by writing the original value, calculating the scaled value and
then writing that (2 writes)
- we instead calculate the scaled value directly from the on-device
cached value and write it once
- fake quantize only on device weight
  - previously the on/offloaded balance layer weight was updated
  - we now just update the on-device value
- compute loss
  - we slightly optimize  compute loss to reduce device movement
- note a number of approaches to improve the loss computation were
attempted including
- progressively calculating the loss while running the samples to get
int_w_outputs to avoid storing the whole int_w_output on device (was
slower and seems to saves the same memory as deleting int_w_outputs
after its used)
- using torch.cat to combine the fp16 outputs and int_w_outputs into a
single tensors so we can do only a single mse calculation (torch.cat
briefly doubles memory usage so it created a significant memory
increase)
- avoiding torch.cat by preallocating a flat tensor of the needed size
and progressively storing chunks into it (slow)
- torch compile on compute loss and/or run samples (would likely speed
up the run samples code but the offloading framework doesn't work well
with it)
- del int_w_outputs, simply deleting this intermediate value after its
used saves a significant amount of memory
- change default offload device behavior, normally None, now by default
we check whether we use the default MoE Mapping and if so, offload to
cpu by default

TEST PLAN:
(see first commit tests)

---------

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Etelis pushed a commit to Etelis/llm-compressor that referenced this pull request Jan 25, 2026
We identified a number of inefficiencies and fixes after the AWQ
generalization
[PR](vllm-project#1961), this PR
largely implements them, see details below. Note I previously made this
[speed improvements
PR](vllm-project#2188) which had
some issues that have been fixed in this one, that PR is going to be
closed.

# BENCHMARKS

to iterate more quickly i ran these tests on models with most of their
[layers
removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69),
the actual improvement should be a bit better since the layers which
were removed are where the improvement happens. To replicate these
numbers see the [first
commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69)

| Runtime (min) | Improvement | PR    | Base  |
|---------------|-------------|-------|-------|
| llama_no_off  |        8.7% |  6.17 |  6.76 |
| llama_off     |        5.6% | 10.46 | 11.08 |
| moe_no_off*    |        2.8% |  6.67 |  6.86 |
| moe_off*       |        1.9% |  7.57 |  7.72 |

| Memory (GB)     |        |       |       |
|-----------------|--------|-------|-------|
| llama_PR_no_off |   7.8% |  9.22 |    10 |
| llama_PR_off    |  17.6% |  3.66 |  4.44 |
| moe_PR_no_off   |  -5.2% | 11.61 | 11.04 |
| moe_PR_off**      | -24.3% |  2.92 |  2.35 |

\*The actual speedup for MoE is going to be higher than this. These
numbers are for a single layer being quantized so the calibration
overhead is going to attenuate the gains.

\*\* This worsening of memory is due to the weights being cached on
device, the layernorm -> up + gate proj mapping has to cache the up +
gate linears for the entire MLP layer which is fairly large.

# SUMMARY:
changes:

- targetted weight cache, no offloading
- previously in compute_best_scale we would record the entire state dict
of the parent and store it on cpu
- now only record the balance layer weights and store those on device
since they are generally small
- reduce load/write/rewrite weights
- during grid search we have to repeatedly updaste the weight to use
scaled and fake quantized versions of the weight. Previously this was
done by writing the original value, calculating the scaled value and
then writing that (2 writes)
- we instead calculate the scaled value directly from the on-device
cached value and write it once
- fake quantize only on device weight
  - previously the on/offloaded balance layer weight was updated
  - we now just update the on-device value
- compute loss
  - we slightly optimize  compute loss to reduce device movement
- note a number of approaches to improve the loss computation were
attempted including
- progressively calculating the loss while running the samples to get
int_w_outputs to avoid storing the whole int_w_output on device (was
slower and seems to saves the same memory as deleting int_w_outputs
after its used)
- using torch.cat to combine the fp16 outputs and int_w_outputs into a
single tensors so we can do only a single mse calculation (torch.cat
briefly doubles memory usage so it created a significant memory
increase)
- avoiding torch.cat by preallocating a flat tensor of the needed size
and progressively storing chunks into it (slow)
- torch compile on compute loss and/or run samples (would likely speed
up the run samples code but the offloading framework doesn't work well
with it)
- del int_w_outputs, simply deleting this intermediate value after its
used saves a significant amount of memory
- change default offload device behavior, normally None, now by default
we check whether we use the default MoE Mapping and if so, offload to
cpu by default

TEST PLAN:
(see first commit tests)

---------

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
cajeonrh pushed a commit to cajeonrh/llm-compressor that referenced this pull request Feb 10, 2026
We identified a number of inefficiencies and fixes after the AWQ
generalization
[PR](vllm-project#1961), this PR
largely implements them, see details below. Note I previously made this
[speed improvements
PR](vllm-project#2188) which had
some issues that have been fixed in this one, that PR is going to be
closed.

# BENCHMARKS

to iterate more quickly i ran these tests on models with most of their
[layers
removed](https://github.com/vllm-project/llm-compressor/blob/1f036248f0310b8e95d488fc5d20831bcc9b62b7/examples/awq/llama_example.py#L69),
the actual improvement should be a bit better since the layers which
were removed are where the improvement happens. To replicate these
numbers see the [first
commit](vllm-project@1f03624#diff-208ced55cba2d38b1bc9b03b5f79ae3483c0849cc708a6c1131c231aae5d4b3dR69)

| Runtime (min) | Improvement | PR    | Base  |
|---------------|-------------|-------|-------|
| llama_no_off  |        8.7% |  6.17 |  6.76 |
| llama_off     |        5.6% | 10.46 | 11.08 |
| moe_no_off*    |        2.8% |  6.67 |  6.86 |
| moe_off*       |        1.9% |  7.57 |  7.72 |

| Memory (GB)     |        |       |       |
|-----------------|--------|-------|-------|
| llama_PR_no_off |   7.8% |  9.22 |    10 |
| llama_PR_off    |  17.6% |  3.66 |  4.44 |
| moe_PR_no_off   |  -5.2% | 11.61 | 11.04 |
| moe_PR_off**      | -24.3% |  2.92 |  2.35 |

\*The actual speedup for MoE is going to be higher than this. These
numbers are for a single layer being quantized so the calibration
overhead is going to attenuate the gains.

\*\* This worsening of memory is due to the weights being cached on
device, the layernorm -> up + gate proj mapping has to cache the up +
gate linears for the entire MLP layer which is fairly large.


# SUMMARY:
changes:

- targetted weight cache, no offloading
- previously in compute_best_scale we would record the entire state dict
of the parent and store it on cpu
- now only record the balance layer weights and store those on device
since they are generally small
- reduce load/write/rewrite weights
- during grid search we have to repeatedly updaste the weight to use
scaled and fake quantized versions of the weight. Previously this was
done by writing the original value, calculating the scaled value and
then writing that (2 writes)
- we instead calculate the scaled value directly from the on-device
cached value and write it once
- fake quantize only on device weight
  - previously the on/offloaded balance layer weight was updated
  - we now just update the on-device value
- compute loss
  - we slightly optimize  compute loss to reduce device movement
- note a number of approaches to improve the loss computation were
attempted including
- progressively calculating the loss while running the samples to get
int_w_outputs to avoid storing the whole int_w_output on device (was
slower and seems to saves the same memory as deleting int_w_outputs
after its used)
- using torch.cat to combine the fp16 outputs and int_w_outputs into a
single tensors so we can do only a single mse calculation (torch.cat
briefly doubles memory usage so it created a significant memory
increase)
- avoiding torch.cat by preallocating a flat tensor of the needed size
and progressively storing chunks into it (slow)
- torch compile on compute loss and/or run samples (would likely speed
up the run samples code but the offloading framework doesn't work well
with it)
- del int_w_outputs, simply deleting this intermediate value after its
used saves a significant amount of memory
- change default offload device behavior, normally None, now by default
we check whether we use the default MoE Mapping and if so, offload to
cpu by default

TEST PLAN:
(see first commit tests)

---------

Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
Signed-off-by: HDCharles <39544797+HDCharles@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Brian Dellabetta <brian-dellabetta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awq For any issue / PR related to AWQ support enhancement New feature or request ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants