Skip to content

[4/N] Quantization Refactor: AWQ schemes and Kernel call and weight init split#21126

Merged
sglang-npu-bot merged 38 commits intosgl-project:mainfrom
Alisehen:hyc/awq-scheme-refactor
Apr 30, 2026
Merged

[4/N] Quantization Refactor: AWQ schemes and Kernel call and weight init split#21126
sglang-npu-bot merged 38 commits intosgl-project:mainfrom
Alisehen:hyc/awq-scheme-refactor

Conversation

@Alisehen
Copy link
Copy Markdown
Contributor

@Alisehen Alisehen commented Mar 22, 2026

Motivation

Add schemes to awq instead of storing all classes in a single file, and split kernel call and weight init. Follow up to #17503.
Images and motivation for this PR can be viewed in our roadmap: #15194.

Modifications

Refactored AWQ to align with the scheme-based quantization structure used by modelslim and compressed_tensors.

Moved AWQ implementations out of the monolithic quantization/awq.py into the new package under quantization/awq/, with scheme implementations split into quantization/awq/schemes/.

Added get_linear_scheme and get_moe_scheme to awq/awq.py so linear and MoE paths select concrete schemes explicitly.

Unified AWQ quant methods into thin wrappers that delegate to layer.scheme, matching the compressed_tensors call pattern.

Moved AWQ Triton helpers into quantization/awq/awq_triton.py and removed the old top-level quantization/awq_triton.py.

Split backend-specific kernel logic into:

  • hardware_backend/gpu/quantization/awq_kernels.py
  • hardware_backend/npu/quantization/awq_kernels.py

This keeps awq.py focused on config, method dispatch, and scheme selection, while concrete weight handling and execution live in schemes and backend kernels.

Accuracy Tests

GPU tests:
image
NPU tests:
image

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the AWQ quantization framework by introducing a more organized and extensible architecture. The changes aim to improve code clarity and facilitate the integration of diverse hardware backends and quantization methods, moving towards a more unified and maintainable quantization system.

Highlights

  • AWQ Refactoring: The AWQ quantization implementation has been refactored into a modular, scheme-based structure, separating configuration, weight initialization, and kernel execution logic.
  • Directory Restructuring: AWQ-related files have been moved into a new quantization/awq/ package, with specific scheme implementations now residing in quantization/awq/schemes/.
  • Backend-Specific Kernels: Backend-specific kernel logic for AWQ linear and MoE layers has been split into dedicated files for GPU (gpu/quantization/awq_kernels.py) and NPU (npu/quantization/awq_kernels.py).
  • Scheme Abstraction: New abstract base classes (AWQLinearSchemeBase, AWQMoESchemeBase) and concrete scheme implementations (e.g., AWQLinearScheme, AWQMarlinLinearScheme, AWQMoEScheme, AWQAscendLinearScheme, AWQAscendMoEScheme) were introduced to encapsulate quantization logic.
  • Centralized Scheme Dispatch: The AWQConfig and AWQMarlinConfig now dynamically select the appropriate quantization scheme via get_linear_scheme and get_moe_scheme methods, delegating operations to the chosen scheme.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@Alisehen
Copy link
Copy Markdown
Contributor Author

We kept both AWQConfig and AWQMarlinConfig for now because awq and awq_marlin are still exposed as separate quantization entry points with distinct compatibility and fallback behavior.

@Alisehen
Copy link
Copy Markdown
Contributor Author

@ping1jing2

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring of the AWQ quantization logic, moving to a more modular scheme-based architecture. This is a great improvement for maintainability and extensibility. The changes also include splitting backend-specific kernels for GPU and NPU, which is a clean separation of concerns. I've found a critical bug in the GPU kernel logic and a few areas for improvement in the new NPU kernels and scheme definitions.


marlin_w13_scales = marlin_moe_permute_scales(
s=layer.w13_scales,
size_k=layer.intermediate_size_per_partition,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The size_k parameter for marlin_moe_permute_scales appears to be incorrect for w13_scales. The k dimension of the w13 weight matrix is hidden_size, and the scales are grouped along this dimension. Therefore, size_k should be layer.hidden_size instead of layer.intermediate_size_per_partition.

Suggested change
size_k=layer.intermediate_size_per_partition,
size_k=layer.hidden_size,

Comment on lines +33 to +35
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bitwise operation can be simplified for clarity and potentially better performance. The multiplication by a power of two can be replaced with a left bit shift. Also, the final bitwise AND is redundant since the shifted value is already a nibble.

Suggested change
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
)
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) & 0xF) << (4 * i)
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

Comment on lines +93 to +100
w13_qweight_tmp.bitwise_or_(
((layer.w13_qweight.data >> shift_num) * (2 ** (4 * i)))
& (0xF << (4 * i))
)
w2_qweight_tmp.bitwise_or_(
((layer.w2_qweight.data >> shift_num) * (2 ** (4 * i)))
& (0xF << (4 * i))
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bitwise operations can be simplified for clarity and potentially better performance. The multiplication by a power of two can be replaced with a left bit shift. Also, the final bitwise AND is redundant since the shifted value is already a nibble.

            w13_qweight_tmp.bitwise_or_(
                ((layer.w13_qweight.data >> shift_num) & 0xF) << (4 * i)
            )
            w2_qweight_tmp.bitwise_or_(
                ((layer.w2_qweight.data >> shift_num) & 0xF) << (4 * i)
            )

Comment on lines +116 to +127
layer.register_parameter(
"w13_qzeros", torch.nn.Parameter(w13_qzeros_tmp, requires_grad=False)
)
layer.register_parameter(
"w13_qweight", torch.nn.Parameter(w13_qweight_tmp, requires_grad=False)
)
layer.register_parameter(
"w2_qzeros", torch.nn.Parameter(w2_qzeros_tmp, requires_grad=False)
)
layer.register_parameter(
"w2_qweight", torch.nn.Parameter(w2_qweight_tmp, requires_grad=False)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with other parts of the codebase (e.g., the GPU AWQ kernel) and to improve maintainability, it's better to use the replace_parameter utility function for replacing module parameters. You'll need to add from sglang.srt.layers.quantization.utils import replace_parameter to the imports, and then replace these layer.register_parameter calls with replace_parameter(layer, "param_name", new_tensor).

self.quant_config = quant_config
self.kernel = AWQMoEKernel(quant_config)
if self.quant_config.weight_bits != 4:
raise ValueError("AWQMoEMethod only supports 4bit now.")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a typo in the error message. The class is AWQMoEScheme, but the error message refers to AWQMoEMethod. This should be corrected for clarity.

Suggested change
raise ValueError("AWQMoEMethod only supports 4bit now.")
raise ValueError("AWQMoEScheme only supports 4bit now.")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

@ping1jing2 ping1jing2 self-assigned this Mar 23, 2026
top_k=topk_ids.shape[1],
use_wna16=True,
)
return StandardCombineInput(hidden_states=output)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are not gated behind specific kernel implementations, could you look if it is possible to call NPUW4A16Int4DynamicMoEMethod as a kernel here?
Example from one of our MoE refactoring PRs: https://github.com/sgl-project/sglang/pull/17361/changes#diff-34cc9aacc2ffaa0ad8351300aad66099bcbc2451d9a0a2c089aab5926d4f5e01
It should work for both apply and process_weights.

self, layer: torch.nn.Module, moe_runner_config: MoeRunnerConfig
):
self.moe_runner_config = moe_runner_config
self.kernel.moe_runner_config = moe_runner_config
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we merge awq_moe.py file and this file together like it's done for Linear schemes in awq_w4a16.py?

Copy link
Copy Markdown
Collaborator

@b8zhong b8zhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, can you leave the GPU code outside of hardware backend? Since I think this is designed for other hardware backends, if I'm not misunderstanding, right?

@TamirBaydasov
Copy link
Copy Markdown
Contributor

Hi, can you leave the GPU code outside of hardware backend? Since I think this is designed for other hardware backends, if I'm not misunderstanding, right?

Hi! From our discussion with @rainj-me here we thought that it would be ok to move kernel related files from quantization folder into hardware_backend structure and create one for gpu.

@b8zhong
Copy link
Copy Markdown
Collaborator

b8zhong commented Mar 25, 2026

@TamirBaydasov Thanks. I saw your comment, but I also saw #15194 (comment).

@TamirBaydasov
Copy link
Copy Markdown
Contributor

@TamirBaydasov Thanks. I saw your comment, but I also saw #15194 (comment).

This comment was related to moving all cuda kernels into hardware_backend structure. That is, creating a similar structure to NPU with not only quantization kernels being present there.
Our overall goal is to clear out quantization folder so the navigation between files becomes easier and their structure similar. If hardware_backend for gpu is not suitable, we can, for example, create a kernels/utils folder inside quantization and move gpu kernels there.

Comment on lines +28 to +30
_is_cuda = is_cuda()
_is_hip = is_hip()
_is_xpu = is_xpu()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you've a seperate gpu folder, i think we should reduce or delete all is_xxx code, right?

Comment on lines +33 to +35
qweight_tmp.bitwise_or_(
((layer.qweight.data >> shift_num) * (2 ** (4 * i))) & (0xF << (4 * i))
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

self.quant_config = quant_config
self.kernel = AWQMoEKernel(quant_config)
if self.quant_config.weight_bits != 4:
raise ValueError("AWQMoEMethod only supports 4bit now.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense

@ping1jing2
Copy link
Copy Markdown
Collaborator

Comment on lines +37 to +60
try:
from sglang.jit_kernel.awq_dequantize import awq_dequantize
from sglang.jit_kernel.awq_marlin_repack import (
awq_marlin_moe_repack,
awq_marlin_repack,
)
from sglang.srt.utils.custom_op import register_custom_op_from_extern

awq_dequantize = register_custom_op_from_extern(
awq_dequantize,
fake_impl=lambda qweight, scales, qzeros: qweight.new_empty(
qweight.shape[:-1] + (qweight.shape[-1] * 8,), dtype=scales.dtype
),
)
except ImportError:
try:
from sglang.srt.layers.quantization.awq.awq_triton import (
awq_dequantize_triton as awq_dequantize,
)
except ImportError:
try:
from sgl_kernel import awq_dequantize
except ImportError:
pass
Copy link
Copy Markdown
Collaborator

@alexnails alexnails Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is now a regression for XPU here? Since it will go directly to triton


class AWQAscendMoEScheme(AWQMoEScheme):
def __init__(self, quant_config: "AWQConfig"):
super().__init__(quant_config)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super().__init__(quant_config) has AWQMoEKernel for GPU, which transitively imports MarlinMoeQuantInfo and marlin_utils. Not needed for NPU.

Suggest skipping the parent init or refactoring to a _init_kernel() hook:

class AWQAscendMoEScheme(AWQMoEScheme):
    def __init__(self, quant_config: "AWQConfig"):
        # skip AWQMoEScheme.__init__
        from sglang.srt.hardware_backend.npu.quantization.awq_kernels import (
            AWQAscendMoEKernel,
        )
        self.quant_config = quant_config
        if self.quant_config.weight_bits != 4:
            raise ValueError("AWQAscendMoEScheme only supports 4bit now.")
        self.kernel = AWQAscendMoEKernel(quant_config)

Let's talk about this a bit further as I think the cleaner long-term fix is making self.kernel come from a platform factory (see plugin integration note on awq.py), at which point this subclass disappears entirely.

class AWQLinearIntelAMXMethod(AWQLinearMethod):
"""Linear method for AWQ on Intel CPU with AMX."""
def __init__(self, quant_config: "AWQConfig"):
self.quant_config = quant_config
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWQIntelAMXLinearScheme overrides __init__ but doesn't call super().__init__() and doesn't set self.kernel. Follows other comment but lets clean this up or make it less brittle?

Comment thread python/sglang/srt/layers/linear.py Outdated
"CompressedTensorsLinearMethod",
"AWQMarlinLinearMethod",
"AWQLinearMethod",
"AWQLinearAscendMethod",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWQLinearAscendMethod was deleted in this PR --> Ascend now goes through the unified AWQLinearMethod + AWQAscendLinearScheme. Since WEIGHT_LOADER_V2_SUPPORTED is matched by class name string, we should rm?

Comment on lines +159 to +172
def get_linear_scheme(self, layer: torch.nn.Module):
assert isinstance(layer, LinearBase)
if _is_npu:
return AWQAscendLinearScheme(self)
return AWQLinearScheme(self)

def get_moe_scheme(self, layer: torch.nn.Module):
from sglang.srt.layers.moe.fused_moe_triton import FusedMoE

assert isinstance(layer, FusedMoE)
if _is_npu:
return AWQAscendMoEScheme(self)
raise NotImplementedError("AWQConfig only supports MoE scheme on NPU.")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugin-integration note (#21388 follow-up).

TODO needed here for Multiplatform plugin. (even if you just mark with need to integrate current_platform.is_out_of_tree

something like:

def get_linear_scheme(self, layer: torch.nn.Module):
    assert isinstance(layer, LinearBase)
    from sglang.srt.platforms import current_platform
    cls = current_platform.get_awq_linear_scheme_cls()
    if cls is not None:
        return cls(self)
    return AWQLinearScheme(self)  # in-tree CUDA default

def get_moe_scheme(self, layer: torch.nn.Module):
    from sglang.srt.platforms import current_platform
    cls = current_platform.get_awq_moe_scheme_cls()
    if cls is None:
        raise NotImplementedError(
            f"AWQ MoE not provided by platform {current_platform.get_dispatch_key_name()!r}."
        )
    return cls(self)

With SRTPlatform extended to expose get_awq_linear_scheme_cls() / get_awq_moe_scheme_cls() / get_awq_marlin_linear_scheme_cls() (returning None by default; concrete platforms override). This matches how PR #21388 already exposes get_mha_kv_pool_cls(), get_graph_runner_cls(), etc.

Another option is to push the platform factory down to the .kernel layer (AWQLinearScheme.__init__ calls current_platform.get_awq_linear_kernel_cls()). This would eliminate AWQAscendLinearScheme / AWQIntelAMXLinearScheme as subclasses entirely since they become kernel registrations on the OOT platform plugin. Also fixes the super().__init__ side-effect issues I mentioned earlier

Copy link
Copy Markdown
Contributor Author

@Alisehen Alisehen Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I addressed the AWQ scheme issues:

  • Restored the XPU AWQ dequant path to use sgl_kernel.awq_dequantize instead of falling through to Triton.
  • Refactored AWQ Linear/MoE schemes to use _init_kernel() hooks, so Ascend no longer initializes the default GPU/Marlin kernel before replacing it.
  • Updated the CPU AMX AWQ path to use CPU-specific kernel objects behind the scheme, avoiding the brittle subclass-without-super().__init__() pattern.
  • Removed the stale AWQLinearAscendMethod entry from WEIGHT_LOADER_V2_SUPPORTED.
  • Added a TODO for moving AWQ scheme/kernel selection into the multiplatform plugin factory once quantization hooks are available.

return AWQAscendLinearScheme(self)
return AWQLinearScheme(self)

def get_moe_scheme(self, layer: torch.nn.Module):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: get_moe_scheme raising NotImplementedError for non-NPU is unreachable today (caller returns None for FusedMoE on non-NPU before consulting it). Either remove the raise or document the intent.

@github-actions github-actions Bot added the quant LLM Quantization label Apr 28, 2026
@ping1jing2
Copy link
Copy Markdown
Collaborator

ping1jing2 commented Apr 28, 2026

@ping1jing2
Copy link
Copy Markdown
Collaborator

I merged it since several committer already reviewed and we confirmed that only one GPU failed CI is unrelated to our change

@sglang-npu-bot sglang-npu-bot merged commit 577dbc4 into sgl-project:main Apr 30, 2026
100 of 114 checks passed
vguduruTT pushed a commit to vguduruTT/sglang that referenced this pull request May 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quant LLM Quantization run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants