[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels by Josephasafg · Pull Request #33178 · vllm-project/vllm

Josephasafg · 2026-01-27T15:39:06Z

Purpose

Consolidates experts_int8 and Fp8OnlineMoEMethod to share common weight loading and quantization logic through a new MoeOnlineWeightLoader class.

Key changes:

Introduced MoeOnlineWeightLoader in moe_weight_loader.py that handles weight loading for MoE layers
Created MoeQuantizationCallbacks protocol that both int8 and fp8 methods implement
Both quantization methods now only need to provide:
- get_quantized_dtype() - target dtype (int8 or fp8)
- quantize_expert(weight) - single expert quantization logic
- create_scale_tensors() - scale tensor creation (per-channel for int8, per-tensor for fp8)
- setup_kernel() - kernel setup after quantization

Diagram of the consolidation and inheritance between fp8 and experts_int8

This PR relies on this PR and need it to be merged first.

Test Plan + Results

Ran lm_eval with and without fp8 and experts_int8 quantization. Model: ai21labs/AI21-Jamba-Mini-1.7

This PR
experts_int8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3963|±  |0.0383|

fp8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value|   |Stderr|
|---------|------:|-----------|-----:|------|---|----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.378|±  | 0.038|

no quantization

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3902|±  |0.0382|

vllm main

no quantization

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3902|±  |0.0382|

experts_int8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3841|±  |0.0381|

fp8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3598|±  |0.0376|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

The pull request successfully consolidates the weight loading and quantization logic for MoE layers into a new MoeOnlineWeightLoader class and a MoeQuantizationCallbacks protocol. This significantly improves code modularity and reusability, as both INT8 and FP8 online quantization methods now implement the same interface. The changes also include necessary adjustments for handling dummy weights and deferred materialization, ensuring compatibility and correctness across different quantization schemes. The introduction of kInt8StaticChannelSym and its integration into the _supports_quant_scheme function correctly enables INT8 support. Overall, the changes are well-structured and contribute positively to the codebase's maintainability and extensibility.

mergify · 2026-01-27T15:46:46Z

Documentation preview: https://vllm--33178.org.readthedocs.build/en/33178/

mergify · 2026-01-27T15:47:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Josephasafg <ajgard7@gmail.com>

kylesayrs

Nice job, this is roughly what I'd expect. @vkuzo and I have had a longer discussion about the overlap between online quantization and layerwise reload logic, but that's probably out of scope for this PR.

@vkuzo We definitely need to start standardizing how we're going to handle online weights. Whatever strategy we go with, we should at least make sure all online quantization implementations use the same strategy and reuse utilities wherever possible.

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py

kylesayrs · 2026-02-16T15:56:41Z

vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py

+            # Refresh the reference to `param` to reflect JIT materialization
+            if id(param) == layer._w13_weight_orig_id:
+                param = layer.w13_weight
+            elif id(param) == layer._w2_weight_orig_id:
+                param = layer.w2_weight


I don't fully understand this logic? Why does param need to be refreshed? Why not just get the name of the parameter, rather than using ids?

I kept it from the original implementation of fp8 but I believe it's there due to when we register_parameter("w13_weight", new_tensor), the layer.w13_weight(or w2_weight) reference updates, but the param argument still points to the old meta tensor. ids are used because param is just a tensor reference - it doesn't pass the parameter name, so we save the original ids to identify which parameter it was.

Hm, so this seems to be because, when you materialize the parameter, the newly materialized tensor overrides the original meta tensor. This is an artifact of having a weight_loader definition which is shared between multiple parameters.

In the reloading implementation, we handle this by just creating separate, generic weight loaders for each parameter.

vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py

kylesayrs · 2026-02-16T16:24:00Z

vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py

+                self.process_weights_after_loading(layer)
+
+                # Prevent the usual `process_weights_after_loading` call
+                layer._already_called_process_weights_after_loading = True


@vkuzo We should probably try to standardize on a strategy here. I think that this is fine

vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg · 2026-02-16T20:33:16Z

@kylesayrs @vkuzo I opened #34645 to easily add support for experts_int8 when loading with dummy_weights. LMK what you think and if its merged, I'll add this change here as well.

Signed-off-by: Josephasafg <ajgard7@gmail.com>

mergify · 2026-02-16T21:19:45Z

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg · 2026-02-18T14:26:19Z

Added uses_meta_device to experts_int8 after merging #34645

mergify · 2026-02-24T03:17:07Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…_to_fp8

mergify · 2026-02-24T09:26:58Z

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg · 2026-03-29T10:41:54Z

Closing in favor of - #38463

Josephasafg force-pushed the int8_to_fp8 branch from dfd4e13 to 31e1b8a Compare January 27, 2026 15:46

gemini-code-assist bot reviewed Jan 27, 2026

View reviewed changes

github-project-automation bot added this to gpt-oss Issues & Enhancements Jan 27, 2026

mergify bot added the rocm Related to AMD ROCm label Jan 27, 2026

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Jan 27, 2026

github-project-automation bot added this to NVIDIA Jan 27, 2026

mergify bot added cpu Related to CPU backends structured-output speculative-decoding labels Jan 27, 2026

github-project-automation bot added this to Structured Output Jan 27, 2026

mergify bot added v1 tpu Related to Google TPUs kv-connector labels Jan 27, 2026

mergify bot added the needs-rebase label Jan 27, 2026

Josephasafg added 3 commits January 27, 2026 17:52

pulled changes from fp8 online pr

a3f8e35

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Added modular kernel to experts_int8

2a0e8b8

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Added modular kernel to experts_int8

f7a7e35

Signed-off-by: Josephasafg <ajgard7@gmail.com>

kylesayrs reviewed Feb 16, 2026

View reviewed changes

Josephasafg added 2 commits February 16, 2026 19:41

Added use of materialize_meta_tensor

96754b3

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Remove extra call to materialize

3de0a54

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Modified meta copy counter

3451d40

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg requested a review from 22quinn as a code owner February 16, 2026 21:15

Josephasafg and others added 6 commits February 16, 2026 23:31

Formatted condition

64dde19

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Used closure for variables

17a1eef

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Removed comment

4e34572

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Merge branch 'main' into int8_to_fp8

7a07c7a

Added uses_meta_device to int8

c7f738c

Signed-off-by: Josephasafg <ajgard7@gmail.com>

ruff

94e094a

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg requested a review from vkuzo February 18, 2026 12:48

Josephasafg added 3 commits February 21, 2026 21:36

Merge branch 'main' into int8_to_fp8

c7d6051

Merge branch 'main' into int8_to_fp8

333b402

Merge branch 'main' into int8_to_fp8

42fd5ba

mergify bot added the needs-rebase label Feb 24, 2026

Merge branch 'main' of https://github.com/vllm-project/vllm into int8…

2087b3d

…_to_fp8

mergify bot removed the needs-rebase label Feb 24, 2026

fix conflict resolution

73b7026

Signed-off-by: Josephasafg <ajgard7@gmail.com>

Josephasafg closed this Mar 29, 2026

github-project-automation bot moved this to Done in Structured Output Mar 29, 2026

github-project-automation bot moved this from Todo to Done in AMD Mar 29, 2026

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Mar 29, 2026

github-project-automation bot moved this to Done in NVIDIA Mar 29, 2026

Uh oh!

Conversation

Josephasafg commented Jan 27, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan + Results

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

mergify bot commented Jan 27, 2026

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Josephasafg Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Josephasafg commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Feb 16, 2026

Uh oh!

Josephasafg commented Feb 18, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

mergify bot commented Feb 24, 2026

Uh oh!

Josephasafg commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Josephasafg commented Jan 27, 2026 •

edited by github-actions bot

Loading

Josephasafg commented Feb 16, 2026 •

edited

Loading