Skip to content

[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels#33178

Closed
Josephasafg wants to merge 55 commits intovllm-project:mainfrom
Josephasafg:int8_to_fp8
Closed

[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels#33178
Josephasafg wants to merge 55 commits intovllm-project:mainfrom
Josephasafg:int8_to_fp8

Conversation

@Josephasafg
Copy link
Copy Markdown
Contributor

@Josephasafg Josephasafg commented Jan 27, 2026

Purpose

Consolidates experts_int8 and Fp8OnlineMoEMethod to share common weight loading and quantization logic through a new MoeOnlineWeightLoader class.

Key changes:

  • Introduced MoeOnlineWeightLoader in moe_weight_loader.py that handles weight loading for MoE layers
  • Created MoeQuantizationCallbacks protocol that both int8 and fp8 methods implement
  • Both quantization methods now only need to provide:
    • get_quantized_dtype() - target dtype (int8 or fp8)
    • quantize_expert(weight) - single expert quantization logic
    • create_scale_tensors() - scale tensor creation (per-channel for int8, per-tensor for fp8)
    • setup_kernel() - kernel setup after quantization

Diagram of the consolidation and inheritance between fp8 and experts_int8
MoE Weight Loading Pipeline Feb 4 2026

  • This PR relies on this PR and need it to be merged first.

Test Plan + Results

Ran lm_eval with and without fp8 and experts_int8 quantization. Model: ai21labs/AI21-Jamba-Mini-1.7

This PR
experts_int8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3963|±  |0.0383|

fp8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value|   |Stderr|
|---------|------:|-----------|-----:|------|---|----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.378|±  | 0.038|

no quantization

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3902|±  |0.0382|

vllm main

no quantization

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3902|±  |0.0382|

experts_int8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3841|±  |0.0381|

fp8

|  Tasks  |Version|  Filter   |n-shot|Metric|   |Value |   |Stderr|
|---------|------:|-----------|-----:|------|---|-----:|---|-----:|
|humaneval|      1|create_test|     0|pass@1|   |0.3598|±  |0.0376|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully consolidates the weight loading and quantization logic for MoE layers into a new MoeOnlineWeightLoader class and a MoeQuantizationCallbacks protocol. This significantly improves code modularity and reusability, as both INT8 and FP8 online quantization methods now implement the same interface. The changes also include necessary adjustments for handling dummy weights and deferred materialization, ensuring compatibility and correctness across different quantization schemes. The introduction of kInt8StaticChannelSym and its integration into the _supports_quant_scheme function correctly enables INT8 support. Overall, the changes are well-structured and contribute positively to the codebase's maintainability and extensibility.

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 27, 2026

Documentation preview: https://vllm--33178.org.readthedocs.build/en/33178/

@mergify mergify bot added documentation Improvements or additions to documentation ci/build deepseek Related to DeepSeek models frontend llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models performance Performance-related issues qwen Related to Qwen models gpt-oss Related to GPT-OSS models nvidia labels Jan 27, 2026
@mergify mergify bot added the rocm Related to AMD ROCm label Jan 27, 2026
@mergify mergify bot added v1 tpu Related to Google TPUs kv-connector labels Jan 27, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 27, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 27, 2026
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, this is roughly what I'd expect. @vkuzo and I have had a longer discussion about the overlap between online quantization and layerwise reload logic, but that's probably out of scope for this PR.

@vkuzo We definitely need to start standardizing how we're going to handle online weights. Whatever strategy we go with, we should at least make sure all online quantization implementations use the same strategy and reuse utilities wherever possible.

Comment on lines +190 to +194
# Refresh the reference to `param` to reflect JIT materialization
if id(param) == layer._w13_weight_orig_id:
param = layer.w13_weight
elif id(param) == layer._w2_weight_orig_id:
param = layer.w2_weight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this logic? Why does param need to be refreshed? Why not just get the name of the parameter, rather than using ids?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it from the original implementation of fp8 but I believe it's there due to when we register_parameter("w13_weight", new_tensor), the layer.w13_weight(or w2_weight) reference updates, but the param argument still points to the old meta tensor. ids are used because param is just a tensor reference - it doesn't pass the parameter name, so we save the original ids to identify which parameter it was.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so this seems to be because, when you materialize the parameter, the newly materialized tensor overrides the original meta tensor. This is an artifact of having a weight_loader definition which is shared between multiple parameters.

In the reloading implementation, we handle this by just creating separate, generic weight loaders for each parameter.

self.process_weights_after_loading(layer)

# Prevent the usual `process_weights_after_loading` call
layer._already_called_process_weights_after_loading = True
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkuzo We should probably try to standardize on a strategy here. I think that this is fine

Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
@Josephasafg
Copy link
Copy Markdown
Contributor Author

Josephasafg commented Feb 16, 2026

@kylesayrs @vkuzo I opened #34645 to easily add support for experts_int8 when loading with dummy_weights. LMK what you think and if its merged, I'll add this change here as well.

Signed-off-by: Josephasafg <ajgard7@gmail.com>
@Josephasafg Josephasafg requested a review from 22quinn as a code owner February 16, 2026 21:15
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 16, 2026

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Josephasafg and others added 6 commits February 16, 2026 23:31
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
@Josephasafg Josephasafg requested a review from vkuzo February 18, 2026 12:48
@Josephasafg
Copy link
Copy Markdown
Contributor Author

Added uses_meta_device to experts_int8 after merging #34645

@mergify
Copy link
Copy Markdown

mergify bot commented Feb 24, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Josephasafg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Feb 24, 2026
@mergify mergify bot removed the needs-rebase label Feb 24, 2026
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 24, 2026

Hi @Josephasafg, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?
mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

Signed-off-by: Josephasafg <ajgard7@gmail.com>
@Josephasafg
Copy link
Copy Markdown
Contributor Author

Closing in favor of - #38463

@github-project-automation github-project-automation bot moved this from Todo to Done in AMD Mar 29, 2026
@github-project-automation github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Mar 29, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build cpu Related to CPU backends deepseek Related to DeepSeek models documentation Improvements or additions to documentation frontend gpt-oss Related to GPT-OSS models kv-connector llama Related to Llama models multi-modality Related to multi-modality (#4194) new-model Requests to new models nvidia performance Performance-related issues qwen Related to Qwen models rocm Related to AMD ROCm speculative-decoding structured-output v1

Projects

Status: Done
Status: Done
Status: Done
Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants