[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels#33178
[Quantization] - Consolidate experts_int8 with FP8 Modular Kernels#33178Josephasafg wants to merge 55 commits intovllm-project:mainfrom
Conversation
dfd4e13 to
31e1b8a
Compare
There was a problem hiding this comment.
Code Review
The pull request successfully consolidates the weight loading and quantization logic for MoE layers into a new MoeOnlineWeightLoader class and a MoeQuantizationCallbacks protocol. This significantly improves code modularity and reusability, as both INT8 and FP8 online quantization methods now implement the same interface. The changes also include necessary adjustments for handling dummy weights and deferred materialization, ensuring compatibility and correctness across different quantization schemes. The introduction of kInt8StaticChannelSym and its integration into the _supports_quant_scheme function correctly enables INT8 support. Overall, the changes are well-structured and contribute positively to the codebase's maintainability and extensibility.
|
Documentation preview: https://vllm--33178.org.readthedocs.build/en/33178/ |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
kylesayrs
left a comment
There was a problem hiding this comment.
Nice job, this is roughly what I'd expect. @vkuzo and I have had a longer discussion about the overlap between online quantization and layerwise reload logic, but that's probably out of scope for this PR.
@vkuzo We definitely need to start standardizing how we're going to handle online weights. Whatever strategy we go with, we should at least make sure all online quantization implementations use the same strategy and reuse utilities wherever possible.
vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py
Outdated
Show resolved
Hide resolved
| # Refresh the reference to `param` to reflect JIT materialization | ||
| if id(param) == layer._w13_weight_orig_id: | ||
| param = layer.w13_weight | ||
| elif id(param) == layer._w2_weight_orig_id: | ||
| param = layer.w2_weight |
There was a problem hiding this comment.
I don't fully understand this logic? Why does param need to be refreshed? Why not just get the name of the parameter, rather than using ids?
There was a problem hiding this comment.
I kept it from the original implementation of fp8 but I believe it's there due to when we register_parameter("w13_weight", new_tensor), the layer.w13_weight(or w2_weight) reference updates, but the param argument still points to the old meta tensor. ids are used because param is just a tensor reference - it doesn't pass the parameter name, so we save the original ids to identify which parameter it was.
There was a problem hiding this comment.
Hm, so this seems to be because, when you materialize the parameter, the newly materialized tensor overrides the original meta tensor. This is an artifact of having a weight_loader definition which is shared between multiple parameters.
In the reloading implementation, we handle this by just creating separate, generic weight loaders for each parameter.
vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py
Outdated
Show resolved
Hide resolved
vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py
Outdated
Show resolved
Hide resolved
| self.process_weights_after_loading(layer) | ||
|
|
||
| # Prevent the usual `process_weights_after_loading` call | ||
| layer._already_called_process_weights_after_loading = True |
There was a problem hiding this comment.
@vkuzo We should probably try to standardize on a strategy here. I think that this is fine
vllm/model_executor/layers/quantization/utils/moe_online_weight_quantizer.py
Show resolved
Hide resolved
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
@kylesayrs @vkuzo I opened #34645 to easily add support for |
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
Hi @Josephasafg, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
Added |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @Josephasafg, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
Closing in favor of - #38463 |
Purpose
Consolidates
experts_int8andFp8OnlineMoEMethodto share common weight loading and quantization logic through a newMoeOnlineWeightLoaderclass.Key changes:
MoeOnlineWeightLoaderinmoe_weight_loader.pythat handles weight loading for MoE layersMoeQuantizationCallbacksprotocol that bothint8andfp8methods implementDiagram of the consolidation and inheritance between fp8 and experts_int8

Test Plan + Results
Ran lm_eval with and without fp8 and experts_int8 quantization. Model:
ai21labs/AI21-Jamba-Mini-1.7This PR
experts_int8
fp8
no quantization
vllm main
no quantization
experts_int8
fp8
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.