[Quantization] Consolidate experts_int8 with fp8 online quantization#38463
[Quantization] Consolidate experts_int8 with fp8 online quantization#38463Josephasafg wants to merge 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request refactors the online MoE quantization infrastructure by introducing a common base class, OnlineMoEMethodBase, and a mixin, Fp8MoEKernelMixin, to share logic between different quantization methods. It migrates ExpertsInt8MoEMethod and Fp8OnlineMoEMethod to this new architecture, which utilizes meta-device weight allocation and deferred quantization after model loading. Review feedback identifies potential division-by-zero issues in the int8 quantization loop when encountering zero-valued weight rows and highlights inefficient cross-device tensor allocations for scale parameters that should be created on the same device as the weights.
Signed-off-by: Josephasafg <ajgard7@gmail.com>
Signed-off-by: Josephasafg <ajgard7@gmail.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Following up on this #38032 - This PR Consolidates
experts_int8with fp8's online quantization infrastructure (QeRL). Extracts shared online MoE quantization logic into a common base class and refactors fp8's MoE kernel infrastructure into a reusable mixin.Test Plan
experts_int8 and fp8 tests should pass
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.