Skip to content

[Feature]: Use QuantFp8 CustomOp-abstraction for MoE layers #20711

@ProExpertProg

Description

@ProExpertProg

🚀 The feature, motivation and pitch

#19830 added QuantFp8, which uses the CustomOp abstraction to implement fp8 quantization in both CUDA and torch, allowing Inductor to achieve superior performance over the CUDA ops (which are unoptimized and also do not fuse by default). However, the class has to be instantiated during init, and MoE uses are currently in util free functions many levels deep. Those need to be mildly rearchitected to take advantage of the new abstraction.

The use to be rearchitected is here: https://github.com/vllm-project/vllm/blob/c7a00e6e6716f45db09e39cb21a8f91f741f10b9/vllm/model_executor/layers/fused_moe/utils.py#L37-L40

The free functions should be converted to class instances with separate init and forward steps.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions