Skip to content

[Tencent][FlashInfer][GroupGemm] Integrate H20 W4A8 Grouped Gemm Kernel into FlashInfer. #1987

@DwenGu

Description

@DwenGu

We are using vLLM/FlashInfer to optimize LLM models.
Low latency and throughput@latency are two scenarios that customers care about most.  W4A8 Gouped Gemm kernel perf is the key point for low latency and throughput@latency.

W4A8 Grouped Gemm kernel in TRT-LLM has better performance than vLLM FP8-PerTensor when m<256. So we want to use FlashInfer to import this kernel.

This kernel in TRTLLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions