[Tencent][FlashInfer][GroupGemm] Integrate H20 W4A8 Grouped Gemm Kernel  into FlashInfer.

We are using vLLM/FlashInfer to optimize LLM models.
Low latency and throughput@latency are two scenarios that customers care about most.  W4A8 Gouped Gemm kernel perf is the key point for low latency and throughput@latency.

W4A8 Grouped Gemm kernel in TRT-LLM has better performance than vLLM FP8-PerTensor when m<256. So we want to use FlashInfer to import this kernel.

This kernel in TRTLLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Tencent][FlashInfer][GroupGemm] Integrate H20 W4A8 Grouped Gemm Kernel into FlashInfer. #1987

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Tencent][FlashInfer][GroupGemm] Integrate H20 W4A8 Grouped Gemm Kernel into FlashInfer. #1987

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions