We are using vLLM/FlashInfer to optimize LLM models.
Low latency and throughput@latency are two scenarios that customers care about most. W4A8 Gouped Gemm kernel perf is the key point for low latency and throughput@latency.
W4A8 Grouped Gemm kernel in TRT-LLM has better performance than vLLM FP8-PerTensor when m<256. So we want to use FlashInfer to import this kernel.
This kernel in TRTLLM: https://github.com/NVIDIA/TensorRT-LLM/blob/main/cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_template_dispatch_tma_ws_mixed_dtype.h