-
Notifications
You must be signed in to change notification settings - Fork 583
Open
Labels
Description
We would like feedback from the community on this rough plan for Q4. This is of course a work in progress, and we welcome feedback at any time. Please add comments below or on any specific issues. We will edit this description as plans change.
Also, if you're interested in contributing, feel free to dive into any of the unassigned issues!
Soul
These are the broad areas of focus for the quarter. Items in the roadmap below are tagged by “soul item”.
- [Testing]: Better testing and CI infrastructure to prevent build breaks and accuracy issues at the framework level
- [Model Optimization]: DeepSeek-R1, GPT-OSS, Qwen3, Qwen3-Next, MiniCPM4.1-8B, and others
- [API Usability]: API cleanup and refactoring for better user experience
October
- [Model Optimization] Update the routing for TRTLLMGEN to support kimi k2 and qwen #1831
- [Model Optimization] GPT-OSS perf improvements for max throughput case
- [Testing] Expanded CI coverage per-PR (ability to trigger tests on NVIDIA-internal test infrastructure, including various Blackwell devices)
- [Model Optimization] Non-gated MoE with squared ReLU activation Feature: Support Relu2 activation in fused MoE #1954
- [Model Optimization] [Perf] FP4 MoE on B200 (latency) #1734, perf: Speed up fp4 quantization for small batch with swizzling for cutlass MoE #2025
- [Model Optimization] Add RoPE, RoPE+Q, RoPE+Q+KVCacheUpdate fused kernels for MLA/GQA/MHA
- [API Usability] Support checks PoC #1809
- [API Usability] refactor: using tvm-ffi for multi-platform bindings #1641
November
- [Feature Request] Support all head dims in BatchQKApplyRotaryPosIdsCosSinCache #2104
- [feature request] add optimal gb200 moe comm kernels #2094
- [Feature Request] "auto" backend for mm_fp4 #1722
- FlashInfer 0.2.3+ does not support per-request generators #1104
- [Model Optimization] DSR1 improvements (details TBD)
- [Model Optimization] [Feature Request] Gated Delta Net #1690
- [Testing] Initial integration testing: e2e functional sanity checks
- [Testing] [API Usability] Minimal example of deploying LLM with flashinfer APIs (e.g. through gpt-fast). #1811
- [Model Optimization] Cosmos Reasoning 7B (details TBD)
- [Model Optimization] MXFP4 gemm perf improvements
- [API Usability] Support FP8-qkv FP8/FP4-output trtllm-gen in FlashInfer prefill/decode wrapper
- [API Usability] Unify qk_scale and o_scale Behavior Between trtllm-gen Attention and flashinfer-jit Attention
- [API Usability] Fused MoE general improvements, including (but not limited to):
- [API Usability] Inaccurate API Docstrings for Attention Prefill #1709
- [API Usability] Support checks: follow up to scale Support checks PoC #1809 across the library
- [Testing] Add comprehensive test xfails tracking system and analysis report #1733
- [Model Optimization] Native Sparse Attention
- [Model Optimization] [Feature Request] TopK Sparse Attention #1691
- [API Usability] [Feature Request] "auto" backend for mm_fp4 #1722
- [Model Optimization] [Perf] FP4 GEMM on B200 #1732
- API Usability] Unifying quantization related modules (fp4 quantize/quantize) silu_and_mul nvfp4 quanization fusion rework #1927
December
- [API Usability] Attention API consolidation
- [Testing] Improved unit testing based on escape analysis
- [Testing] Improved integration testing based on escape analysis
kmrao-nv, Edenzzzz, Fridge003, yzh119, czhu-cohere and 5 moreyiakwy-xpu-ml-framework-teampavanimajety, brayden-hai, kmrao-nv, Fridge003, yzh119 and 4 moreZJLi2013