Roadmap (2025 Q4)

We would like feedback from the community on this rough plan for Q4. This is of course a work in progress, and we welcome feedback at any time. Please add comments below or on any specific issues. We will edit this description as plans change.

Also, if you're interested in contributing, feel free to dive into any of the unassigned issues!

# Soul

These are the broad areas of focus for the quarter. Items in the roadmap below are tagged by “soul item”.

- [Testing]: Better testing and CI infrastructure to prevent build breaks and accuracy issues at the framework level
- [Model Optimization]: DeepSeek-R1, GPT-OSS, Qwen3, Qwen3-Next, MiniCPM4.1-8B, and others
- [API Usability]: API cleanup and refactoring for better user experience

# October

- [x] [Model Optimization] #1831
- [ ] [Model Optimization] GPT-OSS perf improvements for max throughput case
- [x] [Testing] Expanded CI coverage per-PR (ability to trigger tests on NVIDIA-internal test infrastructure, including various Blackwell devices)
- [x] [Model Optimization] Non-gated MoE with squared ReLU activation https://github.com/flashinfer-ai/flashinfer/pull/1954
- [x] [Model Optimization] #1734, #2025
- [x] [Model Optimization] Add RoPE, RoPE+Q, RoPE+Q+KVCacheUpdate fused kernels for MLA/GQA/MHA 
  - #1924
  - #2037  
- [x] [API Usability] #1809
- [x] [API Usability] #1641

# November

- [ ] https://github.com/flashinfer-ai/flashinfer/issues/2104
- [ ] https://github.com/flashinfer-ai/flashinfer/issues/2094
- [ ] #1722 
- [ ] #1104 
- [ ] [Model Optimization] DSR1 improvements (details TBD)
- [ ] [Model Optimization] #1690
- [ ] [Testing] Initial integration testing: e2e functional sanity checks
- [ ] [Testing] [API Usability] #1811 
- [ ] [Model Optimization] Cosmos Reasoning 7B (details TBD)
- [ ] [Model Optimization] MXFP4 gemm perf improvements
- [ ] [API Usability] Support FP8-qkv FP8/FP4-output trtllm-gen in FlashInfer prefill/decode wrapper
- [ ] [API Usability] Unify qk_scale and o_scale Behavior Between trtllm-gen Attention and flashinfer-jit Attention
- [ ] [API Usability] Fused MoE general improvements, including (but not limited to):
  - [RFC: Flashinfer MoE Wrappers for vLLM Integration](https://docs.google.com/document/d/144ZQyWsuNWjqVAYGW1iT4PUo8v00I14xkCXCOo3UuKo/edit?tab=t.0)
  - #1669
- [ ] [API Usability] #1709
- [ ] [API Usability] Support checks: follow up to scale #1809 across the library
- [ ] [Testing] #1733 
- [ ] [Model Optimization] Native Sparse Attention
- [ ] [Model Optimization] #1691 
- [ ] [API Usability] #1722
- [ ] [Model Optimization] #1732 
- [ ] API Usability] Unifying quantization related modules (fp4 quantize/quantize) https://github.com/flashinfer-ai/flashinfer/pull/1927

# December

- [ ] [API Usability] Attention API consolidation
- [ ] [Testing] Improved unit testing based on escape analysis
- [ ] [Testing] Improved integration testing based on escape analysis


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roadmap (2025 Q4) #1770

Soul

October

November

December

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Roadmap (2025 Q4) #1770

Description

Soul

October

November

December

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions