Version bump to 0.6.5 by aleozlx · Pull Request #2668 · flashinfer-ai/flashinfer

aleozlx · 2026-03-02T22:44:42Z

📌 Description

🔍 "Gated by" PR list

https://github.com/flashinfer-ai/flashinfer/pulls?q=is%3Apr+is%3Aopen+label%3Av0.6.5

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

API changes review

$ git diff v0.6.4 | grep -A20 @flashinfer_api                                                                                                                                                                                                                     (version_bump✱)
 @flashinfer_api
 def trtllm_fp8_per_tensor_scale_moe(
@@ -2257,10 +2323,11 @@ def trtllm_fp8_per_tensor_scale_moe(
     routed_scaling_factor: Optional[float],
     use_routing_scales_on_input: bool,
     routing_method_type: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     tune_max_num_tokens: int = 8192,
     activation_type: int = ActivationType.Swiglu.value,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 per tensor scale MoE operation.

     Args:
@@ -2282,22 +2349,21 @@ def trtllm_fp8_per_tensor_scale_moe(
         routed_scaling_factor: Scaling factor for routing
         use_routing_scales_on_input: Whether to use routing scales on input
         routing_method_type: Type of routing method to use (default: 0)
+        do_finalize: Whether to finalize the output (default: True).
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
--
 @flashinfer_api
 def trtllm_fp8_block_scale_moe(
@@ -2343,10 +2418,11 @@ def trtllm_fp8_block_scale_moe(
     routing_method_type: int = 0,
     use_shuffled_weight: bool = False,
     weight_layout: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     tune_max_num_tokens: int = 8192,
     fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 block scale MoE operation.

     Args:
@@ -2374,16 +2450,18 @@ def trtllm_fp8_block_scale_moe(
         weight_layout: Weight layout format (default: WeightLayout.MajorK). Supported layouts:
             - 0: MajorK - K-major layout [Mn, K]
             - 2: BlockMajorK - Blocked along K dimension [K/blockK, Mn, blockK]
+        do_finalize: Whether to finalize the output (default: True).
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
--
 @flashinfer_api
 def trtllm_fp8_block_scale_routed_moe(
@@ -2433,11 +2520,12 @@ def trtllm_fp8_block_scale_routed_moe(
     routing_method_type: int = 0,
     use_shuffled_weight: bool = False,
     weight_layout: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     output: Optional[torch.Tensor] = None,
     tune_max_num_tokens: int = 8192,
     fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 block scale MoE operation with pre-computed routing (packed format).

     This function is used when routing decisions have already been computed
@@ -2468,14 +2556,16 @@ def trtllm_fp8_block_scale_routed_moe(
         use_shuffled_weight: Whether to use shuffled weights
         weight_layout: Weight layout (0 = MajorK, 1 = BlockMajorK)
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
+        do_finalize: Whether to finalize the output (default: True).
--
 @flashinfer_api
 def trtllm_fp4_block_scale_moe(
@@ -2589,12 +2688,8 @@ def trtllm_fp4_block_scale_moe(
         do_finalize (bool): Whether to finalize the output (default: False)
         enable_pdl (Optional[bool]): Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
         activation_type (int): Type of activation function (default: 3 - Swiglu)
-            - 0: Gelu
-            - 1: Relu
-            - 2: Silu
             - 3: Swiglu
             - 4: Geglu
-            - 5: SwigluBias
             - 6: Relu2
             - 7: Identity
         tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)
@@ -2726,12 +2821,8 @@ def trtllm_fp4_block_scale_routed_moe(
             - 4: RenormalizeNaive (Softmax -> TopK -> Renormalize)
         do_finalize (bool): Whether to finalize the output (default: False)
         activation_type (int): Type of activation function (default: 3 - Swiglu)
-            - 0: Gelu
-            - 1: Relu
--
+@flashinfer_api
+def tinygemm_bf16(
+    input: torch.Tensor,
+    weight: torch.Tensor,
+    out: torch.Tensor,
+    bias: Optional[torch.Tensor] = None,
+    use_pdl: bool = False,
+) -> None:
+    """SM90+ optimized small GEMM: out = input @ weight.T + bias (equivalent to F.linear).
+
+    A latency-optimized, warp-specialized GEMM designed for tiny batch sizes (ideally
+    1-8 rows, where a single TILE_N=8 tile covers the entire batch dimension) using
+    Ampere-style HMMA instructions. Uses TMA for async bulk data loads and
+    mma.sync.aligned.m16n8k16 tensor core instructions with BF16 input/weight/bias/output
+    and FP32 internal accumulation. The warp-specialized design (384 threads: 4 compute +
+    8 DMA warps) with 16 pipeline stages and 4x stage unroll trades off peak throughput
+    in favor of minimal latency.
+
+    From TensorRT-LLM tinygemm2 kernel.
+
+    Args:
--
 @flashinfer_api
@@ -78,6 +99,7 @@ def selective_state_update(
     intermediate_states_buffer: Optional[torch.Tensor] = None,
     intermediate_state_indices: Optional[torch.Tensor] = None,
     cache_steps: int = 0,
+    algorithm: str = "auto",
 ) -> torch.Tensor:
     r"""Selective state update operation for Mamba layers (the generation phase).

@@ -126,6 +148,10 @@ def selective_state_update(
         with shape (batch,)
     cache_steps : int
         Number of steps/tokens to cache for speculative decoding
+    algorithm : str
+        Algorithm to use: "auto" (default, picks the best kernel based on GPU arch,
+        data types, and problem size), "simple" (all GPUs), "vertical" and "horizontal"
+        (SM90+ only). MTP mode only supports "auto" or "simple".

     Returns
     -------
@@ -178,6 +204,30 @@ def selective_state_update(

…on_bump

…sion_bump

gemini-code-assist · 2026-03-02T22:44:53Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the project's version number to 0.6.5. This change is a standard procedure for marking a new release or development iteration, ensuring that all components reflect the current state of the codebase and align with release management practices.

Highlights

Version Update: The project version has been incremented from 0.6.4 to 0.6.5.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

version.txt
- Updated the version string to 0.6.5.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-02T22:45:08Z

📝 Walkthrough

Walkthrough

Version number incremented in version.txt from 0.6.4 to 0.6.5. This is a metadata-only change affecting the project version string with no modifications to source code or functionality.

Changes

Cohort / File(s)	Summary
Version Bump `version.txt`	Incremented version from 0.6.4 to 0.6.5

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

release: Bump version for v0.5.2 release #2057 — Version bump PR updating version.txt for project release metadata
bump version to 0.6.4 #2565 — Previous version bump from 0.6.3 to 0.6.4 in version.txt
release: Bump version for v0.5.0rc1 release; #2008 — Similar version.txt update for sequential project versioning

Suggested labels

run-ci

Suggested reviewers

yongwww
yzh119

Poem

🐰 A whisker-twitch bump from point-four to five,
The version marches on, code stays alive!
In version.txt we trust, oh what a delight,
Numbers dance upward, everything's tight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. The Description section is empty, and Related Issues section only contains a link to a filtered PR list without specific issue references. While checklist items are marked complete, the core description is missing.	Add a clear description of what the version bump includes (API changes, features, bug fixes, etc.). The provided diff shows significant API changes that should be documented in the PR description.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Version bump to 0.6.5' is concise, clear, and directly summarizes the main change—updating the version from 0.6.4 to 0.6.5.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

The pull request updates the version number in version.txt from 0.6.4 to 0.6.5. This change is straightforward and aligns with the pull request title and description. No functional code changes were introduced.

aleozlx · 2026-03-03T20:04:54Z

/bot run

flashinfer-bot · 2026-03-03T20:06:27Z

GitLab MR !367 has been created, and the CI pipeline #45263242 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-04T00:12:44Z

[FAILED] Pipeline #45263242: 8/20 passed

aleozlx · 2026-03-04T00:18:41Z

tests clean

aleozlx and others added 8 commits February 4, 2026 15:12

bump version to 0.6.3

5420355

Merge branch 'main' into version_bump

0bbb174

bump version to 0.6.4

bf2e46b

Merge branch 'main' of github.com:flashinfer-ai/flashinfer into versi…

8eddbd7

…on_bump

Merge branch 'version_bump' of github.com:aleozlx/flashinfer into ver…

3202074

…sion_bump

- stale CHANGELOG.md

bfe9f00

Merge branch 'main' into version_bump

6db5225

bump version to 0.6.5

d0d616d

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

aleozlx assigned yzh119 Mar 2, 2026

aleozlx added the run-ci label Mar 2, 2026

yzh119 approved these changes Mar 2, 2026

View reviewed changes

aleozlx merged commit cb593c8 into flashinfer-ai:main Mar 4, 2026
32 checks passed

coderabbitai bot mentioned this pull request Mar 9, 2026

bump version to 0.6.6 #2724

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version bump to 0.6.5#2668

Version bump to 0.6.5#2668
aleozlx merged 8 commits intoflashinfer-ai:mainfrom
aleozlx:version_bump

aleozlx commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

coderabbitai bot commented Mar 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

aleozlx commented Mar 3, 2026

Uh oh!

flashinfer-bot commented Mar 3, 2026

Uh oh!

flashinfer-bot commented Mar 4, 2026

Uh oh!

aleozlx commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

aleozlx commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 "Gated by" PR list

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

aleozlx commented Mar 3, 2026

Uh oh!

flashinfer-bot commented Mar 3, 2026

Uh oh!

flashinfer-bot commented Mar 4, 2026

Uh oh!

aleozlx commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aleozlx commented Mar 2, 2026 •

edited

Loading

coderabbitai bot commented Mar 2, 2026 •

edited

Loading