Skip to content

Version bump to 0.6.5#2668

Merged
aleozlx merged 8 commits intoflashinfer-ai:mainfrom
aleozlx:version_bump
Mar 4, 2026
Merged

Version bump to 0.6.5#2668
aleozlx merged 8 commits intoflashinfer-ai:mainfrom
aleozlx:version_bump

Conversation

@aleozlx
Copy link
Collaborator

@aleozlx aleozlx commented Mar 2, 2026

📌 Description

🔍 "Gated by" PR list

https://github.com/flashinfer-ai/flashinfer/pulls?q=is%3Apr+is%3Aopen+label%3Av0.6.5

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

API changes review

$ git diff v0.6.4 | grep -A20 @flashinfer_api                                                                                                                                                                                                                     (version_bump✱)
 @flashinfer_api
 def trtllm_fp8_per_tensor_scale_moe(
@@ -2257,10 +2323,11 @@ def trtllm_fp8_per_tensor_scale_moe(
     routed_scaling_factor: Optional[float],
     use_routing_scales_on_input: bool,
     routing_method_type: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     tune_max_num_tokens: int = 8192,
     activation_type: int = ActivationType.Swiglu.value,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 per tensor scale MoE operation.

     Args:
@@ -2282,22 +2349,21 @@ def trtllm_fp8_per_tensor_scale_moe(
         routed_scaling_factor: Scaling factor for routing
         use_routing_scales_on_input: Whether to use routing scales on input
         routing_method_type: Type of routing method to use (default: 0)
+        do_finalize: Whether to finalize the output (default: True).
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
--
 @flashinfer_api
 def trtllm_fp8_block_scale_moe(
@@ -2343,10 +2418,11 @@ def trtllm_fp8_block_scale_moe(
     routing_method_type: int = 0,
     use_shuffled_weight: bool = False,
     weight_layout: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     tune_max_num_tokens: int = 8192,
     fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 block scale MoE operation.

     Args:
@@ -2374,16 +2450,18 @@ def trtllm_fp8_block_scale_moe(
         weight_layout: Weight layout format (default: WeightLayout.MajorK). Supported layouts:
             - 0: MajorK - K-major layout [Mn, K]
             - 2: BlockMajorK - Blocked along K dimension [K/blockK, Mn, blockK]
+        do_finalize: Whether to finalize the output (default: True).
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
--
 @flashinfer_api
 def trtllm_fp8_block_scale_routed_moe(
@@ -2433,11 +2520,12 @@ def trtllm_fp8_block_scale_routed_moe(
     routing_method_type: int = 0,
     use_shuffled_weight: bool = False,
     weight_layout: int = 0,
+    do_finalize: bool = True,
     enable_pdl: Optional[bool] = None,
     output: Optional[torch.Tensor] = None,
     tune_max_num_tokens: int = 8192,
     fp8_quantization_type: Fp8QuantizationType = Fp8QuantizationType.DeepSeekFp8,
-) -> torch.Tensor:
+) -> Union[List[torch.Tensor], torch.Tensor]:
     """FP8 block scale MoE operation with pre-computed routing (packed format).

     This function is used when routing decisions have already been computed
@@ -2468,14 +2556,16 @@ def trtllm_fp8_block_scale_routed_moe(
         use_shuffled_weight: Whether to use shuffled weights
         weight_layout: Weight layout (0 = MajorK, 1 = BlockMajorK)
         enable_pdl: Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
+        do_finalize: Whether to finalize the output (default: True).
--
 @flashinfer_api
 def trtllm_fp4_block_scale_moe(
@@ -2589,12 +2688,8 @@ def trtllm_fp4_block_scale_moe(
         do_finalize (bool): Whether to finalize the output (default: False)
         enable_pdl (Optional[bool]): Whether to enable Programmatic Dependent Launch (PDL). Auto-enabled for >= sm90.
         activation_type (int): Type of activation function (default: 3 - Swiglu)
-            - 0: Gelu
-            - 1: Relu
-            - 2: Silu
             - 3: Swiglu
             - 4: Geglu
-            - 5: SwigluBias
             - 6: Relu2
             - 7: Identity
         tune_max_num_tokens(int): Maximum number of tokens for tuning. (default: 8192)
@@ -2726,12 +2821,8 @@ def trtllm_fp4_block_scale_routed_moe(
             - 4: RenormalizeNaive (Softmax -> TopK -> Renormalize)
         do_finalize (bool): Whether to finalize the output (default: False)
         activation_type (int): Type of activation function (default: 3 - Swiglu)
-            - 0: Gelu
-            - 1: Relu
--
+@flashinfer_api
+def tinygemm_bf16(
+    input: torch.Tensor,
+    weight: torch.Tensor,
+    out: torch.Tensor,
+    bias: Optional[torch.Tensor] = None,
+    use_pdl: bool = False,
+) -> None:
+    """SM90+ optimized small GEMM: out = input @ weight.T + bias (equivalent to F.linear).
+
+    A latency-optimized, warp-specialized GEMM designed for tiny batch sizes (ideally
+    1-8 rows, where a single TILE_N=8 tile covers the entire batch dimension) using
+    Ampere-style HMMA instructions. Uses TMA for async bulk data loads and
+    mma.sync.aligned.m16n8k16 tensor core instructions with BF16 input/weight/bias/output
+    and FP32 internal accumulation. The warp-specialized design (384 threads: 4 compute +
+    8 DMA warps) with 16 pipeline stages and 4x stage unroll trades off peak throughput
+    in favor of minimal latency.
+
+    From TensorRT-LLM tinygemm2 kernel.
+
+    Args:
--
 @flashinfer_api
@@ -78,6 +99,7 @@ def selective_state_update(
     intermediate_states_buffer: Optional[torch.Tensor] = None,
     intermediate_state_indices: Optional[torch.Tensor] = None,
     cache_steps: int = 0,
+    algorithm: str = "auto",
 ) -> torch.Tensor:
     r"""Selective state update operation for Mamba layers (the generation phase).

@@ -126,6 +148,10 @@ def selective_state_update(
         with shape (batch,)
     cache_steps : int
         Number of steps/tokens to cache for speculative decoding
+    algorithm : str
+        Algorithm to use: "auto" (default, picks the best kernel based on GPU arch,
+        data types, and problem size), "simple" (all GPUs), "vertical" and "horizontal"
+        (SM90+ only). MTP mode only supports "auto" or "simple".

     Returns
     -------
@@ -178,6 +204,30 @@ def selective_state_update(

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request updates the project's version number to 0.6.5. This change is a standard procedure for marking a new release or development iteration, ensuring that all components reflect the current state of the codebase and align with release management practices.

Highlights

  • Version Update: The project version has been incremented from 0.6.4 to 0.6.5.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • version.txt
    • Updated the version string to 0.6.5.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 2, 2026

📝 Walkthrough

Walkthrough

Version number incremented in version.txt from 0.6.4 to 0.6.5. This is a metadata-only change affecting the project version string with no modifications to source code or functionality.

Changes

Cohort / File(s) Summary
Version Bump
version.txt
Incremented version from 0.6.4 to 0.6.5

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • yongwww
  • yzh119

Poem

🐰 A whisker-twitch bump from point-four to five,
The version marches on, code stays alive!
In version.txt we trust, oh what a delight,
Numbers dance upward, everything's tight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. The Description section is empty, and Related Issues section only contains a link to a filtered PR list without specific issue references. While checklist items are marked complete, the core description is missing. Add a clear description of what the version bump includes (API changes, features, bug fixes, etc.). The provided diff shows significant API changes that should be documented in the PR description.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Version bump to 0.6.5' is concise, clear, and directly summarizes the main change—updating the version from 0.6.4 to 0.6.5.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request updates the version number in version.txt from 0.6.4 to 0.6.5. This change is straightforward and aligns with the pull request title and description. No functional code changes were introduced.

@aleozlx
Copy link
Collaborator Author

aleozlx commented Mar 3, 2026

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !367 has been created, and the CI pipeline #45263242 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #45263242: 8/20 passed

@aleozlx
Copy link
Collaborator Author

aleozlx commented Mar 4, 2026

tests clean

@aleozlx aleozlx merged commit cb593c8 into flashinfer-ai:main Mar 4, 2026
32 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 9, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants