feat(gdn): port pooled decode kernel to f16 backend by xutizhou · Pull Request #2634 · flashinfer-ai/flashinfer

xutizhou · 2026-02-25T00:00:29Z

📌 Description

This PR ports the pooled decode kernel (originally from #2521) to the new F16 backend introduced in #2498. It enables zero-copy state updates using indirect indexing, eliminating the need for manual gather/scatter operations in SGLang.
Key changes:

Ported feat/gdn-decode-pooled commits to the latest main (post-F16 merge).
indexing logic.
- Updated docstrings and shape validation to handle both [B, HV, V, K] (direct) and [pool_size, HV, V, K] (pooled) state layouts.
- Ensured compatibility with the new gdn_decode_klast_bf16_state backend.

🔍 Related Issues

Port of Feat/gdn decode pooled #2521 (feat/gdn-decode-pooled)
Builds on top of Ameyn/gdn decode cutedsl kernel #2498 (F16 kernel backend)

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Optional pool-indexed state mode: supply per-batch indices to read/write shared state slots.
- New hybrid/linear attention backend integrations for improved stateful decode workflows.
Improvements
- Consistent pooled vs per-batch state handling across decode paths.
- Stronger validation for state shapes, dtypes and indexing to prevent invalid reads/writes.
Tools
- Local and Python benchmarking scripts for pooled vs gather/scatter throughput.
Tests
- New tests validating pooled-state decode correctness and padding behavior.

gemini-code-assist · 2026-02-25T00:00:51Z

Summary of Changes

Hello @xutizhou, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the pooled decode kernel with the recently introduced F16 backend, enabling a more efficient, zero-copy mechanism for state updates. By leveraging indirect indexing, the system can now directly access and modify state within a shared pool, significantly reducing memory overhead and improving overall performance. The changes also refine the API and internal logic to seamlessly support both traditional and pooled state representations.

Highlights

F16 Backend Port: The pooled decode kernel has been successfully ported to utilize the new F16 backend, enhancing performance and compatibility with recent optimizations.
Zero-Copy State Updates: Implemented zero-copy state updates through indirect indexing, which eliminates the need for manual gather/scatter operations in SGLang and improves efficiency.
Unified Kernel Logic: The gated_delta_rule_decode_pretranspose function was merged and updated to support both the new F16 kernel path and the pooled indexing logic within a single implementation.
Enhanced State Layout Handling: Docstrings and shape validation have been updated to correctly handle both direct [B, HV, V, K] and pooled [pool_size, HV, V, K] state layouts, ensuring flexibility and robustness.
Performance Improvement: Benchmarking shows consistent performance improvements of up to +1.9% speedup due to the reduction in memory copy overhead with the new pooled indexing approach.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/gdn_decode.py
- Added a use_pool_indexing boolean parameter to gdn_decode_kernel_small_batch_pretranspose and gdn_decode_kernel_big_batch_pretranspose functions.
- Modified kernel logic to conditionally use h0_indices for state lookup and handle padding slots when use_pool_indexing is True.
- Updated run_gdn_decode_kernel_small_batch_pretranspose and run_gdn_decode_kernel_big_batch_pretranspose to accept and pass the new use_pool_indexing parameter, and adjusted grid batch size calculation.
- Modified the _get_compiled_decode_kernel cache key to include pool_size and use_pool_indexing for proper kernel caching.
- Updated the gated_delta_rule_decode_pretranspose function signature to include an optional state_indices tensor.
- Expanded the docstring for gated_delta_rule_decode_pretranspose to explain the new pooled indexing mode and its benefits.
- Adjusted state shape validation in gated_delta_rule_decode_pretranspose to support both [B, HV, V, K] and [pool_size, HV, V, K] layouts, and added specific validation for state_indices.
- Modified h0_source reshaping logic and h0_indices assignment within gated_delta_rule_decode_pretranspose to adapt to pooled indexing.
- Updated the conditional logic for copying state back to handle non-contiguous states only when not in pooled indexing mode.
- Relocated the _get_compiled_decode_kernel_nontranspose function definition.

Activity

The pull request was created by xutizhou.
A detailed description was provided, outlining the purpose, key changes, related issues, and a checklist for pre-commit and tests.
Performance benchmarks were included, demonstrating speedups of up to 1.9% for pooled indexing compared to manual gather/scatter.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-02-25T00:00:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds optional pooled state indexing to gated-delta-rule decode: public APIs accept state_indices, runtime validates pooled vs per-batch layouts, and kernel/launcher signatures propagate initial_state_indices and use_pool_indexing to enable indirect per-batch state lookup from a state pool.

Changes

Cohort / File(s)	Summary
Public API / Dispatcher `flashinfer/gdn_decode.py`	Added optional `state_indices: Optional[torch.Tensor]` to decode entrypoints; validate shapes/dtypes/contiguity; derive `use_pool_indexing` and forward `initial_state_indices` to backend kernels.
Kernels & Launchers `flashinfer/gdn_kernels/gdn_decode_bf16_state.py`	Extended kernel/launcher signatures with `gState_indices`/`mState_indices` and `use_pool_indexing`; implemented indirect addressing (compute `pool_idx`/`state_batch_idx`), conditional SMEM use, padding/-1 handling, and propagated indexing across seqlen1 and seqlen234/unified paths.
Kernel Cache & Selection `flashinfer/gdn_kernels/...`	Updated cache keys and selection to include `use_pool_indexing` so compiled kernels distinguish direct vs pool-indexed modes.
State Layout & Validation `flashinfer/gdn_decode.py`, `flashinfer/gdn_kernels/...`	Support per-batch `[B, HV, V, K]` and pooled `[pool, HV, V, K]` layouts; validate and adapt indexing, computed pool_size, and flatten/reshape semantics for pooled mapping.
Benchmarks & Scripts `bench_gdn_kernel.py`, `run_bench_local.sh`	Added benchmark and local run orchestration comparing pool-indexed (zero-copy) vs baseline gather/scatter; includes warmups and kernel-cache management.
Tests `tests/gdn/test_gdn_decode_pooled.py`	New tests validating pooled BF16-state decode (including -1 padding) against per-sample references; skip on unsupported architectures.
sglang shadow tree `sglang_shadow/python/sglang/...`	Large collection of new placeholder modules and a substantial `hybrid_linear_attn_backend.py` added to shadow tree (many files are placeholders; hybrid backend contains extensive new logic).

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant Dispatch as Dispatcher
    participant Kernel as CUDA Kernel
    participant Pool as State Pool

    rect rgba(120,180,240,0.5)
        Note over User,Dispatch: Pool-indexed flow
        User->>Dispatch: call gated_delta_rule_decode(..., state=pool, state_indices)
        Dispatch->>Dispatch: validate `state_indices` (contiguous int32 [B])\nset use_pool_indexing = true
        Dispatch->>Kernel: launch kernel with use_pool_indexing=true\npass state (pool), state_indices
        Kernel->>Pool: pool_idx = gState_indices[batch_idx]
        alt pool_idx >= 0
            Kernel->>Pool: load state[pool_idx,...]
            Kernel->>Kernel: compute outputs
            Kernel->>Pool: write back state[pool_idx,...]
        else pool_idx < 0
            Kernel->Kernel: treat as padding / produce zeros
        end
    end

    rect rgba(220,160,100,0.5)
        Note over User,Dispatch: Direct per-batch flow
        User->>Dispatch: call gated_delta_rule_decode(..., state=per-batch, state_indices=None)
        Dispatch->>Dispatch: set use_pool_indexing = false
        Dispatch->>Kernel: launch kernel with use_pool_indexing=false\npass per-batch state
        Kernel->>Kernel: load state[batch_idx,...], compute, write outputs
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Ameyn/gdn bf16 tolerance parallel reduction #2610 — Modifies BF16 GDN decode kernel internals (FMA/wrappers) touching same kernel surfaces and numerical paths.
feat: [Qwen3-Next] Add Cute DSL GDN decode kernel and tests #2370 — Adds pooled state indexing and propagates initial_state_indices through GDN kernel/launcher paths (strong overlap).
Ameyn/gdn decode cutedsl kernel #2498 — Extends gated_delta_rule kernel/launch signatures and backend decode paths to accept initial state indices and indexing flags.

Suggested labels

run-ci

Suggested reviewers

yzh119
cyx-6
bkryu
nvmbreughe
jimmyzho

Poem

🐰 I hop through pools of indexed states,
Kernels fetch my tiny gates and weights,
Some slots are empty, some glow soft and bright,
I nibble batch by batch, compute, then write,
Homebound with outputs snug and light.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 31.58% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'feat(gdn): port pooled decode kernel to f16 backend' clearly and concisely describes the main change: porting a pooled decode kernel feature to the F16 backend.
Description check	✅ Passed	The PR description is comprehensive and complete. It includes a clear explanation of what the PR does, references to related issues (`#2521`, `#2498`), detailed key changes, acknowledgment of pre-commit checks completion, and confirmation that tests were added and are passing.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request successfully ports the pooled decode kernel to the F16 backend, introducing zero-copy state updates via indirect indexing. The changes are well-implemented, propagating the use_pool_indexing logic through the CUDA kernels and their corresponding Python wrappers. The documentation and input validation have been updated to reflect these new capabilities. I have one suggestion to refactor a small piece of duplicated code within one of the CUDA kernels to improve maintainability.

gemini-code-assist · 2026-02-25T00:04:48Z

flashinfer/gdn_decode.py

+    if pool_idx >= 0:
+        # Get current batch
+        gSrc_batch = h0_source[(state_idx, None, None)]  # (V, K)
+        gDst = cute.local_tile(h0_source, (1, TILE_V, TILE_K), (state_idx, None, 0))

-    # Partition for load
-    thr_copy_load = tiled_copy_load.get_slice(tidx)
+        # V 方向分 tiles
+        gSrc = cute.local_tile(
+            gSrc_batch, (TILE_V, TILE_K), (None, 0)
+        )  # (TILE_V, TILE_K, num_v_tiles)

-    # ===================================================================
-    # Prefetch: All threads participate in cp.async load
-    # ===================================================================
-    start_v_tiles = batch_inner * num_v_tiles_per_block
-    prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block)
-    for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count):
-        stage = (v_tiles - start_v_tiles) % NUM_STAGES
+        # Partition for load
+        thr_copy_load = tiled_copy_load.get_slice(tidx)

-        gSrc_tile = gSrc[(None, None, v_tiles)]
-        sData_stage = sData[(None, None, stage)]
+        # ===================================================================
+        # Prefetch: All threads participate in cp.async load
+        # ===================================================================
+        start_v_tiles = batch_inner * num_v_tiles_per_block
+        prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block)
+        for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count):
+            stage = (v_tiles - start_v_tiles) % NUM_STAGES

-        thr_gSrc = thr_copy_load.partition_S(gSrc_tile)
-        thr_sData = thr_copy_load.partition_D(sData_stage)
+            gSrc_tile = gSrc[(None, None, v_tiles)]
+            sData_stage = sData[(None, None, stage)]

-        cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
-        cute.arch.cp_async_commit_group()
+            thr_gSrc = thr_copy_load.partition_S(gSrc_tile)
+            thr_sData = thr_copy_load.partition_D(sData_stage)

-    # Load q, k into BF16 registers using autovec_copy (contiguous pattern)
-    q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
-    k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
-    cute.autovec_copy(q_tile, r_q_bf16)
-    cute.autovec_copy(k_tile, r_k_bf16)
+            cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
+            cute.arch.cp_async_commit_group()

-    # Convert BF16 to FP32
-    for i in cutlass.range_constexpr(vec_size):
-        r_q[i] = cutlass.Float32(r_q_bf16[i])
-        r_k[i] = cutlass.Float32(r_k_bf16[i])
+        # Load q, k into BF16 registers using autovec_copy (contiguous pattern)
+        q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
+        k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
+        cute.autovec_copy(q_tile, r_q_bf16)
+        cute.autovec_copy(k_tile, r_k_bf16)

-    # Load v into BF16 registers using autovec_copy, convert to FP32, store to sV
-    v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id))
-    cute.autovec_copy(v_tile, r_v_bf16)
-    for i in cutlass.range_constexpr(vec_size):
-        sV[k_start + i] = cutlass.Float32(r_v_bf16[i])
+        # Convert BF16 to FP32
+        for i in cutlass.range_constexpr(vec_size):
+            r_q[i] = cutlass.Float32(r_q_bf16[i])
+            r_k[i] = cutlass.Float32(r_k_bf16[i])

-    cute.arch.barrier()  # Ensure all threads finish writing to sV
+        # Load v into BF16 registers using autovec_copy, convert to FP32, store to sV
+        v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id))
+        cute.autovec_copy(v_tile, r_v_bf16)
+        for i in cutlass.range_constexpr(vec_size):
+            sV[k_start + i] = cutlass.Float32(r_v_bf16[i])

-    # ===================================================================
-    # Compute g and beta (scalar values)
-    # ===================================================================
-    r_g = 0.0
-    r_beta = 0.0
-    if lane_id == 0:
-        x = r_a + r_dt_bias
-        beta_x = softplus_beta * x
-        softplus_x = 0.0
-
-        if beta_x <= softplus_threshold:
-            # softplus(x) = (1/beta) * log(1 + exp(beta*x))
-            # Compute in Float32
-            exp_beta_x = cute.exp(beta_x, fastmath=True)
-            log_input = cutlass.Float32(1.0 + exp_beta_x)
-            log_result = cutlass.Float32(cute.log(log_input, fastmath=True))
-            softplus_x = cutlass.Float32(
-                (cutlass.Float32(1.0) / softplus_beta) * log_result
-            )
-        else:
-            softplus_x = x
+        cute.arch.barrier()  # Ensure all threads finish writing to sV

-        # Compute g = exp(A_log) * softplus_x
-        r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x
+        # ===================================================================
+        # Compute g and beta (scalar values)
+        # ===================================================================
+        r_g = 0.0
+        r_beta = 0.0
+        if lane_id == 0:
+            x = r_a + r_dt_bias
+            beta_x = softplus_beta * x
+            softplus_x = 0.0

-        # Compute beta = 1 / (1 + exp(-b))
-        r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True))
+            if beta_x <= softplus_threshold:
+                # softplus(x) = (1/beta) * log(1 + exp(beta*x))
+                # Compute in Float32
+                exp_beta_x = cute.exp(beta_x, fastmath=True)
+                log_input = cutlass.Float32(1.0 + exp_beta_x)
+                log_result = cutlass.Float32(cute.log(log_input, fastmath=True))
+                softplus_x = cutlass.Float32(
+                    (cutlass.Float32(1.0) / softplus_beta) * log_result
+                )
+            else:
+                softplus_x = x

-        # Store to scalar (Float32)
-        r_g = cute.exp(r_g_value, fastmath=True)
+            # Compute g = exp(A_log) * softplus_x
+            r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x

-    r_g = cute.arch.shuffle_sync(r_g, 0)
-    r_beta = cute.arch.shuffle_sync(r_beta, 0)
+            # Compute beta = 1 / (1 + exp(-b))
+            r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True))

-    if use_qk_l2norm:
-        # Compute L2 norm of q and k
-        sum_q = 0.0
-        sum_k = 0.0
-        for i in cutlass.range_constexpr(vec_size):
-            sum_q += r_q[i] * r_q[i]
-            sum_k += r_k[i] * r_k[i]
-        # Warp-level reduction using butterfly shuffle
-        for offset in [16, 8, 4, 2, 1]:
-            sum_q += cute.arch.shuffle_sync_bfly(
-                sum_q, offset=offset, mask=-1, mask_and_clamp=31
-            )
-            sum_k += cute.arch.shuffle_sync_bfly(
-                sum_k, offset=offset, mask=-1, mask_and_clamp=31
-            )
+            # Store to scalar (Float32)
+            r_g = cute.exp(r_g_value, fastmath=True)

-        inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True)
-        inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True)
-        for i in cutlass.range_constexpr(vec_size):
-            r_q[i] = r_q[i] * inv_norm_q
-            r_k[i] = r_k[i] * inv_norm_k
+        r_g = cute.arch.shuffle_sync(r_g, 0)
+        r_beta = cute.arch.shuffle_sync(r_beta, 0)

-    # Apply scaling in Float32
-    for i in cutlass.range_constexpr(vec_size):
-        r_q[i] = r_q[i] * scale
+        if use_qk_l2norm:
+            # Compute L2 norm of q and k
+            sum_q = 0.0
+            sum_k = 0.0
+            for i in cutlass.range_constexpr(vec_size):
+                sum_q += r_q[i] * r_q[i]
+                sum_k += r_k[i] * r_k[i]
+            # Warp-level reduction using butterfly shuffle
+            for offset in [16, 8, 4, 2, 1]:
+                sum_q += cute.arch.shuffle_sync_bfly(
+                    sum_q, offset=offset, mask=-1, mask_and_clamp=31
+                )
+                sum_k += cute.arch.shuffle_sync_bfly(
+                    sum_k, offset=offset, mask=-1, mask_and_clamp=31
+                )

-    # ===================================================================
-    # Mainloop: All threads participate
-    # ===================================================================
-    end_v_tiles = start_v_tiles + num_v_tiles_per_block
-    for v_tiles in range(start_v_tiles, end_v_tiles):
-        stage = (v_tiles - start_v_tiles) % NUM_STAGES
+            inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True)
+            inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True)
+            for i in cutlass.range_constexpr(vec_size):
+                r_q[i] = r_q[i] * inv_norm_q
+                r_k[i] = r_k[i] * inv_norm_k

-        # Step 1: Wait for current stage to complete
-        cute.arch.cp_async_wait_group(0)
-        cute.arch.barrier()
+        # Apply scaling in Float32
+        for i in cutlass.range_constexpr(vec_size):
+            r_q[i] = r_q[i] * scale

-        # Step 2: Issue async load for next tile (after compute)
-        next_v_tiles = v_tiles + prefetch_count
-        if next_v_tiles < end_v_tiles:
-            next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES
+        # ===================================================================
+        # Mainloop: All threads participate
+        # ===================================================================
+        end_v_tiles = start_v_tiles + num_v_tiles_per_block
+        for v_tiles in range(start_v_tiles, end_v_tiles):
+            stage = (v_tiles - start_v_tiles) % NUM_STAGES

-            gSrc_next = gSrc[(None, None, next_v_tiles)]
-            sData_next = sData[(None, None, next_stage)]
+            # Step 1: Wait for current stage to complete
+            cute.arch.cp_async_wait_group(0)
+            cute.arch.barrier()

-            thr_gSrc = thr_copy_load.partition_S(gSrc_next)
-            thr_sData = thr_copy_load.partition_D(sData_next)
+            # Step 2: Issue async load for next tile (after compute)
+            next_v_tiles = v_tiles + prefetch_count
+            if next_v_tiles < end_v_tiles:
+                next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES

-            cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
-            cute.arch.cp_async_commit_group()
+                gSrc_next = gSrc[(None, None, next_v_tiles)]
+                sData_next = sData[(None, None, next_stage)]

-        # Step 3: Compute using data from current stage (contiguous access pattern)
-        for row in cutlass.range_constexpr(0, TILE_V, 4):
-            row_offset = tidx // 32
-            sum_hk = 0.0
+                thr_gSrc = thr_copy_load.partition_S(gSrc_next)
+                thr_sData = thr_copy_load.partition_D(sData_next)

-            # Load h from sData using 3D local_tile + autovec_copy (contiguous in K)
-            sData_tile = cute.local_tile(
-                sData, (1, vec_size, 1), (row + row_offset, lane_id, stage)
-            )
-            cute.autovec_copy(sData_tile, r_h)
+                cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
+                cute.arch.cp_async_commit_group()

-            for i in cutlass.range_constexpr(vec_size):
-                r_h[i] = r_h[i] * r_g
-                sum_hk += r_h[i] * r_k[i]
+            # Step 3: Compute using data from current stage (contiguous access pattern)
+            for row in cutlass.range_constexpr(0, TILE_V, 4):
+                row_offset = tidx // 32
+                sum_hk = 0.0

-            for offset in [16, 8, 4, 2, 1]:
-                sum_hk += cute.arch.shuffle_sync_bfly(
-                    sum_hk, offset=offset, mask=-1, mask_and_clamp=31
+                # Load h from sData using 3D local_tile + autovec_copy (contiguous in K)
+                sData_tile = cute.local_tile(
+                    sData, (1, vec_size, 1), (row + row_offset, lane_id, stage)
                )
+                cute.autovec_copy(sData_tile, r_h)

-            v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk
-            v_new = v_new * r_beta
+                for i in cutlass.range_constexpr(vec_size):
+                    r_h[i] = r_h[i] * r_g
+                    sum_hk += r_h[i] * r_k[i]

-            sum_hq = 0.0
-            for i in cutlass.range_constexpr(vec_size):
-                r_h[i] += r_k[i] * v_new
-                sum_hq += r_h[i] * r_q[i]
+                for offset in [16, 8, 4, 2, 1]:
+                    sum_hk += cute.arch.shuffle_sync_bfly(
+                        sum_hk, offset=offset, mask=-1, mask_and_clamp=31
+                    )

-            # Write h to gDst using 4D local_tile + autovec_copy (contiguous in K)
-            gDst_tile = cute.local_tile(
-                gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles)
-            )
-            cute.autovec_copy(r_h, gDst_tile)
+                v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk
+                v_new = v_new * r_beta

-            for offset in [16, 8, 4, 2, 1]:
-                sum_hq += cute.arch.shuffle_sync_bfly(
-                    sum_hq, offset=offset, mask=-1, mask_and_clamp=31
+                sum_hq = 0.0
+                for i in cutlass.range_constexpr(vec_size):
+                    r_h[i] += r_k[i] * v_new
+                    sum_hq += r_h[i] * r_q[i]
+
+                # Write h to gDst using 4D local_tile + autovec_copy (contiguous in K)
+                gDst_tile = cute.local_tile(
+                    gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles)
                )
+                cute.autovec_copy(r_h, gDst_tile)

-            o_idx = v_tiles * TILE_V + row + row_offset
-            if lane_id == 0 and o_idx < V:
-                sOutput[o_idx] = cutlass.BFloat16(sum_hq)
+                for offset in [16, 8, 4, 2, 1]:
+                    sum_hq += cute.arch.shuffle_sync_bfly(
+                        sum_hq, offset=offset, mask=-1, mask_and_clamp=31
+                    )

-    # ===================================================================
-    # Final writeback: Copy output from shared memory to global memory
-    # All threads write (V=128, NUM_THREADS=128)
-    # ===================================================================
-    cute.arch.barrier()  # Ensure all writes to sOutput are complete
-    if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V:
-        o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx]
+                o_idx = v_tiles * TILE_V + row + row_offset
+                if lane_id == 0 and o_idx < V:
+                    sOutput[o_idx] = cutlass.BFloat16(sum_hq)
+
+        # ===================================================================
+        # Final writeback: Copy output from shared memory to global memory
+        # All threads write (V=128, NUM_THREADS=128)
+        # ===================================================================
+        cute.arch.barrier()  # Ensure all writes to sOutput are complete
+        if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V:
+            o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx]
+    else:
+        # Padding slot: write zeros to output
+        start_v_tiles = batch_inner * num_v_tiles_per_block
+        if (
+            tidx >= start_v_tiles * TILE_V
+            and tidx < (start_v_tiles + num_v_tiles_per_block) * TILE_V
+        ):
+            o[(i_n, i_t, i_hv, tidx)] = cutlass.BFloat16(0.0)


The calculation start_v_tiles = batch_inner * num_v_tiles_per_block is duplicated at line 260 (inside the if block) and line 430 (inside the else block). This calculation can be hoisted to before the if pool_idx >= 0: check to avoid redundancy and improve maintainability.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/gdn_decode.py (1)
1077-1096: ⚠️ Potential issue | 🔴 Critical

Add pooled indexing support to bf16 fast path or guard against it.

When use_pool_indexing=True (i.e., state_indices is provided), the bf16 fast path at lines 1074–1100 can still be selected, but the backend ignores pooled indices entirely. The underlying gated_delta_rule function in flashinfer/gdn_kernels/gdn_decode_bf16_state.py explicitly documents initial_state_indices as "Not used (for compatibility)", meaning the kernel will silently default to direct batch mapping rather than respecting the provided pooled indices.

This causes incorrect state reads/writes for non-identity pooled mappings. Either:

Add and not use_pool_indexing to the dispatch condition at line 1074 to prevent bf16 path when pooled mode is active, or

Implement pooled indexing support in the bf16 backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1077 - 1096, The bf16 fast path
(_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so
either prevent that path when pooling is used or implement pooled support; the
minimal fix is to change the dispatch condition that uses
use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or
equivalently check state_indices is None/identity) so the bf16 kernel is not
chosen when pooled indices (use_pool_indexing/state_indices) are provided,
ensuring the scalar bf16 kernel is only used for direct batch mappings.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1056-1067: When use_pool_indexing is true, state_indices currently
only has shape/dtype checks; add a runtime bounds check to ensure all entries in
state_indices are within [0, pool_size) (and optionally non-negative) before
kernel launch to prevent OOB in h0_source address calculations; locate the
pool-mode validation block around use_pool_indexing/state.is_contiguous() and
add a check using state_indices (and B/pool_size) to raise a clear error if any
value >= pool_size (or < 0).

---

Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1077-1096: The bf16 fast path
(_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so
either prevent that path when pooling is used or implement pooled support; the
minimal fix is to change the dispatch condition that uses
use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or
equivalently check state_indices is None/identity) so the bf16 kernel is not
chosen when pooled indices (use_pool_indexing/state_indices) are provided,
ensuring the scalar bf16 kernel is only used for direct batch mappings.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41545e5 and 63c5650.

📒 Files selected for processing (1)

flashinfer/gdn_decode.py

flashinfer/gdn_decode.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)
2111-2150: ⚠️ Potential issue | 🔴 Critical

Fix argument order in cute.compile() and execute call: stream must come before si_ and use_pool_indexing_c.

The launch wrapper functions (e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) declare parameters in this order:

eps: cutlass.Float32

stream: cuda.CUstream

mState_indices: cute.Tensor

use_pool_indexing: cutlass.Constexpr[bool]

However, both the cute.compile() call (line 2111) and execute call (line 2133) pass arguments as:

eps_f32

si_ (mState_indices)

use_pool_indexing_c

stream

With positional argument mapping, this causes si_ to bind to the stream parameter (type mismatch: cute.Tensor vs cuda.CUstream), use_pool_indexing_c to bind to mState_indices (type mismatch: bool vs cute.Tensor), and stream to bind to use_pool_indexing (type mismatch: cuda.CUstream vs bool). This is a critical correctness bug.
Fix: reorder arguments to match function signature
 _compiled_kernels[cache_key] = cute.compile(
     launch_fn,
     q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_,
     scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32,
+    stream,
     si_,
     use_pool_indexing_c,
-    stream,
     options="--enable-tvm-ffi --generate-line-info",
 )

 # Execute
 _compiled_kernels[cache_key](
     q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_,
     scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32,
+    stream,
     si_,
     use_pool_indexing_c,
-    stream,
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2111 - 2150,
The compile and execute calls pass arguments in the wrong order causing
positional binding errors; reorder the final three args so that stream comes
before si_ and use_pool_indexing_c in both the cute.compile(...) invocation and
the subsequent _compiled_kernels[cache_key](...) call to match the launch
wrapper signatures (e.g., gated_delta_rule_launch_seqlen1,
gated_delta_rule_launch_seqlen2) which expect (eps, stream, mState_indices,
use_pool_indexing); update the argument lists for cute.compile and the runtime
call accordingly so stream is placed before si_ and use_pool_indexing_c.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2065-2075: When use_pool_indexing is true, add a defensive
contiguity assertion for the pooled source tensor: assert
initial_state_source.is_contiguous(), with a clear message (e.g. "Expected
contiguous initial_state_source for pooled indexing"), placed alongside the
existing shape and dtype checks that reference initial_state_indices; this
ensures the kernel's async H 128-bit tiled copies are safe when computing
state_indices_t and using initial_state_source in the pooling path.

---

Outside diff comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2111-2150: The compile and execute calls pass arguments in the
wrong order causing positional binding errors; reorder the final three args so
that stream comes before si_ and use_pool_indexing_c in both the
cute.compile(...) invocation and the subsequent
_compiled_kernels[cache_key](...) call to match the launch wrapper signatures
(e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) which
expect (eps, stream, mState_indices, use_pool_indexing); update the argument
lists for cute.compile and the runtime call accordingly so stream is placed
before si_ and use_pool_indexing_c.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63c5650 and 834f1e6.

📒 Files selected for processing (2)

flashinfer/gdn_decode.py
flashinfer/gdn_kernels/gdn_decode_bf16_state.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/gdn_decode.py

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

flashinfer/gdn_decode.py (1)
1832-1875: ⚠️ Potential issue | 🟡 Minor

state_indices parameter is absent from the docstring of gated_delta_rule_decode.

The new state_indices parameter added at line 1830 has no documentation entry in the Args section. This creates a gap between the pretranspose and nontranspose API docs.
📝 Suggested addition to Args section
         use_qk_l2norm (bool):
             Whether to apply L2 normalization to q and k. Default: ``True``.
+        state_indices (Optional[torch.Tensor]):
+            Optional int32 tensor of shape ``[B]`` mapping each batch entry to its
+            slot in the state pool. Negative values are treated as padding and the
+            corresponding output is zeroed. If ``None``, direct per-batch indexing
+            is used and ``state.shape[0]`` must equal ``B``.

     Returns:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1832 - 1875, The docstring for
gated_delta_rule_decode is missing documentation for the new state_indices
parameter; update the Args section to describe state_indices (type torch.Tensor
or Optional[torch.Tensor]), its shape (e.g. [B] or [B, HV] depending on usage),
dtype expectations, purpose (indices mapping into the state for batched or
streaming decode), whether it can be None and how that behavior differs, and any
in-place/update semantics; reference the gated_delta_rule_decode function and
the state_indices parameter so the API docs for both pretranspose and
nontranspose variants remain consistent.
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)
1980-1994: 🛠️ Refactor suggestion | 🟠 Major

gated_delta_rule is missing the @flashinfer_api decorator.

All public decode API functions are expected to carry @flashinfer_api for API logging and crash-safe input capture. This function is the only public entry point in this file, yet it lacks the decorator.
🔧 Proposed fix
+@flashinfer_api
 def gated_delta_rule(
     A_log: torch.Tensor,
You will also need to import flashinfer_api (or provide a fallback) at the top of this file, as is done in gdn_decode.py.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 1980 - 1994,
The public function gated_delta_rule is missing the `@flashinfer_api` decorator;
add `@flashinfer_api` immediately above def gated_delta_rule to enable API logging
and crash-safe input capture, and ensure flashinfer_api is imported (or a
fallback defined) at the top of this module using the same import/fallback
pattern used in gdn_decode.py so the decorator is available at runtime.

♻️ Duplicate comments (2)

flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)

2059-2069: ⚠️ Potential issue | 🟠 Major

Missing contiguity assertion on initial_state_source in pool-indexing path.

The async H loads use 128-bit tiled copies that assume a contiguous layout. A non-contiguous initial_state_source silently corrupts the pointer arithmetic used in gH[(state_batch_idx, value_head_idx, None, None)].

🛡️ Proposed fix

     use_pool_indexing = initial_state_indices is not None
     if use_pool_indexing:
         assert initial_state_indices.shape == (B,), (
             f"Expected shape [{B}], got {initial_state_indices.shape}"
         )
         assert initial_state_indices.dtype == torch.int32, (
             f"Expected int32, got {initial_state_indices.dtype}"
         )
+        assert initial_state_source.is_contiguous(), (
+            "initial_state_source must be contiguous for correct pointer arithmetic in pooled mode"
+        )
         state_indices_t = initial_state_indices

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2059 - 2069,
When use_pool_indexing is true the code doesn't verify that initial_state_source
is contiguous which breaks the 128-bit tiled async H loads (used when indexing
gH[(state_batch_idx, value_head_idx, None, None)]); update the pool-indexing
branch (the path that sets state_indices_t from initial_state_indices) to either
assert initial_state_source.is_contiguous() or replace it with a contiguous copy
(e.g., initial_state_source = initial_state_source.contiguous()) before any
indexing into gH so pointer arithmetic remains valid; reference
initial_state_source, use_pool_indexing, state_indices_t and the
gH[(state_batch_idx, value_head_idx, None, None)] access to locate the change.

flashinfer/gdn_decode.py (1)

1881-1891: ⚠️ Potential issue | 🟠 Major

Missing bounds check: non-negative state_indices values not validated against pool_size.

Values ≥ pool_size in state_indices flow directly into flat_idx = pool_idx * HV + i_hv inside the kernel, causing out-of-bounds reads/writes on h0_source. A guard analogous to the one needed in gated_delta_rule_decode_pretranspose is required here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1881 - 1891, The code currently lacks
validation that entries in state_indices are within [0, pool_size-1], which
allows values >= pool_size to drive flat_idx = pool_idx * HV + i_hv and cause
OOB access on h0_source; add a bounds check similar to
gated_delta_rule_decode_pretranspose: when state_indices is not None, verify all
values are integers >= 0 and < pool_size (and fail with an informative assertion
or raised error referencing pool_size and B), or alternatively explicitly
clamp/validate before computing flat_idx; ensure you reference state_indices,
pool_size, flat_idx and h0_source when adding the check so the guard prevents
any out-of-bounds indexing.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: gated_delta_rule_decode is missing the pool-indexing
validations present in gated_delta_rule_decode_pretranspose; add a guard when
use_pool_indexing (i.e., state_indices is not None) that asserts
state.is_contiguous(), state_indices.shape == (B,), state_indices.dtype ==
torch.int32, and state_indices.is_contiguous(), mirroring the checks in
gated_delta_rule_decode_pretranspose so malformed/non-contiguous or wrong-dtype
state_indices do not reach the kernel.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: initial_state_indices values are never bounds-checked
against the pool size (initial_state_source.shape[0]) which can cause
out-of-bounds accesses when used as indices (e.g., in gH[...] indexing); update
the use_pool_indexing branch that sets state_indices_t to validate every entry
is >= 0 and < pool_size (and still dtype torch.int32 and on q.device), raising a
clear error or clamping/adjusting as appropriate; reference the symbols
initial_state_indices, initial_state_source.shape[0], state_indices_t,
use_pool_indexing and ensure the check happens before assigning state_indices_t
(or convert and move to device first) so downstream kernels cannot read/write
out-of-range indices.

---

Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1832-1875: The docstring for gated_delta_rule_decode is missing
documentation for the new state_indices parameter; update the Args section to
describe state_indices (type torch.Tensor or Optional[torch.Tensor]), its shape
(e.g. [B] or [B, HV] depending on usage), dtype expectations, purpose (indices
mapping into the state for batched or streaming decode), whether it can be None
and how that behavior differs, and any in-place/update semantics; reference the
gated_delta_rule_decode function and the state_indices parameter so the API docs
for both pretranspose and nontranspose variants remain consistent.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 1980-1994: The public function gated_delta_rule is missing the
`@flashinfer_api` decorator; add `@flashinfer_api` immediately above def
gated_delta_rule to enable API logging and crash-safe input capture, and ensure
flashinfer_api is imported (or a fallback defined) at the top of this module
using the same import/fallback pattern used in gdn_decode.py so the decorator is
available at runtime.

---

Duplicate comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: The code currently lacks validation that entries in
state_indices are within [0, pool_size-1], which allows values >= pool_size to
drive flat_idx = pool_idx * HV + i_hv and cause OOB access on h0_source; add a
bounds check similar to gated_delta_rule_decode_pretranspose: when state_indices
is not None, verify all values are integers >= 0 and < pool_size (and fail with
an informative assertion or raised error referencing pool_size and B), or
alternatively explicitly clamp/validate before computing flat_idx; ensure you
reference state_indices, pool_size, flat_idx and h0_source when adding the check
so the guard prevents any out-of-bounds indexing.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: When use_pool_indexing is true the code doesn't verify
that initial_state_source is contiguous which breaks the 128-bit tiled async H
loads (used when indexing gH[(state_batch_idx, value_head_idx, None, None)]);
update the pool-indexing branch (the path that sets state_indices_t from
initial_state_indices) to either assert initial_state_source.is_contiguous() or
replace it with a contiguous copy (e.g., initial_state_source =
initial_state_source.contiguous()) before any indexing into gH so pointer
arithmetic remains valid; reference initial_state_source, use_pool_indexing,
state_indices_t and the gH[(state_batch_idx, value_head_idx, None, None)] access
to locate the change.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834f1e6 and d207428.

📒 Files selected for processing (2)

flashinfer/gdn_decode.py
flashinfer/gdn_kernels/gdn_decode_bf16_state.py

flashinfer/gdn_decode.py

flashinfer/gdn_kernels/gdn_decode_bf16_state.py

coderabbitai

Actionable comments posted: 28

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

flashinfer/gdn_decode.py (1)
1076-1107: ⚠️ Potential issue | 🟠 Major

Legacy pretranspose path (float32 state) doesn't guard against pool indexing — will crash on reshape.

When state_indices is provided with float32 state (i.e. use_gdn_decode_klast_bf16_state is false), execution falls through to the legacy path. Line 1107 does state.reshape(B * HV, V, K), but state.shape[0] is pool_size (which differs from B in pooled mode), causing a reshape failure. The cached h0_indices at line 1115 is always zeros, so even if the reshape didn't crash, the kernel wouldn't use the real state_indices.

Either add an assertion rejecting pool indexing in the legacy path or implement proper support.
Minimal guard (recommended)
     # Legacy path: T=1 only, float32 state
     assert T == 1, f"Decode only supports T=1, got T={T}"
     assert state.dtype == torch.float32, f"state must be float32, got {state.dtype}"
+    assert not use_pool_indexing, (
+        "Pool indexing (state_indices) is not supported with float32 state in the "
+        "pretranspose legacy path. Use bfloat16 state for pool indexing support."
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1076 - 1107, The legacy pretranspose
path performs state.reshape(B * HV, V, K) and always uses a zeroed h0_indices,
which will crash or mis-index when state_indices (pooled mode) is provided;
update the logic in the branch guarded by use_gdn_decode_klast_bf16_state ==
False to either (a) assert that state_indices is None (reject pooled indexing)
by checking state_indices is None before the reshape and raising a clear
assertion mentioning state_indices and the legacy path, or (b) implement proper
pooled support by using state_indices to build h0_indices and reshape/reindex
state into h0_source correctly (use state.shape[0] as pool_size, gather the
per-batch entries into [B*HV, V, K] using state_indices, and replace the zeroed
h0_indices usage), ensuring h0_source and h0_indices are consistent for the
kernel call.

♻️ Duplicate comments (20)

sglang_shadow/python/sglang/bench_offline_throughput.py (1)