Skip to content

feat(gdn): port pooled decode kernel to f16 backend#2634

Closed
xutizhou wants to merge 5 commits intoflashinfer-ai:mainfrom
xutizhou:feat/gdn-decode-pooled-f16
Closed

feat(gdn): port pooled decode kernel to f16 backend#2634
xutizhou wants to merge 5 commits intoflashinfer-ai:mainfrom
xutizhou:feat/gdn-decode-pooled-f16

Conversation

@xutizhou
Copy link
Copy Markdown
Contributor

@xutizhou xutizhou commented Feb 25, 2026

📌 Description

This PR ports the pooled decode kernel (originally from #2521) to the new F16 backend introduced in #2498. It enables zero-copy state updates using indirect indexing, eliminating the need for manual gather/scatter operations in SGLang.
Key changes:

  • Ported feat/gdn-decode-pooled commits to the latest main (post-F16 merge).
    indexing logic.
    • Updated docstrings and shape validation to handle both [B, HV, V, K] (direct) and [pool_size, HV, V, K] (pooled) state layouts.
    • Ensured compatibility with the new gdn_decode_klast_bf16_state backend.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Optional pool-indexed state mode: supply per-batch indices to read/write shared state slots.
    • New hybrid/linear attention backend integrations for improved stateful decode workflows.
  • Improvements

    • Consistent pooled vs per-batch state handling across decode paths.
    • Stronger validation for state shapes, dtypes and indexing to prevent invalid reads/writes.
  • Tools

    • Local and Python benchmarking scripts for pooled vs gather/scatter throughput.
  • Tests

    • New tests validating pooled-state decode correctness and padding behavior.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @xutizhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the pooled decode kernel with the recently introduced F16 backend, enabling a more efficient, zero-copy mechanism for state updates. By leveraging indirect indexing, the system can now directly access and modify state within a shared pool, significantly reducing memory overhead and improving overall performance. The changes also refine the API and internal logic to seamlessly support both traditional and pooled state representations.

Highlights

  • F16 Backend Port: The pooled decode kernel has been successfully ported to utilize the new F16 backend, enhancing performance and compatibility with recent optimizations.
  • Zero-Copy State Updates: Implemented zero-copy state updates through indirect indexing, which eliminates the need for manual gather/scatter operations in SGLang and improves efficiency.
  • Unified Kernel Logic: The gated_delta_rule_decode_pretranspose function was merged and updated to support both the new F16 kernel path and the pooled indexing logic within a single implementation.
  • Enhanced State Layout Handling: Docstrings and shape validation have been updated to correctly handle both direct [B, HV, V, K] and pooled [pool_size, HV, V, K] state layouts, ensuring flexibility and robustness.
  • Performance Improvement: Benchmarking shows consistent performance improvements of up to +1.9% speedup due to the reduction in memory copy overhead with the new pooled indexing approach.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/gdn_decode.py
    • Added a use_pool_indexing boolean parameter to gdn_decode_kernel_small_batch_pretranspose and gdn_decode_kernel_big_batch_pretranspose functions.
    • Modified kernel logic to conditionally use h0_indices for state lookup and handle padding slots when use_pool_indexing is True.
    • Updated run_gdn_decode_kernel_small_batch_pretranspose and run_gdn_decode_kernel_big_batch_pretranspose to accept and pass the new use_pool_indexing parameter, and adjusted grid batch size calculation.
    • Modified the _get_compiled_decode_kernel cache key to include pool_size and use_pool_indexing for proper kernel caching.
    • Updated the gated_delta_rule_decode_pretranspose function signature to include an optional state_indices tensor.
    • Expanded the docstring for gated_delta_rule_decode_pretranspose to explain the new pooled indexing mode and its benefits.
    • Adjusted state shape validation in gated_delta_rule_decode_pretranspose to support both [B, HV, V, K] and [pool_size, HV, V, K] layouts, and added specific validation for state_indices.
    • Modified h0_source reshaping logic and h0_indices assignment within gated_delta_rule_decode_pretranspose to adapt to pooled indexing.
    • Updated the conditional logic for copying state back to handle non-contiguous states only when not in pooled indexing mode.
    • Relocated the _get_compiled_decode_kernel_nontranspose function definition.
Activity
  • The pull request was created by xutizhou.
  • A detailed description was provided, outlining the purpose, key changes, related issues, and a checklist for pre-commit and tests.
  • Performance benchmarks were included, demonstrating speedups of up to 1.9% for pooled indexing compared to manual gather/scatter.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds optional pooled state indexing to gated-delta-rule decode: public APIs accept state_indices, runtime validates pooled vs per-batch layouts, and kernel/launcher signatures propagate initial_state_indices and use_pool_indexing to enable indirect per-batch state lookup from a state pool.

Changes

Cohort / File(s) Summary
Public API / Dispatcher
flashinfer/gdn_decode.py
Added optional state_indices: Optional[torch.Tensor] to decode entrypoints; validate shapes/dtypes/contiguity; derive use_pool_indexing and forward initial_state_indices to backend kernels.
Kernels & Launchers
flashinfer/gdn_kernels/gdn_decode_bf16_state.py
Extended kernel/launcher signatures with gState_indices/mState_indices and use_pool_indexing; implemented indirect addressing (compute pool_idx/state_batch_idx), conditional SMEM use, padding/-1 handling, and propagated indexing across seqlen1 and seqlen234/unified paths.
Kernel Cache & Selection
flashinfer/gdn_kernels/...
Updated cache keys and selection to include use_pool_indexing so compiled kernels distinguish direct vs pool-indexed modes.
State Layout & Validation
flashinfer/gdn_decode.py, flashinfer/gdn_kernels/...
Support per-batch [B, HV, V, K] and pooled [pool, HV, V, K] layouts; validate and adapt indexing, computed pool_size, and flatten/reshape semantics for pooled mapping.
Benchmarks & Scripts
bench_gdn_kernel.py, run_bench_local.sh
Added benchmark and local run orchestration comparing pool-indexed (zero-copy) vs baseline gather/scatter; includes warmups and kernel-cache management.
Tests
tests/gdn/test_gdn_decode_pooled.py
New tests validating pooled BF16-state decode (including -1 padding) against per-sample references; skip on unsupported architectures.
sglang shadow tree
sglang_shadow/python/sglang/...
Large collection of new placeholder modules and a substantial hybrid_linear_attn_backend.py added to shadow tree (many files are placeholders; hybrid backend contains extensive new logic).

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant Dispatch as Dispatcher
    participant Kernel as CUDA Kernel
    participant Pool as State Pool

    rect rgba(120,180,240,0.5)
        Note over User,Dispatch: Pool-indexed flow
        User->>Dispatch: call gated_delta_rule_decode(..., state=pool, state_indices)
        Dispatch->>Dispatch: validate `state_indices` (contiguous int32 [B])\nset use_pool_indexing = true
        Dispatch->>Kernel: launch kernel with use_pool_indexing=true\npass state (pool), state_indices
        Kernel->>Pool: pool_idx = gState_indices[batch_idx]
        alt pool_idx >= 0
            Kernel->>Pool: load state[pool_idx,...]
            Kernel->>Kernel: compute outputs
            Kernel->>Pool: write back state[pool_idx,...]
        else pool_idx < 0
            Kernel->Kernel: treat as padding / produce zeros
        end
    end

    rect rgba(220,160,100,0.5)
        Note over User,Dispatch: Direct per-batch flow
        User->>Dispatch: call gated_delta_rule_decode(..., state=per-batch, state_indices=None)
        Dispatch->>Dispatch: set use_pool_indexing = false
        Dispatch->>Kernel: launch kernel with use_pool_indexing=false\npass per-batch state
        Kernel->>Kernel: load state[batch_idx,...], compute, write outputs
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

run-ci

Suggested reviewers

  • yzh119
  • cyx-6
  • bkryu
  • nvmbreughe
  • jimmyzho

Poem

🐰 I hop through pools of indexed states,
Kernels fetch my tiny gates and weights,
Some slots are empty, some glow soft and bright,
I nibble batch by batch, compute, then write,
Homebound with outputs snug and light.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.58% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(gdn): port pooled decode kernel to f16 backend' clearly and concisely describes the main change: porting a pooled decode kernel feature to the F16 backend.
Description check ✅ Passed The PR description is comprehensive and complete. It includes a clear explanation of what the PR does, references to related issues (#2521, #2498), detailed key changes, acknowledgment of pre-commit checks completion, and confirmation that tests were added and are passing.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully ports the pooled decode kernel to the F16 backend, introducing zero-copy state updates via indirect indexing. The changes are well-implemented, propagating the use_pool_indexing logic through the CUDA kernels and their corresponding Python wrappers. The documentation and input validation have been updated to reflect these new capabilities. I have one suggestion to refactor a small piece of duplicated code within one of the CUDA kernels to improve maintainability.

Comment on lines +244 to +435
if pool_idx >= 0:
# Get current batch
gSrc_batch = h0_source[(state_idx, None, None)] # (V, K)
gDst = cute.local_tile(h0_source, (1, TILE_V, TILE_K), (state_idx, None, 0))

# Partition for load
thr_copy_load = tiled_copy_load.get_slice(tidx)
# V 方向分 tiles
gSrc = cute.local_tile(
gSrc_batch, (TILE_V, TILE_K), (None, 0)
) # (TILE_V, TILE_K, num_v_tiles)

# ===================================================================
# Prefetch: All threads participate in cp.async load
# ===================================================================
start_v_tiles = batch_inner * num_v_tiles_per_block
prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block)
for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count):
stage = (v_tiles - start_v_tiles) % NUM_STAGES
# Partition for load
thr_copy_load = tiled_copy_load.get_slice(tidx)

gSrc_tile = gSrc[(None, None, v_tiles)]
sData_stage = sData[(None, None, stage)]
# ===================================================================
# Prefetch: All threads participate in cp.async load
# ===================================================================
start_v_tiles = batch_inner * num_v_tiles_per_block
prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block)
for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count):
stage = (v_tiles - start_v_tiles) % NUM_STAGES

thr_gSrc = thr_copy_load.partition_S(gSrc_tile)
thr_sData = thr_copy_load.partition_D(sData_stage)
gSrc_tile = gSrc[(None, None, v_tiles)]
sData_stage = sData[(None, None, stage)]

cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
cute.arch.cp_async_commit_group()
thr_gSrc = thr_copy_load.partition_S(gSrc_tile)
thr_sData = thr_copy_load.partition_D(sData_stage)

# Load q, k into BF16 registers using autovec_copy (contiguous pattern)
q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
cute.autovec_copy(q_tile, r_q_bf16)
cute.autovec_copy(k_tile, r_k_bf16)
cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
cute.arch.cp_async_commit_group()

# Convert BF16 to FP32
for i in cutlass.range_constexpr(vec_size):
r_q[i] = cutlass.Float32(r_q_bf16[i])
r_k[i] = cutlass.Float32(r_k_bf16[i])
# Load q, k into BF16 registers using autovec_copy (contiguous pattern)
q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id))
cute.autovec_copy(q_tile, r_q_bf16)
cute.autovec_copy(k_tile, r_k_bf16)

# Load v into BF16 registers using autovec_copy, convert to FP32, store to sV
v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id))
cute.autovec_copy(v_tile, r_v_bf16)
for i in cutlass.range_constexpr(vec_size):
sV[k_start + i] = cutlass.Float32(r_v_bf16[i])
# Convert BF16 to FP32
for i in cutlass.range_constexpr(vec_size):
r_q[i] = cutlass.Float32(r_q_bf16[i])
r_k[i] = cutlass.Float32(r_k_bf16[i])

cute.arch.barrier() # Ensure all threads finish writing to sV
# Load v into BF16 registers using autovec_copy, convert to FP32, store to sV
v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id))
cute.autovec_copy(v_tile, r_v_bf16)
for i in cutlass.range_constexpr(vec_size):
sV[k_start + i] = cutlass.Float32(r_v_bf16[i])

# ===================================================================
# Compute g and beta (scalar values)
# ===================================================================
r_g = 0.0
r_beta = 0.0
if lane_id == 0:
x = r_a + r_dt_bias
beta_x = softplus_beta * x
softplus_x = 0.0

if beta_x <= softplus_threshold:
# softplus(x) = (1/beta) * log(1 + exp(beta*x))
# Compute in Float32
exp_beta_x = cute.exp(beta_x, fastmath=True)
log_input = cutlass.Float32(1.0 + exp_beta_x)
log_result = cutlass.Float32(cute.log(log_input, fastmath=True))
softplus_x = cutlass.Float32(
(cutlass.Float32(1.0) / softplus_beta) * log_result
)
else:
softplus_x = x
cute.arch.barrier() # Ensure all threads finish writing to sV

# Compute g = exp(A_log) * softplus_x
r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x
# ===================================================================
# Compute g and beta (scalar values)
# ===================================================================
r_g = 0.0
r_beta = 0.0
if lane_id == 0:
x = r_a + r_dt_bias
beta_x = softplus_beta * x
softplus_x = 0.0

# Compute beta = 1 / (1 + exp(-b))
r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True))
if beta_x <= softplus_threshold:
# softplus(x) = (1/beta) * log(1 + exp(beta*x))
# Compute in Float32
exp_beta_x = cute.exp(beta_x, fastmath=True)
log_input = cutlass.Float32(1.0 + exp_beta_x)
log_result = cutlass.Float32(cute.log(log_input, fastmath=True))
softplus_x = cutlass.Float32(
(cutlass.Float32(1.0) / softplus_beta) * log_result
)
else:
softplus_x = x

# Store to scalar (Float32)
r_g = cute.exp(r_g_value, fastmath=True)
# Compute g = exp(A_log) * softplus_x
r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x

r_g = cute.arch.shuffle_sync(r_g, 0)
r_beta = cute.arch.shuffle_sync(r_beta, 0)
# Compute beta = 1 / (1 + exp(-b))
r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True))

if use_qk_l2norm:
# Compute L2 norm of q and k
sum_q = 0.0
sum_k = 0.0
for i in cutlass.range_constexpr(vec_size):
sum_q += r_q[i] * r_q[i]
sum_k += r_k[i] * r_k[i]
# Warp-level reduction using butterfly shuffle
for offset in [16, 8, 4, 2, 1]:
sum_q += cute.arch.shuffle_sync_bfly(
sum_q, offset=offset, mask=-1, mask_and_clamp=31
)
sum_k += cute.arch.shuffle_sync_bfly(
sum_k, offset=offset, mask=-1, mask_and_clamp=31
)
# Store to scalar (Float32)
r_g = cute.exp(r_g_value, fastmath=True)

inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True)
inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True)
for i in cutlass.range_constexpr(vec_size):
r_q[i] = r_q[i] * inv_norm_q
r_k[i] = r_k[i] * inv_norm_k
r_g = cute.arch.shuffle_sync(r_g, 0)
r_beta = cute.arch.shuffle_sync(r_beta, 0)

# Apply scaling in Float32
for i in cutlass.range_constexpr(vec_size):
r_q[i] = r_q[i] * scale
if use_qk_l2norm:
# Compute L2 norm of q and k
sum_q = 0.0
sum_k = 0.0
for i in cutlass.range_constexpr(vec_size):
sum_q += r_q[i] * r_q[i]
sum_k += r_k[i] * r_k[i]
# Warp-level reduction using butterfly shuffle
for offset in [16, 8, 4, 2, 1]:
sum_q += cute.arch.shuffle_sync_bfly(
sum_q, offset=offset, mask=-1, mask_and_clamp=31
)
sum_k += cute.arch.shuffle_sync_bfly(
sum_k, offset=offset, mask=-1, mask_and_clamp=31
)

# ===================================================================
# Mainloop: All threads participate
# ===================================================================
end_v_tiles = start_v_tiles + num_v_tiles_per_block
for v_tiles in range(start_v_tiles, end_v_tiles):
stage = (v_tiles - start_v_tiles) % NUM_STAGES
inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True)
inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True)
for i in cutlass.range_constexpr(vec_size):
r_q[i] = r_q[i] * inv_norm_q
r_k[i] = r_k[i] * inv_norm_k

# Step 1: Wait for current stage to complete
cute.arch.cp_async_wait_group(0)
cute.arch.barrier()
# Apply scaling in Float32
for i in cutlass.range_constexpr(vec_size):
r_q[i] = r_q[i] * scale

# Step 2: Issue async load for next tile (after compute)
next_v_tiles = v_tiles + prefetch_count
if next_v_tiles < end_v_tiles:
next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES
# ===================================================================
# Mainloop: All threads participate
# ===================================================================
end_v_tiles = start_v_tiles + num_v_tiles_per_block
for v_tiles in range(start_v_tiles, end_v_tiles):
stage = (v_tiles - start_v_tiles) % NUM_STAGES

gSrc_next = gSrc[(None, None, next_v_tiles)]
sData_next = sData[(None, None, next_stage)]
# Step 1: Wait for current stage to complete
cute.arch.cp_async_wait_group(0)
cute.arch.barrier()

thr_gSrc = thr_copy_load.partition_S(gSrc_next)
thr_sData = thr_copy_load.partition_D(sData_next)
# Step 2: Issue async load for next tile (after compute)
next_v_tiles = v_tiles + prefetch_count
if next_v_tiles < end_v_tiles:
next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES

cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
cute.arch.cp_async_commit_group()
gSrc_next = gSrc[(None, None, next_v_tiles)]
sData_next = sData[(None, None, next_stage)]

# Step 3: Compute using data from current stage (contiguous access pattern)
for row in cutlass.range_constexpr(0, TILE_V, 4):
row_offset = tidx // 32
sum_hk = 0.0
thr_gSrc = thr_copy_load.partition_S(gSrc_next)
thr_sData = thr_copy_load.partition_D(sData_next)

# Load h from sData using 3D local_tile + autovec_copy (contiguous in K)
sData_tile = cute.local_tile(
sData, (1, vec_size, 1), (row + row_offset, lane_id, stage)
)
cute.autovec_copy(sData_tile, r_h)
cute.copy(tiled_copy_load, thr_gSrc, thr_sData)
cute.arch.cp_async_commit_group()

for i in cutlass.range_constexpr(vec_size):
r_h[i] = r_h[i] * r_g
sum_hk += r_h[i] * r_k[i]
# Step 3: Compute using data from current stage (contiguous access pattern)
for row in cutlass.range_constexpr(0, TILE_V, 4):
row_offset = tidx // 32
sum_hk = 0.0

for offset in [16, 8, 4, 2, 1]:
sum_hk += cute.arch.shuffle_sync_bfly(
sum_hk, offset=offset, mask=-1, mask_and_clamp=31
# Load h from sData using 3D local_tile + autovec_copy (contiguous in K)
sData_tile = cute.local_tile(
sData, (1, vec_size, 1), (row + row_offset, lane_id, stage)
)
cute.autovec_copy(sData_tile, r_h)

v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk
v_new = v_new * r_beta
for i in cutlass.range_constexpr(vec_size):
r_h[i] = r_h[i] * r_g
sum_hk += r_h[i] * r_k[i]

sum_hq = 0.0
for i in cutlass.range_constexpr(vec_size):
r_h[i] += r_k[i] * v_new
sum_hq += r_h[i] * r_q[i]
for offset in [16, 8, 4, 2, 1]:
sum_hk += cute.arch.shuffle_sync_bfly(
sum_hk, offset=offset, mask=-1, mask_and_clamp=31
)

# Write h to gDst using 4D local_tile + autovec_copy (contiguous in K)
gDst_tile = cute.local_tile(
gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles)
)
cute.autovec_copy(r_h, gDst_tile)
v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk
v_new = v_new * r_beta

for offset in [16, 8, 4, 2, 1]:
sum_hq += cute.arch.shuffle_sync_bfly(
sum_hq, offset=offset, mask=-1, mask_and_clamp=31
sum_hq = 0.0
for i in cutlass.range_constexpr(vec_size):
r_h[i] += r_k[i] * v_new
sum_hq += r_h[i] * r_q[i]

# Write h to gDst using 4D local_tile + autovec_copy (contiguous in K)
gDst_tile = cute.local_tile(
gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles)
)
cute.autovec_copy(r_h, gDst_tile)

o_idx = v_tiles * TILE_V + row + row_offset
if lane_id == 0 and o_idx < V:
sOutput[o_idx] = cutlass.BFloat16(sum_hq)
for offset in [16, 8, 4, 2, 1]:
sum_hq += cute.arch.shuffle_sync_bfly(
sum_hq, offset=offset, mask=-1, mask_and_clamp=31
)

# ===================================================================
# Final writeback: Copy output from shared memory to global memory
# All threads write (V=128, NUM_THREADS=128)
# ===================================================================
cute.arch.barrier() # Ensure all writes to sOutput are complete
if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V:
o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx]
o_idx = v_tiles * TILE_V + row + row_offset
if lane_id == 0 and o_idx < V:
sOutput[o_idx] = cutlass.BFloat16(sum_hq)

# ===================================================================
# Final writeback: Copy output from shared memory to global memory
# All threads write (V=128, NUM_THREADS=128)
# ===================================================================
cute.arch.barrier() # Ensure all writes to sOutput are complete
if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V:
o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx]
else:
# Padding slot: write zeros to output
start_v_tiles = batch_inner * num_v_tiles_per_block
if (
tidx >= start_v_tiles * TILE_V
and tidx < (start_v_tiles + num_v_tiles_per_block) * TILE_V
):
o[(i_n, i_t, i_hv, tidx)] = cutlass.BFloat16(0.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The calculation start_v_tiles = batch_inner * num_v_tiles_per_block is duplicated at line 260 (inside the if block) and line 430 (inside the else block). This calculation can be hoisted to before the if pool_idx >= 0: check to avoid redundancy and improve maintainability.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/gdn_decode.py (1)

1077-1096: ⚠️ Potential issue | 🔴 Critical

Add pooled indexing support to bf16 fast path or guard against it.

When use_pool_indexing=True (i.e., state_indices is provided), the bf16 fast path at lines 1074–1100 can still be selected, but the backend ignores pooled indices entirely. The underlying gated_delta_rule function in flashinfer/gdn_kernels/gdn_decode_bf16_state.py explicitly documents initial_state_indices as "Not used (for compatibility)", meaning the kernel will silently default to direct batch mapping rather than respecting the provided pooled indices.

This causes incorrect state reads/writes for non-identity pooled mappings. Either:

  1. Add and not use_pool_indexing to the dispatch condition at line 1074 to prevent bf16 path when pooled mode is active, or
  2. Implement pooled indexing support in the bf16 backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1077 - 1096, The bf16 fast path
(_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so
either prevent that path when pooling is used or implement pooled support; the
minimal fix is to change the dispatch condition that uses
use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or
equivalently check state_indices is None/identity) so the bf16 kernel is not
chosen when pooled indices (use_pool_indexing/state_indices) are provided,
ensuring the scalar bf16 kernel is only used for direct batch mappings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1056-1067: When use_pool_indexing is true, state_indices currently
only has shape/dtype checks; add a runtime bounds check to ensure all entries in
state_indices are within [0, pool_size) (and optionally non-negative) before
kernel launch to prevent OOB in h0_source address calculations; locate the
pool-mode validation block around use_pool_indexing/state.is_contiguous() and
add a check using state_indices (and B/pool_size) to raise a clear error if any
value >= pool_size (or < 0).

---

Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1077-1096: The bf16 fast path
(_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so
either prevent that path when pooling is used or implement pooled support; the
minimal fix is to change the dispatch condition that uses
use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or
equivalently check state_indices is None/identity) so the bf16 kernel is not
chosen when pooled indices (use_pool_indexing/state_indices) are provided,
ensuring the scalar bf16 kernel is only used for direct batch mappings.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 41545e5 and 63c5650.

📒 Files selected for processing (1)
  • flashinfer/gdn_decode.py

@xutizhou xutizhou force-pushed the feat/gdn-decode-pooled-f16 branch from 63c5650 to 834f1e6 Compare February 25, 2026 04:20
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)

2111-2150: ⚠️ Potential issue | 🔴 Critical

Fix argument order in cute.compile() and execute call: stream must come before si_ and use_pool_indexing_c.

The launch wrapper functions (e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) declare parameters in this order:

  • eps: cutlass.Float32
  • stream: cuda.CUstream
  • mState_indices: cute.Tensor
  • use_pool_indexing: cutlass.Constexpr[bool]

However, both the cute.compile() call (line 2111) and execute call (line 2133) pass arguments as:

  • eps_f32
  • si_ (mState_indices)
  • use_pool_indexing_c
  • stream

With positional argument mapping, this causes si_ to bind to the stream parameter (type mismatch: cute.Tensor vs cuda.CUstream), use_pool_indexing_c to bind to mState_indices (type mismatch: bool vs cute.Tensor), and stream to bind to use_pool_indexing (type mismatch: cuda.CUstream vs bool). This is a critical correctness bug.

Fix: reorder arguments to match function signature
 _compiled_kernels[cache_key] = cute.compile(
     launch_fn,
     q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_,
     scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32,
+    stream,
     si_,
     use_pool_indexing_c,
-    stream,
     options="--enable-tvm-ffi --generate-line-info",
 )

 # Execute
 _compiled_kernels[cache_key](
     q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_,
     scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32,
+    stream,
     si_,
     use_pool_indexing_c,
-    stream,
 )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2111 - 2150,
The compile and execute calls pass arguments in the wrong order causing
positional binding errors; reorder the final three args so that stream comes
before si_ and use_pool_indexing_c in both the cute.compile(...) invocation and
the subsequent _compiled_kernels[cache_key](...) call to match the launch
wrapper signatures (e.g., gated_delta_rule_launch_seqlen1,
gated_delta_rule_launch_seqlen2) which expect (eps, stream, mState_indices,
use_pool_indexing); update the argument lists for cute.compile and the runtime
call accordingly so stream is placed before si_ and use_pool_indexing_c.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2065-2075: When use_pool_indexing is true, add a defensive
contiguity assertion for the pooled source tensor: assert
initial_state_source.is_contiguous(), with a clear message (e.g. "Expected
contiguous initial_state_source for pooled indexing"), placed alongside the
existing shape and dtype checks that reference initial_state_indices; this
ensures the kernel's async H 128-bit tiled copies are safe when computing
state_indices_t and using initial_state_source in the pooling path.

---

Outside diff comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2111-2150: The compile and execute calls pass arguments in the
wrong order causing positional binding errors; reorder the final three args so
that stream comes before si_ and use_pool_indexing_c in both the
cute.compile(...) invocation and the subsequent
_compiled_kernels[cache_key](...) call to match the launch wrapper signatures
(e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) which
expect (eps, stream, mState_indices, use_pool_indexing); update the argument
lists for cute.compile and the runtime call accordingly so stream is placed
before si_ and use_pool_indexing_c.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 63c5650 and 834f1e6.

📒 Files selected for processing (2)
  • flashinfer/gdn_decode.py
  • flashinfer/gdn_kernels/gdn_decode_bf16_state.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • flashinfer/gdn_decode.py

@xutizhou xutizhou force-pushed the feat/gdn-decode-pooled-f16 branch from 834f1e6 to d207428 Compare February 25, 2026 05:41
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
flashinfer/gdn_decode.py (1)

1832-1875: ⚠️ Potential issue | 🟡 Minor

state_indices parameter is absent from the docstring of gated_delta_rule_decode.

The new state_indices parameter added at line 1830 has no documentation entry in the Args section. This creates a gap between the pretranspose and nontranspose API docs.

📝 Suggested addition to Args section
         use_qk_l2norm (bool):
             Whether to apply L2 normalization to q and k. Default: ``True``.
+        state_indices (Optional[torch.Tensor]):
+            Optional int32 tensor of shape ``[B]`` mapping each batch entry to its
+            slot in the state pool. Negative values are treated as padding and the
+            corresponding output is zeroed. If ``None``, direct per-batch indexing
+            is used and ``state.shape[0]`` must equal ``B``.

     Returns:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1832 - 1875, The docstring for
gated_delta_rule_decode is missing documentation for the new state_indices
parameter; update the Args section to describe state_indices (type torch.Tensor
or Optional[torch.Tensor]), its shape (e.g. [B] or [B, HV] depending on usage),
dtype expectations, purpose (indices mapping into the state for batched or
streaming decode), whether it can be None and how that behavior differs, and any
in-place/update semantics; reference the gated_delta_rule_decode function and
the state_indices parameter so the API docs for both pretranspose and
nontranspose variants remain consistent.
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)

1980-1994: 🛠️ Refactor suggestion | 🟠 Major

gated_delta_rule is missing the @flashinfer_api decorator.

All public decode API functions are expected to carry @flashinfer_api for API logging and crash-safe input capture. This function is the only public entry point in this file, yet it lacks the decorator.

🔧 Proposed fix
+@flashinfer_api
 def gated_delta_rule(
     A_log: torch.Tensor,

You will also need to import flashinfer_api (or provide a fallback) at the top of this file, as is done in gdn_decode.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 1980 - 1994,
The public function gated_delta_rule is missing the `@flashinfer_api` decorator;
add `@flashinfer_api` immediately above def gated_delta_rule to enable API logging
and crash-safe input capture, and ensure flashinfer_api is imported (or a
fallback defined) at the top of this module using the same import/fallback
pattern used in gdn_decode.py so the decorator is available at runtime.
♻️ Duplicate comments (2)
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)

2059-2069: ⚠️ Potential issue | 🟠 Major

Missing contiguity assertion on initial_state_source in pool-indexing path.

The async H loads use 128-bit tiled copies that assume a contiguous layout. A non-contiguous initial_state_source silently corrupts the pointer arithmetic used in gH[(state_batch_idx, value_head_idx, None, None)].

🛡️ Proposed fix
     use_pool_indexing = initial_state_indices is not None
     if use_pool_indexing:
         assert initial_state_indices.shape == (B,), (
             f"Expected shape [{B}], got {initial_state_indices.shape}"
         )
         assert initial_state_indices.dtype == torch.int32, (
             f"Expected int32, got {initial_state_indices.dtype}"
         )
+        assert initial_state_source.is_contiguous(), (
+            "initial_state_source must be contiguous for correct pointer arithmetic in pooled mode"
+        )
         state_indices_t = initial_state_indices
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2059 - 2069,
When use_pool_indexing is true the code doesn't verify that initial_state_source
is contiguous which breaks the 128-bit tiled async H loads (used when indexing
gH[(state_batch_idx, value_head_idx, None, None)]); update the pool-indexing
branch (the path that sets state_indices_t from initial_state_indices) to either
assert initial_state_source.is_contiguous() or replace it with a contiguous copy
(e.g., initial_state_source = initial_state_source.contiguous()) before any
indexing into gH so pointer arithmetic remains valid; reference
initial_state_source, use_pool_indexing, state_indices_t and the
gH[(state_batch_idx, value_head_idx, None, None)] access to locate the change.
flashinfer/gdn_decode.py (1)

1881-1891: ⚠️ Potential issue | 🟠 Major

Missing bounds check: non-negative state_indices values not validated against pool_size.

Values ≥ pool_size in state_indices flow directly into flat_idx = pool_idx * HV + i_hv inside the kernel, causing out-of-bounds reads/writes on h0_source. A guard analogous to the one needed in gated_delta_rule_decode_pretranspose is required here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1881 - 1891, The code currently lacks
validation that entries in state_indices are within [0, pool_size-1], which
allows values >= pool_size to drive flat_idx = pool_idx * HV + i_hv and cause
OOB access on h0_source; add a bounds check similar to
gated_delta_rule_decode_pretranspose: when state_indices is not None, verify all
values are integers >= 0 and < pool_size (and fail with an informative assertion
or raised error referencing pool_size and B), or alternatively explicitly
clamp/validate before computing flat_idx; ensure you reference state_indices,
pool_size, flat_idx and h0_source when adding the check so the guard prevents
any out-of-bounds indexing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: gated_delta_rule_decode is missing the pool-indexing
validations present in gated_delta_rule_decode_pretranspose; add a guard when
use_pool_indexing (i.e., state_indices is not None) that asserts
state.is_contiguous(), state_indices.shape == (B,), state_indices.dtype ==
torch.int32, and state_indices.is_contiguous(), mirroring the checks in
gated_delta_rule_decode_pretranspose so malformed/non-contiguous or wrong-dtype
state_indices do not reach the kernel.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: initial_state_indices values are never bounds-checked
against the pool size (initial_state_source.shape[0]) which can cause
out-of-bounds accesses when used as indices (e.g., in gH[...] indexing); update
the use_pool_indexing branch that sets state_indices_t to validate every entry
is >= 0 and < pool_size (and still dtype torch.int32 and on q.device), raising a
clear error or clamping/adjusting as appropriate; reference the symbols
initial_state_indices, initial_state_source.shape[0], state_indices_t,
use_pool_indexing and ensure the check happens before assigning state_indices_t
(or convert and move to device first) so downstream kernels cannot read/write
out-of-range indices.

---

Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1832-1875: The docstring for gated_delta_rule_decode is missing
documentation for the new state_indices parameter; update the Args section to
describe state_indices (type torch.Tensor or Optional[torch.Tensor]), its shape
(e.g. [B] or [B, HV] depending on usage), dtype expectations, purpose (indices
mapping into the state for batched or streaming decode), whether it can be None
and how that behavior differs, and any in-place/update semantics; reference the
gated_delta_rule_decode function and the state_indices parameter so the API docs
for both pretranspose and nontranspose variants remain consistent.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 1980-1994: The public function gated_delta_rule is missing the
`@flashinfer_api` decorator; add `@flashinfer_api` immediately above def
gated_delta_rule to enable API logging and crash-safe input capture, and ensure
flashinfer_api is imported (or a fallback defined) at the top of this module
using the same import/fallback pattern used in gdn_decode.py so the decorator is
available at runtime.

---

Duplicate comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: The code currently lacks validation that entries in
state_indices are within [0, pool_size-1], which allows values >= pool_size to
drive flat_idx = pool_idx * HV + i_hv and cause OOB access on h0_source; add a
bounds check similar to gated_delta_rule_decode_pretranspose: when state_indices
is not None, verify all values are integers >= 0 and < pool_size (and fail with
an informative assertion or raised error referencing pool_size and B), or
alternatively explicitly clamp/validate before computing flat_idx; ensure you
reference state_indices, pool_size, flat_idx and h0_source when adding the check
so the guard prevents any out-of-bounds indexing.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: When use_pool_indexing is true the code doesn't verify
that initial_state_source is contiguous which breaks the 128-bit tiled async H
loads (used when indexing gH[(state_batch_idx, value_head_idx, None, None)]);
update the pool-indexing branch (the path that sets state_indices_t from
initial_state_indices) to either assert initial_state_source.is_contiguous() or
replace it with a contiguous copy (e.g., initial_state_source =
initial_state_source.contiguous()) before any indexing into gH so pointer
arithmetic remains valid; reference initial_state_source, use_pool_indexing,
state_indices_t and the gH[(state_batch_idx, value_head_idx, None, None)] access
to locate the change.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834f1e6 and d207428.

📒 Files selected for processing (2)
  • flashinfer/gdn_decode.py
  • flashinfer/gdn_kernels/gdn_decode_bf16_state.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 28

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
flashinfer/gdn_decode.py (1)

1076-1107: ⚠️ Potential issue | 🟠 Major

Legacy pretranspose path (float32 state) doesn't guard against pool indexing — will crash on reshape.

When state_indices is provided with float32 state (i.e. use_gdn_decode_klast_bf16_state is false), execution falls through to the legacy path. Line 1107 does state.reshape(B * HV, V, K), but state.shape[0] is pool_size (which differs from B in pooled mode), causing a reshape failure. The cached h0_indices at line 1115 is always zeros, so even if the reshape didn't crash, the kernel wouldn't use the real state_indices.

Either add an assertion rejecting pool indexing in the legacy path or implement proper support.

Minimal guard (recommended)
     # Legacy path: T=1 only, float32 state
     assert T == 1, f"Decode only supports T=1, got T={T}"
     assert state.dtype == torch.float32, f"state must be float32, got {state.dtype}"
+    assert not use_pool_indexing, (
+        "Pool indexing (state_indices) is not supported with float32 state in the "
+        "pretranspose legacy path. Use bfloat16 state for pool indexing support."
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1076 - 1107, The legacy pretranspose
path performs state.reshape(B * HV, V, K) and always uses a zeroed h0_indices,
which will crash or mis-index when state_indices (pooled mode) is provided;
update the logic in the branch guarded by use_gdn_decode_klast_bf16_state ==
False to either (a) assert that state_indices is None (reject pooled indexing)
by checking state_indices is None before the reshape and raising a clear
assertion mentioning state_indices and the legacy path, or (b) implement proper
pooled support by using state_indices to build h0_indices and reshape/reindex
state into h0_source correctly (use state.shape[0] as pool_size, gather the
per-batch entries into [B*HV, V, K] using state_indices, and replace the zeroed
h0_indices usage), ensuring h0_source and h0_indices are consistent for the
kernel call.
♻️ Duplicate comments (20)
sglang_shadow/python/sglang/bench_offline_throughput.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/bench_offline_throughput.py` at line 1, This file
uses the same broken absolute symlink/raw-path pattern found in other
sglang_shadow modules; replicate the fix applied in vision_utils.py by replacing
hard-coded filesystem paths with package-relative resource loading (e.g.,
importlib.resources or pkgutil) or by resolving symlinks programmatically, and
update any path construction/usage in bench_offline_throughput.py to call the
same helper or utility function used in vision_utils.py so the file no longer
relies on raw absolute paths or broken symlinks.
sglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.py` at
line 1, The file torch_flex_backend.py in the sglang_shadow package contains a
hardcoded/incorrect raw filesystem path or broken symlink reference (same issue
as other sglang_shadow files); update this module to use package-relative
resource resolution instead of the absolute/raw path—locate and replace any
occurrences of the raw path or symlink usage in torch_flex_backend.py with a
stable approach (e.g., importlib.resources / pathlib with package-relative
__file__ resolution or a proper package data lookup) so the module loads
resources consistently across environments; ensure the fix mirrors the solution
applied in vision_utils.py and adjust any import or resource-loading calls in
torch_flex_backend.py accordingly.
sglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.py` at
line 1, The file sglang.srt.layers.attention.cutlass_mla_backend contains the
same broken symlink/raw filesystem path as other sglang_shadow modules; fix it
by replacing the hard-coded path usage with a proper package/module import or a
valid repository-relative symlink so the module can be imported normally (e.g.,
ensure cutlass_mla_backend is accessible via the sglang package namespace);
update any import statements in cutlass_mla_backend.py to use package-relative
imports (import sglang.srt.layers.attention...) or remove the raw path and add
the file into the package so it resolves at runtime.
sglang_shadow/python/sglang/srt/layers/attention/tbo_backend.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/tbo_backend.py` at line 1,
This module still contains the same broken absolute/symlink path usage as other
sglang_shadow files; locate any hard-coded path strings or os.path calls in
tbo_backend.py (top-level variables, functions or helpers that reference
filesystem paths) and replace them with package-relative resource access (e.g.,
use Path(__file__).resolve().parent / "<relative_name>" or
importlib.resources.read_binary/text to load bundled files) so the code no
longer depends on /home/... raw paths or broken symlinks; ensure any path
resolution uses .resolve() and relative joins and update the import/loading
logic in the relevant functions/classes in tbo_backend.py to use those
package-relative resources.
sglang_shadow/python/sglang/srt/layers/attention/utils.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/utils.py` at line 1, The
file sglang_shadow/python/sglang/srt/layers/attention/utils.py contains the same
broken symlink/raw absolute path issue as vision_utils; replace the raw absolute
path entry with the proper module content or a package-relative import so the
module sglang.srt.layers.attention.utils resolves correctly. Specifically,
remove the hardcoded "/home/..." path or symlink placeholder at the top of
utils.py and restore the actual implementation (or copy the corresponding
implementation from the non-shadow module), ensuring imports use
package-relative names (e.g., from sglang.srt.layers.attention import ...) and
that the file is a real Python file rather than a broken symlink.
sglang_shadow/python/sglang/srt/layers/radix_attention.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py — this file also contains only a local filesystem path instead of Python source.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/radix_attention.py` at line 1, The
file radix_attention.py currently contains only a raw filesystem path (broken
symlink) rather than Python source; replace the placeholder content with the
actual implementation of the RadixAttention module (restore the real
class/function definitions that should live in radix_attention.py), or remove
the stray file if it was accidentally added and import the real module from the
correct package; ensure the module defines the expected public symbols (e.g.,
RadixAttention class or radex_attention_* functions used elsewhere) so imports
elsewhere in the codebase resolve correctly.
sglang_shadow/python/sglang/__init__.py (1)

1-1: Same broken symlink / raw path issue as the other sglang_shadow/ files.

See the consolidated comment on vision_utils.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/__init__.py` at line 1, The
sglang_shadow/python/sglang/__init__.py file is a broken symlink/raw path copy
like the other sglang_shadow files; replace it with a proper package-init that
re-exports the real package instead of referencing the absolute path. Edit
__init__.py in the sglang_shadow package to remove the symlink/path reference
and instead import and re-export the real sglang symbols (e.g., use explicit
imports or wildcard re-exports from the canonical sglang package) so consumers
of sglang_shadow import via normal package imports rather than a raw filesystem
path.
sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

Line 1 repeats the same non-portable absolute-path artifact and should be replaced by a regular source file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py` at line 1, The
file sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py currently
contains a non-portable absolute filesystem path on its first line; remove that
path-only line and replace it with the real Python module source (or a proper
stub) so the file is a normal importable module. Ensure you restore any original
functions/classes expected from this module (e.g., rocm_linear_* helpers or
utility functions) or, if not available, add a minimal well-documented stub
(module docstring and exported symbols) so imports of rocm_linear_utils succeed
across environments.
sglang_shadow/python/sglang/srt/layers/layernorm.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

This file has the same non-portable absolute-path artifact and should be replaced with a real tracked module.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/layernorm.py` at line 1, The file
contains a non-portable absolute local-path artifact that should be removed and
replaced with the proper tracked module import; open
sgtlang/srt/layers/layernorm.py, remove any hard-coded absolute path string or
sys.path manipulation that references a local workspace, and replace it with a
normal import of the shared/packaged LayerNorm implementation (e.g., import the
canonical module or class such as LayerNorm from the project package instead of
referencing /home/...); ensure tests/imports elsewhere reference the new package
import and update any top-level variables or factory functions in layernorm.py
to delegate to the packaged implementation.
sglang_shadow/python/sglang/srt/layers/multimodal.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

Line 1 contains the same machine-local path artifact and should be replaced by a normal module file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/multimodal.py` at line 1, The file
sglang.srt.layers.multimodal (multimodal.py) currently contains an absolute
machine-local filesystem path as its first line; remove that path artifact and
replace it with a normal Python module header (e.g., a module docstring or
relevant imports) so the file is a valid module. Specifically, delete the
literal "/home/.../multimodal.py" line, add an appropriate top-of-file docstring
or imports used by this module (retain or re-add any real definitions like
classes or functions in this file such as multimodal layer classes or helper
functions), and ensure there are no other hard-coded local paths left in
multimodal.py; run linters/tests to confirm the module imports correctly.
sglang_shadow/python/sglang/srt/layers/modelopt_utils.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

Line 1 repeats the same host-absolute path issue; please replace with a normal module file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/modelopt_utils.py` at line 1, The top
of modelopt_utils.py contains a host-absolute filesystem path string instead of
module code; remove that absolute path line and replace it with a normal module
header (optional module docstring or imports) so the file is a valid Python
module; ensure no other absolute local-path artifacts remain in
sglang/srt/layers/modelopt_utils.py and run tests/import to verify
functions/classes in modelopt_utils (e.g., any functions or classes defined in
that file) are importable after the change.
sglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

Line 1 shows the same host-specific path artifact; this should be replaced with a real in-repo module file.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.py` at
line 1, The file hybrid_attn_backend.py contains a host-specific absolute path
artifact on line 1; remove that literal path and replace it with a proper
in-repo module or relative import (e.g., import the equivalent module from the
package or add the module file to the repo and use a relative import). Ensure
any references in hybrid_attn_backend.py to the external path are updated to
import the correct in-repo symbol (or class/function) and that the module is
added to the package __init__ if needed so imports resolve at runtime.
sglang_shadow/python/sglang/srt/environ.py (1)

1-1: Same blocker as above: absolute local-path artifact in source.

This file also contains the same machine-local absolute path artifact and needs conversion to a real Python module.

sglang_shadow/python/sglang/srt/multiplex (1)

1-1: Same blocker as above: absolute local-path artifact in tracked path.

This path also appears to be an absolute local-path artifact. Please replace it with the intended in-repo module/package structure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/multiplex` at line 1, The tracked artifact is
an absolute local filesystem path (sglang_shadow/python/sglang/srt/multiplex)
that must be replaced with the in-repo package/module layout; move the contents
into the repository package (e.g., package path under
python/sglang/srt/multiplex) or update references to use the package import path
(sglang.srt.multiplex) instead of an absolute path, and remove any hard-coded
local-path entries from manifests, setup/config, or import statements that
reference sglang_shadow.
sglang_shadow/python/sglang/srt/layers/sparse_pooler.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/sparse_pooler.py` at line 1, The file
sglang.srt.layers.sparse_pooler.py is a broken symlink; replace the symlink with
the real source file or restore the missing target so Python can import it.
Locate the symlink currently referenced as sparse_pooler.py in the
sglang.srt.layers package, remove and replace it with the actual module
implementation (or recreate the correct symlink target), then run the same fix
you applied for trtllm_mha_backend.py to ensure imports (e.g., from
sglang.srt.layers.sparse_pooler import ...) resolve successfully.
sglang_shadow/python/sglang/srt/layers/attention/nsa_backend.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/nsa_backend.py` at line 1,
The repository contains a broken symlink for
srlang/srt/layers/attention/nsa_backend.py (same problem as
trtllm_mha_backend.py); fix it by replacing the broken symlink with the correct
implementation or a valid symlink target: locate the symlink at nsa_backend.py,
remove the broken link (git rm) and either add the real file content for
nsa_backend.py or recreate the symlink to the correct existing file, then commit
so imports referencing nsa_backend.py resolve correctly.
sglang_shadow/python/sglang/srt/layers/attention/triton_backend.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/triton_backend.py` at line
1, The triton_backend.py file is a broken symlink (same problem as
trtllm_mha_backend.py); replace the broken symlink with the correct
implementation or a valid symlink to the real backend module used in this repo,
or remove the symlink and add the proper Python module content/export so imports
work; ensure the module name triton_backend.py resolves to the actual backend
(or mirrors trtllm_mha_backend.py behaviour) and update any imports referencing
srt.layers.attention.triton_backend to point to the real module.
sglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.py`
at line 1, The file sglang/srt/layers/attention/flashinfer_mla_backend.py is a
broken symlink (same problem as trtllm_mha_backend.py); replace or remove the
symlink and ensure the real implementation file is present and imported
correctly: either restore the original target file for flashinfer_mla_backend.py
or convert the symlink into a real Python module file with the expected symbols
(functions/classes used by the codebase), and update any imports that referenced
flashinfer_mla_backend.py to point to the correct module; mirror the fix you
applied for trtllm_mha_backend.py so builds and imports no longer fail.
sglang_shadow/python/sglang/bench_serving.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/bench_serving.py` at line 1, The bench_serving.py
entry is a broken symlink (same problem as trtllm_mha_backend.py); fix it by
replacing the symlink at sglang/bench_serving.py with the actual Python file (or
update the symlink target to the correct, existing path) so imports and CI can
find it; ensure the new file content matches the intended implementation, add
and commit the real file (or corrected symlink) to the repo so bench_serving.py
is no longer a dangling link.
sglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.py (1)

1-1: Broken symlink — same issue as noted on trtllm_mha_backend.py.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.py` at
line 1, The file sglang/srt/layers/attention/base_attn_backend.py is a broken
symlink (same problem as trtllm_mha_backend.py); replace the symlink with a real
Python module file or recreate the symlink to point to the correct target so
imports succeed. Specifically, ensure base_attn_backend.py is a regular file
(not a dangling link) containing the expected backend implementation (or
correctly points to the actual implementation used by attention backends), and
verify any imports referencing base_attn_backend.py and trtllm_mha_backend.py
resolve without errors.
🟠 Major comments (6)
flashinfer/gdn_decode.py-1959-1966 (1)

1959-1966: ⚠️ Potential issue | 🟠 Major

Add pool_size to the cache_key to handle variable pool sizes when using state_indices.

When use_pool_indexing=True (state_indices provided), pool_size can vary independently of B per the assertion at line 1903-1906. However, the cache_key at line 1966 omits pool_size, allowing the same compiled kernel to be reused for different h0_source shapes [pool_size * HV, K, V]. This inconsistency contrasts with the similar gated_delta_rule_mtp function (lines 2592-2602), which explicitly includes pool_size in its cache_key. Include pool_size in the cache_key to ensure correct kernel reuse across varying pool sizes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1959 - 1966, The kernel cache key
omits pool_size causing kernels compiled for one pool_size to be reused for
different h0_source shapes; update the cache_key tuple (used where cache_key is
assigned in gdn_decode) to include pool_size so it matches the actual h0_source
layout (state_contiguous and h0_source) and mirrors the gated_delta_rule_mtp
behavior when use_pool_indexing/state_indices can vary pool_size.
sglang_shadow/python/sglang/srt/layers/attention/mamba-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

Avoid committing machine-local absolute path targets in the shadow package.

Line 1 embeds a local /home/... path; this will not resolve on other machines and makes the shadow tree brittle.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/attention/mamba` at line 1, The shadow
package contains a machine-local absolute path string ("/home/...") in
sglang/srt/layers/attention/mamba which must be removed; edit the mamba shadow
file to delete the hard-coded absolute path and replace it with a relative path
or a proper package/resource reference (e.g., use package-relative imports or
resource lookup APIs) so the package does not embed local filesystem targets,
and ensure any tests/build scripts that generated this entry use relative paths
or package_data instead.
sglang_shadow/python/sglang/srt/distributed-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

Use repo-relative links or real files for shadow distributed package entries.

Line 1 is a workstation-specific absolute path, which will fail on any other machine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/distributed` at line 1, The first line
contains a workstation-specific absolute path instead of a portable reference
for the shadow distributed package; replace that absolute path with either a
repo-relative reference or commit the real package files and point the shadow
entry to the module sglang.srt.distributed (i.e., update the shadow package
entry so it references the project-relative package/module rather than an
absolute filesystem path).
flashinfer/gdn_kernels/gdn_decode_bf16_state.py-2060-2082 (1)

2060-2082: ⚠️ Potential issue | 🟠 Major

Validate initial_state_indices device before from_dlpack.

initial_state_indices is checked for shape/dtype, but not that it lives on the same device as q. A CPU tensor here can fail later in kernel plumbing with less actionable errors.

🛡️ Suggested fix
     if use_pool_indexing:
         assert initial_state_indices.shape == (B,), (
             f"Expected shape [{B}], got {initial_state_indices.shape}"
         )
         assert initial_state_indices.dtype == torch.int32, (
             f"Expected int32, got {initial_state_indices.dtype}"
         )
+        assert initial_state_indices.device == q.device, (
+            f"initial_state_indices must be on {q.device}, got {initial_state_indices.device}"
+        )
         assert initial_state_source.is_contiguous(), (
             "initial_state_source must be contiguous for correct pointer arithmetic in pooled mode"
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2060 - 2082,
The code validates shape/dtype for initial_state_indices but not its device
relative to q, which can cause cryptic failures later (e.g., during
from_dlpack); add a device check and coerce or fail fast: verify
initial_state_indices.device == q.device (where q is the query tensor used
later) and if not either transfer initial_state_indices to q.device
(initial_state_indices = initial_state_indices.to(q.device)) or raise an assert
mentioning q.device and initial_state_indices.device; update the block around
the use of initial_state_indices, initial_state_source, and any subsequent
from_dlpack usage to perform this check before proceeding with pointer
arithmetic or DLPack conversions.
bench_gdn_kernel.py-70-75 (1)

70-75: ⚠️ Potential issue | 🟠 Major

Remove the duplicate time_ported computation.

Line 74 recomputes time_ported using the old start, inflating the reported ported latency.

🐛 Suggested fix
     torch.cuda.synchronize()
     time_ported = (time.perf_counter() - start) / repeat

     # Clear cache to avoid shape mismatch on h0_source (pool vs gathered)
-
-    time_ported = (time.perf_counter() - start) / repeat
-
-    # Clear cache to avoid shape mismatch on h0_source (pool vs gathered)
     _get_compiled_decode_kernel_nontranspose.cache_clear()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench_gdn_kernel.py` around lines 70 - 75, The duplicate computation of
time_ported is causing an inflated ported latency; remove the second
reassignment "time_ported = (time.perf_counter() - start) / repeat" (the one
after the cache-clear comment) so time_ported is only computed once using the
original start and repeat variables; keep the first computation and the
cache-clear comment related to h0_source intact.
bench_gdn_kernel.py-29-31 (1)

29-31: ⚠️ Potential issue | 🟠 Major

State tensor must use BF16 dtype to test the optimized kernel path.

Line 30 allocates state_pool with hardcoded dtype=torch.float32, but the BF16-state kernel dispatch (line 1040 in flashinfer/gdn_decode.py) requires state.dtype == torch.bfloat16. This benchmark cannot hit the new optimized kernel path as written.

Suggested fix
    state_pool = torch.randn(
-       pool_size, num_value_heads, K, V, dtype=torch.float32, device=device
+       pool_size, num_value_heads, K, V, dtype=dtype, device=device
    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench_gdn_kernel.py` around lines 29 - 31, The benchmark allocates state_pool
with dtype=torch.float32 so the BF16 optimized kernel in
flashinfer/gdn_decode.py never triggers; change the allocation for state_pool to
use torch.bfloat16 (or cast it immediately after creation) so state_pool.dtype
== torch.bfloat16, ensuring the BF16-state kernel dispatch path can be
exercised; ensure the device supports bfloat16 if necessary and keep the same
shape/variables (state_pool, K, V, num_value_heads, pool_size) when changing the
dtype.
🟡 Minor comments (3)
flashinfer/gdn_decode.py-953-953 (1)

953-953: ⚠️ Potential issue | 🟡 Minor

state_indices parameter is missing from the docstring.

The Args section (lines 960–998) documents all parameters up through use_qk_l2norm but omits the newly added state_indices. The nontranspose variant at line 1874 does document it — please mirror that here for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` at line 953, The docstring's Args section is
missing documentation for the newly added parameter state_indices; update the
Args block in the function's docstring to include a description for
state_indices (type Optional[torch.Tensor]) matching the wording and semantics
used in the nontranspose variant's docstring (see state_indices entry around
line 1874) so both variants are consistent; locate the function in
flashinfer/gdn_decode.py that declares state_indices: Optional[torch.Tensor] =
None and add the same parameter description to the Args list.
flashinfer/gdn_decode.py-1896-1902 (1)

1896-1902: ⚠️ Potential issue | 🟡 Minor

Misleading comment: says "K-last" but the assertion checks K-major layout (HV, K, V).

The comment on line 1899 says "Qwen-style K-last" but the assertion verifies state.shape[1:] == (HV, K, V), which is K-major (V-last). The pretranspose variant at line 1006 correctly says "K-last" for (HV, V, K). This is confusing for anyone cross-referencing the two paths.

Suggested fix
-    # Validate state shape (Qwen-style K-last: [*, HV, V, K])
+    # Validate state shape (K-major: [*, HV, K, V])
     assert state.shape[1:] == (HV, K, V), (
         f"Expected state shape [*, HV={HV}, K={K}, V={V}], got {state.shape}"
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1896 - 1902, The comment "Qwen-style
K-last" is incorrect for the post-transpose path: the assertion checks
state.shape[1:] == (HV, K, V) which is K-major (V-last), not K-last. Update the
inline comment near use_pool_indexing/state to describe the actual layout —
e.g., "Qwen-style K-major (V-last): [*, HV, K, V]" — and optionally add a note
contrasting it with the pretranspose variant that uses (HV, V, K) to avoid
confusion; keep variable names (state, HV, K, V, use_pool_indexing,
state_indices) as reference.
bench_gdn_kernel.py-82-83 (1)

82-83: ⚠️ Potential issue | 🟡 Minor

Use _ for intentionally unused outputs.

out is assigned but never read in warmup/baseline loops.

🧹 Suggested cleanup
-        out, temp_state = gated_delta_rule_decode(
+        _, temp_state = gated_delta_rule_decode(
@@
-        out, temp_state = gated_delta_rule_decode(
+        _, temp_state = gated_delta_rule_decode(

Also applies to: 107-108

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bench_gdn_kernel.py` around lines 82 - 83, The variable out from the
gated_delta_rule_decode calls is assigned but never used; change the tuple
unpacking to discard the unused output by replacing "out, temp_state =" with "_,
temp_state =" (and similarly for the other occurrence around lines 107-108) so
only temp_state is kept; keep the function call and q argument the same and
ensure no other logic depends on out.
🧹 Nitpick comments (7)
flashinfer/gdn_decode.py (1)

1690-1691: Dead expression: h0_indices.layout.shape[0] result is discarded.

Line 1690 evaluates h0_indices.layout.shape[0] but the result is not assigned to anything. Same issue at line 1771 in run_gdn_decode_kernel_big_batch_nontranspose. These appear to be leftover debug statements.

Remove dead expressions
     # h0_source is flattened to [B*HV, K, V] to ensure proper alignment for SIMT async copy
     batch_hv_dim, k_dim, v_dim = h0_source.layout.shape
-    h0_indices.layout.shape[0]
     batch_size = B * HV  # Use actual batch size, not pool size

Apply the same removal at line 1771.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_decode.py` around lines 1690 - 1691, Remove the dead
expression evaluations of h0_indices.layout.shape[0] that are left as standalone
statements; in the function where batch_size is computed (the block setting
batch_size = B * HV) delete the preceding unused line
"h0_indices.layout.shape[0]" and apply the same removal in
run_gdn_decode_kernel_big_batch_nontranspose (the analogous stray
h0_indices.layout.shape[0] at line ~1771) so the code only computes and uses
batch_size without the discarded expression.
tests/gdn/test_gdn_decode_pooled.py (1)

19-22: Make the SM gate future-proof.

Line 21 hardcodes [9, 10, 11, 12]; this will skip tests on newer supported SM majors.

♻️ Suggested change
 def _skip_if_not_sm90_or_later():
     cc = get_compute_capability(torch.device("cuda"))
-    if cc[0] not in [9, 10, 11, 12]:
+    if cc[0] < 9:
         pytest.skip(f"GDN decode requires SM90+, got SM{cc[0]}{cc[1]}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/gdn/test_gdn_decode_pooled.py` around lines 19 - 22, The helper
_skip_if_not_sm90_or_later currently hardcodes supported SM majors
([9,10,11,12]) causing future SM majors to be skipped; update the check in
_skip_if_not_sm90_or_later (which calls get_compute_capability) to allow any
major >= 9 (e.g., use cc[0] < 9 to decide pytest.skip) and keep the pytest.skip
message but format it generically (e.g., "GDN decode requires SM90+, got
SM{cc[0]}{cc[1]}") so newer SM majors pass without needing further edits.
run_bench_local.sh (1)

249-251: restore_to_ported is called but never defined.

Line 250 adds a cleanup hook that currently does nothing except rely on command not found being swallowed.

♻️ Suggested guard
 cleanup() {
-    restore_to_ported 2>/dev/null || true
+    if command -v restore_to_ported >/dev/null 2>&1; then
+        restore_to_ported 2>/dev/null || true
+    fi
     kill_server 2>/dev/null || true
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_bench_local.sh` around lines 249 - 251, The cleanup function calls
restore_to_ported which is not defined; either add a small restore_to_ported
implementation (e.g., a function that reverses any port forwarding or is a
no-op) or guard the call so missing-command errors aren't relied on—e.g., check
command -v or type for restore_to_ported before calling it (or define
restore_to_ported() {}), and keep kill_server as-is; update the cleanup block to
call the defined or conditionally-invoked restore_to_ported to ensure the
cleanup hook is deterministic.
sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py (4)

67-71: Inconsistent cleanup of global handles in the except block.

On import failure, _flashinfer_chunk_gated_delta_rule and _flashinfer_gated_delta_rule_mtp are not explicitly reset to None, unlike the decode handles. While they happen to remain None from module-level initialization, this is fragile if a future code path partially sets them before failure.

Proposed fix
         except (ImportError, RuntimeError) as e:
             logger.warning(f"FlashInfer GDN kernels not available: {e}")
             _flashinfer_gdn_available = False
+            _flashinfer_chunk_gated_delta_rule = None
+            _flashinfer_gated_delta_rule_mtp = None
             _flashinfer_gated_delta_rule_decode = None
             _flashinfer_gated_delta_rule_decode_nontranspose = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`
around lines 67 - 71, The except block that handles ImportError/RuntimeError
leaves _flashinfer_chunk_gated_delta_rule and _flashinfer_gated_delta_rule_mtp
untouched, which is fragile; update the except block in
hybrid_linear_attn_backend.py (the handler that currently sets
_flashinfer_gdn_available, _flashinfer_gated_delta_rule_decode, and
_flashinfer_gated_delta_rule_decode_nontranspose) to also explicitly set
_flashinfer_chunk_gated_delta_rule and _flashinfer_gated_delta_rule_mtp to None
so all FlashInfer kernel handles are consistently cleared on failure.

1334-1337: Environment variable lookup on every decode forward call.

os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") is evaluated on every forward_decode invocation in V-last FlashInfer mode. While the dict lookup itself is cheap, for a hot decode path this could be cached once during __init__ instead.

Proposed fix

In __init__, cache the flag:

self._force_gather_scatter = os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") == "1"

Then in forward_decode:

-                force_gather_scatter = (
-                    os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") == "1"
-                )
+                force_gather_scatter = self._force_gather_scatter
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`
around lines 1334 - 1337, The environment lookup for
SGLANG_TEST_FORCE_GATHER_SCATTER is performed on every forward_decode call; move
that check into the class initializer by adding a cached boolean (e.g. set
self._force_gather_scatter = os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER",
"0") == "1" inside __init__) and then update forward_decode to read that cached
self._force_gather_scatter instead of calling os.environ.get or defining a local
force_gather_scatter each invocation; ensure references to force_gather_scatter
in forward_decode use the new instance attribute name.

817-819: Local re-import of causal_conv1d_fn shadows module-level CUDA override.

Inside forward_extend, the Triton version of causal_conv1d_fn is explicitly re-imported, overriding the module-level CUDA variant (set at line 114). If this is intentional (e.g., the Triton version supports extra kwargs like has_initial_state), a brief comment would prevent future confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`
around lines 817 - 819, The local re-import of causal_conv1d_fn inside
forward_extend is shadowing the module-level CUDA-assigned variant (set earlier)
and causing confusion; either remove the local import so forward_extend uses the
module-level causal_conv1d_fn, or if you intentionally need the Triton-specific
implementation (e.g., for has_initial_state), keep the import but add a concise
comment above it explaining why the Triton override is required and/or rename
the local binding to triton_causal_conv1d_fn to make the override explicit;
update all references in forward_extend to the chosen identifier.

579-584: In-place mutation of req_pool_indices during CUDA graph replay.

Line 581 mutates req_pool_indices directly (req_pool_indices[bs - num_padding :] = 0). If this tensor is shared or reused by the caller, this side effect could cause subtle bugs. This is likely intentional for CUDA graph replay (where buffers are pre-allocated), but worth a brief comment to document the contract.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py`
around lines 579 - 584, The code is performing an in-place mutation of
req_pool_indices (req_pool_indices[bs - num_padding :] = 0) which can leak side
effects to callers; either avoid mutating the caller-visible tensor by making a
local copy before zeroing (e.g., local_req_pool_indices =
req_pool_indices.clone() or .detach()) and then call
self.req_to_token_pool.get_mamba_indices(local_req_pool_indices), or if the
in-place write is intentional for CUDA-graph buffer reuse, add a clear comment
above the mutation explaining the contract (that req_pool_indices is mutated on
purpose for replay and must not be reused by callers) and ensure subsequent uses
operate on a non-shared tensor (mamba_indices, get_mamba_indices, and
self.state_indices_list usage).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d207428 and 36b5674.

📒 Files selected for processing (123)
  • bench_gdn_kernel.py
  • flashinfer/gdn_decode.py
  • flashinfer/gdn_kernels/gdn_decode_bf16_state.py
  • run_bench_local.sh
  • sglang_shadow/python/sglang/README.md
  • sglang_shadow/python/sglang/__init__.py
  • sglang_shadow/python/sglang/__pycache__
  • sglang_shadow/python/sglang/_version.py
  • sglang_shadow/python/sglang/bench_offline_throughput.py
  • sglang_shadow/python/sglang/bench_one_batch.py
  • sglang_shadow/python/sglang/bench_one_batch_server.py
  • sglang_shadow/python/sglang/bench_serving.py
  • sglang_shadow/python/sglang/check_env.py
  • sglang_shadow/python/sglang/cli
  • sglang_shadow/python/sglang/compile_deep_gemm.py
  • sglang_shadow/python/sglang/eval
  • sglang_shadow/python/sglang/global_config.py
  • sglang_shadow/python/sglang/jit_kernel
  • sglang_shadow/python/sglang/lang
  • sglang_shadow/python/sglang/launch_server.py
  • sglang_shadow/python/sglang/multimodal_gen
  • sglang_shadow/python/sglang/profiler.py
  • sglang_shadow/python/sglang/srt/__pycache__
  • sglang_shadow/python/sglang/srt/batch_invariant_ops
  • sglang_shadow/python/sglang/srt/batch_overlap
  • sglang_shadow/python/sglang/srt/checkpoint_engine
  • sglang_shadow/python/sglang/srt/compilation
  • sglang_shadow/python/sglang/srt/configs
  • sglang_shadow/python/sglang/srt/connector
  • sglang_shadow/python/sglang/srt/constants.py
  • sglang_shadow/python/sglang/srt/constrained
  • sglang_shadow/python/sglang/srt/debug_utils
  • sglang_shadow/python/sglang/srt/disaggregation
  • sglang_shadow/python/sglang/srt/distributed
  • sglang_shadow/python/sglang/srt/dllm
  • sglang_shadow/python/sglang/srt/elastic_ep
  • sglang_shadow/python/sglang/srt/entrypoints
  • sglang_shadow/python/sglang/srt/environ.py
  • sglang_shadow/python/sglang/srt/eplb
  • sglang_shadow/python/sglang/srt/function_call
  • sglang_shadow/python/sglang/srt/grpc
  • sglang_shadow/python/sglang/srt/hardware_backend
  • sglang_shadow/python/sglang/srt/layers/__pycache__
  • sglang_shadow/python/sglang/srt/layers/activation.py
  • sglang_shadow/python/sglang/srt/layers/amx_utils.py
  • sglang_shadow/python/sglang/srt/layers/attention/__pycache__
  • sglang_shadow/python/sglang/srt/layers/attention/aiter_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/attention_registry.py
  • sglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/double_sparsity_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/dual_chunk_flashattention_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/fla
  • sglang_shadow/python/sglang/srt/layers/attention/flashattention_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/flashinfer_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/flashmla_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/intel_amx_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/mamba
  • sglang_shadow/python/sglang/srt/layers/attention/merge_state.py
  • sglang_shadow/python/sglang/srt/layers/attention/nsa
  • sglang_shadow/python/sglang/srt/layers/attention/nsa_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/tbo_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/torch_native_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/triton_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/triton_ops
  • sglang_shadow/python/sglang/srt/layers/attention/trtllm_mha_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/trtllm_mla_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/utils.py
  • sglang_shadow/python/sglang/srt/layers/attention/vision.py
  • sglang_shadow/python/sglang/srt/layers/attention/vision_utils.py
  • sglang_shadow/python/sglang/srt/layers/attention/wave_backend.py
  • sglang_shadow/python/sglang/srt/layers/attention/wave_ops
  • sglang_shadow/python/sglang/srt/layers/attention/xpu_backend.py
  • sglang_shadow/python/sglang/srt/layers/communicator.py
  • sglang_shadow/python/sglang/srt/layers/communicator_nsa_cp.py
  • sglang_shadow/python/sglang/srt/layers/deep_gemm_wrapper
  • sglang_shadow/python/sglang/srt/layers/dp_attention.py
  • sglang_shadow/python/sglang/srt/layers/elementwise.py
  • sglang_shadow/python/sglang/srt/layers/flashinfer_comm_fusion.py
  • sglang_shadow/python/sglang/srt/layers/layernorm.py
  • sglang_shadow/python/sglang/srt/layers/linear.py
  • sglang_shadow/python/sglang/srt/layers/logits_processor.py
  • sglang_shadow/python/sglang/srt/layers/model_parallel.py
  • sglang_shadow/python/sglang/srt/layers/modelopt_utils.py
  • sglang_shadow/python/sglang/srt/layers/moe
  • sglang_shadow/python/sglang/srt/layers/multimodal.py
  • sglang_shadow/python/sglang/srt/layers/parameter.py
  • sglang_shadow/python/sglang/srt/layers/pooler.py
  • sglang_shadow/python/sglang/srt/layers/quantization
  • sglang_shadow/python/sglang/srt/layers/radix_attention.py
  • sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py
  • sglang_shadow/python/sglang/srt/layers/rotary_embedding.py
  • sglang_shadow/python/sglang/srt/layers/sampler.py
  • sglang_shadow/python/sglang/srt/layers/sparse_pooler.py
  • sglang_shadow/python/sglang/srt/layers/torchao_utils.py
  • sglang_shadow/python/sglang/srt/layers/utils
  • sglang_shadow/python/sglang/srt/layers/vocab_parallel_embedding.py
  • sglang_shadow/python/sglang/srt/lora
  • sglang_shadow/python/sglang/srt/managers
  • sglang_shadow/python/sglang/srt/mem_cache
  • sglang_shadow/python/sglang/srt/metrics
  • sglang_shadow/python/sglang/srt/model_executor
  • sglang_shadow/python/sglang/srt/model_loader
  • sglang_shadow/python/sglang/srt/models
  • sglang_shadow/python/sglang/srt/multimodal
  • sglang_shadow/python/sglang/srt/multiplex
  • sglang_shadow/python/sglang/srt/parser
  • sglang_shadow/python/sglang/srt/sampling
  • sglang_shadow/python/sglang/srt/server_args.py
  • sglang_shadow/python/sglang/srt/server_args_config_parser.py
  • sglang_shadow/python/sglang/srt/speculative
  • sglang_shadow/python/sglang/srt/tokenizer
  • sglang_shadow/python/sglang/srt/tracing
  • sglang_shadow/python/sglang/srt/utils
  • sglang_shadow/python/sglang/srt/weight_sync
  • sglang_shadow/python/sglang/test
  • sglang_shadow/python/sglang/utils.py
  • sglang_shadow/python/sglang/version.py
  • tests/gdn/test_gdn_decode_pooled.py
✅ Files skipped from review due to trivial changes (48)
  • sglang_shadow/python/sglang/srt/layers/attention/pycache
  • sglang_shadow/python/sglang/srt/entrypoints
  • sglang_shadow/python/sglang/srt/configs
  • sglang_shadow/python/sglang/jit_kernel
  • sglang_shadow/python/sglang/srt/layers/utils
  • sglang_shadow/python/sglang/srt/constrained
  • sglang_shadow/python/sglang/srt/metrics
  • sglang_shadow/python/sglang/srt/managers
  • sglang_shadow/python/sglang/srt/layers/pycache
  • sglang_shadow/python/sglang/srt/debug_utils
  • sglang_shadow/python/sglang/srt/tokenizer
  • sglang_shadow/python/sglang/srt/layers/quantization
  • sglang_shadow/python/sglang/srt/function_call
  • sglang_shadow/python/sglang/srt/connector
  • sglang_shadow/python/sglang/srt/elastic_ep
  • sglang_shadow/python/sglang/srt/dllm
  • sglang_shadow/python/sglang/srt/compilation
  • sglang_shadow/python/sglang/srt/grpc
  • sglang_shadow/python/sglang/srt/layers/moe
  • sglang_shadow/python/sglang/srt/layers/attention/triton_ops
  • sglang_shadow/python/sglang/README.md
  • sglang_shadow/python/sglang/lang
  • sglang_shadow/python/sglang/srt/mem_cache
  • sglang_shadow/python/sglang/srt/pycache
  • sglang_shadow/python/sglang/test
  • sglang_shadow/python/sglang/srt/weight_sync
  • sglang_shadow/python/sglang/srt/utils
  • sglang_shadow/python/sglang/srt/batch_overlap
  • sglang_shadow/python/sglang/srt/layers/attention/fla
  • sglang_shadow/python/sglang/eval
  • sglang_shadow/python/sglang/srt/models
  • sglang_shadow/python/sglang/srt/model_executor
  • sglang_shadow/python/sglang/pycache
  • sglang_shadow/python/sglang/srt/layers/attention/wave_ops
  • sglang_shadow/python/sglang/multimodal_gen
  • sglang_shadow/python/sglang/srt/multimodal
  • sglang_shadow/python/sglang/cli
  • sglang_shadow/python/sglang/srt/eplb
  • sglang_shadow/python/sglang/srt/checkpoint_engine
  • sglang_shadow/python/sglang/srt/lora
  • sglang_shadow/python/sglang/srt/sampling
  • sglang_shadow/python/sglang/srt/layers/attention/nsa
  • sglang_shadow/python/sglang/srt/batch_invariant_ops
  • sglang_shadow/python/sglang/srt/parser
  • sglang_shadow/python/sglang/srt/speculative
  • sglang_shadow/python/sglang/srt/hardware_backend
  • sglang_shadow/python/sglang/srt/tracing
  • sglang_shadow/python/sglang/srt/layers/deep_gemm_wrapper


h_out = gH[(batch_idx, value_head_idx, None, None)]
o_head = gO[(batch_idx, 0, value_head_idx, None)]
if pool_idx >= 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Add in-kernel upper-bound guard for pool_idx to prevent OOB state access.

Current guards only check pool_idx >= 0. Because host max-index validation is skipped during graph capture (Line 2073), positive out-of-range indices can still read/write gH out of bounds.

🛡️ Suggested fix
-    if pool_idx >= 0:
+    if pool_idx >= 0 and pool_idx < gH.shape[0]:
@@
-    if pool_idx >= 0:
+    if pool_idx >= 0 and pool_idx < gH.shape[0]:
@@
-    if pool_idx >= 0:
+    if pool_idx >= 0 and pool_idx < gH.shape[0]:

Also applies to: 1158-1158, 1621-1621

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` at line 763, The kernel
currently only checks "if pool_idx >= 0" before accessing the shared state (gH),
allowing positive out-of-range indices to OOB read/write; update the guard to
also enforce an upper bound (e.g., "if 0 <= pool_idx < pool_len") using the
actual pool length (use the kernel's pool size constant or gH.shape[0]/len(gH))
so accesses to gH are only performed when pool_idx is within [0, pool_len-1];
apply the same change at the other occurrences mentioned (around the checks at
lines corresponding to pool_idx usage such as the blocks referencing pool_idx,
gH in gdn_decode_bf16_state.py, including the other spots flagged at the same
pattern).

Comment on lines +1 to +3
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

#!/bin/bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect the first 5 lines of run_bench_local.sh
head -n 5 run_bench_local.sh | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 303


Move the shebang to Line 1.

The shebang must be the first line of the script for proper shell interpretation. Currently, the export statement runs before the shebang is recognized.

Suggested fix
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-
-#!/bin/bash
+#!/bin/bash
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
#!/bin/bash
#!/bin/bash
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_bench_local.sh` around lines 1 - 3, Move the shebang "#!/bin/bash" to be
the very first line of the script and place the environment export statement
(export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True) after it so the shell
interpreter is correctly recognized; update the file so the first line is the
shebang and subsequent lines contain the export and other commands.

Comment on lines +217 to +223


--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--mamba-ssm-k-last \
> "${OUTPUT_DIR}/${name}_server.log" 2>&1 &
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and examine the relevant lines
git ls-files | grep -E "run_bench_local\.sh"

Repository: flashinfer-ai/flashinfer

Length of output: 86


🏁 Script executed:

sed -n '200,250p' run_bench_local.sh | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 1784


🏁 Script executed:

# Get the exact line numbers of the problematic lines
wc -l run_bench_local.sh
# Check the exact lines mentioned in the review
sed -n '215,225p' run_bench_local.sh | cat -n
echo "---"
sed -n '237,245p' run_bench_local.sh | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 795


🏁 Script executed:

#!/bin/bash
set -euo pipefail
awk '
  prev_bslash && $0 ~ /^[[:space:]]*$/ {print "blank line after trailing backslash at line " NR-1}
  {prev_bslash = ($0 ~ /\\[[:space:]]*$/)}
' run_bench_local.sh

Repository: flashinfer-ai/flashinfer

Length of output: 163


Remove blank lines after line continuations to fix broken multiline commands.

Lines 216 and 239 have trailing backslashes followed by blank lines, which breaks shell line continuation. This causes subsequent lines to be parsed as command names instead of arguments, resulting in execution errors (SC2215).

Suggested fix
     python -u -m sglang.launch_server \
         --model-path ${MODEL_PATH} \
         --tp ${TP_SIZE} \
         --host 127.0.0.1 \
         --port ${SERVER_PORT} \
         --trust-remote-code \
         --mem-fraction-static 0.20 \
         --disable-radix-cache \
         --disable-cuda-graph \
-
-
         --speculative-algorithm EAGLE \
         --speculative-num-steps 3 \
         --speculative-eagle-topk 1 \
         --mamba-ssm-k-last \
         > "${OUTPUT_DIR}/${name}_server.log" 2>&1 &
@@
     python -m sglang.bench_one_batch_server \
         --model "${MODEL_PATH}" \
         --base-url "http://127.0.0.1:${SERVER_PORT}" \
         --batch-size 1 32 64 128 \
-
         --output-len ${OUTPUT_LEN} \
         --result-filename "${OUTPUT_DIR}/${name}_result.jsonl" \
         2>&1 | tee "${OUTPUT_DIR}/${name}_output.txt"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--mamba-ssm-k-last \
> "${OUTPUT_DIR}/${name}_server.log" 2>&1 &
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--mamba-ssm-k-last \
> "${OUTPUT_DIR}/${name}_server.log" 2>&1 &
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 219-219: This flag is used as a command name. Bad line break or missing [ .. ]?

(SC2215)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@run_bench_local.sh` around lines 217 - 223, There are blank lines after shell
line-continuation backslashes in the multiline command that includes flags like
--speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk and
--mamba-ssm-k-last (the command redirecting to
"${OUTPUT_DIR}/${name}_server.log" 2>&1 &); remove the empty lines immediately
following any line that ends with a trailing backslash so the backslash actually
escapes the newline and the next flag/redirect is treated as the same command.
Ensure every backslash is followed directly by the next argument line (no
intervening blank lines) so the shell continuation is not broken.

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the _version.py file in the repository
find . -name "_version.py" -type f

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# List top-level structure
ls -la

# Search for any version-related files
find . -name "*version*" -type f 2>/dev/null | head -20

# Search for any Python files with "sglang" in path
find . -path "*sglang*" -type f 2>/dev/null | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 3082


🏁 Script executed:

# List contents of sglang_shadow directory
ls -la sglang_shadow/

# List contents of sglang_shadow/python/sglang/
ls -la sglang_shadow/python/sglang/

# Check if _version.py exists
ls -la sglang_shadow/python/sglang/_version.py 2>&1

Repository: flashinfer-ai/flashinfer

Length of output: 3362


🏁 Script executed:

# Check if the symlink target exists
ls -la /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py 2>&1

# Try to read the symlink
cat sglang_shadow/python/sglang/_version.py 2>&1

# Check what the symlink resolves to
readlink -f sglang_shadow/python/sglang/_version.py 2>&1

Repository: flashinfer-ai/flashinfer

Length of output: 254


Remove the broken symbolic link to /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py.

sglang_shadow/python/sglang/_version.py is a symbolic link pointing to a nonexistent absolute filesystem path. This breaks imports and prevents the package from being distributed or installed. Replace the symlink with actual version code or a proper relative reference.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/_version.py` at line 1, The file
sglang_shadow/python/sglang/_version.py is a broken absolute symlink; remove the
symlink and replace it with a real module that exports the package version
(e.g., define __version__ = "x.y.z" or implement a dynamic lookup using
importlib.metadata.version in a function like get_version()); ensure the new
_version.py is a regular file (not a symlink) and that any imports referencing
sglang._version or from sglang._version import __version__ continue to work, or
alternatively change imports to use a relative/resource-based lookup so no
absolute filesystem links are required.

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Raw filesystem path committed as Python code breaks module parsing.

Line 1 is not valid Python and will fail import/linting. Replace this with actual module content (or remove this file from Python package inputs if it is only a path marker artifact).

Suggested minimal fix (keep file syntactically valid until real implementation lands)
-/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py
+"""Shadow module placeholder.
+
+TODO: replace with actual `bench_one_batch_server.py` implementation.
+"""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py
"""Shadow module placeholder.
TODO: replace with actual `bench_one_batch_server.py` implementation.
"""
🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/bench_one_batch_server.py` at line 1, The file
sglang_shadow/python/sglang/bench_one_batch_server.py currently contains a raw
filesystem path on line 1 which is invalid Python and breaks imports; replace
that line with valid module content (e.g., a module docstring and a minimal
placeholder such as an empty function or "pass" in a named function/class) or
remove the file from package inputs if it was only a path marker artifact, then
run lint/import to verify the module now parses cleanly as requested in
[request_verification].

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/layers/torchao_utils.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

git ls-files | grep -i "torchao_utils"

Repository: flashinfer-ai/flashinfer

Length of output: 123


🏁 Script executed:

git ls-files | grep -E "(sglang|torchao)" | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 923


🏁 Script executed:

cat -n sglang_shadow/python/sglang/srt/layers/torchao_utils.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 155


🏁 Script executed:

ls -la sglang_shadow/python/sglang/srt/layers/torchao_utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 257


🏁 Script executed:

git ls-files -s | grep torchao_utils

Repository: flashinfer-ai/flashinfer

Length of output: 173


🏁 Script executed:

file sglang_shadow/python/sglang/srt/layers/torchao_utils.py

Repository: flashinfer-ai/flashinfer

Length of output: 112


🏁 Script executed:

git show HEAD:sglang_shadow/python/sglang/srt/layers/torchao_utils.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 151


Remove hardcoded symlink; replace with actual Python module.

The file is a symbolic link pointing to an absolute path (/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/layers/torchao_utils.py) instead of containing Python code. This cannot be imported and is not portable across environments. Replace the symlink with the actual module file.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/layers/torchao_utils.py` at line 1, The file
sglang.srt.layers.torchao_utils.py is currently a filesystem symlink to an
absolute path and thus not importable; replace the symlink with a real Python
module file named torchao_utils.py that implements the expected public API for
this package (module sglang.srt.layers.torchao_utils). Create the module
in-place (not a symlink), implement or forward the expected functions/classes
used elsewhere (e.g., any helpers referenced as torchao_utils.* in imports), add
an __all__ export list and proper imports, and ensure it contains tests or
minimal runtime-safe stubs if full functionality depends on optional
dependencies so the package can be imported across environments.

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/model_loader No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, check if the file exists and what it contains
if [ -f "sglang_shadow/python/sglang/srt/model_loader" ]; then
  echo "=== File content ==="
  cat -n "sglang_shadow/python/sglang/srt/model_loader"
  echo ""
  echo "=== File type ==="
  file "sglang_shadow/python/sglang/srt/model_loader"
  echo ""
  echo "=== File size ==="
  wc -c "sglang_shadow/python/sglang/srt/model_loader"
fi

# Check if it's a symlink
if [ -L "sglang_shadow/python/sglang/srt/model_loader" ]; then
  echo "=== Symlink target ==="
  readlink -f "sglang_shadow/python/sglang/srt/model_loader"
fi

# List directory structure
echo ""
echo "=== Directory structure of sglang_shadow/ ==="
find sglang_shadow -type f -name "model_loader*" 2>/dev/null
find sglang_shadow -type l 2>/dev/null | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 993


🏁 Script executed:

# Search for files with absolute paths as content
echo "=== Checking for files with absolute paths ==="
fd --type f . sglang_shadow/ 2>/dev/null | while read -r file; do
  # Skip binary and large files
  if file "$file" | grep -q "text"; then
    lines=$(wc -l < "$file" 2>/dev/null || echo "999")
    if [ "$lines" -le 5 ]; then
      content=$(cat "$file" 2>/dev/null)
      if echo "$content" | grep -qE "^/home/|^/usr/|^/opt/|^/var/"; then
        echo "SUSPECT: $file"
        echo "Content: $content"
        echo "---"
      fi
    fi
  fi
done

# Check for symlinks with absolute targets
echo ""
echo "=== Checking for absolute symlinks ==="
find sglang_shadow -type l 2>/dev/null | while read -r link; do
  target=$(readlink "$link" 2>/dev/null)
  if echo "$target" | grep -qE "^/"; then
    echo "ABSOLUTE SYMLINK: $link -> $target"
  fi
done

Repository: flashinfer-ai/flashinfer

Length of output: 18079


🏁 Script executed:

# Explore the sglang_shadow directory structure
echo "=== sglang_shadow directory tree ==="
tree -L 3 sglang_shadow 2>/dev/null || find sglang_shadow -type d | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 2238


🏁 Script executed:

# Check if there's any documentation or configuration about sglang_shadow
find . -maxdepth 3 -name "*README*" -o -name "*.md" | grep -i sglang | head -10

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check if there's any gitignore or setup documentation
cat .gitignore 2>/dev/null | grep -A5 -B5 sglang_shadow || echo "No sglang_shadow entries in .gitignore"

Repository: flashinfer-ai/flashinfer

Length of output: 106


Critical: Absolute symlinks committed throughout sglang_shadow/ — will break for all contributors and CI.

The sglang_shadow/ directory contains 150+ symlinks with hardcoded absolute paths to a developer's local workspace (/home/xutingz/workspace/gitsrc/sglang_gdn/...). This affects all subdirectories: python/sglang/, srt/, layers/, models/, and more. These symlinks will not resolve on any other machine, breaking builds for all contributors and CI pipelines. It also leaks developer-specific information (username and workspace structure).

Convert all absolute symlinks to relative symlinks (e.g., ../../../sglang_gdn/python/sglang/...), remove this directory entirely and wire integration via pyproject.toml path dependencies or editable installs, or add the entire directory to .gitignore if it's only needed locally.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/model_loader` at line 1, The sglang_shadow
directory contains many absolute symlinks pointing to /home/xutingz/... which
will fail for other contributors and CI; replace every absolute symlink in
sglang_shadow (including subdirs python/sglang/, srt/, layers/, models/) with
relative symlinks (e.g., ../../../sglang_gdn/...), or remove the directory and
wire the package via pyproject.toml path dependencies or an editable install, or
add sglang_shadow/ to .gitignore if it’s only local; search the repo for
symlinks containing the absolute prefix
"/home/xutingz/workspace/gitsrc/sglang_gdn" and update each link target to a
correct relative path or eliminate the symlink and use proper packaging instead.

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/server_args.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, find the actual file path in the repository
fd "server_args.py" -t f

Repository: flashinfer-ai/flashinfer

Length of output: 50


🏁 Script executed:

# Check repository structure and look for sglang references
git ls-files | head -30
echo "---"
git ls-files | grep -i sglang
echo "---"
git ls-files | grep -i server

Repository: flashinfer-ai/flashinfer

Length of output: 7356


🏁 Script executed:

# Check file size and read the content
wc -l sglang_shadow/python/sglang/srt/server_args.py
echo "---"
head -20 sglang_shadow/python/sglang/srt/server_args.py

Repository: flashinfer-ai/flashinfer

Length of output: 272


🏁 Script executed:

# Check the file using git directly
git ls-files -s sglang_shadow/python/sglang/srt/server_args.py
echo "---"
# Try to read it via git
git cat-file -p HEAD:sglang_shadow/python/sglang/srt/server_args.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 243


Fix invalid symlink in module file.

The file sglang_shadow/python/sglang/srt/server_args.py is a symbolic link pointing to an absolute path (/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/server_args.py), which is invalid Python and breaks module imports. Replace the symlink with actual server_args definitions or an intentional placeholder docstring.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/srt/server_args.py` at line 1, The server_args.py
file is currently a broken symlink; replace it with a real Python module instead
of a symlink: remove the symlink and create a new server_args.py containing
either the actual argument parsing/constants/classes used elsewhere (e.g.,
functions or names referenced like parse_server_args, ServerArgs, DEFAULT_PORT)
or, if the real implementations live elsewhere, create a clear placeholder
module with a top-level docstring and exported names (e.g., __all__ =
["parse_server_args", "ServerArgs"]) so imports succeed; ensure the module
defines the same public symbols other code expects and is valid Python (no
absolute-path symlink).

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

head -n 5 sglang_shadow/python/sglang/utils.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 163


🏁 Script executed:

fd -type f -name "utils.py" | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 238


🏁 Script executed:

fd "utils.py"

Repository: flashinfer-ai/flashinfer

Length of output: 1155


🏁 Script executed:

head -n 10 sglang_shadow/python/sglang/utils.py | cat -n

Repository: flashinfer-ai/flashinfer

Length of output: 163


🏁 Script executed:

git show HEAD:sglang_shadow/python/sglang/utils.py | head -n 10

Repository: flashinfer-ai/flashinfer

Length of output: 132


🏁 Script executed:

git show HEAD:sglang_shadow/python/sglang/utils.py | wc -l

Repository: flashinfer-ai/flashinfer

Length of output: 69


🏁 Script executed:

git show HEAD:sglang_shadow/python/sglang/utils.py | xxd | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 109


🏁 Script executed:

git show HEAD:sglang_shadow/python/sglang/utils.py | od -c

Repository: flashinfer-ai/flashinfer

Length of output: 110


sglang.utils is currently unparsable due to Line 1.

The absolute path literal is invalid Python and must be removed/replaced.

Proposed fix
-/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py
+"""Shared utility helpers (placeholder)."""
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py
"""Shared utility helpers (placeholder)."""
🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/utils.py` at line 1, The first line of
sglang.utils contains an invalid absolute path literal which makes the module
unparsable; remove that path string from the top of the file and replace it with
a valid construct (e.g., a module docstring, a shebang, or the intended
import/definition). Ensure the file begins with valid Python (no bare filesystem
paths) and that any intended metadata is expressed as a string literal docstring
or comment rather than a raw path.

@@ -0,0 +1 @@
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/version.py No newline at end of file
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Accidentally committed local filesystem path instead of file content.

This file (and every other file added under sglang_shadow/) contains only the developer's local absolute machine path as its content:

/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/version.py

These are not valid Python modules. They were almost certainly created by a tooling artifact (e.g., a find/echo redirect or a broken symlink materialisation) and then accidentally staged. The same problem affects all 8 files in this batch:

  • sglang_shadow/python/sglang/version.py
  • sglang_shadow/python/sglang/srt/disaggregation (also not a .py file — may be a directory committed as a regular file)
  • sglang_shadow/python/sglang/srt/layers/vocab_parallel_embedding.py
  • sglang_shadow/python/sglang/profiler.py
  • sglang_shadow/python/sglang/srt/layers/parameter.py
  • sglang_shadow/python/sglang/srt/constants.py
  • sglang_shadow/python/sglang/srt/layers/attention/intel_amx_backend.py
  • sglang_shadow/python/sglang/global_config.py

Please:

  1. Determine whether these shadow stubs should be empty files, copies/symlinks of the real SGLang source, or omitted entirely.
  2. Replace the path-string content with the correct content (or remove the files if not needed).
  3. For sglang_shadow/python/sglang/srt/disaggregation, verify whether this is meant to be a directory (package) — if so, a __init__.py inside that directory is needed rather than a file at that path.

The Ruff E902 errors on all of these are a direct downstream symptom of this same root cause.

🧰 Tools
🪛 Ruff (0.15.2)

[error] 1-1: No such file or directory (os error 2)

(E902)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@sglang_shadow/python/sglang/version.py` at line 1, Several files under
sglang_shadow contain only the developer's local absolute path instead of valid
Python content (e.g., sglang_shadow/python/sglang/version.py and the listed
eight files); you should remove or replace these stub files with the correct
artifacts: first determine whether each shadow entry (version.py, profiler.py,
global_config.py, constants.py, parameter.py, vocab_parallel_embedding.py,
intel_amx_backend.py) should be omitted, be a real copy/symlink of the
corresponding sglang module, or be an empty placeholder; for
sglang_shadow/python/sglang/srt/disaggregation verify if it should be a package
— if so replace the file with a directory named disaggregation and add an
__init__.py (or otherwise remove the stray file); ensure the final commit
removes the single-line path contents and changes the files to valid Python
modules or deletes them, then re-run linting to confirm Ruff E902 is resolved.

@xutizhou xutizhou force-pushed the feat/gdn-decode-pooled-f16 branch 2 times, most recently from c627a21 to 06b94e7 Compare February 25, 2026 09:44
@xutizhou xutizhou force-pushed the feat/gdn-decode-pooled-f16 branch from 702da5b to 34121a1 Compare February 25, 2026 10:22
@xutizhou xutizhou mentioned this pull request Feb 26, 2026
10 tasks
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, please fix the typo, also cc @kaixih because your #2619 works on similar features (but on different code paths).

@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Mar 4, 2026

/bot run

@yzh119 yzh119 added the run-ci label Mar 4, 2026
@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !372 has been created, and the CI pipeline #45304753 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #45304753: 9/20 passed

@xutizhou xutizhou closed this Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants