feat(gdn): port pooled decode kernel to f16 backend#2634
feat(gdn): port pooled decode kernel to f16 backend#2634xutizhou wants to merge 5 commits intoflashinfer-ai:mainfrom
Conversation
Summary of ChangesHello @xutizhou, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the pooled decode kernel with the recently introduced F16 backend, enabling a more efficient, zero-copy mechanism for state updates. By leveraging indirect indexing, the system can now directly access and modify state within a shared pool, significantly reducing memory overhead and improving overall performance. The changes also refine the API and internal logic to seamlessly support both traditional and pooled state representations. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds optional pooled state indexing to gated-delta-rule decode: public APIs accept Changes
Sequence Diagram(s)sequenceDiagram
participant User as User Code
participant Dispatch as Dispatcher
participant Kernel as CUDA Kernel
participant Pool as State Pool
rect rgba(120,180,240,0.5)
Note over User,Dispatch: Pool-indexed flow
User->>Dispatch: call gated_delta_rule_decode(..., state=pool, state_indices)
Dispatch->>Dispatch: validate `state_indices` (contiguous int32 [B])\nset use_pool_indexing = true
Dispatch->>Kernel: launch kernel with use_pool_indexing=true\npass state (pool), state_indices
Kernel->>Pool: pool_idx = gState_indices[batch_idx]
alt pool_idx >= 0
Kernel->>Pool: load state[pool_idx,...]
Kernel->>Kernel: compute outputs
Kernel->>Pool: write back state[pool_idx,...]
else pool_idx < 0
Kernel->Kernel: treat as padding / produce zeros
end
end
rect rgba(220,160,100,0.5)
Note over User,Dispatch: Direct per-batch flow
User->>Dispatch: call gated_delta_rule_decode(..., state=per-batch, state_indices=None)
Dispatch->>Dispatch: set use_pool_indexing = false
Dispatch->>Kernel: launch kernel with use_pool_indexing=false\npass per-batch state
Kernel->>Kernel: load state[batch_idx,...], compute, write outputs
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request successfully ports the pooled decode kernel to the F16 backend, introducing zero-copy state updates via indirect indexing. The changes are well-implemented, propagating the use_pool_indexing logic through the CUDA kernels and their corresponding Python wrappers. The documentation and input validation have been updated to reflect these new capabilities. I have one suggestion to refactor a small piece of duplicated code within one of the CUDA kernels to improve maintainability.
flashinfer/gdn_decode.py
Outdated
| if pool_idx >= 0: | ||
| # Get current batch | ||
| gSrc_batch = h0_source[(state_idx, None, None)] # (V, K) | ||
| gDst = cute.local_tile(h0_source, (1, TILE_V, TILE_K), (state_idx, None, 0)) | ||
|
|
||
| # Partition for load | ||
| thr_copy_load = tiled_copy_load.get_slice(tidx) | ||
| # V 方向分 tiles | ||
| gSrc = cute.local_tile( | ||
| gSrc_batch, (TILE_V, TILE_K), (None, 0) | ||
| ) # (TILE_V, TILE_K, num_v_tiles) | ||
|
|
||
| # =================================================================== | ||
| # Prefetch: All threads participate in cp.async load | ||
| # =================================================================== | ||
| start_v_tiles = batch_inner * num_v_tiles_per_block | ||
| prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block) | ||
| for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count): | ||
| stage = (v_tiles - start_v_tiles) % NUM_STAGES | ||
| # Partition for load | ||
| thr_copy_load = tiled_copy_load.get_slice(tidx) | ||
|
|
||
| gSrc_tile = gSrc[(None, None, v_tiles)] | ||
| sData_stage = sData[(None, None, stage)] | ||
| # =================================================================== | ||
| # Prefetch: All threads participate in cp.async load | ||
| # =================================================================== | ||
| start_v_tiles = batch_inner * num_v_tiles_per_block | ||
| prefetch_count = cutlass.min(NUM_STAGES - 1, num_v_tiles_per_block) | ||
| for v_tiles in range(start_v_tiles, start_v_tiles + prefetch_count): | ||
| stage = (v_tiles - start_v_tiles) % NUM_STAGES | ||
|
|
||
| thr_gSrc = thr_copy_load.partition_S(gSrc_tile) | ||
| thr_sData = thr_copy_load.partition_D(sData_stage) | ||
| gSrc_tile = gSrc[(None, None, v_tiles)] | ||
| sData_stage = sData[(None, None, stage)] | ||
|
|
||
| cute.copy(tiled_copy_load, thr_gSrc, thr_sData) | ||
| cute.arch.cp_async_commit_group() | ||
| thr_gSrc = thr_copy_load.partition_S(gSrc_tile) | ||
| thr_sData = thr_copy_load.partition_D(sData_stage) | ||
|
|
||
| # Load q, k into BF16 registers using autovec_copy (contiguous pattern) | ||
| q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id)) | ||
| k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id)) | ||
| cute.autovec_copy(q_tile, r_q_bf16) | ||
| cute.autovec_copy(k_tile, r_k_bf16) | ||
| cute.copy(tiled_copy_load, thr_gSrc, thr_sData) | ||
| cute.arch.cp_async_commit_group() | ||
|
|
||
| # Convert BF16 to FP32 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = cutlass.Float32(r_q_bf16[i]) | ||
| r_k[i] = cutlass.Float32(r_k_bf16[i]) | ||
| # Load q, k into BF16 registers using autovec_copy (contiguous pattern) | ||
| q_tile = cute.local_tile(q, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id)) | ||
| k_tile = cute.local_tile(k, (1, 1, 1, vec_size), (i_n, i_t, i_h, lane_id)) | ||
| cute.autovec_copy(q_tile, r_q_bf16) | ||
| cute.autovec_copy(k_tile, r_k_bf16) | ||
|
|
||
| # Load v into BF16 registers using autovec_copy, convert to FP32, store to sV | ||
| v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id)) | ||
| cute.autovec_copy(v_tile, r_v_bf16) | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| sV[k_start + i] = cutlass.Float32(r_v_bf16[i]) | ||
| # Convert BF16 to FP32 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = cutlass.Float32(r_q_bf16[i]) | ||
| r_k[i] = cutlass.Float32(r_k_bf16[i]) | ||
|
|
||
| cute.arch.barrier() # Ensure all threads finish writing to sV | ||
| # Load v into BF16 registers using autovec_copy, convert to FP32, store to sV | ||
| v_tile = cute.local_tile(v, (1, 1, 1, vec_size), (i_n, i_t, i_hv, lane_id)) | ||
| cute.autovec_copy(v_tile, r_v_bf16) | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| sV[k_start + i] = cutlass.Float32(r_v_bf16[i]) | ||
|
|
||
| # =================================================================== | ||
| # Compute g and beta (scalar values) | ||
| # =================================================================== | ||
| r_g = 0.0 | ||
| r_beta = 0.0 | ||
| if lane_id == 0: | ||
| x = r_a + r_dt_bias | ||
| beta_x = softplus_beta * x | ||
| softplus_x = 0.0 | ||
|
|
||
| if beta_x <= softplus_threshold: | ||
| # softplus(x) = (1/beta) * log(1 + exp(beta*x)) | ||
| # Compute in Float32 | ||
| exp_beta_x = cute.exp(beta_x, fastmath=True) | ||
| log_input = cutlass.Float32(1.0 + exp_beta_x) | ||
| log_result = cutlass.Float32(cute.log(log_input, fastmath=True)) | ||
| softplus_x = cutlass.Float32( | ||
| (cutlass.Float32(1.0) / softplus_beta) * log_result | ||
| ) | ||
| else: | ||
| softplus_x = x | ||
| cute.arch.barrier() # Ensure all threads finish writing to sV | ||
|
|
||
| # Compute g = exp(A_log) * softplus_x | ||
| r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x | ||
| # =================================================================== | ||
| # Compute g and beta (scalar values) | ||
| # =================================================================== | ||
| r_g = 0.0 | ||
| r_beta = 0.0 | ||
| if lane_id == 0: | ||
| x = r_a + r_dt_bias | ||
| beta_x = softplus_beta * x | ||
| softplus_x = 0.0 | ||
|
|
||
| # Compute beta = 1 / (1 + exp(-b)) | ||
| r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True)) | ||
| if beta_x <= softplus_threshold: | ||
| # softplus(x) = (1/beta) * log(1 + exp(beta*x)) | ||
| # Compute in Float32 | ||
| exp_beta_x = cute.exp(beta_x, fastmath=True) | ||
| log_input = cutlass.Float32(1.0 + exp_beta_x) | ||
| log_result = cutlass.Float32(cute.log(log_input, fastmath=True)) | ||
| softplus_x = cutlass.Float32( | ||
| (cutlass.Float32(1.0) / softplus_beta) * log_result | ||
| ) | ||
| else: | ||
| softplus_x = x | ||
|
|
||
| # Store to scalar (Float32) | ||
| r_g = cute.exp(r_g_value, fastmath=True) | ||
| # Compute g = exp(A_log) * softplus_x | ||
| r_g_value = -cute.exp(r_A_log, fastmath=True) * softplus_x | ||
|
|
||
| r_g = cute.arch.shuffle_sync(r_g, 0) | ||
| r_beta = cute.arch.shuffle_sync(r_beta, 0) | ||
| # Compute beta = 1 / (1 + exp(-b)) | ||
| r_beta = 1.0 / (1.0 + cute.exp(-r_b, fastmath=True)) | ||
|
|
||
| if use_qk_l2norm: | ||
| # Compute L2 norm of q and k | ||
| sum_q = 0.0 | ||
| sum_k = 0.0 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| sum_q += r_q[i] * r_q[i] | ||
| sum_k += r_k[i] * r_k[i] | ||
| # Warp-level reduction using butterfly shuffle | ||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_q += cute.arch.shuffle_sync_bfly( | ||
| sum_q, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
| sum_k += cute.arch.shuffle_sync_bfly( | ||
| sum_k, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
| # Store to scalar (Float32) | ||
| r_g = cute.exp(r_g_value, fastmath=True) | ||
|
|
||
| inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True) | ||
| inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True) | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = r_q[i] * inv_norm_q | ||
| r_k[i] = r_k[i] * inv_norm_k | ||
| r_g = cute.arch.shuffle_sync(r_g, 0) | ||
| r_beta = cute.arch.shuffle_sync(r_beta, 0) | ||
|
|
||
| # Apply scaling in Float32 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = r_q[i] * scale | ||
| if use_qk_l2norm: | ||
| # Compute L2 norm of q and k | ||
| sum_q = 0.0 | ||
| sum_k = 0.0 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| sum_q += r_q[i] * r_q[i] | ||
| sum_k += r_k[i] * r_k[i] | ||
| # Warp-level reduction using butterfly shuffle | ||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_q += cute.arch.shuffle_sync_bfly( | ||
| sum_q, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
| sum_k += cute.arch.shuffle_sync_bfly( | ||
| sum_k, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
|
|
||
| # =================================================================== | ||
| # Mainloop: All threads participate | ||
| # =================================================================== | ||
| end_v_tiles = start_v_tiles + num_v_tiles_per_block | ||
| for v_tiles in range(start_v_tiles, end_v_tiles): | ||
| stage = (v_tiles - start_v_tiles) % NUM_STAGES | ||
| inv_norm_q = cute.rsqrt(sum_q + 1e-6, fastmath=True) | ||
| inv_norm_k = cute.rsqrt(sum_k + 1e-6, fastmath=True) | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = r_q[i] * inv_norm_q | ||
| r_k[i] = r_k[i] * inv_norm_k | ||
|
|
||
| # Step 1: Wait for current stage to complete | ||
| cute.arch.cp_async_wait_group(0) | ||
| cute.arch.barrier() | ||
| # Apply scaling in Float32 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_q[i] = r_q[i] * scale | ||
|
|
||
| # Step 2: Issue async load for next tile (after compute) | ||
| next_v_tiles = v_tiles + prefetch_count | ||
| if next_v_tiles < end_v_tiles: | ||
| next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES | ||
| # =================================================================== | ||
| # Mainloop: All threads participate | ||
| # =================================================================== | ||
| end_v_tiles = start_v_tiles + num_v_tiles_per_block | ||
| for v_tiles in range(start_v_tiles, end_v_tiles): | ||
| stage = (v_tiles - start_v_tiles) % NUM_STAGES | ||
|
|
||
| gSrc_next = gSrc[(None, None, next_v_tiles)] | ||
| sData_next = sData[(None, None, next_stage)] | ||
| # Step 1: Wait for current stage to complete | ||
| cute.arch.cp_async_wait_group(0) | ||
| cute.arch.barrier() | ||
|
|
||
| thr_gSrc = thr_copy_load.partition_S(gSrc_next) | ||
| thr_sData = thr_copy_load.partition_D(sData_next) | ||
| # Step 2: Issue async load for next tile (after compute) | ||
| next_v_tiles = v_tiles + prefetch_count | ||
| if next_v_tiles < end_v_tiles: | ||
| next_stage = (next_v_tiles - start_v_tiles) % NUM_STAGES | ||
|
|
||
| cute.copy(tiled_copy_load, thr_gSrc, thr_sData) | ||
| cute.arch.cp_async_commit_group() | ||
| gSrc_next = gSrc[(None, None, next_v_tiles)] | ||
| sData_next = sData[(None, None, next_stage)] | ||
|
|
||
| # Step 3: Compute using data from current stage (contiguous access pattern) | ||
| for row in cutlass.range_constexpr(0, TILE_V, 4): | ||
| row_offset = tidx // 32 | ||
| sum_hk = 0.0 | ||
| thr_gSrc = thr_copy_load.partition_S(gSrc_next) | ||
| thr_sData = thr_copy_load.partition_D(sData_next) | ||
|
|
||
| # Load h from sData using 3D local_tile + autovec_copy (contiguous in K) | ||
| sData_tile = cute.local_tile( | ||
| sData, (1, vec_size, 1), (row + row_offset, lane_id, stage) | ||
| ) | ||
| cute.autovec_copy(sData_tile, r_h) | ||
| cute.copy(tiled_copy_load, thr_gSrc, thr_sData) | ||
| cute.arch.cp_async_commit_group() | ||
|
|
||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_h[i] = r_h[i] * r_g | ||
| sum_hk += r_h[i] * r_k[i] | ||
| # Step 3: Compute using data from current stage (contiguous access pattern) | ||
| for row in cutlass.range_constexpr(0, TILE_V, 4): | ||
| row_offset = tidx // 32 | ||
| sum_hk = 0.0 | ||
|
|
||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_hk += cute.arch.shuffle_sync_bfly( | ||
| sum_hk, offset=offset, mask=-1, mask_and_clamp=31 | ||
| # Load h from sData using 3D local_tile + autovec_copy (contiguous in K) | ||
| sData_tile = cute.local_tile( | ||
| sData, (1, vec_size, 1), (row + row_offset, lane_id, stage) | ||
| ) | ||
| cute.autovec_copy(sData_tile, r_h) | ||
|
|
||
| v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk | ||
| v_new = v_new * r_beta | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_h[i] = r_h[i] * r_g | ||
| sum_hk += r_h[i] * r_k[i] | ||
|
|
||
| sum_hq = 0.0 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_h[i] += r_k[i] * v_new | ||
| sum_hq += r_h[i] * r_q[i] | ||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_hk += cute.arch.shuffle_sync_bfly( | ||
| sum_hk, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
|
|
||
| # Write h to gDst using 4D local_tile + autovec_copy (contiguous in K) | ||
| gDst_tile = cute.local_tile( | ||
| gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles) | ||
| ) | ||
| cute.autovec_copy(r_h, gDst_tile) | ||
| v_new = sV[v_tiles * TILE_V + row + row_offset] - sum_hk | ||
| v_new = v_new * r_beta | ||
|
|
||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_hq += cute.arch.shuffle_sync_bfly( | ||
| sum_hq, offset=offset, mask=-1, mask_and_clamp=31 | ||
| sum_hq = 0.0 | ||
| for i in cutlass.range_constexpr(vec_size): | ||
| r_h[i] += r_k[i] * v_new | ||
| sum_hq += r_h[i] * r_q[i] | ||
|
|
||
| # Write h to gDst using 4D local_tile + autovec_copy (contiguous in K) | ||
| gDst_tile = cute.local_tile( | ||
| gDst, (1, 1, vec_size, 1), (0, row + row_offset, lane_id, v_tiles) | ||
| ) | ||
| cute.autovec_copy(r_h, gDst_tile) | ||
|
|
||
| o_idx = v_tiles * TILE_V + row + row_offset | ||
| if lane_id == 0 and o_idx < V: | ||
| sOutput[o_idx] = cutlass.BFloat16(sum_hq) | ||
| for offset in [16, 8, 4, 2, 1]: | ||
| sum_hq += cute.arch.shuffle_sync_bfly( | ||
| sum_hq, offset=offset, mask=-1, mask_and_clamp=31 | ||
| ) | ||
|
|
||
| # =================================================================== | ||
| # Final writeback: Copy output from shared memory to global memory | ||
| # All threads write (V=128, NUM_THREADS=128) | ||
| # =================================================================== | ||
| cute.arch.barrier() # Ensure all writes to sOutput are complete | ||
| if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V: | ||
| o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx] | ||
| o_idx = v_tiles * TILE_V + row + row_offset | ||
| if lane_id == 0 and o_idx < V: | ||
| sOutput[o_idx] = cutlass.BFloat16(sum_hq) | ||
|
|
||
| # =================================================================== | ||
| # Final writeback: Copy output from shared memory to global memory | ||
| # All threads write (V=128, NUM_THREADS=128) | ||
| # =================================================================== | ||
| cute.arch.barrier() # Ensure all writes to sOutput are complete | ||
| if tidx >= start_v_tiles * TILE_V and tidx < end_v_tiles * TILE_V: | ||
| o[(i_n, i_t, i_hv, tidx)] = sOutput[tidx] | ||
| else: | ||
| # Padding slot: write zeros to output | ||
| start_v_tiles = batch_inner * num_v_tiles_per_block | ||
| if ( | ||
| tidx >= start_v_tiles * TILE_V | ||
| and tidx < (start_v_tiles + num_v_tiles_per_block) * TILE_V | ||
| ): | ||
| o[(i_n, i_t, i_hv, tidx)] = cutlass.BFloat16(0.0) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/gdn_decode.py (1)
1077-1096:⚠️ Potential issue | 🔴 CriticalAdd pooled indexing support to bf16 fast path or guard against it.
When
use_pool_indexing=True(i.e.,state_indicesis provided), the bf16 fast path at lines 1074–1100 can still be selected, but the backend ignores pooled indices entirely. The underlyinggated_delta_rulefunction inflashinfer/gdn_kernels/gdn_decode_bf16_state.pyexplicitly documentsinitial_state_indicesas "Not used (for compatibility)", meaning the kernel will silently default to direct batch mapping rather than respecting the provided pooled indices.This causes incorrect state reads/writes for non-identity pooled mappings. Either:
- Add
and not use_pool_indexingto the dispatch condition at line 1074 to prevent bf16 path when pooled mode is active, or- Implement pooled indexing support in the bf16 backend.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1077 - 1096, The bf16 fast path (_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so either prevent that path when pooling is used or implement pooled support; the minimal fix is to change the dispatch condition that uses use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or equivalently check state_indices is None/identity) so the bf16 kernel is not chosen when pooled indices (use_pool_indexing/state_indices) are provided, ensuring the scalar bf16 kernel is only used for direct batch mappings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1056-1067: When use_pool_indexing is true, state_indices currently
only has shape/dtype checks; add a runtime bounds check to ensure all entries in
state_indices are within [0, pool_size) (and optionally non-negative) before
kernel launch to prevent OOB in h0_source address calculations; locate the
pool-mode validation block around use_pool_indexing/state.is_contiguous() and
add a check using state_indices (and B/pool_size) to raise a clear error if any
value >= pool_size (or < 0).
---
Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1077-1096: The bf16 fast path
(_gated_delta_rule_gdn_decode_klast_bf16_state) ignores pooled indexing, so
either prevent that path when pooling is used or implement pooled support; the
minimal fix is to change the dispatch condition that uses
use_gdn_decode_klast_bf16_state to also require not use_pool_indexing (or
equivalently check state_indices is None/identity) so the bf16 kernel is not
chosen when pooled indices (use_pool_indexing/state_indices) are provided,
ensuring the scalar bf16 kernel is only used for direct batch mappings.
63c5650 to
834f1e6
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)
2111-2150:⚠️ Potential issue | 🔴 CriticalFix argument order in
cute.compile()and execute call:streammust come beforesi_anduse_pool_indexing_c.The launch wrapper functions (e.g.,
gated_delta_rule_launch_seqlen1,gated_delta_rule_launch_seqlen2) declare parameters in this order:
eps: cutlass.Float32stream: cuda.CUstreammState_indices: cute.Tensoruse_pool_indexing: cutlass.Constexpr[bool]However, both the
cute.compile()call (line 2111) and execute call (line 2133) pass arguments as:
eps_f32si_(mState_indices)use_pool_indexing_cstreamWith positional argument mapping, this causes
si_to bind to thestreamparameter (type mismatch:cute.Tensorvscuda.CUstream),use_pool_indexing_cto bind tomState_indices(type mismatch:boolvscute.Tensor), andstreamto bind touse_pool_indexing(type mismatch:cuda.CUstreamvsbool). This is a critical correctness bug.Fix: reorder arguments to match function signature
_compiled_kernels[cache_key] = cute.compile( launch_fn, q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_, scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32, + stream, si_, use_pool_indexing_c, - stream, options="--enable-tvm-ffi --generate-line-info", ) # Execute _compiled_kernels[cache_key]( q_, k_, v_, a_, b_, A_log_, dt_bias_, h_, o_, scale_f32, softplus_beta_f32, softplus_threshold_f32, eps_f32, + stream, si_, use_pool_indexing_c, - stream, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2111 - 2150, The compile and execute calls pass arguments in the wrong order causing positional binding errors; reorder the final three args so that stream comes before si_ and use_pool_indexing_c in both the cute.compile(...) invocation and the subsequent _compiled_kernels[cache_key](...) call to match the launch wrapper signatures (e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) which expect (eps, stream, mState_indices, use_pool_indexing); update the argument lists for cute.compile and the runtime call accordingly so stream is placed before si_ and use_pool_indexing_c.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2065-2075: When use_pool_indexing is true, add a defensive
contiguity assertion for the pooled source tensor: assert
initial_state_source.is_contiguous(), with a clear message (e.g. "Expected
contiguous initial_state_source for pooled indexing"), placed alongside the
existing shape and dtype checks that reference initial_state_indices; this
ensures the kernel's async H 128-bit tiled copies are safe when computing
state_indices_t and using initial_state_source in the pooling path.
---
Outside diff comments:
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2111-2150: The compile and execute calls pass arguments in the
wrong order causing positional binding errors; reorder the final three args so
that stream comes before si_ and use_pool_indexing_c in both the
cute.compile(...) invocation and the subsequent
_compiled_kernels[cache_key](...) call to match the launch wrapper signatures
(e.g., gated_delta_rule_launch_seqlen1, gated_delta_rule_launch_seqlen2) which
expect (eps, stream, mState_indices, use_pool_indexing); update the argument
lists for cute.compile and the runtime call accordingly so stream is placed
before si_ and use_pool_indexing_c.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
flashinfer/gdn_decode.pyflashinfer/gdn_kernels/gdn_decode_bf16_state.py
🚧 Files skipped from review as they are similar to previous changes (1)
- flashinfer/gdn_decode.py
834f1e6 to
d207428
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
flashinfer/gdn_decode.py (1)
1832-1875:⚠️ Potential issue | 🟡 Minor
state_indicesparameter is absent from the docstring ofgated_delta_rule_decode.The new
state_indicesparameter added at line 1830 has no documentation entry in theArgssection. This creates a gap between the pretranspose and nontranspose API docs.📝 Suggested addition to Args section
use_qk_l2norm (bool): Whether to apply L2 normalization to q and k. Default: ``True``. + state_indices (Optional[torch.Tensor]): + Optional int32 tensor of shape ``[B]`` mapping each batch entry to its + slot in the state pool. Negative values are treated as padding and the + corresponding output is zeroed. If ``None``, direct per-batch indexing + is used and ``state.shape[0]`` must equal ``B``. Returns:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1832 - 1875, The docstring for gated_delta_rule_decode is missing documentation for the new state_indices parameter; update the Args section to describe state_indices (type torch.Tensor or Optional[torch.Tensor]), its shape (e.g. [B] or [B, HV] depending on usage), dtype expectations, purpose (indices mapping into the state for batched or streaming decode), whether it can be None and how that behavior differs, and any in-place/update semantics; reference the gated_delta_rule_decode function and the state_indices parameter so the API docs for both pretranspose and nontranspose variants remain consistent.flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)
1980-1994: 🛠️ Refactor suggestion | 🟠 Major
gated_delta_ruleis missing the@flashinfer_apidecorator.All public decode API functions are expected to carry
@flashinfer_apifor API logging and crash-safe input capture. This function is the only public entry point in this file, yet it lacks the decorator.🔧 Proposed fix
+@flashinfer_api def gated_delta_rule( A_log: torch.Tensor,You will also need to import
flashinfer_api(or provide a fallback) at the top of this file, as is done ingdn_decode.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 1980 - 1994, The public function gated_delta_rule is missing the `@flashinfer_api` decorator; add `@flashinfer_api` immediately above def gated_delta_rule to enable API logging and crash-safe input capture, and ensure flashinfer_api is imported (or a fallback defined) at the top of this module using the same import/fallback pattern used in gdn_decode.py so the decorator is available at runtime.
♻️ Duplicate comments (2)
flashinfer/gdn_kernels/gdn_decode_bf16_state.py (1)
2059-2069:⚠️ Potential issue | 🟠 MajorMissing contiguity assertion on
initial_state_sourcein pool-indexing path.The async H loads use 128-bit tiled copies that assume a contiguous layout. A non-contiguous
initial_state_sourcesilently corrupts the pointer arithmetic used ingH[(state_batch_idx, value_head_idx, None, None)].🛡️ Proposed fix
use_pool_indexing = initial_state_indices is not None if use_pool_indexing: assert initial_state_indices.shape == (B,), ( f"Expected shape [{B}], got {initial_state_indices.shape}" ) assert initial_state_indices.dtype == torch.int32, ( f"Expected int32, got {initial_state_indices.dtype}" ) + assert initial_state_source.is_contiguous(), ( + "initial_state_source must be contiguous for correct pointer arithmetic in pooled mode" + ) state_indices_t = initial_state_indices🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2059 - 2069, When use_pool_indexing is true the code doesn't verify that initial_state_source is contiguous which breaks the 128-bit tiled async H loads (used when indexing gH[(state_batch_idx, value_head_idx, None, None)]); update the pool-indexing branch (the path that sets state_indices_t from initial_state_indices) to either assert initial_state_source.is_contiguous() or replace it with a contiguous copy (e.g., initial_state_source = initial_state_source.contiguous()) before any indexing into gH so pointer arithmetic remains valid; reference initial_state_source, use_pool_indexing, state_indices_t and the gH[(state_batch_idx, value_head_idx, None, None)] access to locate the change.flashinfer/gdn_decode.py (1)
1881-1891:⚠️ Potential issue | 🟠 MajorMissing bounds check: non-negative
state_indicesvalues not validated againstpool_size.Values
≥ pool_sizeinstate_indicesflow directly intoflat_idx = pool_idx * HV + i_hvinside the kernel, causing out-of-bounds reads/writes onh0_source. A guard analogous to the one needed ingated_delta_rule_decode_pretransposeis required here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1881 - 1891, The code currently lacks validation that entries in state_indices are within [0, pool_size-1], which allows values >= pool_size to drive flat_idx = pool_idx * HV + i_hv and cause OOB access on h0_source; add a bounds check similar to gated_delta_rule_decode_pretranspose: when state_indices is not None, verify all values are integers >= 0 and < pool_size (and fail with an informative assertion or raised error referencing pool_size and B), or alternatively explicitly clamp/validate before computing flat_idx; ensure you reference state_indices, pool_size, flat_idx and h0_source when adding the check so the guard prevents any out-of-bounds indexing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: gated_delta_rule_decode is missing the pool-indexing
validations present in gated_delta_rule_decode_pretranspose; add a guard when
use_pool_indexing (i.e., state_indices is not None) that asserts
state.is_contiguous(), state_indices.shape == (B,), state_indices.dtype ==
torch.int32, and state_indices.is_contiguous(), mirroring the checks in
gated_delta_rule_decode_pretranspose so malformed/non-contiguous or wrong-dtype
state_indices do not reach the kernel.
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: initial_state_indices values are never bounds-checked
against the pool size (initial_state_source.shape[0]) which can cause
out-of-bounds accesses when used as indices (e.g., in gH[...] indexing); update
the use_pool_indexing branch that sets state_indices_t to validate every entry
is >= 0 and < pool_size (and still dtype torch.int32 and on q.device), raising a
clear error or clamping/adjusting as appropriate; reference the symbols
initial_state_indices, initial_state_source.shape[0], state_indices_t,
use_pool_indexing and ensure the check happens before assigning state_indices_t
(or convert and move to device first) so downstream kernels cannot read/write
out-of-range indices.
---
Outside diff comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1832-1875: The docstring for gated_delta_rule_decode is missing
documentation for the new state_indices parameter; update the Args section to
describe state_indices (type torch.Tensor or Optional[torch.Tensor]), its shape
(e.g. [B] or [B, HV] depending on usage), dtype expectations, purpose (indices
mapping into the state for batched or streaming decode), whether it can be None
and how that behavior differs, and any in-place/update semantics; reference the
gated_delta_rule_decode function and the state_indices parameter so the API docs
for both pretranspose and nontranspose variants remain consistent.
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 1980-1994: The public function gated_delta_rule is missing the
`@flashinfer_api` decorator; add `@flashinfer_api` immediately above def
gated_delta_rule to enable API logging and crash-safe input capture, and ensure
flashinfer_api is imported (or a fallback defined) at the top of this module
using the same import/fallback pattern used in gdn_decode.py so the decorator is
available at runtime.
---
Duplicate comments:
In `@flashinfer/gdn_decode.py`:
- Around line 1881-1891: The code currently lacks validation that entries in
state_indices are within [0, pool_size-1], which allows values >= pool_size to
drive flat_idx = pool_idx * HV + i_hv and cause OOB access on h0_source; add a
bounds check similar to gated_delta_rule_decode_pretranspose: when state_indices
is not None, verify all values are integers >= 0 and < pool_size (and fail with
an informative assertion or raised error referencing pool_size and B), or
alternatively explicitly clamp/validate before computing flat_idx; ensure you
reference state_indices, pool_size, flat_idx and h0_source when adding the check
so the guard prevents any out-of-bounds indexing.
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py`:
- Around line 2059-2069: When use_pool_indexing is true the code doesn't verify
that initial_state_source is contiguous which breaks the 128-bit tiled async H
loads (used when indexing gH[(state_batch_idx, value_head_idx, None, None)]);
update the pool-indexing branch (the path that sets state_indices_t from
initial_state_indices) to either assert initial_state_source.is_contiguous() or
replace it with a contiguous copy (e.g., initial_state_source =
initial_state_source.contiguous()) before any indexing into gH so pointer
arithmetic remains valid; reference initial_state_source, use_pool_indexing,
state_indices_t and the gH[(state_batch_idx, value_head_idx, None, None)] access
to locate the change.
There was a problem hiding this comment.
Actionable comments posted: 28
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
flashinfer/gdn_decode.py (1)
1076-1107:⚠️ Potential issue | 🟠 MajorLegacy pretranspose path (float32 state) doesn't guard against pool indexing — will crash on reshape.
When
state_indicesis provided with float32 state (i.e.use_gdn_decode_klast_bf16_stateis false), execution falls through to the legacy path. Line 1107 doesstate.reshape(B * HV, V, K), butstate.shape[0]ispool_size(which differs fromBin pooled mode), causing a reshape failure. The cachedh0_indicesat line 1115 is always zeros, so even if the reshape didn't crash, the kernel wouldn't use the realstate_indices.Either add an assertion rejecting pool indexing in the legacy path or implement proper support.
Minimal guard (recommended)
# Legacy path: T=1 only, float32 state assert T == 1, f"Decode only supports T=1, got T={T}" assert state.dtype == torch.float32, f"state must be float32, got {state.dtype}" + assert not use_pool_indexing, ( + "Pool indexing (state_indices) is not supported with float32 state in the " + "pretranspose legacy path. Use bfloat16 state for pool indexing support." + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1076 - 1107, The legacy pretranspose path performs state.reshape(B * HV, V, K) and always uses a zeroed h0_indices, which will crash or mis-index when state_indices (pooled mode) is provided; update the logic in the branch guarded by use_gdn_decode_klast_bf16_state == False to either (a) assert that state_indices is None (reject pooled indexing) by checking state_indices is None before the reshape and raising a clear assertion mentioning state_indices and the legacy path, or (b) implement proper pooled support by using state_indices to build h0_indices and reshape/reindex state into h0_source correctly (use state.shape[0] as pool_size, gather the per-batch entries into [B*HV, V, K] using state_indices, and replace the zeroed h0_indices usage), ensuring h0_source and h0_indices are consistent for the kernel call.
♻️ Duplicate comments (20)
sglang_shadow/python/sglang/bench_offline_throughput.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/bench_offline_throughput.py` at line 1, This file uses the same broken absolute symlink/raw-path pattern found in other sglang_shadow modules; replicate the fix applied in vision_utils.py by replacing hard-coded filesystem paths with package-relative resource loading (e.g., importlib.resources or pkgutil) or by resolving symlinks programmatically, and update any path construction/usage in bench_offline_throughput.py to call the same helper or utility function used in vision_utils.py so the file no longer relies on raw absolute paths or broken symlinks.sglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.py` at line 1, The file torch_flex_backend.py in the sglang_shadow package contains a hardcoded/incorrect raw filesystem path or broken symlink reference (same issue as other sglang_shadow files); update this module to use package-relative resource resolution instead of the absolute/raw path—locate and replace any occurrences of the raw path or symlink usage in torch_flex_backend.py with a stable approach (e.g., importlib.resources / pathlib with package-relative __file__ resolution or a proper package data lookup) so the module loads resources consistently across environments; ensure the fix mirrors the solution applied in vision_utils.py and adjust any import or resource-loading calls in torch_flex_backend.py accordingly.sglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.py` at line 1, The file sglang.srt.layers.attention.cutlass_mla_backend contains the same broken symlink/raw filesystem path as other sglang_shadow modules; fix it by replacing the hard-coded path usage with a proper package/module import or a valid repository-relative symlink so the module can be imported normally (e.g., ensure cutlass_mla_backend is accessible via the sglang package namespace); update any import statements in cutlass_mla_backend.py to use package-relative imports (import sglang.srt.layers.attention...) or remove the raw path and add the file into the package so it resolves at runtime.sglang_shadow/python/sglang/srt/layers/attention/tbo_backend.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/tbo_backend.py` at line 1, This module still contains the same broken absolute/symlink path usage as other sglang_shadow files; locate any hard-coded path strings or os.path calls in tbo_backend.py (top-level variables, functions or helpers that reference filesystem paths) and replace them with package-relative resource access (e.g., use Path(__file__).resolve().parent / "<relative_name>" or importlib.resources.read_binary/text to load bundled files) so the code no longer depends on /home/... raw paths or broken symlinks; ensure any path resolution uses .resolve() and relative joins and update the import/loading logic in the relevant functions/classes in tbo_backend.py to use those package-relative resources.sglang_shadow/python/sglang/srt/layers/attention/utils.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/utils.py` at line 1, The file sglang_shadow/python/sglang/srt/layers/attention/utils.py contains the same broken symlink/raw absolute path issue as vision_utils; replace the raw absolute path entry with the proper module content or a package-relative import so the module sglang.srt.layers.attention.utils resolves correctly. Specifically, remove the hardcoded "/home/..." path or symlink placeholder at the top of utils.py and restore the actual implementation (or copy the corresponding implementation from the non-shadow module), ensuring imports use package-relative names (e.g., from sglang.srt.layers.attention import ...) and that the file is a real Python file rather than a broken symlink.sglang_shadow/python/sglang/srt/layers/radix_attention.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py— this file also contains only a local filesystem path instead of Python source.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/radix_attention.py` at line 1, The file radix_attention.py currently contains only a raw filesystem path (broken symlink) rather than Python source; replace the placeholder content with the actual implementation of the RadixAttention module (restore the real class/function definitions that should live in radix_attention.py), or remove the stray file if it was accidentally added and import the real module from the correct package; ensure the module defines the expected public symbols (e.g., RadixAttention class or radex_attention_* functions used elsewhere) so imports elsewhere in the codebase resolve correctly.sglang_shadow/python/sglang/__init__.py (1)
1-1: Same broken symlink / raw path issue as the othersglang_shadow/files.See the consolidated comment on
vision_utils.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/__init__.py` at line 1, The sglang_shadow/python/sglang/__init__.py file is a broken symlink/raw path copy like the other sglang_shadow files; replace it with a proper package-init that re-exports the real package instead of referencing the absolute path. Edit __init__.py in the sglang_shadow package to remove the symlink/path reference and instead import and re-export the real sglang symbols (e.g., use explicit imports or wildcard re-exports from the canonical sglang package) so consumers of sglang_shadow import via normal package imports rather than a raw filesystem path.sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.Line 1 repeats the same non-portable absolute-path artifact and should be replaced by a regular source file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py` at line 1, The file sglang_shadow/python/sglang/srt/layers/rocm_linear_utils.py currently contains a non-portable absolute filesystem path on its first line; remove that path-only line and replace it with the real Python module source (or a proper stub) so the file is a normal importable module. Ensure you restore any original functions/classes expected from this module (e.g., rocm_linear_* helpers or utility functions) or, if not available, add a minimal well-documented stub (module docstring and exported symbols) so imports of rocm_linear_utils succeed across environments.sglang_shadow/python/sglang/srt/layers/layernorm.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.This file has the same non-portable absolute-path artifact and should be replaced with a real tracked module.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/layernorm.py` at line 1, The file contains a non-portable absolute local-path artifact that should be removed and replaced with the proper tracked module import; open sgtlang/srt/layers/layernorm.py, remove any hard-coded absolute path string or sys.path manipulation that references a local workspace, and replace it with a normal import of the shared/packaged LayerNorm implementation (e.g., import the canonical module or class such as LayerNorm from the project package instead of referencing /home/...); ensure tests/imports elsewhere reference the new package import and update any top-level variables or factory functions in layernorm.py to delegate to the packaged implementation.sglang_shadow/python/sglang/srt/layers/multimodal.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.Line 1 contains the same machine-local path artifact and should be replaced by a normal module file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/multimodal.py` at line 1, The file sglang.srt.layers.multimodal (multimodal.py) currently contains an absolute machine-local filesystem path as its first line; remove that path artifact and replace it with a normal Python module header (e.g., a module docstring or relevant imports) so the file is a valid module. Specifically, delete the literal "/home/.../multimodal.py" line, add an appropriate top-of-file docstring or imports used by this module (retain or re-add any real definitions like classes or functions in this file such as multimodal layer classes or helper functions), and ensure there are no other hard-coded local paths left in multimodal.py; run linters/tests to confirm the module imports correctly.sglang_shadow/python/sglang/srt/layers/modelopt_utils.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.Line 1 repeats the same host-absolute path issue; please replace with a normal module file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/modelopt_utils.py` at line 1, The top of modelopt_utils.py contains a host-absolute filesystem path string instead of module code; remove that absolute path line and replace it with a normal module header (optional module docstring or imports) so the file is a valid Python module; ensure no other absolute local-path artifacts remain in sglang/srt/layers/modelopt_utils.py and run tests/import to verify functions/classes in modelopt_utils (e.g., any functions or classes defined in that file) are importable after the change.sglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.Line 1 shows the same host-specific path artifact; this should be replaced with a real in-repo module file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.py` at line 1, The file hybrid_attn_backend.py contains a host-specific absolute path artifact on line 1; remove that literal path and replace it with a proper in-repo module or relative import (e.g., import the equivalent module from the package or add the module file to the repo and use a relative import). Ensure any references in hybrid_attn_backend.py to the external path are updated to import the correct in-repo symbol (or class/function) and that the module is added to the package __init__ if needed so imports resolve at runtime.sglang_shadow/python/sglang/srt/environ.py (1)
1-1: Same blocker as above: absolute local-path artifact in source.This file also contains the same machine-local absolute path artifact and needs conversion to a real Python module.
sglang_shadow/python/sglang/srt/multiplex (1)
1-1: Same blocker as above: absolute local-path artifact in tracked path.This path also appears to be an absolute local-path artifact. Please replace it with the intended in-repo module/package structure.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/multiplex` at line 1, The tracked artifact is an absolute local filesystem path (sglang_shadow/python/sglang/srt/multiplex) that must be replaced with the in-repo package/module layout; move the contents into the repository package (e.g., package path under python/sglang/srt/multiplex) or update references to use the package import path (sglang.srt.multiplex) instead of an absolute path, and remove any hard-coded local-path entries from manifests, setup/config, or import statements that reference sglang_shadow.sglang_shadow/python/sglang/srt/layers/sparse_pooler.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/sparse_pooler.py` at line 1, The file sglang.srt.layers.sparse_pooler.py is a broken symlink; replace the symlink with the real source file or restore the missing target so Python can import it. Locate the symlink currently referenced as sparse_pooler.py in the sglang.srt.layers package, remove and replace it with the actual module implementation (or recreate the correct symlink target), then run the same fix you applied for trtllm_mha_backend.py to ensure imports (e.g., from sglang.srt.layers.sparse_pooler import ...) resolve successfully.sglang_shadow/python/sglang/srt/layers/attention/nsa_backend.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/nsa_backend.py` at line 1, The repository contains a broken symlink for srlang/srt/layers/attention/nsa_backend.py (same problem as trtllm_mha_backend.py); fix it by replacing the broken symlink with the correct implementation or a valid symlink target: locate the symlink at nsa_backend.py, remove the broken link (git rm) and either add the real file content for nsa_backend.py or recreate the symlink to the correct existing file, then commit so imports referencing nsa_backend.py resolve correctly.sglang_shadow/python/sglang/srt/layers/attention/triton_backend.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/triton_backend.py` at line 1, The triton_backend.py file is a broken symlink (same problem as trtllm_mha_backend.py); replace the broken symlink with the correct implementation or a valid symlink to the real backend module used in this repo, or remove the symlink and add the proper Python module content/export so imports work; ensure the module name triton_backend.py resolves to the actual backend (or mirrors trtllm_mha_backend.py behaviour) and update any imports referencing srt.layers.attention.triton_backend to point to the real module.sglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.py` at line 1, The file sglang/srt/layers/attention/flashinfer_mla_backend.py is a broken symlink (same problem as trtllm_mha_backend.py); replace or remove the symlink and ensure the real implementation file is present and imported correctly: either restore the original target file for flashinfer_mla_backend.py or convert the symlink into a real Python module file with the expected symbols (functions/classes used by the codebase), and update any imports that referenced flashinfer_mla_backend.py to point to the correct module; mirror the fix you applied for trtllm_mha_backend.py so builds and imports no longer fail.sglang_shadow/python/sglang/bench_serving.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/bench_serving.py` at line 1, The bench_serving.py entry is a broken symlink (same problem as trtllm_mha_backend.py); fix it by replacing the symlink at sglang/bench_serving.py with the actual Python file (or update the symlink target to the correct, existing path) so imports and CI can find it; ensure the new file content matches the intended implementation, add and commit the real file (or corrected symlink) to the repo so bench_serving.py is no longer a dangling link.sglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.py (1)
1-1: Broken symlink — same issue as noted ontrtllm_mha_backend.py.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.py` at line 1, The file sglang/srt/layers/attention/base_attn_backend.py is a broken symlink (same problem as trtllm_mha_backend.py); replace the symlink with a real Python module file or recreate the symlink to point to the correct target so imports succeed. Specifically, ensure base_attn_backend.py is a regular file (not a dangling link) containing the expected backend implementation (or correctly points to the actual implementation used by attention backends), and verify any imports referencing base_attn_backend.py and trtllm_mha_backend.py resolve without errors.
🟠 Major comments (6)
flashinfer/gdn_decode.py-1959-1966 (1)
1959-1966:⚠️ Potential issue | 🟠 MajorAdd
pool_sizeto the cache_key to handle variable pool sizes when usingstate_indices.When
use_pool_indexing=True(state_indices provided),pool_sizecan vary independently ofBper the assertion at line 1903-1906. However, the cache_key at line 1966 omitspool_size, allowing the same compiled kernel to be reused for differenth0_sourceshapes[pool_size * HV, K, V]. This inconsistency contrasts with the similargated_delta_rule_mtpfunction (lines 2592-2602), which explicitly includespool_sizein its cache_key. Includepool_sizein the cache_key to ensure correct kernel reuse across varying pool sizes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1959 - 1966, The kernel cache key omits pool_size causing kernels compiled for one pool_size to be reused for different h0_source shapes; update the cache_key tuple (used where cache_key is assigned in gdn_decode) to include pool_size so it matches the actual h0_source layout (state_contiguous and h0_source) and mirrors the gated_delta_rule_mtp behavior when use_pool_indexing/state_indices can vary pool_size.sglang_shadow/python/sglang/srt/layers/attention/mamba-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorAvoid committing machine-local absolute path targets in the shadow package.
Line 1 embeds a local
/home/...path; this will not resolve on other machines and makes the shadow tree brittle.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/mamba` at line 1, The shadow package contains a machine-local absolute path string ("/home/...") in sglang/srt/layers/attention/mamba which must be removed; edit the mamba shadow file to delete the hard-coded absolute path and replace it with a relative path or a proper package/resource reference (e.g., use package-relative imports or resource lookup APIs) so the package does not embed local filesystem targets, and ensure any tests/build scripts that generated this entry use relative paths or package_data instead.sglang_shadow/python/sglang/srt/distributed-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorUse repo-relative links or real files for shadow distributed package entries.
Line 1 is a workstation-specific absolute path, which will fail on any other machine.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/distributed` at line 1, The first line contains a workstation-specific absolute path instead of a portable reference for the shadow distributed package; replace that absolute path with either a repo-relative reference or commit the real package files and point the shadow entry to the module sglang.srt.distributed (i.e., update the shadow package entry so it references the project-relative package/module rather than an absolute filesystem path).flashinfer/gdn_kernels/gdn_decode_bf16_state.py-2060-2082 (1)
2060-2082:⚠️ Potential issue | 🟠 MajorValidate
initial_state_indicesdevice beforefrom_dlpack.
initial_state_indicesis checked for shape/dtype, but not that it lives on the same device asq. A CPU tensor here can fail later in kernel plumbing with less actionable errors.🛡️ Suggested fix
if use_pool_indexing: assert initial_state_indices.shape == (B,), ( f"Expected shape [{B}], got {initial_state_indices.shape}" ) assert initial_state_indices.dtype == torch.int32, ( f"Expected int32, got {initial_state_indices.dtype}" ) + assert initial_state_indices.device == q.device, ( + f"initial_state_indices must be on {q.device}, got {initial_state_indices.device}" + ) assert initial_state_source.is_contiguous(), ( "initial_state_source must be contiguous for correct pointer arithmetic in pooled mode" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` around lines 2060 - 2082, The code validates shape/dtype for initial_state_indices but not its device relative to q, which can cause cryptic failures later (e.g., during from_dlpack); add a device check and coerce or fail fast: verify initial_state_indices.device == q.device (where q is the query tensor used later) and if not either transfer initial_state_indices to q.device (initial_state_indices = initial_state_indices.to(q.device)) or raise an assert mentioning q.device and initial_state_indices.device; update the block around the use of initial_state_indices, initial_state_source, and any subsequent from_dlpack usage to perform this check before proceeding with pointer arithmetic or DLPack conversions.bench_gdn_kernel.py-70-75 (1)
70-75:⚠️ Potential issue | 🟠 MajorRemove the duplicate
time_portedcomputation.Line 74 recomputes
time_portedusing the oldstart, inflating the reported ported latency.🐛 Suggested fix
torch.cuda.synchronize() time_ported = (time.perf_counter() - start) / repeat # Clear cache to avoid shape mismatch on h0_source (pool vs gathered) - - time_ported = (time.perf_counter() - start) / repeat - - # Clear cache to avoid shape mismatch on h0_source (pool vs gathered) _get_compiled_decode_kernel_nontranspose.cache_clear()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench_gdn_kernel.py` around lines 70 - 75, The duplicate computation of time_ported is causing an inflated ported latency; remove the second reassignment "time_ported = (time.perf_counter() - start) / repeat" (the one after the cache-clear comment) so time_ported is only computed once using the original start and repeat variables; keep the first computation and the cache-clear comment related to h0_source intact.bench_gdn_kernel.py-29-31 (1)
29-31:⚠️ Potential issue | 🟠 MajorState tensor must use BF16 dtype to test the optimized kernel path.
Line 30 allocates
state_poolwith hardcodeddtype=torch.float32, but the BF16-state kernel dispatch (line 1040 inflashinfer/gdn_decode.py) requiresstate.dtype == torch.bfloat16. This benchmark cannot hit the new optimized kernel path as written.Suggested fix
state_pool = torch.randn( - pool_size, num_value_heads, K, V, dtype=torch.float32, device=device + pool_size, num_value_heads, K, V, dtype=dtype, device=device )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench_gdn_kernel.py` around lines 29 - 31, The benchmark allocates state_pool with dtype=torch.float32 so the BF16 optimized kernel in flashinfer/gdn_decode.py never triggers; change the allocation for state_pool to use torch.bfloat16 (or cast it immediately after creation) so state_pool.dtype == torch.bfloat16, ensuring the BF16-state kernel dispatch path can be exercised; ensure the device supports bfloat16 if necessary and keep the same shape/variables (state_pool, K, V, num_value_heads, pool_size) when changing the dtype.
🟡 Minor comments (3)
flashinfer/gdn_decode.py-953-953 (1)
953-953:⚠️ Potential issue | 🟡 Minor
state_indicesparameter is missing from the docstring.The
Argssection (lines 960–998) documents all parameters up throughuse_qk_l2normbut omits the newly addedstate_indices. The nontranspose variant at line 1874 does document it — please mirror that here for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` at line 953, The docstring's Args section is missing documentation for the newly added parameter state_indices; update the Args block in the function's docstring to include a description for state_indices (type Optional[torch.Tensor]) matching the wording and semantics used in the nontranspose variant's docstring (see state_indices entry around line 1874) so both variants are consistent; locate the function in flashinfer/gdn_decode.py that declares state_indices: Optional[torch.Tensor] = None and add the same parameter description to the Args list.flashinfer/gdn_decode.py-1896-1902 (1)
1896-1902:⚠️ Potential issue | 🟡 MinorMisleading comment: says "K-last" but the assertion checks K-major layout
(HV, K, V).The comment on line 1899 says "Qwen-style K-last" but the assertion verifies
state.shape[1:] == (HV, K, V), which is K-major (V-last). The pretranspose variant at line 1006 correctly says "K-last" for(HV, V, K). This is confusing for anyone cross-referencing the two paths.Suggested fix
- # Validate state shape (Qwen-style K-last: [*, HV, V, K]) + # Validate state shape (K-major: [*, HV, K, V]) assert state.shape[1:] == (HV, K, V), ( f"Expected state shape [*, HV={HV}, K={K}, V={V}], got {state.shape}" )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1896 - 1902, The comment "Qwen-style K-last" is incorrect for the post-transpose path: the assertion checks state.shape[1:] == (HV, K, V) which is K-major (V-last), not K-last. Update the inline comment near use_pool_indexing/state to describe the actual layout — e.g., "Qwen-style K-major (V-last): [*, HV, K, V]" — and optionally add a note contrasting it with the pretranspose variant that uses (HV, V, K) to avoid confusion; keep variable names (state, HV, K, V, use_pool_indexing, state_indices) as reference.bench_gdn_kernel.py-82-83 (1)
82-83:⚠️ Potential issue | 🟡 MinorUse
_for intentionally unused outputs.
outis assigned but never read in warmup/baseline loops.🧹 Suggested cleanup
- out, temp_state = gated_delta_rule_decode( + _, temp_state = gated_delta_rule_decode( @@ - out, temp_state = gated_delta_rule_decode( + _, temp_state = gated_delta_rule_decode(Also applies to: 107-108
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bench_gdn_kernel.py` around lines 82 - 83, The variable out from the gated_delta_rule_decode calls is assigned but never used; change the tuple unpacking to discard the unused output by replacing "out, temp_state =" with "_, temp_state =" (and similarly for the other occurrence around lines 107-108) so only temp_state is kept; keep the function call and q argument the same and ensure no other logic depends on out.
🧹 Nitpick comments (7)
flashinfer/gdn_decode.py (1)
1690-1691: Dead expression:h0_indices.layout.shape[0]result is discarded.Line 1690 evaluates
h0_indices.layout.shape[0]but the result is not assigned to anything. Same issue at line 1771 inrun_gdn_decode_kernel_big_batch_nontranspose. These appear to be leftover debug statements.Remove dead expressions
# h0_source is flattened to [B*HV, K, V] to ensure proper alignment for SIMT async copy batch_hv_dim, k_dim, v_dim = h0_source.layout.shape - h0_indices.layout.shape[0] batch_size = B * HV # Use actual batch size, not pool sizeApply the same removal at line 1771.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/gdn_decode.py` around lines 1690 - 1691, Remove the dead expression evaluations of h0_indices.layout.shape[0] that are left as standalone statements; in the function where batch_size is computed (the block setting batch_size = B * HV) delete the preceding unused line "h0_indices.layout.shape[0]" and apply the same removal in run_gdn_decode_kernel_big_batch_nontranspose (the analogous stray h0_indices.layout.shape[0] at line ~1771) so the code only computes and uses batch_size without the discarded expression.tests/gdn/test_gdn_decode_pooled.py (1)
19-22: Make the SM gate future-proof.Line 21 hardcodes
[9, 10, 11, 12]; this will skip tests on newer supported SM majors.♻️ Suggested change
def _skip_if_not_sm90_or_later(): cc = get_compute_capability(torch.device("cuda")) - if cc[0] not in [9, 10, 11, 12]: + if cc[0] < 9: pytest.skip(f"GDN decode requires SM90+, got SM{cc[0]}{cc[1]}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/gdn/test_gdn_decode_pooled.py` around lines 19 - 22, The helper _skip_if_not_sm90_or_later currently hardcodes supported SM majors ([9,10,11,12]) causing future SM majors to be skipped; update the check in _skip_if_not_sm90_or_later (which calls get_compute_capability) to allow any major >= 9 (e.g., use cc[0] < 9 to decide pytest.skip) and keep the pytest.skip message but format it generically (e.g., "GDN decode requires SM90+, got SM{cc[0]}{cc[1]}") so newer SM majors pass without needing further edits.run_bench_local.sh (1)
249-251:restore_to_portedis called but never defined.Line 250 adds a cleanup hook that currently does nothing except rely on
command not foundbeing swallowed.♻️ Suggested guard
cleanup() { - restore_to_ported 2>/dev/null || true + if command -v restore_to_ported >/dev/null 2>&1; then + restore_to_ported 2>/dev/null || true + fi kill_server 2>/dev/null || true }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@run_bench_local.sh` around lines 249 - 251, The cleanup function calls restore_to_ported which is not defined; either add a small restore_to_ported implementation (e.g., a function that reverses any port forwarding or is a no-op) or guard the call so missing-command errors aren't relied on—e.g., check command -v or type for restore_to_ported before calling it (or define restore_to_ported() {}), and keep kill_server as-is; update the cleanup block to call the defined or conditionally-invoked restore_to_ported to ensure the cleanup hook is deterministic.sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py (4)
67-71: Inconsistent cleanup of global handles in theexceptblock.On import failure,
_flashinfer_chunk_gated_delta_ruleand_flashinfer_gated_delta_rule_mtpare not explicitly reset toNone, unlike the decode handles. While they happen to remainNonefrom module-level initialization, this is fragile if a future code path partially sets them before failure.Proposed fix
except (ImportError, RuntimeError) as e: logger.warning(f"FlashInfer GDN kernels not available: {e}") _flashinfer_gdn_available = False + _flashinfer_chunk_gated_delta_rule = None + _flashinfer_gated_delta_rule_mtp = None _flashinfer_gated_delta_rule_decode = None _flashinfer_gated_delta_rule_decode_nontranspose = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 67 - 71, The except block that handles ImportError/RuntimeError leaves _flashinfer_chunk_gated_delta_rule and _flashinfer_gated_delta_rule_mtp untouched, which is fragile; update the except block in hybrid_linear_attn_backend.py (the handler that currently sets _flashinfer_gdn_available, _flashinfer_gated_delta_rule_decode, and _flashinfer_gated_delta_rule_decode_nontranspose) to also explicitly set _flashinfer_chunk_gated_delta_rule and _flashinfer_gated_delta_rule_mtp to None so all FlashInfer kernel handles are consistently cleared on failure.
1334-1337: Environment variable lookup on every decode forward call.
os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0")is evaluated on everyforward_decodeinvocation in V-last FlashInfer mode. While the dict lookup itself is cheap, for a hot decode path this could be cached once during__init__instead.Proposed fix
In
__init__, cache the flag:self._force_gather_scatter = os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") == "1"Then in
forward_decode:- force_gather_scatter = ( - os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") == "1" - ) + force_gather_scatter = self._force_gather_scatter🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 1334 - 1337, The environment lookup for SGLANG_TEST_FORCE_GATHER_SCATTER is performed on every forward_decode call; move that check into the class initializer by adding a cached boolean (e.g. set self._force_gather_scatter = os.environ.get("SGLANG_TEST_FORCE_GATHER_SCATTER", "0") == "1" inside __init__) and then update forward_decode to read that cached self._force_gather_scatter instead of calling os.environ.get or defining a local force_gather_scatter each invocation; ensure references to force_gather_scatter in forward_decode use the new instance attribute name.
817-819: Local re-import ofcausal_conv1d_fnshadows module-level CUDA override.Inside
forward_extend, the Triton version ofcausal_conv1d_fnis explicitly re-imported, overriding the module-level CUDA variant (set at line 114). If this is intentional (e.g., the Triton version supports extra kwargs likehas_initial_state), a brief comment would prevent future confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 817 - 819, The local re-import of causal_conv1d_fn inside forward_extend is shadowing the module-level CUDA-assigned variant (set earlier) and causing confusion; either remove the local import so forward_extend uses the module-level causal_conv1d_fn, or if you intentionally need the Triton-specific implementation (e.g., for has_initial_state), keep the import but add a concise comment above it explaining why the Triton override is required and/or rename the local binding to triton_causal_conv1d_fn to make the override explicit; update all references in forward_extend to the chosen identifier.
579-584: In-place mutation ofreq_pool_indicesduring CUDA graph replay.Line 581 mutates
req_pool_indicesdirectly (req_pool_indices[bs - num_padding :] = 0). If this tensor is shared or reused by the caller, this side effect could cause subtle bugs. This is likely intentional for CUDA graph replay (where buffers are pre-allocated), but worth a brief comment to document the contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@sglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.py` around lines 579 - 584, The code is performing an in-place mutation of req_pool_indices (req_pool_indices[bs - num_padding :] = 0) which can leak side effects to callers; either avoid mutating the caller-visible tensor by making a local copy before zeroing (e.g., local_req_pool_indices = req_pool_indices.clone() or .detach()) and then call self.req_to_token_pool.get_mamba_indices(local_req_pool_indices), or if the in-place write is intentional for CUDA-graph buffer reuse, add a clear comment above the mutation explaining the contract (that req_pool_indices is mutated on purpose for replay and must not be reused by callers) and ensure subsequent uses operate on a non-shared tensor (mamba_indices, get_mamba_indices, and self.state_indices_list usage).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (123)
bench_gdn_kernel.pyflashinfer/gdn_decode.pyflashinfer/gdn_kernels/gdn_decode_bf16_state.pyrun_bench_local.shsglang_shadow/python/sglang/README.mdsglang_shadow/python/sglang/__init__.pysglang_shadow/python/sglang/__pycache__sglang_shadow/python/sglang/_version.pysglang_shadow/python/sglang/bench_offline_throughput.pysglang_shadow/python/sglang/bench_one_batch.pysglang_shadow/python/sglang/bench_one_batch_server.pysglang_shadow/python/sglang/bench_serving.pysglang_shadow/python/sglang/check_env.pysglang_shadow/python/sglang/clisglang_shadow/python/sglang/compile_deep_gemm.pysglang_shadow/python/sglang/evalsglang_shadow/python/sglang/global_config.pysglang_shadow/python/sglang/jit_kernelsglang_shadow/python/sglang/langsglang_shadow/python/sglang/launch_server.pysglang_shadow/python/sglang/multimodal_gensglang_shadow/python/sglang/profiler.pysglang_shadow/python/sglang/srt/__pycache__sglang_shadow/python/sglang/srt/batch_invariant_opssglang_shadow/python/sglang/srt/batch_overlapsglang_shadow/python/sglang/srt/checkpoint_enginesglang_shadow/python/sglang/srt/compilationsglang_shadow/python/sglang/srt/configssglang_shadow/python/sglang/srt/connectorsglang_shadow/python/sglang/srt/constants.pysglang_shadow/python/sglang/srt/constrainedsglang_shadow/python/sglang/srt/debug_utilssglang_shadow/python/sglang/srt/disaggregationsglang_shadow/python/sglang/srt/distributedsglang_shadow/python/sglang/srt/dllmsglang_shadow/python/sglang/srt/elastic_epsglang_shadow/python/sglang/srt/entrypointssglang_shadow/python/sglang/srt/environ.pysglang_shadow/python/sglang/srt/eplbsglang_shadow/python/sglang/srt/function_callsglang_shadow/python/sglang/srt/grpcsglang_shadow/python/sglang/srt/hardware_backendsglang_shadow/python/sglang/srt/layers/__pycache__sglang_shadow/python/sglang/srt/layers/activation.pysglang_shadow/python/sglang/srt/layers/amx_utils.pysglang_shadow/python/sglang/srt/layers/attention/__pycache__sglang_shadow/python/sglang/srt/layers/attention/aiter_backend.pysglang_shadow/python/sglang/srt/layers/attention/attention_registry.pysglang_shadow/python/sglang/srt/layers/attention/base_attn_backend.pysglang_shadow/python/sglang/srt/layers/attention/cutlass_mla_backend.pysglang_shadow/python/sglang/srt/layers/attention/double_sparsity_backend.pysglang_shadow/python/sglang/srt/layers/attention/dual_chunk_flashattention_backend.pysglang_shadow/python/sglang/srt/layers/attention/flasglang_shadow/python/sglang/srt/layers/attention/flashattention_backend.pysglang_shadow/python/sglang/srt/layers/attention/flashinfer_backend.pysglang_shadow/python/sglang/srt/layers/attention/flashinfer_mla_backend.pysglang_shadow/python/sglang/srt/layers/attention/flashmla_backend.pysglang_shadow/python/sglang/srt/layers/attention/hybrid_attn_backend.pysglang_shadow/python/sglang/srt/layers/attention/hybrid_linear_attn_backend.pysglang_shadow/python/sglang/srt/layers/attention/intel_amx_backend.pysglang_shadow/python/sglang/srt/layers/attention/mambasglang_shadow/python/sglang/srt/layers/attention/merge_state.pysglang_shadow/python/sglang/srt/layers/attention/nsasglang_shadow/python/sglang/srt/layers/attention/nsa_backend.pysglang_shadow/python/sglang/srt/layers/attention/tbo_backend.pysglang_shadow/python/sglang/srt/layers/attention/torch_flex_backend.pysglang_shadow/python/sglang/srt/layers/attention/torch_native_backend.pysglang_shadow/python/sglang/srt/layers/attention/triton_backend.pysglang_shadow/python/sglang/srt/layers/attention/triton_opssglang_shadow/python/sglang/srt/layers/attention/trtllm_mha_backend.pysglang_shadow/python/sglang/srt/layers/attention/trtllm_mla_backend.pysglang_shadow/python/sglang/srt/layers/attention/utils.pysglang_shadow/python/sglang/srt/layers/attention/vision.pysglang_shadow/python/sglang/srt/layers/attention/vision_utils.pysglang_shadow/python/sglang/srt/layers/attention/wave_backend.pysglang_shadow/python/sglang/srt/layers/attention/wave_opssglang_shadow/python/sglang/srt/layers/attention/xpu_backend.pysglang_shadow/python/sglang/srt/layers/communicator.pysglang_shadow/python/sglang/srt/layers/communicator_nsa_cp.pysglang_shadow/python/sglang/srt/layers/deep_gemm_wrappersglang_shadow/python/sglang/srt/layers/dp_attention.pysglang_shadow/python/sglang/srt/layers/elementwise.pysglang_shadow/python/sglang/srt/layers/flashinfer_comm_fusion.pysglang_shadow/python/sglang/srt/layers/layernorm.pysglang_shadow/python/sglang/srt/layers/linear.pysglang_shadow/python/sglang/srt/layers/logits_processor.pysglang_shadow/python/sglang/srt/layers/model_parallel.pysglang_shadow/python/sglang/srt/layers/modelopt_utils.pysglang_shadow/python/sglang/srt/layers/moesglang_shadow/python/sglang/srt/layers/multimodal.pysglang_shadow/python/sglang/srt/layers/parameter.pysglang_shadow/python/sglang/srt/layers/pooler.pysglang_shadow/python/sglang/srt/layers/quantizationsglang_shadow/python/sglang/srt/layers/radix_attention.pysglang_shadow/python/sglang/srt/layers/rocm_linear_utils.pysglang_shadow/python/sglang/srt/layers/rotary_embedding.pysglang_shadow/python/sglang/srt/layers/sampler.pysglang_shadow/python/sglang/srt/layers/sparse_pooler.pysglang_shadow/python/sglang/srt/layers/torchao_utils.pysglang_shadow/python/sglang/srt/layers/utilssglang_shadow/python/sglang/srt/layers/vocab_parallel_embedding.pysglang_shadow/python/sglang/srt/lorasglang_shadow/python/sglang/srt/managerssglang_shadow/python/sglang/srt/mem_cachesglang_shadow/python/sglang/srt/metricssglang_shadow/python/sglang/srt/model_executorsglang_shadow/python/sglang/srt/model_loadersglang_shadow/python/sglang/srt/modelssglang_shadow/python/sglang/srt/multimodalsglang_shadow/python/sglang/srt/multiplexsglang_shadow/python/sglang/srt/parsersglang_shadow/python/sglang/srt/samplingsglang_shadow/python/sglang/srt/server_args.pysglang_shadow/python/sglang/srt/server_args_config_parser.pysglang_shadow/python/sglang/srt/speculativesglang_shadow/python/sglang/srt/tokenizersglang_shadow/python/sglang/srt/tracingsglang_shadow/python/sglang/srt/utilssglang_shadow/python/sglang/srt/weight_syncsglang_shadow/python/sglang/testsglang_shadow/python/sglang/utils.pysglang_shadow/python/sglang/version.pytests/gdn/test_gdn_decode_pooled.py
✅ Files skipped from review due to trivial changes (48)
- sglang_shadow/python/sglang/srt/layers/attention/pycache
- sglang_shadow/python/sglang/srt/entrypoints
- sglang_shadow/python/sglang/srt/configs
- sglang_shadow/python/sglang/jit_kernel
- sglang_shadow/python/sglang/srt/layers/utils
- sglang_shadow/python/sglang/srt/constrained
- sglang_shadow/python/sglang/srt/metrics
- sglang_shadow/python/sglang/srt/managers
- sglang_shadow/python/sglang/srt/layers/pycache
- sglang_shadow/python/sglang/srt/debug_utils
- sglang_shadow/python/sglang/srt/tokenizer
- sglang_shadow/python/sglang/srt/layers/quantization
- sglang_shadow/python/sglang/srt/function_call
- sglang_shadow/python/sglang/srt/connector
- sglang_shadow/python/sglang/srt/elastic_ep
- sglang_shadow/python/sglang/srt/dllm
- sglang_shadow/python/sglang/srt/compilation
- sglang_shadow/python/sglang/srt/grpc
- sglang_shadow/python/sglang/srt/layers/moe
- sglang_shadow/python/sglang/srt/layers/attention/triton_ops
- sglang_shadow/python/sglang/README.md
- sglang_shadow/python/sglang/lang
- sglang_shadow/python/sglang/srt/mem_cache
- sglang_shadow/python/sglang/srt/pycache
- sglang_shadow/python/sglang/test
- sglang_shadow/python/sglang/srt/weight_sync
- sglang_shadow/python/sglang/srt/utils
- sglang_shadow/python/sglang/srt/batch_overlap
- sglang_shadow/python/sglang/srt/layers/attention/fla
- sglang_shadow/python/sglang/eval
- sglang_shadow/python/sglang/srt/models
- sglang_shadow/python/sglang/srt/model_executor
- sglang_shadow/python/sglang/pycache
- sglang_shadow/python/sglang/srt/layers/attention/wave_ops
- sglang_shadow/python/sglang/multimodal_gen
- sglang_shadow/python/sglang/srt/multimodal
- sglang_shadow/python/sglang/cli
- sglang_shadow/python/sglang/srt/eplb
- sglang_shadow/python/sglang/srt/checkpoint_engine
- sglang_shadow/python/sglang/srt/lora
- sglang_shadow/python/sglang/srt/sampling
- sglang_shadow/python/sglang/srt/layers/attention/nsa
- sglang_shadow/python/sglang/srt/batch_invariant_ops
- sglang_shadow/python/sglang/srt/parser
- sglang_shadow/python/sglang/srt/speculative
- sglang_shadow/python/sglang/srt/hardware_backend
- sglang_shadow/python/sglang/srt/tracing
- sglang_shadow/python/sglang/srt/layers/deep_gemm_wrapper
|
|
||
| h_out = gH[(batch_idx, value_head_idx, None, None)] | ||
| o_head = gO[(batch_idx, 0, value_head_idx, None)] | ||
| if pool_idx >= 0: |
There was a problem hiding this comment.
Add in-kernel upper-bound guard for pool_idx to prevent OOB state access.
Current guards only check pool_idx >= 0. Because host max-index validation is skipped during graph capture (Line 2073), positive out-of-range indices can still read/write gH out of bounds.
🛡️ Suggested fix
- if pool_idx >= 0:
+ if pool_idx >= 0 and pool_idx < gH.shape[0]:
@@
- if pool_idx >= 0:
+ if pool_idx >= 0 and pool_idx < gH.shape[0]:
@@
- if pool_idx >= 0:
+ if pool_idx >= 0 and pool_idx < gH.shape[0]:Also applies to: 1158-1158, 1621-1621
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/gdn_kernels/gdn_decode_bf16_state.py` at line 763, The kernel
currently only checks "if pool_idx >= 0" before accessing the shared state (gH),
allowing positive out-of-range indices to OOB read/write; update the guard to
also enforce an upper bound (e.g., "if 0 <= pool_idx < pool_len") using the
actual pool length (use the kernel's pool size constant or gH.shape[0]/len(gH))
so accesses to gH are only performed when pool_idx is within [0, pool_len-1];
apply the same change at the other occurrences mentioned (around the checks at
lines corresponding to pool_idx usage such as the blocks referencing pool_idx,
gH in gdn_decode_bf16_state.py, including the other spots flagged at the same
pattern).
run_bench_local.sh
Outdated
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | ||
|
|
||
| #!/bin/bash |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect the first 5 lines of run_bench_local.sh
head -n 5 run_bench_local.sh | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 303
Move the shebang to Line 1.
The shebang must be the first line of the script for proper shell interpretation. Currently, the export statement runs before the shebang is recognized.
Suggested fix
-export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
-
-#!/bin/bash
+#!/bin/bash
+export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True | |
| #!/bin/bash | |
| #!/bin/bash | |
| export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_bench_local.sh` around lines 1 - 3, Move the shebang "#!/bin/bash" to be
the very first line of the script and place the environment export statement
(export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True) after it so the shell
interpreter is correctly recognized; update the file so the first line is the
shebang and subsequent lines contain the export and other commands.
run_bench_local.sh
Outdated
|
|
||
|
|
||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --mamba-ssm-k-last \ | ||
| > "${OUTPUT_DIR}/${name}_server.log" 2>&1 & |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and examine the relevant lines
git ls-files | grep -E "run_bench_local\.sh"Repository: flashinfer-ai/flashinfer
Length of output: 86
🏁 Script executed:
sed -n '200,250p' run_bench_local.sh | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 1784
🏁 Script executed:
# Get the exact line numbers of the problematic lines
wc -l run_bench_local.sh
# Check the exact lines mentioned in the review
sed -n '215,225p' run_bench_local.sh | cat -n
echo "---"
sed -n '237,245p' run_bench_local.sh | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 795
🏁 Script executed:
#!/bin/bash
set -euo pipefail
awk '
prev_bslash && $0 ~ /^[[:space:]]*$/ {print "blank line after trailing backslash at line " NR-1}
{prev_bslash = ($0 ~ /\\[[:space:]]*$/)}
' run_bench_local.shRepository: flashinfer-ai/flashinfer
Length of output: 163
Remove blank lines after line continuations to fix broken multiline commands.
Lines 216 and 239 have trailing backslashes followed by blank lines, which breaks shell line continuation. This causes subsequent lines to be parsed as command names instead of arguments, resulting in execution errors (SC2215).
Suggested fix
python -u -m sglang.launch_server \
--model-path ${MODEL_PATH} \
--tp ${TP_SIZE} \
--host 127.0.0.1 \
--port ${SERVER_PORT} \
--trust-remote-code \
--mem-fraction-static 0.20 \
--disable-radix-cache \
--disable-cuda-graph \
-
-
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--mamba-ssm-k-last \
> "${OUTPUT_DIR}/${name}_server.log" 2>&1 &
@@
python -m sglang.bench_one_batch_server \
--model "${MODEL_PATH}" \
--base-url "http://127.0.0.1:${SERVER_PORT}" \
--batch-size 1 32 64 128 \
-
--output-len ${OUTPUT_LEN} \
--result-filename "${OUTPUT_DIR}/${name}_result.jsonl" \
2>&1 | tee "${OUTPUT_DIR}/${name}_output.txt"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| --speculative-algorithm EAGLE \ | |
| --speculative-num-steps 3 \ | |
| --speculative-eagle-topk 1 \ | |
| --mamba-ssm-k-last \ | |
| > "${OUTPUT_DIR}/${name}_server.log" 2>&1 & | |
| --speculative-algorithm EAGLE \ | |
| --speculative-num-steps 3 \ | |
| --speculative-eagle-topk 1 \ | |
| --mamba-ssm-k-last \ | |
| > "${OUTPUT_DIR}/${name}_server.log" 2>&1 & |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 219-219: This flag is used as a command name. Bad line break or missing [ .. ]?
(SC2215)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@run_bench_local.sh` around lines 217 - 223, There are blank lines after shell
line-continuation backslashes in the multiline command that includes flags like
--speculative-algorithm, --speculative-num-steps, --speculative-eagle-topk and
--mamba-ssm-k-last (the command redirecting to
"${OUTPUT_DIR}/${name}_server.log" 2>&1 &); remove the empty lines immediately
following any line that ends with a trailing backslash so the backslash actually
escapes the newline and the next flag/redirect is treated as the same command.
Ensure every backslash is followed directly by the next argument line (no
intervening blank lines) so the shell continuation is not broken.
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py No newline at end of file | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the _version.py file in the repository
find . -name "_version.py" -type fRepository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# List top-level structure
ls -la
# Search for any version-related files
find . -name "*version*" -type f 2>/dev/null | head -20
# Search for any Python files with "sglang" in path
find . -path "*sglang*" -type f 2>/dev/null | head -20Repository: flashinfer-ai/flashinfer
Length of output: 3082
🏁 Script executed:
# List contents of sglang_shadow directory
ls -la sglang_shadow/
# List contents of sglang_shadow/python/sglang/
ls -la sglang_shadow/python/sglang/
# Check if _version.py exists
ls -la sglang_shadow/python/sglang/_version.py 2>&1Repository: flashinfer-ai/flashinfer
Length of output: 3362
🏁 Script executed:
# Check if the symlink target exists
ls -la /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py 2>&1
# Try to read the symlink
cat sglang_shadow/python/sglang/_version.py 2>&1
# Check what the symlink resolves to
readlink -f sglang_shadow/python/sglang/_version.py 2>&1Repository: flashinfer-ai/flashinfer
Length of output: 254
Remove the broken symbolic link to /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/_version.py.
sglang_shadow/python/sglang/_version.py is a symbolic link pointing to a nonexistent absolute filesystem path. This breaks imports and prevents the package from being distributed or installed. Replace the symlink with actual version code or a proper relative reference.
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/_version.py` at line 1, The file
sglang_shadow/python/sglang/_version.py is a broken absolute symlink; remove the
symlink and replace it with a real module that exports the package version
(e.g., define __version__ = "x.y.z" or implement a dynamic lookup using
importlib.metadata.version in a function like get_version()); ensure the new
_version.py is a regular file (not a symlink) and that any imports referencing
sglang._version or from sglang._version import __version__ continue to work, or
alternatively change imports to use a relative/resource-based lookup so no
absolute filesystem links are required.
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py No newline at end of file | |||
There was a problem hiding this comment.
Raw filesystem path committed as Python code breaks module parsing.
Line 1 is not valid Python and will fail import/linting. Replace this with actual module content (or remove this file from Python package inputs if it is only a path marker artifact).
Suggested minimal fix (keep file syntactically valid until real implementation lands)
-/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py
+"""Shadow module placeholder.
+
+TODO: replace with actual `bench_one_batch_server.py` implementation.
+"""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/bench_one_batch_server.py | |
| """Shadow module placeholder. | |
| TODO: replace with actual `bench_one_batch_server.py` implementation. | |
| """ |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/bench_one_batch_server.py` at line 1, The file
sglang_shadow/python/sglang/bench_one_batch_server.py currently contains a raw
filesystem path on line 1 which is invalid Python and breaks imports; replace
that line with valid module content (e.g., a module docstring and a minimal
placeholder such as an empty function or "pass" in a named function/class) or
remove the file from package inputs if it was only a path marker artifact, then
run lint/import to verify the module now parses cleanly as requested in
[request_verification].
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/layers/torchao_utils.py No newline at end of file | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
git ls-files | grep -i "torchao_utils"Repository: flashinfer-ai/flashinfer
Length of output: 123
🏁 Script executed:
git ls-files | grep -E "(sglang|torchao)" | head -20Repository: flashinfer-ai/flashinfer
Length of output: 923
🏁 Script executed:
cat -n sglang_shadow/python/sglang/srt/layers/torchao_utils.py | head -20Repository: flashinfer-ai/flashinfer
Length of output: 155
🏁 Script executed:
ls -la sglang_shadow/python/sglang/srt/layers/torchao_utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 257
🏁 Script executed:
git ls-files -s | grep torchao_utilsRepository: flashinfer-ai/flashinfer
Length of output: 173
🏁 Script executed:
file sglang_shadow/python/sglang/srt/layers/torchao_utils.pyRepository: flashinfer-ai/flashinfer
Length of output: 112
🏁 Script executed:
git show HEAD:sglang_shadow/python/sglang/srt/layers/torchao_utils.py | head -20Repository: flashinfer-ai/flashinfer
Length of output: 151
Remove hardcoded symlink; replace with actual Python module.
The file is a symbolic link pointing to an absolute path (/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/layers/torchao_utils.py) instead of containing Python code. This cannot be imported and is not portable across environments. Replace the symlink with the actual module file.
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/srt/layers/torchao_utils.py` at line 1, The file
sglang.srt.layers.torchao_utils.py is currently a filesystem symlink to an
absolute path and thus not importable; replace the symlink with a real Python
module file named torchao_utils.py that implements the expected public API for
this package (module sglang.srt.layers.torchao_utils). Create the module
in-place (not a symlink), implement or forward the expected functions/classes
used elsewhere (e.g., any helpers referenced as torchao_utils.* in imports), add
an __all__ export list and proper imports, and ensure it contains tests or
minimal runtime-safe stubs if full functionality depends on optional
dependencies so the package can be imported across environments.
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/model_loader No newline at end of file | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check if the file exists and what it contains
if [ -f "sglang_shadow/python/sglang/srt/model_loader" ]; then
echo "=== File content ==="
cat -n "sglang_shadow/python/sglang/srt/model_loader"
echo ""
echo "=== File type ==="
file "sglang_shadow/python/sglang/srt/model_loader"
echo ""
echo "=== File size ==="
wc -c "sglang_shadow/python/sglang/srt/model_loader"
fi
# Check if it's a symlink
if [ -L "sglang_shadow/python/sglang/srt/model_loader" ]; then
echo "=== Symlink target ==="
readlink -f "sglang_shadow/python/sglang/srt/model_loader"
fi
# List directory structure
echo ""
echo "=== Directory structure of sglang_shadow/ ==="
find sglang_shadow -type f -name "model_loader*" 2>/dev/null
find sglang_shadow -type l 2>/dev/null | head -20Repository: flashinfer-ai/flashinfer
Length of output: 993
🏁 Script executed:
# Search for files with absolute paths as content
echo "=== Checking for files with absolute paths ==="
fd --type f . sglang_shadow/ 2>/dev/null | while read -r file; do
# Skip binary and large files
if file "$file" | grep -q "text"; then
lines=$(wc -l < "$file" 2>/dev/null || echo "999")
if [ "$lines" -le 5 ]; then
content=$(cat "$file" 2>/dev/null)
if echo "$content" | grep -qE "^/home/|^/usr/|^/opt/|^/var/"; then
echo "SUSPECT: $file"
echo "Content: $content"
echo "---"
fi
fi
fi
done
# Check for symlinks with absolute targets
echo ""
echo "=== Checking for absolute symlinks ==="
find sglang_shadow -type l 2>/dev/null | while read -r link; do
target=$(readlink "$link" 2>/dev/null)
if echo "$target" | grep -qE "^/"; then
echo "ABSOLUTE SYMLINK: $link -> $target"
fi
doneRepository: flashinfer-ai/flashinfer
Length of output: 18079
🏁 Script executed:
# Explore the sglang_shadow directory structure
echo "=== sglang_shadow directory tree ==="
tree -L 3 sglang_shadow 2>/dev/null || find sglang_shadow -type d | head -20Repository: flashinfer-ai/flashinfer
Length of output: 2238
🏁 Script executed:
# Check if there's any documentation or configuration about sglang_shadow
find . -maxdepth 3 -name "*README*" -o -name "*.md" | grep -i sglang | head -10Repository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check if there's any gitignore or setup documentation
cat .gitignore 2>/dev/null | grep -A5 -B5 sglang_shadow || echo "No sglang_shadow entries in .gitignore"Repository: flashinfer-ai/flashinfer
Length of output: 106
Critical: Absolute symlinks committed throughout sglang_shadow/ — will break for all contributors and CI.
The sglang_shadow/ directory contains 150+ symlinks with hardcoded absolute paths to a developer's local workspace (/home/xutingz/workspace/gitsrc/sglang_gdn/...). This affects all subdirectories: python/sglang/, srt/, layers/, models/, and more. These symlinks will not resolve on any other machine, breaking builds for all contributors and CI pipelines. It also leaks developer-specific information (username and workspace structure).
Convert all absolute symlinks to relative symlinks (e.g., ../../../sglang_gdn/python/sglang/...), remove this directory entirely and wire integration via pyproject.toml path dependencies or editable installs, or add the entire directory to .gitignore if it's only needed locally.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/srt/model_loader` at line 1, The sglang_shadow
directory contains many absolute symlinks pointing to /home/xutingz/... which
will fail for other contributors and CI; replace every absolute symlink in
sglang_shadow (including subdirs python/sglang/, srt/, layers/, models/) with
relative symlinks (e.g., ../../../sglang_gdn/...), or remove the directory and
wire the package via pyproject.toml path dependencies or an editable install, or
add sglang_shadow/ to .gitignore if it’s only local; search the repo for
symlinks containing the absolute prefix
"/home/xutingz/workspace/gitsrc/sglang_gdn" and update each link target to a
correct relative path or eliminate the symlink and use proper packaging instead.
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/server_args.py No newline at end of file | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the actual file path in the repository
fd "server_args.py" -t fRepository: flashinfer-ai/flashinfer
Length of output: 50
🏁 Script executed:
# Check repository structure and look for sglang references
git ls-files | head -30
echo "---"
git ls-files | grep -i sglang
echo "---"
git ls-files | grep -i serverRepository: flashinfer-ai/flashinfer
Length of output: 7356
🏁 Script executed:
# Check file size and read the content
wc -l sglang_shadow/python/sglang/srt/server_args.py
echo "---"
head -20 sglang_shadow/python/sglang/srt/server_args.pyRepository: flashinfer-ai/flashinfer
Length of output: 272
🏁 Script executed:
# Check the file using git directly
git ls-files -s sglang_shadow/python/sglang/srt/server_args.py
echo "---"
# Try to read it via git
git cat-file -p HEAD:sglang_shadow/python/sglang/srt/server_args.py | head -20Repository: flashinfer-ai/flashinfer
Length of output: 243
Fix invalid symlink in module file.
The file sglang_shadow/python/sglang/srt/server_args.py is a symbolic link pointing to an absolute path (/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/srt/server_args.py), which is invalid Python and breaks module imports. Replace the symlink with actual server_args definitions or an intentional placeholder docstring.
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/srt/server_args.py` at line 1, The server_args.py
file is currently a broken symlink; replace it with a real Python module instead
of a symlink: remove the symlink and create a new server_args.py containing
either the actual argument parsing/constants/classes used elsewhere (e.g.,
functions or names referenced like parse_server_args, ServerArgs, DEFAULT_PORT)
or, if the real implementations live elsewhere, create a clear placeholder
module with a top-level docstring and exported names (e.g., __all__ =
["parse_server_args", "ServerArgs"]) so imports succeed; ensure the module
defines the same public symbols other code expects and is valid Python (no
absolute-path symlink).
sglang_shadow/python/sglang/utils.py
Outdated
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py No newline at end of file | |||
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -n 5 sglang_shadow/python/sglang/utils.py | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 163
🏁 Script executed:
fd -type f -name "utils.py" | head -20Repository: flashinfer-ai/flashinfer
Length of output: 238
🏁 Script executed:
fd "utils.py"Repository: flashinfer-ai/flashinfer
Length of output: 1155
🏁 Script executed:
head -n 10 sglang_shadow/python/sglang/utils.py | cat -nRepository: flashinfer-ai/flashinfer
Length of output: 163
🏁 Script executed:
git show HEAD:sglang_shadow/python/sglang/utils.py | head -n 10Repository: flashinfer-ai/flashinfer
Length of output: 132
🏁 Script executed:
git show HEAD:sglang_shadow/python/sglang/utils.py | wc -lRepository: flashinfer-ai/flashinfer
Length of output: 69
🏁 Script executed:
git show HEAD:sglang_shadow/python/sglang/utils.py | xxd | head -20Repository: flashinfer-ai/flashinfer
Length of output: 109
🏁 Script executed:
git show HEAD:sglang_shadow/python/sglang/utils.py | od -cRepository: flashinfer-ai/flashinfer
Length of output: 110
sglang.utils is currently unparsable due to Line 1.
The absolute path literal is invalid Python and must be removed/replaced.
Proposed fix
-/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py
+"""Shared utility helpers (placeholder)."""📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/utils.py | |
| """Shared utility helpers (placeholder).""" |
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/utils.py` at line 1, The first line of
sglang.utils contains an invalid absolute path literal which makes the module
unparsable; remove that path string from the top of the file and replace it with
a valid construct (e.g., a module docstring, a shebang, or the intended
import/definition). Ensure the file begins with valid Python (no bare filesystem
paths) and that any intended metadata is expressed as a string literal docstring
or comment rather than a raw path.
| @@ -0,0 +1 @@ | |||
| /home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/version.py No newline at end of file | |||
There was a problem hiding this comment.
Accidentally committed local filesystem path instead of file content.
This file (and every other file added under sglang_shadow/) contains only the developer's local absolute machine path as its content:
/home/xutingz/workspace/gitsrc/sglang_gdn/python/sglang/version.py
These are not valid Python modules. They were almost certainly created by a tooling artifact (e.g., a find/echo redirect or a broken symlink materialisation) and then accidentally staged. The same problem affects all 8 files in this batch:
sglang_shadow/python/sglang/version.pysglang_shadow/python/sglang/srt/disaggregation(also not a.pyfile — may be a directory committed as a regular file)sglang_shadow/python/sglang/srt/layers/vocab_parallel_embedding.pysglang_shadow/python/sglang/profiler.pysglang_shadow/python/sglang/srt/layers/parameter.pysglang_shadow/python/sglang/srt/constants.pysglang_shadow/python/sglang/srt/layers/attention/intel_amx_backend.pysglang_shadow/python/sglang/global_config.py
Please:
- Determine whether these shadow stubs should be empty files, copies/symlinks of the real SGLang source, or omitted entirely.
- Replace the path-string content with the correct content (or remove the files if not needed).
- For
sglang_shadow/python/sglang/srt/disaggregation, verify whether this is meant to be a directory (package) — if so, a__init__.pyinside that directory is needed rather than a file at that path.
The Ruff E902 errors on all of these are a direct downstream symptom of this same root cause.
🧰 Tools
🪛 Ruff (0.15.2)
[error] 1-1: No such file or directory (os error 2)
(E902)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@sglang_shadow/python/sglang/version.py` at line 1, Several files under
sglang_shadow contain only the developer's local absolute path instead of valid
Python content (e.g., sglang_shadow/python/sglang/version.py and the listed
eight files); you should remove or replace these stub files with the correct
artifacts: first determine whether each shadow entry (version.py, profiler.py,
global_config.py, constants.py, parameter.py, vocab_parallel_embedding.py,
intel_amx_backend.py) should be omitted, be a real copy/symlink of the
corresponding sglang module, or be an empty placeholder; for
sglang_shadow/python/sglang/srt/disaggregation verify if it should be a package
— if so replace the file with a directory named disaggregation and add an
__init__.py (or otherwise remove the stray file); ensure the final commit
removes the single-line path contents and changes the files to valid Python
modules or deletes them, then re-run linting to confirm Ruff E902 is resolved.
c627a21 to
06b94e7
Compare
702da5b to
34121a1
Compare
|
/bot run |
|
[FAILED] Pipeline #45304753: 9/20 passed |
📌 Description
This PR ports the pooled decode kernel (originally from #2521) to the new F16 backend introduced in #2498. It enables zero-copy state updates using indirect indexing, eliminating the need for manual gather/scatter operations in SGLang.
Key changes:
feat/gdn-decode-pooledcommits to the latestmain(post-F16 merge).indexing logic.
[B, HV, V, K](direct) and[pool_size, HV, V, K](pooled) state layouts.gdn_decode_klast_bf16_statebackend.🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
New Features
Improvements
Tools
Tests