feat: support more head dim in RoPE kernel #2109

raayandhar · 2025-11-19T01:52:46Z

📌 Description

With the new changes we should be able to support arbitrary head dim using the RopeQuantizeKernel, and I have routed the BatchQKApplyRotaryPosIdsCosSinCache to do so.

🔍 Related Issues

#2104

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.). NOTE: There were a set of tests where I got this error:

/tmp/tmp_v7jd1rh/cuda_utils.c:6:10: fatal error: Python.h: No such file or directory
    6 | #include <Python.h>
      |          ^~~~~~~~~~
compilation terminated.

which I know is related to my system. Unfortunately I do not manage this system and it does not have docker, so trying to fix this is a bit difficult. Hopefully someone else can verify my tests or run CI. All other tests were passing, and all the failing tests had that error.

Reviewer Notes

Please let me know if there's a smarter way to get around this hack or if other tests should be updated. Also I think we should remove the older kernel but let me know if we should do otherwise. I also need to test perf.

Summary by CodeRabbit

Bug Fixes
- Fixed edge-case handling for rotary position embeddings when embedding/head dimensions don't align with processing chunk sizes, preventing incorrect writes and ensuring correct tail-chunk behavior.
- Added stricter validation for rotary vs. head dimensions to surface configuration errors earlier.
- Improved dynamic kernel selection and execution for more robust performance across varied configurations.
Tests
- Expanded test coverage for rotary position embedding across additional head/rotary-dimension, batch, sequence, and interleave combinations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

coderabbitai · 2025-11-19T01:52:56Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Added a device helper to handle partial RoPE quantization chunks, replaced per-element vector loads/stores with guarded partial-chunk writes, refactored dynamic kernel dispatch, routed RoPE‑quantized flows through RopeQuantize, and expanded cos/sin cache tests with four new configurations.

Changes

Cohort / File(s)	Summary
RoPE quantization & kernels `include/flashinfer/pos_enc.cuh`	Added `scale_store_partial_chunk` device helper for zero-padding and safe tail writes; replaced per-element vector load/store with guarded partial-chunk stores across RopeQuantizeKernel and Q/K RoPE paths; added head_dim < rotary_dim guard; reworked dynamic dispatch (vec_size, bdx, bdy, block/grid) and routed RoPE-quantized flows through RopeQuantize.
RoPE tests `tests/attention/test_rope.py`	Added four new test configurations to `test_rope_cos_sin_cache` parameter set to increase coverage for head sizes, rotary_dims, batch/sequence shapes, and interleaving modes.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Host
    participant Dispatch as KernelDispatch
    participant GPU
    Note over Host,Dispatch: Host prepares params (head_dim, rotary_dim, no_rope_dim,...)
    Host->>Dispatch: call launch routine
    Dispatch-->>GPU: select & launch kernel (RopeQuantize / other) with computed vec_size/bdx/bdy
    alt head_dim < rotary_dim
        GPU-->>Host: error return
    else RoPE-quantized path
        GPU->>GPU: RopeQuantizeKernel runs
        GPU->>GPU: call scale_store_partial_chunk for tail lanes (zero-pad, scale, store)
    else Non-RoPE / full-chunk
        GPU->>GPU: regular vector loads/stores
    end
    GPU-->>Host: results / completion

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~35 minutes

Inspect scale_store_partial_chunk for correct bounds, padding, scaling, and store semantics.
Verify updated kernel dispatch (vec_size, bdx, bdy, grid/block) across template instantiations.
Confirm head_dim < rotary_dim guard, error propagation, and RoPE-quantized routing correctness.
Re-run the expanded test_rope_cos_sin_cache configurations.

Possibly related PRs

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924 — Modifies the same RoPE/quantization kernel and dispatch logic in include/flashinfer/pos_enc.cuh; likely overlapping on RopeQuantize/MLARopeQuantize changes.

Suggested reviewers

yzh119

Poem

🐇 I nibble bytes where tails might fray,

I pad and scale so numbers stay,
Dispatch picks the kernel lane,
Partial chunks now sail the plain,
Hopping fixes, snug and fey ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: extending RoPE kernel support to handle arbitrary head dimensions through RopeQuantizeKernel and routing logic.
Description check	✅ Passed	The PR description covers the key change (supporting arbitrary head dim via RopeQuantizeKernel routing), links the related issue, confirms pre-commit checks, and documents testing status with known environment constraints.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

📝 Customizable high-level summaries are now available in beta!

You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.

Provide your own instructions using the high_level_summary_instructions setting.
Format the summary however you like (bullet lists, tables, multi-section layouts, contributor stats, etc.).
Use high_level_summary_in_walkthrough to move the summary from the description to the walkthrough section.

Example instruction:

"Divide the high-level summary into five sections:

📝 Description — Summarize the main change in 50–60 words, explaining what was done.

📓 References — List relevant issues, discussions, documentation, or related PRs.

📦 Dependencies & Requirements — Mention any new/updated dependencies, environment variable changes, or configuration updates.

📊 Contributor Summary — Include a Markdown table showing contributions:
| Contributor | Lines Added | Lines Removed | Files Changed |

✔️ Additional Notes — Add any extra reviewer context.
Keep each section concise (under 200 words) and use bullet or numbered lists for clarity."

Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-11-19T01:53:04Z

Summary of Changes

Hello @raayandhar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and robustness of the Rotary Positional Embedding (RoPE) implementation by enabling support for arbitrary head dimensions within the RopeQuantizeKernel. It introduces a mechanism to gracefully handle partial data chunks in non-RoPE dimensions and refactors the BatchQKApplyRotaryPosIdsCosSinCache function to utilize this improved kernel. These changes ensure correct processing for a broader range of model configurations and simplify future maintenance.

Highlights

Arbitrary Head Dimension Support: The RoPE kernel (RopeQuantizeKernel) now supports arbitrary head dimensions, specifically addressing cases where the non-RoPE portion of the head dimension is not a multiple of the rope_dim.
Partial Chunk Handling: Introduced a new scale_store_partial_chunk helper function to correctly handle partial vector loads and stores for non-RoPE dimensions, preventing errors when no_rope_dim is not a multiple of rope_dim.
Kernel Routing and Refactoring: The BatchQKApplyRotaryPosIdsCosSinCache function has been refactored to route its operations through the more general RopeQuantize kernel, simplifying the codebase and leveraging the new arbitrary head dimension support.
Dynamic Block Dimension (bdx): The DISPATCH_ROPE_DIM macro was removed from RopeQuantize and RopeQuantizeAppendPagedKVCache, allowing the bdx (block dimension x) to be dynamically calculated at runtime, improving flexibility for different rope_dim values.
Expanded Test Coverage: New test cases have been added to test_rope.py to validate the RoPE kernel's behavior with various head dimensions, including those that previously posed challenges.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for arbitrary head dimensions in the RoPE kernel by introducing a new helper function scale_store_partial_chunk to handle partial memory chunks and refactoring BatchQKApplyRotaryPosIdsCosSinCache to use the more general RopeQuantize kernel. This is a good simplification that reduces code duplication.

However, I've found a critical issue in how the non-RoPE tensor slices are handled. The pointer arithmetic used to create q_nope_in and k_nope_in is incorrect for multi-dimensional tensors, which will lead to incorrect memory accesses. I've also included a couple of suggestions to improve code clarity in the new helper function.

The added tests are good, but they seem to be passing despite the critical issue, which might indicate a problem with the test setup or reference implementation.

include/flashinfer/pos_enc.cuh

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

include/flashinfer/pos_enc.cuh (1)
1052-1120: Consider clarifying the bdx template parameter usage.

The kernel dispatch sets the template parameter bdx=1 (line 1097) while computing a runtime bdx value (line 1054). This works because the rotary_dim argument is explicitly passed to the RoPE functions, overriding the default vec_size * bdx. However, this discrepancy could be confusing for maintainability.

Consider either:

Using the computed bdx value as the template parameter (would require DISPATCH_BDX macro), or

Adding a comment explaining why the template bdx is set to 1 while runtime bdx varies

Example comment:
// Template bdx=1 because rotary_dim is explicitly passed to RoPE functions
auto kernel = RopeQuantizeKernel<INTERLEAVE, vec_size, 1, DType, IdType, QuantType>;

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1c4b522 and 1bee170.

📒 Files selected for processing (2)

include/flashinfer/pos_enc.cuh (7 hunks)
tests/attention/test_rope.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (5)

tests/attention/test_rope.py (1)

303-306: LGTM! Test coverage expanded appropriately.

The new test configurations effectively validate the partial chunk handling introduced in this PR. They cover various scenarios where no_rope_dim < rope_dim, which exercises the new scale_store_partial_chunk logic for tail chunks.

include/flashinfer/pos_enc.cuh (4)

546-551: Correct usage of partial chunk handling.

The chunk_valid calculation properly handles tail chunks where no_rope_dim is not a multiple of rope_dim. The logic correctly computes the number of valid elements in the current chunk and handles the case where elem_offset >= no_rope_dim by setting chunk_valid = 0.

Also applies to: 566-571

1140-1196: Dispatch logic is consistent across variants.

The dynamic dispatch logic for RopeQuantizeAppendPagedKVCache (GQA/MHA) and RopeQuantizeAppendPagedMLACache follows the same pattern as RopeQuantize. The total_blocks_y calculation correctly accounts for the differences between GQA/MHA (includes V blocks) and MLA (no V blocks).

Note: The same bdx template parameter concern mentioned in the previous comment applies here as well.

Also applies to: 1214-1274

236-292: Verify performance impact with targeted benchmarks for partial chunk scenarios.

The code logic is correct with proper boundary checks and zero-padding. However, verification confirms the author's concern: no performance data exists for this code path. The existing benchmark uses head_size = rotary_dim (both 128), meaning no_rope_dim = 0, so it doesn't exercise the partial chunk handling that this function addresses.

Before merging, run benchmarks with configurations where no_rope_dim > 0 and no_rope_dim < rope_dim (e.g., head_dim=192, rope_dim=128, no_rope_dim=64) to quantify the performance impact of the element-by-element fallback path and zero-padding logic.

1286-1312: Performance verification requires manual benchmarking—the routing change logic is correct and well-tested.

The routing to RopeQuantize is intentional, uniform across all callers, and thoroughly validated for correctness. Existing tests in tests/attention/test_rope.py verify the output against reference implementations for all relevant configurations (head_dim: 64/128/256, partial_rotary_factor: 0.25–1.0). However, the original review specifically requests performance profiling to detect regressions, which cannot be completed automatically in this environment—you must run performance benchmarks locally to measure kernel execution time and throughput across representative workloads.

yzh119

LGTM overall, cc @kahyunnam for another look

include/flashinfer/pos_enc.cuh

yzh119 · 2025-11-19T07:53:55Z

/bot run

flashinfer-bot · 2025-11-19T07:54:28Z

GitLab MR !149 has been created, and the CI pipeline #38773414 is currently running. I'll report back once the pipeline job completes.

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

raayandhar · 2025-11-21T01:19:50Z

Is there an issue with CI? Seems like it has been running for 2 days now 😅

yzh119 · 2025-11-21T01:27:44Z

Hi @raayandhar the CI is finished (result not returned here for some reasons), the PR itself do not bring any regressions and should be ready to merge.

I'm running the benchmarks and will merge it as long as there is no performance regression.

kahyunnam · 2025-11-21T02:10:42Z

LGTM overall, cc @kahyunnam for another look

This LGTM to me! I do wonder if we're adding some extra not-necessary overhead with still having the pointwise multiply by 1 (/*quant_scale_q=*/1.0f, /*quant_scale_kv=*/1.0f) but I also don't think that's that big of a deal.

I agree we can merge when benchmarking looks ok @yzh119 . Thanks @raayandhar for the contribution!

yzh119 · 2025-11-22T07:49:28Z

There are indeed some performance regressions @raayandhar @kahyunnam :

On H100, Before this PR:

rope-latency:
    seq_len  FlashInfer    Native      vLLM
0       2.0    0.005936  0.062576  0.007968
1       4.0    0.005952  0.064256  0.008160
2       8.0    0.005888  0.069376  0.008128
3      16.0    0.006112  0.066160  0.008352
4      32.0    0.006240  0.066784  0.008576
5      64.0    0.006752  0.068608  0.009056
6     128.0    0.007808  0.075328  0.010464
7     256.0    0.009664  0.088256  0.012832
8     512.0    0.013472  0.115648  0.019904
9    1024.0    0.020896  0.170496  0.033728
10   2048.0    0.035712  0.290272  0.060896
11   4096.0    0.066240  0.523520  0.114400
12   8192.0    0.129952  0.985888  0.221632
13  16384.0    0.255168  1.897296  0.436032
14  32768.0    0.486576  3.715232  0.864640
15  65536.0    0.953376  7.342368  1.722112

After:

    seq_len  FlashInfer    Native      vLLM
0       2.0    0.005952  0.063488  0.007968
1       4.0    0.005952  0.064112  0.008128
2       8.0    0.005920  0.069440  0.008128
3      16.0    0.006272  0.067104  0.008384
4      32.0    0.006400  0.067552  0.008576
5      64.0    0.006688  0.068512  0.009056
6     128.0    0.007744  0.075424  0.010464
7     256.0    0.009760  0.088224  0.012832
8     512.0    0.013632  0.115712  0.019872
9    1024.0    0.021120  0.170720  0.033696
10   2048.0    0.036064  0.289760  0.060864
11   4096.0    0.066976  0.524288  0.114528
12   8192.0    0.128800  0.985664  0.221760
13  16384.0    0.259968  1.899248  0.435840
14  32768.0    0.621312  3.711968  0.864608
15  65536.0    1.758672  7.343424  1.722016

raayandhar · 2025-11-22T07:53:35Z

There are indeed some performance regressions @raayandhar @kahyunnam :

On H100, Before this PR:

rope-latency:
    seq_len  FlashInfer    Native      vLLM
0       2.0    0.005936  0.062576  0.007968
1       4.0    0.005952  0.064256  0.008160
2       8.0    0.005888  0.069376  0.008128
3      16.0    0.006112  0.066160  0.008352
4      32.0    0.006240  0.066784  0.008576
5      64.0    0.006752  0.068608  0.009056
6     128.0    0.007808  0.075328  0.010464
7     256.0    0.009664  0.088256  0.012832
8     512.0    0.013472  0.115648  0.019904
9    1024.0    0.020896  0.170496  0.033728
10   2048.0    0.035712  0.290272  0.060896
11   4096.0    0.066240  0.523520  0.114400
12   8192.0    0.129952  0.985888  0.221632
13  16384.0    0.255168  1.897296  0.436032
14  32768.0    0.486576  3.715232  0.864640
15  65536.0    0.953376  7.342368  1.722112

After:

    seq_len  FlashInfer    Native      vLLM
0       2.0    0.005952  0.063488  0.007968
1       4.0    0.005952  0.064112  0.008128
2       8.0    0.005920  0.069440  0.008128
3      16.0    0.006272  0.067104  0.008384
4      32.0    0.006400  0.067552  0.008576
5      64.0    0.006688  0.068512  0.009056
6     128.0    0.007744  0.075424  0.010464
7     256.0    0.009760  0.088224  0.012832
8     512.0    0.013632  0.115712  0.019872
9    1024.0    0.021120  0.170720  0.033696
10   2048.0    0.036064  0.289760  0.060864
11   4096.0    0.066976  0.524288  0.114528
12   8192.0    0.128800  0.985664  0.221760
13  16384.0    0.259968  1.899248  0.435840
14  32768.0    0.621312  3.711968  0.864608
15  65536.0    1.758672  7.343424  1.722016

Oof ok, I will go and investigate. Could you share your benchmarking scripts?

yzh119 · 2025-11-22T08:24:21Z

https://github.com/flashinfer-ai/flashinfer/blob/main/benchmarks/bench_rope.py

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

include/flashinfer/pos_enc.cuh (2)
236-290: Minor optimization opportunity in scale_store_partial_chunk.

Lines 273-276 scale all vec_size elements including zero-padded ones when the chunk is partial. While harmless (0 × scale = 0) and likely optimized by the compiler, you could skip scaling invalid elements for marginal gains:
 #pragma unroll
   for (uint32_t i = 0; i < vec_size; ++i) {
-    vec[i] = vec[i] * scale;
+    uint32_t elem_idx = lane_elem_offset + i;
+    if (elem_idx < chunk_valid) {
+      vec[i] = vec[i] * scale;
+    }
   }
1213-1282: MLA dispatch looks correct; minor cleanup possible.

The dispatch correctly handles MLA-specific requirements (num_kv_heads=1, no V processing, MLA cache type).

Lines 1245-1246 and 1262-1264 introduce duplicate stride variables:
size_t k_rope_in_stride_h_dup = k_rope_in_stride;
size_t k_nope_in_stride_h_dup = k_nope_in_stride;
These can be removed by directly assigning k_rope_in_stride and k_nope_in_stride to the params struct fields. This minor cleanup would reduce verbosity.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6805e9c and cb04ef1.

📒 Files selected for processing (1)

include/flashinfer/pos_enc.cuh (7 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (5)

include/flashinfer/pos_enc.cuh (5)

544-549: LGTM: Partial chunk handling for non-RoPE dimensions.

The usage of scale_store_partial_chunk correctly guards writes when no_rope_dim is not a multiple of rope_dim. The chunk_valid calculation ensures only valid elements are written, preventing out-of-bounds access.

Also applies to: 564-569

1294-1298: LGTM: Guard against invalid head_dim.

The check ensures head_dim >= rotary_dim before proceeding, preventing undefined behavior. Clear error message aids debugging.

1361-1380: Routing to RopeQuantize introduces known performance trade-off.

For arbitrary head dimensions, the code routes through RopeQuantize with quant_scale_q=1.0f and quant_scale_kv=1.0f. This:

Adds a multiply-by-1.0 operation per element (minor, likely optimized by compiler)

Uses a more general kernel instead of the optimized fast-path kernels for standard dimensions

The pointer arithmetic (lines 1367-1372) is correct despite past review concerns. The base pointer offset by rotary_dim combined with full-tensor strides produces the correct element addresses for all (idx, head) combinations.

Based on PR comments, performance regressions on H100 are known and under investigation with benchmark results.

1050-1118: LGTM: Dynamic dispatch supports arbitrary dimensions.

The refactored dispatch logic correctly computes:

Thread block dimensions ensuring at least bdx threads in x-dimension to cover rope_dim with vectorization

At least 128 threads per block for occupancy

Dynamic no_rope_chunks based on no_rope_dim / rope_dim ratio

The launch configuration with programmatic stream serialization attribute is properly constructed.

1138-1195: LGTM: Consistent dispatch pattern for paged KV cache.

The dispatch logic mirrors RopeQuantize with appropriate adjustments for cache append operations. The total_blocks_y correctly includes V processing blocks for GQA/MHA.

raayandhar · 2025-11-27T03:40:09Z

@yzh119

I think the fundamental issue that leads to this perf gap is that the RopeQuantize kernel is too complex (at least, definitely for me) since it also supports a lot of different ways of doing the operation. I think there are two options:

I have routed the non-64/128/256/512 dim case to the RopeQuantize kernel and keep the routing to the RopeQuantize option when doing a separate head dim. We can note that the performance is worse somehow and maybe open a tracking issue for it. Besides, it may still be useful to add the partial chunking support for other future cases. Then do the first part of 2)
Extend the existing kernels/write new kernel to support these other head dim and work to improve the performance there (1), and optionally close this PR. This seems like it might be a bit easier to get to the existing perf gap. I'm happy to tackle this.

Let me know your thoughts.

route BatchQKApplyRotaryPosIdsCosSinCache to RopeQuantize

1bee170

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

raayandhar requested review from IwakuraRein, bkryu, jiahanc, nvmbreughe and yzh119 as code owners November 19, 2025 01:52

gemini-code-assist bot reviewed Nov 19, 2025

View reviewed changes

include/flashinfer/pos_enc.cuh Show resolved Hide resolved

include/flashinfer/pos_enc.cuh Show resolved Hide resolved

include/flashinfer/pos_enc.cuh Show resolved Hide resolved

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

yzh119 reviewed Nov 19, 2025

View reviewed changes

include/flashinfer/pos_enc.cuh Outdated Show resolved Hide resolved

address small comment

6805e9c

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

kahyunnam self-requested a review November 21, 2025 02:19

route to original kernel given head dim

cb04ef1

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

coderabbitai bot reviewed Nov 26, 2025

View reviewed changes

feat: support more head dim in RoPE kernel #2109

Are you sure you want to change the base?

feat: support more head dim in RoPE kernel #2109

Conversation

raayandhar commented Nov 19, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Nov 19, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yzh119 commented Nov 19, 2025

Uh oh!

flashinfer-bot commented Nov 19, 2025

Uh oh!

raayandhar commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Nov 21, 2025

Uh oh!

kahyunnam commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yzh119 commented Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raayandhar commented Nov 22, 2025

Uh oh!

yzh119 commented Nov 22, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

raayandhar commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

raayandhar commented Nov 19, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 19, 2025 •

edited

Loading

raayandhar commented Nov 21, 2025 •

edited

Loading

kahyunnam commented Nov 21, 2025 •

edited

Loading

yzh119 commented Nov 22, 2025 •

edited

Loading

raayandhar commented Nov 27, 2025 •

edited

Loading