fix: correct PDL parameter handling in RopeQuantize kernel #1982

cicirori · 2025-10-26T08:02:11Z

📌 Description

1. Fixed Parameter Alignment

Issue: The stream parameter was being passed to the wrong position in the RopeQuantize function call due to missing enable_pdl parameter. SGLang will hang before this pr.
Fix: Added the enable_pdl parameter to the function signature and properly aligned all parameters

2. Fixed PDL Launch Configuration

Issue: When enable_pdl=true, the kernel would throw CUDA errors due to incorrect PDL attribute handling
Fix: Aligned the implementation with csrc/fmhaReduction.cu.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added PDL (Programmatic Dynamic Launch) benchmarking capability for rope quantization operations.
- Extended configuration options to enable or disable PDL functionality.
Tests
- Updated test suite to validate PDL enabled and disabled scenarios in rope quantization workflows.

coderabbitai · 2025-10-26T08:02:21Z

Walkthrough

This PR introduces PDL (Programmatic Stream Serialization) support to the RoPE quantization pipeline by adding an enable_pdl boolean parameter across the entire stack. The parameter is threaded from benchmark code through Python APIs, C++ bindings, and CUDA kernel launch configurations, with corresponding test parametrization added.

Changes

Cohort / File(s)	Change Summary
Benchmarking Infrastructure `benchmarks/bench_rope_quantize_fp8.py`	Added `enable_pdl` parameter to `benchmark_config()` (default False) and introduced new `benchmark_pdl()` function with Triton perf_report decorator. Extended execution flow to run PDL benchmarking and emit rope-pdl-benchmark.png plot output.
C++ CUDA Bindings `csrc/flashinfer_rope_binding.cu`, `csrc/rope.cu`	Extended `rope_quantize()` function signature to include `enable_pdl: bool` parameter after existing `interleave` parameter. Updated call site in `rope.cu` to forward `enable_pdl` to the RopeQuantize kernel dispatch.
Python API Layer `flashinfer/rope.py`	Added `enable_pdl: bool` parameter to `_rope_quantize()`, `_fake_rope_quantize()`, `rope_quantize_fp8()`, and `mla_rope_quantize_fp8()` functions. Propagated parameter through entire quantization call chain with updated docstrings.
CUDA Launch Configuration `include/flashinfer/pos_enc.cuh`	Simplified PDL launch configuration to always use a single `cudaLaunchAttribute` entry with `programmaticStreamSerializationAllowed` value derived from `enable_pdl` flag (1 if true, 0 if false), eliminating conditional branching.
Test Infrastructure `tests/attention/test_rope.py`	Added `enable_pdl` parameter via `pytest.mark.parametrize` to `test_generalized_rope_quantize()` and `test_mla_rope_quantize()` test functions. Threaded parameter through quantization function calls in test body.

Sequence Diagram

sequenceDiagram
    participant Test/Bench as Test or Benchmark
    participant Python as Python API<br/>(rope.py)
    participant Binding as C++ Binding<br/>(rope_binding.cu)
    participant Kernel as CUDA Kernel<br/>(rope.cu)
    participant Launch as Launch Config<br/>(pos_enc.cuh)

    Test/Bench->>Python: rope_quantize_fp8(..., enable_pdl)
    activate Python
    Python->>Python: _rope_quantize(..., enable_pdl)
    Python->>Binding: rope_quantize(..., enable_pdl)
    activate Binding
    Binding->>Kernel: rope_quantize(..., enable_pdl)
    activate Kernel
    Kernel->>Launch: Prepare launch config with enable_pdl
    activate Launch
    Launch->>Launch: Set programmaticStreamSerializationAllowed<br/>= enable_pdl ? 1 : 0
    Note over Launch: Single cudaLaunchAttribute<br/>always supplied
    deactivate Launch
    Kernel->>Kernel: Launch RopeQuantize kernel<br/>with attribute config
    deactivate Kernel
    deactivate Binding
    deactivate Python

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Areas requiring extra attention:
- Verify enable_pdl parameter propagation is complete and consistent across all quantization code paths (benchmark, test, production code)
- Review the CUDA launch attribute logic in pos_enc.cuh to ensure the boolean-to-integer conversion and single-attribute configuration works correctly with the kernel dispatcher
- Confirm backward compatibility: existing call sites without enable_pdl must handle the new default parameter correctly
- Validate that parameter order changes (particularly in C++ binding signatures) don't introduce binding/ABI issues

Possibly related PRs

MLA RoPE + quantization fused kernel: shape generalization for MHA / GQA #1924: Also modifies the RoPE quantize path including rope_quantize/RopeQuantize signatures, launch attributes, and enable_pdl parameter threading through the kernel dispatch layer.

Suggested reviewers

pavanimajety
nvpohanh
yzh119

Poem

🐰 A PDL hop through the RoPE so fine,
Stream serialization flows by design,
From Python down to CUDA's core,
Parameters march through every door,
Benchmarks dance, tests sing along,
One flag makes the kernel strong! 🎯

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "fix: correct PDL parameter handling in RopeQuantize kernel" is directly related to the main changes in the changeset. The PR addresses two key issues: adding the missing enable_pdl parameter to fix parameter alignment (where the stream parameter was being passed in the wrong position) and correcting the PDL launch configuration to prevent CUDA errors. The title specifically references "PDL parameter handling in RopeQuantize kernel," which accurately captures the core fixes implemented across multiple files. The title is concise, specific, and clearly communicates the nature of the fix to reviewers.
Description Check	✅ Passed	The pull request description follows the required template structure and provides substantive content in the Description section. It clearly identifies two fixes: parameter alignment issues with the `enable_pdl` parameter and PDL launch configuration corrections. The Description explains both the problems encountered and the solutions implemented. The Pull Request Checklist sections are properly completed, with all pre-commit checks and tests marked as done. The Related Issues section is present but not filled in, and the Reviewer Notes section is optional per the template. While the Related Issues section could ideally include links if applicable, the core description content is specific and complete enough to understand the PR's purpose and changes.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-26T08:02:30Z

Summary of Changes

Hello @cicirori, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues related to the handling of Programmatic Dependent Launch (PDL) within the RopeQuantize kernel. It rectifies parameter alignment problems and corrects the CUDA launch configuration for PDL, preventing runtime errors. Additionally, it introduces new benchmarks and expands existing test coverage to thoroughly validate the enable_pdl functionality, ensuring robust and correct operation.

Highlights

Parameter Alignment: Corrected the RopeQuantize function signature to include the enable_pdl parameter, ensuring proper alignment of all parameters, which previously caused the stream parameter to be misplaced.
PDL Launch Configuration Fix: Resolved CUDA errors that occurred when enable_pdl was true by aligning the kernel's PDL launch configuration with established patterns found in csrc/fmhaReduction.cu.
PDL Benchmarking: Introduced a new benchmark specifically for evaluating the performance of the RopeQuantize kernel with and without Programmatic Dependent Launch (PDL) enabled.
Test Coverage Expansion: Extended existing tests for generalized_rope_quantize and mla_rope_quantize to include enable_pdl as a parameter, ensuring comprehensive validation of the new functionality.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses a critical bug in the RopeQuantize kernel related to PDL parameter handling. The primary issue was a dangling pointer caused by incorrect variable scope in the CUDA kernel launch configuration, which has been resolved effectively. The new enable_pdl parameter has been consistently propagated through the Python API, CUDA bindings, and kernel implementation. Additionally, comprehensive updates to the benchmarks and tests have been included to validate the fix and measure its performance impact. The changes are well-executed and improve the robustness of the kernel.

cicirori · 2025-10-26T08:15:20Z

pytest -v tests/attention/test_rope.py::test_generalized_rope_quantize

================================= warnings summary =================================
../../../../usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63
  /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 624 passed, 1 warning in 19.17s ==========================

I'm not familiar with PDL, not sure if it's meaningful to do a PDL on/off benchmark.
testing on B200

 python benchmarks/bench_rope_quantize_fp8.py
...
PDL on/off
num_tokens,enable_pdl=False,enable_pdl=True
1.000000,0.002864,0.002458
2.000000,0.002864,0.002451
4.000000,0.002790,0.002451
8.000000,0.002790,0.002454
16.000000,0.002998,0.002531
32.000000,0.003466,0.003402
64.000000,0.004829,0.004813
128.000000,0.007446,0.007450
256.000000,0.012493,0.012501
384.000000,0.018074,0.017619
512.000000,0.025194,0.024576
768.000000,0.036861,0.036246

yzh119

Thanks for the bugfix @cicirori !

fix

fabbcbd

cicirori requested review from bkryu, cyx-6, djmmoss, jiahanc, joker-eph, kahyunnam, nvmbreughe, wenscarl, yongwww and yzh119 as code owners October 26, 2025 08:02

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

yzh119 approved these changes Oct 26, 2025

View reviewed changes

yzh119 merged commit d4a3ff4 into flashinfer-ai:main Oct 26, 2025
4 checks passed

coderabbitai bot mentioned this pull request Oct 31, 2025

feat: add xqa backend and completes NHD/HND coverage for trtllm-gen/xqa backend #2001

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: correct PDL parameter handling in RopeQuantize kernel #1982

fix: correct PDL parameter handling in RopeQuantize kernel #1982

Uh oh!

cicirori commented Oct 26, 2025 •

edited

Loading

Uh oh!

coderabbitai bot commented Oct 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

cicirori commented Oct 26, 2025

Uh oh!

yzh119 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: correct PDL parameter handling in RopeQuantize kernel #1982

fix: correct PDL parameter handling in RopeQuantize kernel #1982

Uh oh!

Conversation

cicirori commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

1. Fixed Parameter Alignment

2. Fixed PDL Launch Configuration

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

cicirori commented Oct 26, 2025

Uh oh!

yzh119 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cicirori commented Oct 26, 2025 •

edited

Loading

coderabbitai bot commented Oct 26, 2025 •

edited

Loading