Skip to content

Conversation

@cicirori
Copy link
Contributor

@cicirori cicirori commented Oct 26, 2025

📌 Description

1. Fixed Parameter Alignment

  • Issue: The stream parameter was being passed to the wrong position in the RopeQuantize function call due to missing enable_pdl parameter. SGLang will hang before this pr.
  • Fix: Added the enable_pdl parameter to the function signature and properly aligned all parameters

2. Fixed PDL Launch Configuration

  • Issue: When enable_pdl=true, the kernel would throw CUDA errors due to incorrect PDL attribute handling
  • Fix: Aligned the implementation with csrc/fmhaReduction.cu.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added PDL (Programmatic Dynamic Launch) benchmarking capability for rope quantization operations.
    • Extended configuration options to enable or disable PDL functionality.
  • Tests

    • Updated test suite to validate PDL enabled and disabled scenarios in rope quantization workflows.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 26, 2025

Walkthrough

This PR introduces PDL (Programmatic Stream Serialization) support to the RoPE quantization pipeline by adding an enable_pdl boolean parameter across the entire stack. The parameter is threaded from benchmark code through Python APIs, C++ bindings, and CUDA kernel launch configurations, with corresponding test parametrization added.

Changes

Cohort / File(s) Change Summary
Benchmarking Infrastructure
benchmarks/bench_rope_quantize_fp8.py
Added enable_pdl parameter to benchmark_config() (default False) and introduced new benchmark_pdl() function with Triton perf_report decorator. Extended execution flow to run PDL benchmarking and emit rope-pdl-benchmark.png plot output.
C++ CUDA Bindings
csrc/flashinfer_rope_binding.cu, csrc/rope.cu
Extended rope_quantize() function signature to include enable_pdl: bool parameter after existing interleave parameter. Updated call site in rope.cu to forward enable_pdl to the RopeQuantize kernel dispatch.
Python API Layer
flashinfer/rope.py
Added enable_pdl: bool parameter to _rope_quantize(), _fake_rope_quantize(), rope_quantize_fp8(), and mla_rope_quantize_fp8() functions. Propagated parameter through entire quantization call chain with updated docstrings.
CUDA Launch Configuration
include/flashinfer/pos_enc.cuh
Simplified PDL launch configuration to always use a single cudaLaunchAttribute entry with programmaticStreamSerializationAllowed value derived from enable_pdl flag (1 if true, 0 if false), eliminating conditional branching.
Test Infrastructure
tests/attention/test_rope.py
Added enable_pdl parameter via pytest.mark.parametrize to test_generalized_rope_quantize() and test_mla_rope_quantize() test functions. Threaded parameter through quantization function calls in test body.

Sequence Diagram

sequenceDiagram
    participant Test/Bench as Test or Benchmark
    participant Python as Python API<br/>(rope.py)
    participant Binding as C++ Binding<br/>(rope_binding.cu)
    participant Kernel as CUDA Kernel<br/>(rope.cu)
    participant Launch as Launch Config<br/>(pos_enc.cuh)

    Test/Bench->>Python: rope_quantize_fp8(..., enable_pdl)
    activate Python
    Python->>Python: _rope_quantize(..., enable_pdl)
    Python->>Binding: rope_quantize(..., enable_pdl)
    activate Binding
    Binding->>Kernel: rope_quantize(..., enable_pdl)
    activate Kernel
    Kernel->>Launch: Prepare launch config with enable_pdl
    activate Launch
    Launch->>Launch: Set programmaticStreamSerializationAllowed<br/>= enable_pdl ? 1 : 0
    Note over Launch: Single cudaLaunchAttribute<br/>always supplied
    deactivate Launch
    Kernel->>Kernel: Launch RopeQuantize kernel<br/>with attribute config
    deactivate Kernel
    deactivate Binding
    deactivate Python
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Areas requiring extra attention:
    • Verify enable_pdl parameter propagation is complete and consistent across all quantization code paths (benchmark, test, production code)
    • Review the CUDA launch attribute logic in pos_enc.cuh to ensure the boolean-to-integer conversion and single-attribute configuration works correctly with the kernel dispatcher
    • Confirm backward compatibility: existing call sites without enable_pdl must handle the new default parameter correctly
    • Validate that parameter order changes (particularly in C++ binding signatures) don't introduce binding/ABI issues

Possibly related PRs

Suggested reviewers

  • pavanimajety
  • nvpohanh
  • yzh119

Poem

🐰 A PDL hop through the RoPE so fine,
Stream serialization flows by design,
From Python down to CUDA's core,
Parameters march through every door,
Benchmarks dance, tests sing along,
One flag makes the kernel strong! 🎯

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "fix: correct PDL parameter handling in RopeQuantize kernel" is directly related to the main changes in the changeset. The PR addresses two key issues: adding the missing enable_pdl parameter to fix parameter alignment (where the stream parameter was being passed in the wrong position) and correcting the PDL launch configuration to prevent CUDA errors. The title specifically references "PDL parameter handling in RopeQuantize kernel," which accurately captures the core fixes implemented across multiple files. The title is concise, specific, and clearly communicates the nature of the fix to reviewers.
Description Check ✅ Passed The pull request description follows the required template structure and provides substantive content in the Description section. It clearly identifies two fixes: parameter alignment issues with the enable_pdl parameter and PDL launch configuration corrections. The Description explains both the problems encountered and the solutions implemented. The Pull Request Checklist sections are properly completed, with all pre-commit checks and tests marked as done. The Related Issues section is present but not filled in, and the Reviewer Notes section is optional per the template. While the Related Issues section could ideally include links if applicable, the core description content is specific and complete enough to understand the PR's purpose and changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @cicirori, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses critical issues related to the handling of Programmatic Dependent Launch (PDL) within the RopeQuantize kernel. It rectifies parameter alignment problems and corrects the CUDA launch configuration for PDL, preventing runtime errors. Additionally, it introduces new benchmarks and expands existing test coverage to thoroughly validate the enable_pdl functionality, ensuring robust and correct operation.

Highlights

  • Parameter Alignment: Corrected the RopeQuantize function signature to include the enable_pdl parameter, ensuring proper alignment of all parameters, which previously caused the stream parameter to be misplaced.
  • PDL Launch Configuration Fix: Resolved CUDA errors that occurred when enable_pdl was true by aligning the kernel's PDL launch configuration with established patterns found in csrc/fmhaReduction.cu.
  • PDL Benchmarking: Introduced a new benchmark specifically for evaluating the performance of the RopeQuantize kernel with and without Programmatic Dependent Launch (PDL) enabled.
  • Test Coverage Expansion: Extended existing tests for generalized_rope_quantize and mla_rope_quantize to include enable_pdl as a parameter, ensuring comprehensive validation of the new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a critical bug in the RopeQuantize kernel related to PDL parameter handling. The primary issue was a dangling pointer caused by incorrect variable scope in the CUDA kernel launch configuration, which has been resolved effectively. The new enable_pdl parameter has been consistently propagated through the Python API, CUDA bindings, and kernel implementation. Additionally, comprehensive updates to the benchmarks and tests have been included to validate the fix and measure its performance impact. The changes are well-executed and improve the robustness of the kernel.

@cicirori
Copy link
Contributor Author

pytest -v tests/attention/test_rope.py::test_generalized_rope_quantize

================================= warnings summary =================================
../../../../usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63
  /usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
    import pynvml  # type: ignore[import]

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 624 passed, 1 warning in 19.17s ==========================

I'm not familiar with PDL, not sure if it's meaningful to do a PDL on/off benchmark.
testing on B200

 python benchmarks/bench_rope_quantize_fp8.py
...
PDL on/off
num_tokens,enable_pdl=False,enable_pdl=True
1.000000,0.002864,0.002458
2.000000,0.002864,0.002451
4.000000,0.002790,0.002451
8.000000,0.002790,0.002454
16.000000,0.002998,0.002531
32.000000,0.003466,0.003402
64.000000,0.004829,0.004813
128.000000,0.007446,0.007450
256.000000,0.012493,0.012501
384.000000,0.018074,0.017619
512.000000,0.025194,0.024576
768.000000,0.036861,0.036246
rope-pdl-benchmark

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bugfix @cicirori !

@yzh119 yzh119 merged commit d4a3ff4 into flashinfer-ai:main Oct 26, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants