feat(autotuner): enable per-op autotune bypass for faster framework warmup by qiching · Pull Request #3396 · flashinfer-ai/flashinfer

qiching · 2026-05-23T22:22:41Z

Add skip_ops to let frameworks exclude specific ops from autotuning, falling back to heuristic tactics without kernel compilation overhead. addresses #3295: mm_fp4 cute-dsl autotuning is slow due to compilation.

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

New Features
- Added an option to exclude specific custom operations from profiling via the tuning context, with nested contexts combining exclusions and restoring prior state on exit.
- Excluded operations immediately use fallback behavior and bypass profiling and cache writes.
Tests
- Added tests validating skip behavior, nesting/restore semantics, fallback selection, and that skipped ops don’t affect profiling cache.

Add skip_ops to let frameworks exclude specific ops from autotuning, falling back to heuristic tactics without kernel compilation overhead. addresses #3295: mm_fp4 cute-dsl autotuning is slow due to compilation.

coderabbitai · 2026-05-23T22:22:48Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba7ebded-8989-42c5-a257-281d29664bff

📥 Commits

Reviewing files that changed from the base of the PR and between 43d842f and e4194e2.

📒 Files selected for processing (1)

flashinfer/autotuner.py

🚧 Files skipped from review as they are similar to previous changes (1)

flashinfer/autotuner.py

📝 Walkthrough

Walkthrough

Adds an optional skip_ops parameter to autotune() that maintains a per-thread, unioned stack of op names to skip; skipped ops cause AutoTuner.choose_one() to return the fallback runner/tactic immediately and avoid profiling or cache writes. Tests cover nested behavior and cache isolation.

Changes

Skip-ops feature for autotuner

Layer / File(s)	Summary
Skip-ops parameter contract and documentation `flashinfer/autotuner.py`	`autotune()` gains optional `skip_ops: Optional[Union[str, Set[str]]]` with docstring, semantics for nested unioning, and example usage (e.g., `skip_ops={"fp4_gemm"}`).
Thread-local skip-ops state infrastructure `flashinfer/autotuner.py`	`AutoTuner` adds `_skip_ops_local`, `_get_skip_ops_stack()`, and `_effective_skip_ops` to maintain and compute per-thread cumulative skip sets.
Context manager entry and exit behavior `flashinfer/autotuner.py`	On entry `autotune()` normalizes `skip_ops` (string→set), unions with current top of the thread stack, and pushes it; on both normal and exceptional exit the pushed entry is popped to restore prior state.
Skip-ops enforcement in choose_one() `flashinfer/autotuner.py`	Before acquiring locks or profiling, `AutoTuner.choose_one()` checks `_effective_skip_ops`; if `custom_op` is present it logs and returns `(runners[0], -1)` immediately, erroring if no runners are provided.
Comprehensive skip-ops test coverage `tests/autotuner/test_autotuner_core.py`	Adds eight tests validating skipped-op immediate fallback without profiling, unaffected non-skipped ops, nested union/restoration semantics, first-runner selection for skipped ops, empty-set no-op, nested-scope restoration, cache non-pollution, and full cleanup after context exit.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

run-ci, op: moe

Suggested reviewers

yyihuang
aleozlx
yzh119

Poem

🐰 A rabbit hops through stacks of skips,
unioning names on tiny tips.
Nested frames, then pop—restore,
fallback found, no profiling chore.
Tests nibble carrots, all checks pass—hip!

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description includes the issue reference (`#3295`) and brief context, but the required template sections are mostly incomplete with only checkboxes left unchecked and placeholder comments unfilled.	Fill in the Description section with detailed information about what the PR does and why, complete the Related Issues section, and explicitly confirm the checklist items or explain the current status.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a skip_ops feature to enable per-op autotune bypass for faster framework warmup.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/skip-ops

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a skip_ops parameter to the autotune context manager, enabling specific operations to bypass profiling and utilize fallback heuristics. The implementation uses a thread-local stack to manage these overrides, supporting nested contexts through set unions. Extensive tests have been added to ensure correct behavior across various scenarios, including nested contexts and cache integrity. Review feedback suggests enhancing usability by allowing single string inputs for skip_ops, fixing a potential bug where strings are incorrectly split into characters during set conversion, and optimizing performance by moving the skip check outside the global lock to minimize thread contention.

…armup Add skip_ops parameter to autotune() context manager, allowing frameworks to exclude specific ops from autotuning and use heuristic fallback instead. This eliminates kernel compilation overhead for ops like mm_fp4(cute-dsl) whose heuristics are already near-optimal. - Support both str and Set[str] input for skip_ops - Handle single string safely to prevent frozenset character splitting - Move skip check before global lock to reduce thread contention - Add runners empty check for defensive safety Addresses #3295

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

tests/autotuner/test_autotuner_core.py (1)

848-1017: ⚡ Quick win

Add a dedicated test for skip_ops string input.

Line 619 in production normalizes skip_ops when passed as str; current tests only exercise set forms. Please add one case like skip_ops="skip_me" to lock this contract.

Suggested test addition

+def test_skip_ops_string_input_prevents_profiling(monkeypatch):
+    tuner = reset_autotuner()
+    runner = DummyRunner(valid_tactics=(0, 1))
+    inputs = [torch.empty((8, 16), dtype=torch.float32)]
+    config = TuningConfig()
+
+    called = []
+
+    def fake_profile(self, runner_obj, prof_inputs, tactic, tuning_config=None, **kwargs):
+        called.append(tactic)
+        return 1.0
+
+    monkeypatch.setattr(AutoTuner, "_profile_single_kernel", fake_profile)
+
+    with autotune(tune_mode=True, skip_ops="skip_me"):
+        _, tactic = tuner.choose_one("skip_me", [runner], config, inputs)
+
+    assert tactic == -1
+    assert called == []

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/autotuner/test_autotuner_core.py` around lines 848 - 1017, Add a new
unit test that verifies skip_ops accepts a string by calling
autotune(tune_mode=True, skip_ops="skip_me") and asserting it behaves like the
set form: use reset_autotuner() and a DummyRunner, monkeypatch
AutoTuner._profile_single_kernel to track/no-op profiling, call
tuner.choose_one("skip_me", [runner], config, inputs) and assert the returned
tactic is -1, the chosen runner is the first, no profiling was invoked, and that
tuner._effective_skip_ops contains "skip_me" (normalized to a frozenset) to lock
the contract for string inputs.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@flashinfer/autotuner.py`:
- Around line 1126-1138: The method currently only checks for an empty runners
list inside the skip_ops branch, which leaves non-skipped paths trying to index
runners (e.g., runners[runner_id]) and can raise IndexError; at the start of the
enclosing method (the function that references custom_op, _effective_skip_ops,
runners and runner_id), validate that runners is non-empty and raise a clear
ValueError if empty (same style as the skip path) before any use of runners or
early returns, so both skipped and non-skipped flows are safe.

In `@tests/autotuner/test_autotuner_core.py`:
- Line 957: The test assigns unused unpacked values causing Ruff RUF059: when
calling tuner.choose_one in tests/autotuner/test_autotuner_core.py (the call
that currently does "chosen, tactic = tuner.choose_one(...)" and the similar
assignment that creates "tactic_b2"), discard the unused value(s) by replacing
the unused variable with an underscore (e.g., use "_, tactic =
tuner.choose_one(...)" or "tactic_b2, _ = ..." as appropriate) or by only
assigning the needed element (e.g., assign the return to a single name and index
into it), keeping the call to choose_one intact but removing the unused
symbol(s).

---

Nitpick comments:
In `@tests/autotuner/test_autotuner_core.py`:
- Around line 848-1017: Add a new unit test that verifies skip_ops accepts a
string by calling autotune(tune_mode=True, skip_ops="skip_me") and asserting it
behaves like the set form: use reset_autotuner() and a DummyRunner, monkeypatch
AutoTuner._profile_single_kernel to track/no-op profiling, call
tuner.choose_one("skip_me", [runner], config, inputs) and assert the returned
tactic is -1, the chosen runner is the first, no profiling was invoked, and that
tuner._effective_skip_ops contains "skip_me" (normalized to a frozenset) to lock
the contract for string inputs.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3cbb7f00-cc3d-48d9-9489-21e3057e7345

📥 Commits

Reviewing files that changed from the base of the PR and between bff85f3 and 43d842f.

📒 Files selected for processing (2)

flashinfer/autotuner.py
tests/autotuner/test_autotuner_core.py

coderabbitai · 2026-05-26T23:29:43Z

+        # Skip profiling for ops in the skip_ops set — return fallback
+        # immediately.  The fallback runner (runners[0], tactic=-1) uses
+        # the op's built-in heuristic, avoiding kernel compilation.
+        # Checked before acquiring the lock since _effective_skip_ops is
+        # thread-local and does not touch shared state.
+        if custom_op in self._effective_skip_ops:
+            logger.debug(
+                f"[AutoTuner]: Skipping autotuning for '{custom_op}' "
+                f"(in skip_ops). Using fallback tactic."
+            )
+            if not runners:
+                raise ValueError(f"No runners provided for op '{custom_op}'")
+            return runners[0], -1


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate runners once at method entry, not only in the skip path.

Line 1136 adds an empty-list guard only for skipped ops. For non-skipped ops, Line 1152 still does runners[runner_id], so runners=[] can raise IndexError in inference mode.

Suggested fix

def choose_one( self, custom_op: str, runners: List[TunableRunner], tuning_config: TuningConfig, inputs: List[torch.Tensor], **kwargs, ) -> Tuple[TunableRunner, int]: + if not runners: + raise ValueError(f"No runners provided for op '{custom_op}'") + if not all(isinstance(r, TunableRunner) for r in runners): + raise TypeError("All given runners must be subclasses of TunableRunner") + # Skip profiling for ops in the skip_ops set — return fallback # immediately. if custom_op in self._effective_skip_ops: logger.debug( f"[AutoTuner]: Skipping autotuning for '{custom_op}' " f"(in skip_ops). Using fallback tactic." ) - if not runners: - raise ValueError(f"No runners provided for op '{custom_op}'") return runners[0], -1 ... - assert len(runners) > 0, "At least one runner is required" - assert all([isinstance(r, TunableRunner) for r in runners]), ( - "All Given runners must be subclass of TunableRunner" - )

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@flashinfer/autotuner.py` around lines 1126 - 1138, The method currently only checks for an empty runners list inside the skip_ops branch, which leaves non-skipped paths trying to index runners (e.g., runners[runner_id]) and can raise IndexError; at the start of the enclosing method (the function that references custom_op, _effective_skip_ops, runners and runner_id), validate that runners is non-empty and raise a clear ValueError if empty (same style as the skip path) before any use of runners or early returns, so both skipped and non-skipped flows are safe.

coderabbitai · 2026-05-26T23:29:43Z

+    monkeypatch.setattr(AutoTuner, "_profile_single_kernel", fake_profile)
+
+    with autotune(tune_mode=True, skip_ops=set()):
+        chosen, tactic = tuner.choose_one("should_tune", [runner], config, inputs)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix Ruff RUF059 by discarding unused unpacked values.

Line 957 (chosen) and Line 986 (tactic_b2) are assigned but unused.

Suggested cleanup

- with autotune(tune_mode=True, skip_ops=set()): - chosen, tactic = tuner.choose_one("should_tune", [runner], config, inputs) + with autotune(tune_mode=True, skip_ops=set()): + _, tactic = tuner.choose_one("should_tune", [runner], config, inputs) ... - _, tactic_b2 = tuner.choose_one("op_b", [runner], config, inputs) + _, _ = tuner.choose_one("op_b", [runner], config, inputs)

Also applies to: 986-986

🧰 Tools

🪛 Ruff (0.15.14)

[warning] 957-957: Unpacked variable chosen is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/autotuner/test_autotuner_core.py` at line 957, The test assigns unused unpacked values causing Ruff RUF059: when calling tuner.choose_one in tests/autotuner/test_autotuner_core.py (the call that currently does "chosen, tactic = tuner.choose_one(...)" and the similar assignment that creates "tactic_b2"), discard the unused value(s) by replacing the unused variable with an underscore (e.g., use "_, tactic = tuner.choose_one(...)" or "tactic_b2, _ = ..." as appropriate) or by only assigning the needed element (e.g., assign the return to a single name and index into it), keeping the call to choose_one intact but removing the unused symbol(s).

bkryu · 2026-05-27T17:54:54Z

/bot run

flashinfer-bot · 2026-05-27T17:56:14Z

GitLab MR !719 has been created, and the CI pipeline #52812346 is currently running. I'll report back once the pipeline job completes.

feat(autotuner): add skip_ops parameter to autotune() context manager

006078a

Add skip_ops to let frameworks exclude specific ops from autotuning, falling back to heuristic tactics without kernel compilation overhead. addresses #3295: mm_fp4 cute-dsl autotuning is slow due to compilation.

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

Comment thread flashinfer/autotuner.py Outdated

Comment thread flashinfer/autotuner.py Outdated

Comment thread flashinfer/autotuner.py Outdated

qiching mentioned this pull request May 23, 2026

The autotune speed of mm_fp4 with backend=cute-dsl is slow #3295

Closed

qiching marked this pull request as ready for review May 26, 2026 23:25

qiching requested review from aleozlx, bkryu, cyx-6, dhiraj113, jimmyzho, kahyunnam, nv-yunzheq, saltyminty, samuellees, sricketts, yongwww, yyihuang and yzh119 as code owners May 26, 2026 23:25

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

kpham-sgl mentioned this pull request May 27, 2026

Fix flashinfer autotune oom glm51 sgl-project/sglang#24195

Merged

docs(autotuner): list common op names in skip_ops docstring

e4194e2

bkryu added the run-ci label May 27, 2026

LopezCastroRoberto mentioned this pull request May 28, 2026

[Kernel][Performance] Add FlashInfer cutedsl NVFP4 GEMM backend vllm-project/vllm#42235

Open

bkryu approved these changes May 28, 2026

View reviewed changes

bkryu merged commit 8eb6154 into main May 28, 2026
42 of 44 checks passed

bkryu deleted the feat/skip-ops branch May 28, 2026 16:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(autotuner): enable per-op autotune bypass for faster framework warmup#3396

feat(autotuner): enable per-op autotune bypass for faster framework warmup#3396
bkryu merged 3 commits into
mainfrom
feat/skip-ops

qiching commented May 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

coderabbitai Bot May 26, 2026

Uh oh!

bkryu commented May 27, 2026

Uh oh!

flashinfer-bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

qiching commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu commented May 27, 2026

Uh oh!

flashinfer-bot commented May 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qiching commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading