[Perf] Add VLLM_TRITON_FORCE_FIRST_CONFIG to skip Triton autotuning#42425
[Perf] Add VLLM_TRITON_FORCE_FIRST_CONFIG to skip Triton autotuning#42425fuscof-ibm wants to merge 1 commit into
Conversation
When VLLM_TRITON_FORCE_FIRST_CONFIG=1, monkeypatch triton.runtime.autotuner.Autotuner.run to always select configs[0] and skip benchmarking, eliminating autotuning variability when measuring kernel performance. Log one line per unique kernel showing the number of candidate configs and the picked config, so it is easy to verify which kernels the patch intercepts. Gated on HAS_TRITON so it is a no-op on builds without Triton; default off preserves normal autotuning behavior. Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to disable Triton's autotuning by monkey-patching the Autotuner.run method to always select the first configuration when the VLLM_TRITON_FORCE_FIRST_CONFIG environment variable is enabled. Feedback indicates that the implementation should handle cases where self.configs is empty to avoid an IndexError and should call self.fn directly instead of self.fn.run to ensure compatibility with Triton 3.x.
| seen_kernels: set[str] = set() | ||
|
|
||
| def _run_first_config(self, *args, **kwargs): | ||
| config = self.configs[0] |
There was a problem hiding this comment.
The code assumes self.configs is non-empty. If a kernel is defined with an empty list of configurations (which is allowed in Triton), this will raise an IndexError. The original Triton Autotuner.run implementation handles this by checking if self.configs is empty and falling back to a direct call.
| config = self.configs[0] | |
| if not self.configs: | |
| return self.fn(*args, **kwargs) | |
| config = self.configs[0] |
| **config.all_kwargs(), | ||
| } | ||
| config.pre_hook(full_nargs) | ||
| return self.fn.run(*args, **kwargs, **config.all_kwargs()) |
There was a problem hiding this comment.
Using self.fn.run(...) will cause an AttributeError on Triton 3.x when the autotuner wraps a Heuristics object, as Heuristics does not have a run method in newer Triton versions. It is safer and more compatible to call self.fn(...) directly, which is what the upstream Triton Autotuner.run does.
| return self.fn.run(*args, **kwargs, **config.all_kwargs()) | |
| return self.fn(*args, **kwargs, **config.all_kwargs()) |
|
This pull request has merge conflicts that must be resolved before it can be |
Purpose
Triton autotuning can alter the sizes of matrix multiplies and introduce substantial non-determinism.
While debugging accuracy regressions it is useful to run against fixed configurations without depending on runtime fluctuations or cached results. Changing the code for all the kernels involved or clearing caches is both inconvenient and a source of problems.
This patch introduces a commit that has been proven useful to debug issues in speculative decoding performance for #40172.
When VLLM_TRITON_FORCE_FIRST_CONFIG=1, monkeypatch triton.runtime.autotuner.Autotuner.run to always select configs[0] and skip benchmarking, eliminating autotuning variability when measuring kernel performance. Log one line per unique kernel showing the number of candidate configs and the picked config, so it is easy to verify which kernels the patch intercepts. Gated on HAS_TRITON so it is a no-op on builds without Triton; default off preserves normal autotuning behavior.
Test Plan
This code has been tested to debug #40172