[Performance] Enable Triton autotuning disk cache by default#37188
[Performance] Enable Triton autotuning disk cache by default#37188zou3519 merged 3 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a significant performance issue by enabling Triton's autotuning disk cache by default. The use of os.environ.setdefault is appropriate, allowing existing user configurations to take precedence while providing a beneficial default. The accompanying comments clearly explain the purpose and impact of this change, including how users can override the setting. This is a valuable improvement for vLLM serving workloads, reducing latency on subsequent server starts.
Triton's `@triton.autotune` decorator re-runs kernel autotuning on every process restart because `TRITON_CACHE_AUTOTUNING` defaults to disabled. For vLLM serving workloads this adds significant latency to the first inference request after each server start. Set `TRITON_CACHE_AUTOTUNING=1` via `os.environ.setdefault` so that autotuning results are persisted to `TRITON_CACHE_DIR` and reused across restarts. Users can still opt out by explicitly setting `TRITON_CACHE_AUTOTUNING=0`. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Made-with: Cursor
52945c0 to
965eaa7
Compare
|
Likely your perf data was collected without #36599 Now it should be difference in runtime. But the speed of server run should be sufficient. Could you pls remeasure after sync with main |
|
@mgoin and @robertgshaw2-redhat |
mgoin
left a comment
There was a problem hiding this comment.
Let's see if it breaks anything :)
Keep the related TORCHINDUCTOR_COMPILE_THREADS env var and torch._inductor.config.compile_threads assignment adjacent. Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
…oject#37188) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com>
Purpose
Triton's
@triton.autotunedecorator re-runs kernel autotuning on every process restart becauseTRITON_CACHE_AUTOTUNINGdefaults to disabled. For vLLM serving workloads this adds significant latency to the first inference request after each server start.Set
TRITON_CACHE_AUTOTUNING=1viaos.environ.setdefaultso that autotuning results are persisted toTRITON_CACHE_DIRand reused across restarts. Users can still opt out by explicitly settingTRITON_CACHE_AUTOTUNING=0.Test description
Setup: 8×B200, Qwen/Qwen3.5-397B-A17B-FP8, dp=8, expert parallelism enabled.
Server command:
Benchmark command:
Before this change
1st run — started the server, ran the benchmark. Autotuning messages appeared in the log:
2nd run — restarted the server with the exact same command, ran the same benchmark. The same autotuning messages appeared again. Triton did not reuse any cached results because
TRITON_CACHE_AUTOTUNINGisFalseby default — without it Triton ignores the disk cache entirely and re-runs autotuning from scratch on every process restart.After this change
1st run — started the server, ran the benchmark. Autotuning messages appeared in the log (cache is cold, expected).
2nd run — restarted the server with the exact same command, ran the same benchmark. No autotuning messages in the log. Triton successfully loaded the cached autotuning results from disk and skipped re-autotuning.
Test result
Setup: 8x NVIDIA B200, Qwen/Qwen3.5-397B-A17B-FP8, dp=8, expert parallelism, 128 prompts, random input 8000 tokens, output 1 token, concurrency 8.
By caching Triton kernel autotuning results to disk, the 2nd benchmark run achieved a 3x speedup.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.