[None][infra] Mount triton cache #11301

coderabbitai · 2026-02-05T04:04:11Z

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file to confirm the context head -n 920 jenkins/L0_Test.groovy | tail -n 25

Repository: NVIDIA/TensorRT-LLM

Length of output: 1170

🌐 Web query:

Triton compiler cache concurrent writes safety thread safety

💡 Result:

Triton’s compiler cache has historically been not fully safe under concurrent multi-process writes to the same cache directory (common in distributed training / TP>1 on a shared filesystem).

Triton’s cache put() uses a “write temp → os.replace() to final name” pattern (atomic rename/replace), but other processes can still observe and act on transient temp files (e.g., via directory glob/listing) and then fail when those temp files get replaced, causing FileNotFoundError in multi-process runs. This exact failure mode is reported in triton-lang/triton issue trtllm-build llama3.1-8b failed #2688. (github.com)

Two upstream fixes addressed concurrency hazards:

PR #3544: avoids temp-file name collisions across distributed setups (e.g., same PID/seed across nodes) by using UUIDs. (github.com)

PR #4295: changes the cache manager to write into a process-private temp directory, so directory scans don’t pick up another process’s temp artifacts (and explains the race in detail). (github.com)

NVIDIA Megatron-Core ships a patched ParallelFileCacheManager explicitly to “prevent errors related to the building of the Triton compiler cache” under parallel ranks and on filesystems like Lustre, and notes it implements Triton PRs #3544 and #4295 (and suggests these changes should land in Triton 3.2). (docs.nvidia.com)

Practical guidance

If you see cache FileNotFoundError / odd behavior under distributed compilation: upgrade Triton to a version that includes PRs #3544 and #4295, or use Megatron-Core’s ParallelFileCacheManager, or ensure each process/rank uses a distinct TRITON_CACHE_DIR (ideally on local disk, not shared NFS/Lustre). (docs.nvidia.com)

Citations:

1: Encountering FileNotFoundError while Compiling Triton Kernel in Distributed Training triton-lang/triton#2688

2: [RUNTIME] Replace random number with UUID triton-lang/triton#3544

3: [RUNTIME] Replace random number with UUID triton-lang/triton#3544

4: https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/apidocs/core/core.ssm.triton_cache_manager.html

5: https://docs.nvidia.com/megatron-core/developer-guide/0.15.0/apidocs/core/core.ssm.triton_cache_manager.html

🏁 Script executed:

# Let's also search for any other references to the .triton mount or cache handling in the codebase rg "\.triton|triton.*cache" --type groovy -i -A 2 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 339

Address Triton compiler cache concurrency safety with proper mitigation.

The Triton compiler cache at Line 905 does not specify an access mode, allowing concurrent writes from multiple SLURM jobs. This is a known issue in Triton's cache implementation—concurrent multi-process writes to the same cache directory can cause race conditions (FileNotFoundError) and cache corruption, particularly in distributed settings.

Simply adding :ro is insufficient if the cache needs to be populated dynamically. The proper upstream solutions are:

Upgrade Triton to a version that includes concurrency fixes (PRs #3544 and #4295, expected in Triton 3.2+)

Use NVIDIA Megatron-Core's ParallelFileCacheManager (explicitly designed to prevent cache corruption under parallel ranks)

Use per-process cache directories on local disk by setting distinct TRITON_CACHE_DIR per job, rather than sharing a single cache

If this cache is intended as a pre-populated shared read-only resource, then adding :ro is appropriate. If jobs populate it dynamically, clarify the concurrency handling strategy and implement one of the upstream mitigations above.

🤖 Prompt for AI Agents

In `@jenkins/L0_Test.groovy` around lines 903 - 906, The Triton compiler cache mount (the entry that maps "/home/svc_tensorrt/.triton" into the container via the mounts list using cluster.scratchPath and the explicit "/home/svc_tensorrt/.triton:/root/.triton" mapping) can cause concurrent write races; either make the mount read-only if it is a pre-populated shared cache (add :ro to "/home/svc_tensorrt/.triton:/root/.triton") or, if jobs may populate the cache, implement one of the recommended mitigations: upgrade Triton to a version with the concurrency fixes, switch to NVIDIA Megatron-Core’s ParallelFileCacheManager, or configure per-job unique TRITON_CACHE_DIRs (so each job uses a local unique cache path) and remove the shared writable mount—apply the chosen strategy consistently wherever mounts is constructed and where TRITON_CACHE_DIR is set.

-Original file line number
+Diff line change
@@ Expand Up @@
             }
             mounts += [
                 "${cluster.scratchPath}:/scratch.trt_llm_data:ro",
+                "/home/svc_tensorrt/.triton:/root/.triton",
             ]
         } else {
             throw new Exception("Unsupported container runtime: ${cluster.containerRuntime}")
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][infra] Mount triton cache #11301

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Feb 5, 2026

Uh oh!

Uh oh!

[None][infra] Mount triton cache #11301

Uh oh!

[None][infra] Mount triton cache #11301

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

coderabbitai Bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!