Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions jenkins/L0_Test.groovy
Original file line number Diff line number Diff line change
Expand Up @@ -902,6 +902,7 @@ def getMountListForSlurmTest(SlurmCluster cluster, boolean useSbatch = false)
}
mounts += [
"${cluster.scratchPath}:/scratch.trt_llm_data:ro",
"/home/svc_tensorrt/.triton:/root/.triton",
]
Comment on lines 903 to 906
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the file to confirm the context
head -n 920 jenkins/L0_Test.groovy | tail -n 25

Repository: NVIDIA/TensorRT-LLM

Length of output: 1170


🌐 Web query:

Triton compiler cache concurrent writes safety thread safety

💡 Result:

Triton’s compiler cache has historically been not fully safe under concurrent multi-process writes to the same cache directory (common in distributed training / TP>1 on a shared filesystem).

  • Triton’s cache put() uses a “write temp → os.replace() to final name” pattern (atomic rename/replace), but other processes can still observe and act on transient temp files (e.g., via directory glob/listing) and then fail when those temp files get replaced, causing FileNotFoundError in multi-process runs. This exact failure mode is reported in triton-lang/triton issue trtllm-build llama3.1-8b failed #2688. (github.com)
  • Two upstream fixes addressed concurrency hazards:
    1. PR #3544: avoids temp-file name collisions across distributed setups (e.g., same PID/seed across nodes) by using UUIDs. (github.com)
    2. PR #4295: changes the cache manager to write into a process-private temp directory, so directory scans don’t pick up another process’s temp artifacts (and explains the race in detail). (github.com)
  • NVIDIA Megatron-Core ships a patched ParallelFileCacheManager explicitly to “prevent errors related to the building of the Triton compiler cache” under parallel ranks and on filesystems like Lustre, and notes it implements Triton PRs #3544 and #4295 (and suggests these changes should land in Triton 3.2). (docs.nvidia.com)

Practical guidance

  • If you see cache FileNotFoundError / odd behavior under distributed compilation: upgrade Triton to a version that includes PRs #3544 and #4295, or use Megatron-Core’s ParallelFileCacheManager, or ensure each process/rank uses a distinct TRITON_CACHE_DIR (ideally on local disk, not shared NFS/Lustre). (docs.nvidia.com)

Citations:


🏁 Script executed:

# Let's also search for any other references to the .triton mount or cache handling in the codebase
rg "\.triton|triton.*cache" --type groovy -i -A 2 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 339


Address Triton compiler cache concurrency safety with proper mitigation.

The Triton compiler cache at Line 905 does not specify an access mode, allowing concurrent writes from multiple SLURM jobs. This is a known issue in Triton's cache implementation—concurrent multi-process writes to the same cache directory can cause race conditions (FileNotFoundError) and cache corruption, particularly in distributed settings.

Simply adding :ro is insufficient if the cache needs to be populated dynamically. The proper upstream solutions are:

  1. Upgrade Triton to a version that includes concurrency fixes (PRs #3544 and #4295, expected in Triton 3.2+)
  2. Use NVIDIA Megatron-Core's ParallelFileCacheManager (explicitly designed to prevent cache corruption under parallel ranks)
  3. Use per-process cache directories on local disk by setting distinct TRITON_CACHE_DIR per job, rather than sharing a single cache

If this cache is intended as a pre-populated shared read-only resource, then adding :ro is appropriate. If jobs populate it dynamically, clarify the concurrency handling strategy and implement one of the upstream mitigations above.

🤖 Prompt for AI Agents
In `@jenkins/L0_Test.groovy` around lines 903 - 906, The Triton compiler cache
mount (the entry that maps "/home/svc_tensorrt/.triton" into the container via
the mounts list using cluster.scratchPath and the explicit
"/home/svc_tensorrt/.triton:/root/.triton" mapping) can cause concurrent write
races; either make the mount read-only if it is a pre-populated shared cache
(add :ro to "/home/svc_tensorrt/.triton:/root/.triton") or, if jobs may populate
the cache, implement one of the recommended mitigations: upgrade Triton to a
version with the concurrency fixes, switch to NVIDIA Megatron-Core’s
ParallelFileCacheManager, or configure per-job unique TRITON_CACHE_DIRs (so each
job uses a local unique cache path) and remove the shared writable mount—apply
the chosen strategy consistently wherever mounts is constructed and where
TRITON_CACHE_DIR is set.

} else {
throw new Exception("Unsupported container runtime: ${cluster.containerRuntime}")
Expand Down
Loading