[FRONTEND] let CacheManager write to temp dir instead of temp file#4295
Merged
Conversation
ThomasRaoux
approved these changes
Jul 10, 2024
Collaborator
ThomasRaoux
left a comment
There was a problem hiding this comment.
Looks reasonable to me
|
awesome thanks @yundai424 - this looks like it should fix the issue we have in vLLM. is there a rough ETA for when this would be in a Triton release? trying to plan whether we need to merge a work-around in the meantime. |
bertmaher
pushed a commit
to bertmaher/triton
that referenced
this pull request
Dec 10, 2024
…riton-lang#4295) # Summary there've been multiple issues discussing around the `FileNotFoundError` on compilation when `CompiledKernel` is trying to read from the listed ASM files. triton-lang#2688 triton-lang#4002 vllm-project/vllm#6103 etc. and there have been some attempts to address it such as triton-lang#3544 . This PR attempts to explain the root cause and suggest a fix. # Why When a kernel is being compiled, triton first writes IRs to triton cache dir ([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L289)). Inside of the write operation, the process first writes it to a temp file unique to the current process (plus a uuid to distinguish between multiple processes with same PID on different hosts sharing the same underlying FS) ([ref](https://github.com/triton-lang/triton/blob/c14b033cd979d5c39e5fdb3847c022fa5d71a0c1/python/triton/runtime/cache.py#L124-L130)) and then atomically `os.replace` it to the final file name. Afterwards the `CompiledKernel` lists all the IRs and reads them ([ref](https://github.com/triton-lang/triton/blob/78091647fccb6825ed9956ff7c0300859856d261/python/triton/compiler/compiler.py#L362-L367)). On multiprocess set up this may however result in a race condition. Let's focus on a case where there's one host with 2 processes on it.  At the time when `pid 1` lists ASMs, the dir may contain temp files generated from another process `pid 2`. However at the time when `pid 1` proceeds to read bytes from the listed files, `pid2` may have already `os.replace`ed its temp files, so `pid 1` will encounter `FileNotFoundError` when trying to read the temp file generated by `pid 2`. IBM/vllm#35 (comment) also believes this is the root cause. # How There're multiple potential solutions towards this, as mentioned in IBM/vllm#35 (comment) as well: - let each process write to a private temp dir instead so `glob` won't bother taking the temp stuff into consideration - or, exclude `tmp.pid_*` from `glob` This PR tries to go with the 1st approach to avoid adding an assumption on the tmp file pattern (which is only used in `runtime/cache.py`) in `compiler/compiler.py` but is open to any suggestion. Thanks! Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `not applicable`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)
fulvius31
added a commit
to fulvius31/vllm
that referenced
this pull request
Feb 24, 2025
since vllm uses torch 2.5.1 which uses triton 3.1.0 that includes triton-lang/triton#4295 Signed-off-by: Alessandro Sangiorgi <asangior@redhat.com> Signed-off-by: fulvius31 <asangior@redhat.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
there've been multiple issues discussing around the
FileNotFoundErroron compilation whenCompiledKernelis trying to read from the listed ASM files. #2688 #4002 vllm-project/vllm#6103 etc. and there have been some attempts to address it such as #3544 . This PR attempts to explain the root cause and suggest a fix.Why
When a kernel is being compiled, triton first writes IRs to triton cache dir (ref). Inside of the write operation, the process first writes it to a temp file unique to the current process (plus a uuid to distinguish between multiple processes with same PID on different hosts sharing the same underlying FS) (ref) and then atomically
os.replaceit to the final file name. Afterwards theCompiledKernellists all the IRs and reads them (ref).On multiprocess set up this may however result in a race condition. Let's focus on a case where there's one host with 2 processes on it.

At the time when
pid 1lists ASMs, the dir may contain temp files generated from another processpid 2. However at the time whenpid 1proceeds to read bytes from the listed files,pid2may have alreadyos.replaceed its temp files, sopid 1will encounterFileNotFoundErrorwhen trying to read the temp file generated bypid 2. IBM/vllm#35 (comment) also believes this is the root cause.How
There're multiple potential solutions towards this, as mentioned in IBM/vllm#35 (comment) as well:
globwon't bother taking the temp stuff into considerationtmp.pid_*fromglobThis PR tries to go with the 1st approach to avoid adding an assumption on the tmp file pattern (which is only used in
runtime/cache.py) incompiler/compiler.pybut is open to any suggestion. Thanks!Complete the following tasks before sending your PR, and replace
[ ]with[x]to indicate you have done them.I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run
pre-commit run --from-ref origin/main --to-ref HEAD.Select one of the following.
/testforlittests/unittestfor C++ tests/python/testfor end-to-end testsnot applicable.Select one of the following.
littests.littests I have added follow these best practices,including the "tests should be minimal" section. (Usually running Python code
and using the instructions it generates is not minimal.)