Add option to use torch._inductor.standalone_compile#17057
Add option to use torch._inductor.standalone_compile#17057houseroad merged 1 commit intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
031b8a1 to
24dc355
Compare
24dc355 to
0bbb2e9
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
0bbb2e9 to
5bb2380
Compare
houseroad
left a comment
There was a problem hiding this comment.
Overall looks reasonable to me.
5bb2380 to
80d85bd
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
Let's rebase? |
80d85bd to
dcdbdc3
Compare
8a9288d to
a147b3f
Compare
There was a problem hiding this comment.
@zou3519 Could you explain this comment? Would like to understand the sketchiness
There was a problem hiding this comment.
This is a pre-existing problem, the pre-existing InductorAdaptor also has this code in it.
If a function being compiled returns a single tensor, e.g. f(x) = x.sin(), and we compile this, then Inductor is always passed a graph that returns a (Tensor,) and Inductor returns a compiled artifact that returns a (Tensor,). Dynamo is responsible for unpacking this back into a single tensor via the bytecode it generates.
vLLM takes the bytecode that Dynamo generates and turns it into some Python code that wraps the compiled artifact. However, since we also need to manually do the unpacking here, I suspect that vLLM is not doing that process correctly.
There was a problem hiding this comment.
Makes sense, thanks for the explanation.
There was a problem hiding this comment.
since we also need to manually do the unpacking here, I suspect that vLLM is not doing that process correctly
what does this mean? as you mentioned, we have special handling logic for the case when the original graph returns a single tensor, and I think vLLM is correct here.
There was a problem hiding this comment.
what does this mean? as you mentioned, we have special handling logic for the case when the original graph returns a single tensor, and I think vLLM is correct here.
In torch.compile, the handling logic for what happens when the original graph returns a single tensor is in the Dynamo-produced bytecode. In vLLM, the handling logic is in the InductorAdaptor. I would expect it to be in the Dynamo-produced bytecode.
There was a problem hiding this comment.
@youkaichao mentioned to me that vLLM does use the Dynamo-produced bytecode directly so... this needs more investigation
There was a problem hiding this comment.
Let's track it in the following PR? (like creating some issue?)
a147b3f to
4cd5ae6
Compare
vllm/config.py
Outdated
There was a problem hiding this comment.
this is a user-facing config, and I don't think we should add it here. If you want to switch the behavior, you can have an env var like VLLM_TEST_XXX , that's less user-facing and shows it is not intended for external use cases.
There was a problem hiding this comment.
Thanks for pointing that out, I'll update this.
vllm/compilation/backends.py
Outdated
There was a problem hiding this comment.
I think it should be the compiler's responsibility to summarize the graph and generate the key. The compiler manager just calls the compiler. It should not provide the graph index.
There was a problem hiding this comment.
@youkaichao The "key" is the file that the compiled artifact gets saved to. I don't think it's the compiler's responsibility to specify the key. Consider gcc - by default it will write a file to a.out, otherwise it's the user's responsibility to pick a name for their binary.
From the Inductor side, the inductor hash key computation is not public API and shouldn't be.
There was a problem hiding this comment.
@youkaichao The "key" is the file that the compiled artifact gets saved to. I don't think it's the compiler's responsibility to specify the key. Consider gcc - by default it will write a file to a.out, otherwise it's the user's responsibility to pick a name for their binary.
From the Inductor side, the inductor hash key computation is not public API and shouldn't be.
There was a problem hiding this comment.
Btw, do we need to use the key for hash computation to validate the cache?
There was a problem hiding this comment.
Btw, do we need to use the key for hash computation to validate the cache?
We do not use this key to validate the vLLM cache. The "key" here is just a file path to the compiled artifact.
The "key" that is returned by InductorAdapator has semantic meaning to torch.compile. But vLLM manages its own cache and compilation, so this key has no semantic meaning to vLLM.
There was a problem hiding this comment.
we should add a unittest for this adaptor. Feel free to do it in the following PR.
There was a problem hiding this comment.
In the following PR I will turn on InductorStandaloneAdaptor for PyTorch >= 2.8, which will make it so that it gets tested in the "torch nightly" vLLM CI.
7bafaae to
009857e
Compare
There was a problem hiding this comment.
Do we need to consider the none pytorch source code in the hash?
There was a problem hiding this comment.
What do you mean? The compute_hash function here is the same as the compute_hash function in InductorAdaptor. I can put this into a helper function for better code reuse.
There was a problem hiding this comment.
Offline synced, it will be called by vLLM to compute the overall cache key.
This PR adds the option to use torch._inductor.standalone_compile to perform compilation instead of compile_fx. The goal of standalone_compile is to remove the hacks around vLLM's usage of compile_fx, we want to migrate to using it in PyTorch 2.8. standalone_compile replaces how vLLM interacts with the torch.compile caches. Instead of vLLM trying to redirect them into its torch_compile_cache folder, vLLM can pass standalone_compile a filepath that is inside of the torch_compile_cache folder and standalone_compile will write the full precompiled artifact to it. Right now this option is hidden behind an envvar. It is also not tested in vLLM CI (vLLM CI only tests against PyTorch 2.6). This option also needs more testing before we turn it on by default for PyTorch 2.8+. I am putting this PR out so that we can merge something that we can keep developing on top of. Test Plan: - Run https://gist.github.com/zou3519/aebb622714e80f4cd4c369472f2372cd with or without VLLM_TEST_STANDALONE_COMPILE Signed-off-by: rzou <zou3519@gmail.com>
009857e to
e651f13
Compare
This includes the current PyTorch nightlies. It also renames the VLLM_TEST_STANDALONE_COMPILE envvar to VLLM_USE_STANDALONE_COMPILE to make it clearer. Test Plan: - in vllm-project#17057, I verified that running https://gist.github.com/zou3519/aebb622714e80f4cd4c369472f2372cd with or without VLLM_TEST_STANDALONE_COMPILE resulted in Inductor producing the same exact output code (via tlparse). I did this for the cold-start case and the warm start case. - there are vllm x torch nightly tests in CI that I will trigger on this PR. Signed-off-by: rzou <zou3519@gmail.com>
This PR adds the option to use torch._inductor.standalone_compile to perform compilation instead of compile_fx. The goal of standalone_compile is to remove the hacks around vLLM's usage of compile_fx, we want to migrate to using it in PyTorch 2.8.
standalone_compile replaces how vLLM interacts with the torch.compile caches. Instead of vLLM trying to redirect them into its torch_compile_cache folder, vLLM can pass standalone_compile a filepath that is inside of the torch_compile_cache folder and standalone_compile will write the full precompiled artifact to it.
Right now this option is hidden behind an envvar (VLLM_TEST_STANDALONE_COMPILE). It is also not tested in vLLM CI (vLLM CI only tests against PyTorch 2.6). This option also needs more testing before we turn it on by default for PyTorch 2.8+. I am putting this PR out so that we can merge something that we can keep developing on top of.
Test Plan: