-
-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core][torch.compile] not compile for profiling #7796
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
my measurement on gemma 2b shows this can reduce the Dynamo overhead by about 0.1~0.2 ms. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Left minor comments.
@youkaichao On Gemma-2B, this PR reduced the KV cache space from 34552 to 25752, 25% drop. |
@youkaichao Can we do this instead? # Compiled model only used for initial memory profiling
tmp_for_profile = torch.compile(model, ...)
# Compiled model used for actual execution.
self.model = torch.compile(model, ...) |
you mean you also want the compilation for profiling stage? |
Yeah, while I'm not sure, there's a chance that not using |
@WoosukKwon how do you like the current implementation? compilation and optimiation still happens for the profiling run, but it will be discarded and does not affect later runs. |
@youkaichao Looks very good to me! Thanks for the quick fix! |
For gemma 2b, I didn't see much performance difference: the latency decreased from 1.59s to 1.58s (batch size 8, input len 1024, output len 128). |
I'm measuring the time spent in the decode run, and I see a clear overhead reduction. Previously, it takes 8ms for every step, now it only takes 7.5~7.6 ms. |
@youkaichao I see. How does it work without |
yes, you already marked symbolic shapes, and all shapes match the symbolic shapes. |
…#7796) Signed-off-by: Alvant <[email protected]>
we only profile once to determine the space for kv-cache, and don't run profile anymore.
compiling this run will only add guards and code cache for useless code.
removing this compilation can reduce the overhead of dynamp.