[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint#5074
Conversation
|
Performance-wise, this adds one memory copy operation into cudagraph. This should be negligible, as the operation will also be recorded in the cudagraph, and the copy size is quite small. I run the benchmark, and find the throughput diff is within run-to-run variations. Before this PR: After this PR: |
|
I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related? |
Yes it is possible. To cleverly use cudagraph with multiple sizes, we need some technique like sharing buffers, just as what we do for inputs, and what this PR does for outputs. |
|
If this PR is merged, more cuda graphs can be captured at no additional memory cost, which leads to further questions:
vllm/benchmarks/kernels/benchmark_mixtral_moe.py Lines 19 to 22 in fbdb7b3 cc @WoosukKwon for cudagraph Update: I realize that cudagraph will cost some memory anyway . Therefore we don't need to actively capture many cudagraphs. |
|
For llama2-7b: before this PR: the first graph takes 11MB memory, the second takes 2MB memory, and in total cudagraphs take 45MB memory. after this PR: the first graph takes 8MB memory, the second takes 0MB memory, and in total cudagraphs take 8MB memory. We can see the memory used after capturing all graphs (19464.05 MB) is almost the same as just capturing one graph previously (19464.02 MB). |
|
@rkooo567 I can take this PR if you are ok with that. Just please let me know! |
|
oh yeah please go ahead! |
WoosukKwon
left a comment
There was a problem hiding this comment.
@youkaichao Thanks for the PR! This is a good finding. Please check my comments.
87e4fdc to
8ff3b91
Compare
|
@youkaichao Is the CI failure related to this PR? |
investigating. |
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)
Currently, each cudagraph for one batch size will incur additional memory cost of
TP * BATCH_SIZE * HIDDEN_SIZE. Then the total memory footprint for all cudagraphs areTP * SUM(BATCH_SIZES) * HIDDEN_SIZE.If we create output buffer too, the total memory footprint for all cudagraphs are
TP * MAX(BATCH_SIZES) * HIDDEN_SIZE, which is constant, regardless of how many graphs we capture.