[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint by youkaichao · Pull Request #5074 · vllm-project/vllm

youkaichao · 2024-05-27T22:19:37Z

Currently, each cudagraph for one batch size will incur additional memory cost of TP * BATCH_SIZE * HIDDEN_SIZE . Then the total memory footprint for all cudagraphs are TP * SUM(BATCH_SIZES) * HIDDEN_SIZE.

If we create output buffer too, the total memory footprint for all cudagraphs are TP * MAX(BATCH_SIZES) * HIDDEN_SIZE, which is constant, regardless of how many graphs we capture.

youkaichao · 2024-05-27T22:29:33Z

Performance-wise, this adds one memory copy operation into cudagraph. This should be negligible, as the operation will also be recorded in the cudagraph, and the copy size is quite small.

I run the benchmark, and find the throughput diff is within run-to-run variations.

Before this PR:
Throughput: 18.57 requests/s, 9510.07 tokens/s

After this PR:
Throughput: 18.59 requests/s, 9516.03 tokens/s

sfc-gh-hazhang · 2024-05-27T22:50:27Z

I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related?

youkaichao · 2024-05-27T22:55:14Z

I observe when enabling cuda_graph, the pp branch will use significantly more memory. This issue might be related?

Yes it is possible. To cleverly use cudagraph with multiple sizes, we need some technique like sharing buffers, just as what we do for inputs, and what this PR does for outputs.

youkaichao · 2024-05-27T23:11:01Z

If this PR is merged, more cuda graphs can be captured at no additional memory cost, which leads to further questions:

should we capture as many cuda graphs as we can (all batch sizes up until the maximum), so that we don't need padding anymore?
should we reduce _BATCH_SIZE_ALIGNMENT from 8 to 4 or 2 , so that less computation is wasted on padding?
should we align the batchsize used for tuning moe kernels with the cudagraph?

vllm/benchmarks/kernels/benchmark_mixtral_moe.py

Lines 19 to 22 in fbdb7b3

    
           for bs in [ 
        
                   1, 2, 4, 8, 16, 24, 32, 48, 64, 96, 128, 256, 512, 1024, 1536, 
        
                   2048, 3072, 4096 
        
           ]:

cc @WoosukKwon for cudagraph
cc @pcmoritz for moe kernels

Update: I realize that cudagraph will cost some memory anyway . Therefore we don't need to actively capture many cudagraphs.

youkaichao · 2024-05-27T23:25:15Z

For llama2-7b:

before this PR:

before capture graph with batch size 256 Used Memory: 19453.89 MB
after capture graph with batch size 256 Used Memory: 19464.02 MB
before capture graph with batch size 248 Used Memory: 19464.02 MB
after capture graph with batch size 248 Used Memory: 19465.95 MB
before capture graph with batch size 1 Used Memory: 19498.69 MB
after capture graph with batch size 1 Used Memory: 19498.70 MB
INFO 05-27 23:21:09 model_runner.py:920] Graph capturing finished in 6 secs.

the first graph takes 11MB memory, the second takes 2MB memory, and in total cudagraphs take 45MB memory.

after this PR:

before capture graph with batch size 256 Used Memory: 19455.89 MB
after capture graph with batch size 256 Used Memory: 19464.02 MB
before capture graph with batch size 248 Used Memory: 19464.02 MB
after capture graph with batch size 248 Used Memory: 19464.02 MB
...
before capture graph with batch size 1 Used Memory: 19464.05 MB
after capture graph with batch size 1 Used Memory: 19464.05 MB
INFO 05-27 23:18:21 model_runner.py:927] Graph capturing finished in 8 secs.

the first graph takes 8MB memory, the second takes 0MB memory, and in total cudagraphs take 8MB memory. We can see the memory used after capturing all graphs (19464.05 MB) is almost the same as just capturing one graph previously (19464.02 MB).

WoosukKwon · 2024-06-04T21:57:02Z

@rkooo567 I can take this PR if you are ok with that. Just please let me know!

rkooo567 · 2024-06-04T23:57:31Z

oh yeah please go ahead!

WoosukKwon

@youkaichao Thanks for the PR! This is a good finding. Please check my comments.

vllm/worker/model_runner.py

WoosukKwon

LGTM!

WoosukKwon · 2024-06-09T01:04:39Z

@youkaichao Is the CI failure related to this PR?

youkaichao · 2024-06-09T01:07:50Z

@youkaichao Is the CI failure related to this PR?

investigating.

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

add output buffer for cudagraph

36c06ee

enforce gc collect

10e4086

youkaichao changed the title ~~[Core][CUDA Graph] add output buffer for cudagraph~~ [Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint May 27, 2024

youkaichao added 5 commits May 29, 2024 22:13

Merge branch 'main' into cudagraph_save_memory

29366f7

fix output size calculation

1889703

reduce alignment size, to reduce padding

483b620

Merge branch 'main' into cudagraph_save_memory

6eae4c2

withdraw padding 4

fe75a8e

rkooo567 self-assigned this Jun 4, 2024

WoosukKwon self-assigned this Jun 6, 2024

WoosukKwon reviewed Jun 7, 2024

View reviewed changes

vllm/worker/model_runner.py Outdated Show resolved Hide resolved

youkaichao added 2 commits June 7, 2024 14:52

use the first graph output as output buffer

dbd5248

Merge branch 'main' into cudagraph_save_memory

8ff3b91

youkaichao force-pushed the cudagraph_save_memory branch from 87e4fdc to 8ff3b91 Compare June 7, 2024 21:54

remove output size

b1fe23b

youkaichao requested a review from WoosukKwon June 7, 2024 21:57

WoosukKwon reviewed Jun 8, 2024

View reviewed changes

vllm/worker/model_runner.py Show resolved Hide resolved

WoosukKwon approved these changes Jun 9, 2024

View reviewed changes

youkaichao merged commit 0373e18 into vllm-project:main Jun 9, 2024

youkaichao deleted the cudagraph_save_memory branch June 9, 2024 02:14

youkaichao mentioned this pull request Jun 9, 2024

[mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py #5361

Merged

dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Jun 10, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

7cd87d6

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request Jun 11, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

550ed83

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

joerunde pushed a commit to joerunde/vllm that referenced this pull request Jun 17, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

556d52f

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jun 27, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

8514bbd

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 8, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

aef5e4c

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

bf8ef0b

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

youkaichao mentioned this pull request Oct 26, 2024

[core] cudagraph output with tensor weak reference #9724

Merged

Uh oh!

Comments

Conversation

youkaichao commented May 27, 2024

Uh oh!

youkaichao commented May 27, 2024

Uh oh!

sfc-gh-hazhang commented May 27, 2024

Uh oh!

youkaichao commented May 27, 2024

Uh oh!

youkaichao commented May 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

youkaichao commented May 27, 2024

Uh oh!

WoosukKwon commented Jun 4, 2024

Uh oh!

rkooo567 commented Jun 4, 2024

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jun 9, 2024

Uh oh!

youkaichao commented Jun 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

youkaichao commented May 27, 2024 •

edited

Loading