[core] cudagraph output with tensor weak reference by youkaichao · Pull Request #9724 · vllm-project/vllm

youkaichao · 2024-10-26T20:52:30Z

to correctly use cudagraph, we have to:

capture the graph

prepare input buffer
run the graph (buffers created and destroyed inside cudagraph will be replayed later)
save the output buffer

replay the graph

copy data to input buffer
replay the graph
read data from the output buffer

say we have a simple graph that does a series of operations, and we want to use cudagraph to record operations with different sizes:

import torch

GiB = 1024 * 1024 * 1024 // 4

# 4 GiB input buffer
input_buffer = torch.randn((4, GiB), device="cuda")

# output buffer sizes:
# 4 GiB, 4 GiB, 2 GiB, 1 GiB
sizes = [4, 3, 2, 1]
graphs = []
outputs = []

def report_memory(prefix):
    free, total = torch.cuda.mem_get_info()
    used = total - free
    print(f"{prefix}: Used: {used / 1024 / 1024} MB, Free: {free / 1024 / 1024} MB, Total: {total / 1024 / 1024} MB")

report_memory("Before capture")

use_weak_ref = False
from vllm.utils import weak_ref_tensor

pool = torch.cuda.graph_pool_handle()
for size in sizes:
    graph = torch.cuda.CUDAGraph()
    with torch.cuda.graph(graph, pool=pool):
        x = input_buffer[:size]
        out = x.clone()
        out += 1
        out += 2
        out += 3
        if use_weak_ref:
            out = weak_ref_tensor(out)
        outputs.append(out)
        del x, out
    graphs.append(graph)

report_memory("After capture")

input_buffer.zero_()
graphs[1].replay() # outputs[1] will add 6
for i in range(4):
    print(f"{outputs[i][-1][0]=}")

The output is:

Before capture: Used: 4627.0625 MB, Free: 76489.625 MB, Total: 81116.6875 MB
After capture: Used: 15029.0625 MB, Free: 66087.625 MB, Total: 81116.6875 MB
i=0, outputs[i][-1][0]=tensor(0., device='cuda:0')
i=1, outputs[i][-1][0]=tensor(6., device='cuda:0')
i=2, outputs[i][-1][0]=tensor(0., device='cuda:0')
i=3, outputs[i][-1][0]=tensor(0., device='cuda:0')

To understand the output:

Step 1: Capture the Graph
---------------------------
1. Prepare Input Buffer (4 GiB)

       ┌─────────────────────────────────────────────────────────┐
Input: │                  input_buffer (4 GiB)                   │
       └─────────────────────────────────────────────────────────┘

2. Record Graph Operations
   └─ Graph captures series of operations on input slices of sizes: 4 GiB, 3 GiB, 2 GiB, 1 GiB
       ┌───────────┐   ┌───────────┐   ┌───────────┐   ┌───────────┐
       │ Size 4GiB │   │ Size 3GiB │   │ Size 2GiB │   │ Size 1GiB │
       └───────────┘   └───────────┘   └───────────┘   └───────────┘
   └─ Graph allocates output buffers for each size, adds +1, +2, +3 operations.

3. Save Output Buffers
   └─ Each cudagraph output saved in separate buffers

Memory Report After Capture:
   Used: 15029.0625 MB, Free: 66087.625 MB, Total: 81116.6875 MB

Step 2: Replay the Graph
--------------------------
1. Copy data to Input Buffer
   └─ Graph replay starts by loading data into input buffer.

2. Replay Graph Operations
   └─ Only `graphs[1]` is replayed
       └─ Graph performs `+1`, `+2`, and `+3` on input slice, resulting in output tensor value of 6.

3. Read data from Output Buffer
   └─ Result is saved and displayed for each graph output:
       ┌───────────────┐
       │ outputs[0]=0 │   # Slice size 4 GiB: replay not executed
       └───────────────┘
       ┌───────────────┐
       │ outputs[1]=6 │   # Slice size 3 GiB: replay executed
       └───────────────┘
       ┌───────────────┐
       │ outputs[2]=0 │   # Slice size 2 GiB: replay not executed
       └───────────────┘
       ┌───────────────┐
       │ outputs[3]=0 │   # Slice size 1 GiB: replay not executed
       └───────────────┘

Since we know we will only execute one graph at a time, can we somehow make the output of 4 cudagraphs share the same buffer as well?

This is not possible in normal pytorch, because we have to hold the output buffer for every output. However, when we hold a reference of a tensor from the cudagraph, that part of memory is reserved and will not be recycled.

With this pr, we introduce tensor weak reference, so that we can take a weak reference from the output tensor. This way, cudagraph can also recycle and reuse the output buffer.

When we change use_weak_ref = True in the code, we will get:

Before capture: Used: 4627.0625 MB, Free: 76489.625 MB, Total: 81116.6875 MB
After capture: Used: 8885.0625 MB, Free: 72231.625 MB, Total: 81116.6875 MB
i=0, outputs[i][-1][0]=tensor(0., device='cuda:0')
i=1, outputs[i][-1][0]=tensor(6., device='cuda:0')
i=2, outputs[i][-1][0]=tensor(6., device='cuda:0')
i=3, outputs[i][-1][0]=tensor(6., device='cuda:0')

Note that 4 graphs use 4 GiB memory in total.

Although we only replay graph 1, we can see that graph 2 and 3 's output also change. This is because they share the same buffer. It does not matter here, because we never execute two cudagraphs concurrently.

To understand the output:

1. Prepare Input Buffer (4 GiB)

       ┌─────────────────────────────────────────────────────────┐
Input: │                  input_buffer (4 GiB)                   │
       └─────────────────────────────────────────────────────────┘

2. output buffer

          ┌─────────────────────────────────────────────────────────┐
          │                      Shared Output Buffer               │
          │                (Used sequentially by each graph)        │
          └─────────────────────────────────────────────────────────┘

          ┌─────────────────────────────────────────────────────────┐
          │                    Output for Size 4 GiB                │
          ├───────────────────────────────────────────────┐
          │                 Output for Size 3 GiB         │
          ├───────────────────────────────┐
          │         Output for Size 2 GiB │
          ├───────────────┐
          │ Output for Size 1 GiB 
          └───────────────┘

After replay:

┌───────────────┐
│ outputs[0]=0 │   # Slice size 4 GiB: replay not executed
└───────────────┘
┌───────────────┐
│ outputs[1]=6 │   # Slice size 3 GiB: replay executed
└───────────────┘
┌───────────────┐
│ outputs[2]=6 │   # Slice size 2 GiB: updated due to shared buffer
└───────────────┘
┌───────────────┐
│ outputs[3]=6 │   # Slice size 1 GiB: updated due to shared buffer
└───────────────┘

Signed-off-by: youkaichao <youkaichao@gmail.com>

github-actions · 2024-10-26T20:52:41Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: youkaichao <youkaichao@gmail.com>

vllm/worker/model_runner.py

WoosukKwon

LGTM! Very clever implementation!

alexm-redhat · 2024-10-28T14:36:10Z

@youkaichao why only 1,2,3 change to 6 and not 0,1,2,3?

youkaichao · 2024-10-28T21:40:09Z

@alexm-neuralmagic because:

Slice size 4 GiB: replay not executed

3 GiB graph will modify the buffer of 2 GiB graph and also 1 GiB graph, but not 4 GiB buffer.

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: qishuai <ferdinandzhong@gmail.com>

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com>

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: youkaichao <youkaichao@gmail.com>

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

weakref

24016b9

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao added 3 commits October 26, 2024 14:04

fix compilation

06d8205

Signed-off-by: youkaichao <youkaichao@gmail.com>

try to fix

0c622b0

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix graph output

2288225

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao changed the title ~~[draft] weakref tensors~~ [core] cudagraph output with tensor weak reference Oct 26, 2024

youkaichao commented Oct 26, 2024

View reviewed changes

vllm/worker/model_runner.py Show resolved Hide resolved

youkaichao mentioned this pull request Oct 26, 2024

[torch.compile] rework compile control with piecewise cudagraph #9715

Merged

youkaichao added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 27, 2024

WoosukKwon approved these changes Oct 27, 2024

View reviewed changes

youkaichao merged commit 8549c82 into vllm-project:main Oct 27, 2024

youkaichao deleted the weakref_tensor branch October 27, 2024 07:19

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[core] cudagraph output with tensor weak reference (vllm-project#9724)

a1fb710

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: qishuai <ferdinandzhong@gmail.com>

rasmith pushed a commit to rasmith/vllm that referenced this pull request Oct 30, 2024

[core] cudagraph output with tensor weak reference (vllm-project#9724)

319e5df

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[core] cudagraph output with tensor weak reference (vllm-project#9724)

428bf9d

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[core] cudagraph output with tensor weak reference (vllm-project#9724)

fefbd73

Signed-off-by: youkaichao <youkaichao@gmail.com>

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[core] cudagraph output with tensor weak reference (vllm-project#9724)

0287fc8

Signed-off-by: youkaichao <youkaichao@gmail.com> Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

[core] cudagraph output with tensor weak reference#9724

[core] cudagraph output with tensor weak reference#9724
youkaichao merged 4 commits intovllm-project:mainfrom
youkaichao:weakref_tensor

youkaichao commented Oct 26, 2024 •

edited

Loading

Uh oh!

github-actions bot commented Oct 26, 2024

Uh oh!

Uh oh!

WoosukKwon left a comment

Uh oh!

alexm-redhat commented Oct 28, 2024

Uh oh!

youkaichao commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

youkaichao commented Oct 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 26, 2024

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

alexm-redhat commented Oct 28, 2024

Uh oh!

youkaichao commented Oct 28, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

youkaichao commented Oct 26, 2024 •

edited

Loading