Skip to content

[Model Runner V2] Add full cuda graph support for eagle prefill#37588

Open
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-eagle-full-pw-cudagraph-support
Open

[Model Runner V2] Add full cuda graph support for eagle prefill#37588
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-eagle-full-pw-cudagraph-support

Conversation

@TheEpicDolphin
Copy link
Copy Markdown
Collaborator

@TheEpicDolphin TheEpicDolphin commented Mar 19, 2026

Purpose

FULL cudagraphs are currently only used for the position 1+ drafting phase. In this PR, I apply FULL cudagraphs to the Eagle prefill path as well to reduce the CPU dispatch overhead in EagleSpeculator.propose.

Benchmarks

H200

I ran an exhaustive set of accuracy and performance benchmarks across several models (Llama3, Qwen3, Mimo, GLM 4.7 Flash), parallelizations (TP, EP, DP), and spec decode types (Eagle-1, Eagle-3, MTP), and compared main (baseline) with this PR. Here are the full results for both commits: https://docs.google.com/spreadsheets/d/1EY4OO9TrPOg4qQPTr6lKmeMpQL1EqCpi633hqU6SboA/edit?usp=sharing.

That spreadsheet is difficult to read due to the size, so I vibe-coded an HTML visualization here: https://gistpreview.github.io/?4a6fc01a426c25560fbbb03a389906ec

NOTE: "ol" means output length, and "c" stands for concurrency in the HTML visualization

In summary, we see significantly more improvements than regressions, particularly with TPOT.

GB300

Using vigil I benchmarked with the following MiniMax M2.5 config:

model: lukealonso/MiniMax-M2.5-NVFP4
mode: local
precheck: true
collect_env: true

pre_serve:
  - cmd: nvidia-smi

serving:
  roles:
    - role: worker
      vllm_engine:
        repo_path: /home/gdelfin/vllm
        env:
          HF_HOME: /home/hf-models/
          VLLM_FLASHINFER_MOE_BACKEND: latency
          VLLM_SERVER_DEV_MODE: "1"
          VLLM_USE_V2_MODEL_RUNNER: "1"
        cmd: >-
          vllm serve {model}
          -tp 4
          --performance-mode interactivity
          --trust-remote-code
          --max-num-seq 64
          --kv-cache-dtype fp8
          --compilation-config '{"mode":3,"pass_config":{"fuse_norm_quant":true,"fuse_act_quant":true,"fuse_gemm_comms":true}}'
          --speculative-config '{"method": "eagle3", "model": "novita/Eagle3-Spec-Minimax-M2.5-Exp15", "num_speculative_tokens": 3, "rejection_sample_method": "synthetic", "synthetic_acceptance_rate": 0.5}'
        health_check:
          url: http://localhost:8000/health
          timeout_s: 1200
          poll_interval_s: 5

post_serve:
  - cmd: >-
      vllm-bench
      --backend openai-chat
      --base-url http://127.0.0.1:8000
      --model {model}
      --dataset-name speed-bench
      --speed-bench-config throughput_16k
      --speed-bench-max-input-len 10240
      --speed-bench-category low_entropy
      --num-prompts 50
      --output-len 256
  - cmd: >-
      vllm-bench
      --backend openai-chat
      --base-url http://127.0.0.1:8000
      --model {model}
      --dataset-name speed-bench
      --speed-bench-config throughput_16k
      --speed-bench-max-input-len 10240
      --speed-bench-category low_entropy
      --num-warmups 50
      --num-prompts 1000
      --output-len 1536
      --sweep-max-concurrency 1,2,4,8,16,32,64
      --sweep-num-prompts-factor 10
      --reset-prefix-cache
      --save-result

And here is the comparison of results for eager vs cudagraph draft prefill:

Concurrency Req/s (Eager) Req/s (Cuda) Tok/s (Eager) Tok/s (Cuda) Total tok/s (Eager) Total tok/s (Cuda) TTFT ms (Eager) TTFT ms (Cuda) TPOT ms (Eager) TPOT ms (Cuda)
1 0.13 0.17 205.95 254.79 1,578.96 1,953.37 303.82 341.87 4.66 3.70
2 0.26 0.31 401.75 483.79 3,080.10 3,709.09 377.62 358.72 4.71 3.89
4 0.48 0.56 732.15 859.16 5,613.12 6,586.87 384.88 403.47 5.19 4.37
8 0.78 0.87 1,196.03 1,338.52 9,169.57 10,262.00 417.58 407.14 6.40 5.70
16 1.26 1.33 1,941.84 2,041.97 14,887.43 15,655.13 455.93 474.35 7.89 7.48
32 1.95 2.00 2,992.65 3,074.22 22,943.65 23,569.02 593.04 616.02 10.23 9.94
64 2.79 2.79 4,280.37 4,281.85 32,816.14 32,827.58 762.89 755.76 14.36 14.37

For smaller concurrencies, this PR yields better TPOT at the cost of TTFT. But the tradeoff seems worth it given the improvement in output tok/s.

NOTE: I used synthetic_acceptance_rate = 0.5 to isolate the performance improvement of Eagle prefill cudagraph.

DP + EP Edge Case

I also verified that there are no regressions for DP + EP by testing the case from #35294 and not observing any hangs:
Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve zai-org/GLM-4.7-Flash --trust-remote-code --no-enable-prefix-caching --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --seed 42 --gpu-memory-utilization 0.8 --speculative-config '{"method": "mtp", "num_speculative_tokens": 2, "model": "zai-org/GLM-4.7-Flash"}'

Client

vllm bench serve --model zai-org/GLM-4.7-Flash --host 0.0.0.0 --dataset-name hf --dataset-path philschmid/mt-bench --ignore-eos --request-rate inf --max-concurrency 16 --temperature 0

Results

Metric main #37588
Successful requests 1,000 1,000
Failed requests 0 0
Max request concurrency 16 16
Benchmark duration (s) 117.81 116.21
Total input tokens 75,028 75,028
Total generated tokens 256,000 256,000
Request throughput (req/s) 8.49 8.61
Output token throughput (tok/s) 2,172.98 2,202.99
Peak output token throughput (tok/s) 1,117.00 1,136.00
Peak concurrent requests 32.00 32.00
Total token throughput (tok/s) 2,809.83 2,848.64
Time to First Token    
Mean TTFT (ms) 50.20 50.86
Median TTFT (ms) 46.23 46.32
P99 TTFT (ms) 256.56 293.25
Time per Output Token (excl. 1st)    
Mean TPOT (ms) 7.15 7.05
Median TPOT (ms) 7.19 7.08
P99 TPOT (ms) 8.05 7.89
Inter-token Latency    
Mean ITL (ms) 14.83 14.68
Median ITL (ms) 14.55 14.34
P99 ITL (ms) 20.20 20.57
Speculative Decoding    
Acceptance rate (%) 53.91 54.36
Acceptance length 2.08 2.09
Drafts 122,972 122,454
Draft tokens 245,944 244,908
Accepted tokens 132,593 133,129
Position 0 acceptance (%) 86.79 86.70
Position 1 acceptance (%) 21.03 22.01

Profiling

Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct --no-enable-prefix-caching --tensor-parallel-size 1 --data-parallel-size 1 --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' --profiler-config '{"profiler": "torch", "torch_profiler_dir": "~/traces/"}'

Client

vllm bench serve --model meta-llama/Meta-Llama-3-8B-Instruct --tokenizer meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input-len 16 --random-output-len 1024 --num-prompts 8 --max-concurrency 8 --request-rate inf --ignore-eos --temperature 0

Profiling revealed a 70% decrease in the propose CPU dispatch overhead:
Before
image
After
image

Testing

Manually verified that the outputs for the following prompts remained unchanged, using meta-llama/Meta-Llama-3-8B-Instruct with Eagle-1:

Before

[0] "Explain the theory of relativity in simple terms."

Response: The theory of relativity! It's a mind-bending concept, but I

Token Logprob Top Alternatives
The -0.1073 The: -0.107, A: -2.482, Albert: -4.357
theory -0.0027 theory: -0.003, Theory: -6.003, famous: -9.128
of 0.0000 of: 0.000
rel -0.0000 rel: -0.000, special: -13.375, relative: -14.625
ativity -0.0000 ativity: -0.000
! -0.2347 !: -0.235, ,: -1.985, is: -2.860
It -0.4301 It: -0.430, One: -1.555, Albert: -2.930
's -0.0043 's: -0.004, can: -6.129, may: -6.629
a -0.0226 a: -0.023, actually: -4.023, one: -6.148
mind -0.4806 mind: -0.481, complex: -1.981, big: -1.981
-b -0.0010 -b: -0.001, -st: -7.501, -bl: -8.501
ending -0.2530 ending: -0.253, low: -1.503, ender: -7.003
concept -0.0258 concept: -0.026, idea: -3.776, topic: -6.151
, -0.5774 ,: -0.577, that: -0.827, developed: -6.702
but -0.0000 but: -0.000
I -0.0801 I: -0.080, don: -2.580, fear: -7.205
[1] "What is the capital of France?"

Response: The capital of France is Paris.

Token Logprob Top Alternatives
The -0.0586 The: -0.059, That: -2.934, Easy: -6.309
capital -0.0000 capital: -0.000, answer: -11.125, capital: -14.125
of -0.0000 of: -0.000, city: -14.750, and: -18.750
France -0.0000 France: -0.000, France: -14.375, Franc: -18.250
is 0.0000 is: 0.000, was: -18.375, adalah: -18.625
Paris -0.0001 Paris: -0.000, Paris: -9.625, PAR: -12.125
. -0.2090 .: -0.209, !: -1.709, (: -5.084
<|eot_id|> -0.0001 <|eot_id|>: -0.000, It: -11.125, Paris: -11.750
[2] "Write a haiku about coding."

Response: Here is a haiku about coding:

Lines of code unfold
Logic flows like

Token Logprob Top Alternatives
Here -0.1743 Here: -0.174, Lines: -2.674, Code: -3.174
is -0.0381 is: -0.038, 's: -3.288, 's: -12.163
a -0.0000 a: -0.000
ha -0.0001 ha: -0.000, short: -9.750
iku 0.0000 iku: 0.000
about -0.0000 about: -0.000
coding -0.0000 coding: -0.000
:\n\n -0.0000 :\n\n: -0.000
Lines -0.4447 Lines: -0.445, Code: -1.320, Lines: -3.695
of -0.0013 of: -0.001, dance: -8.001, and: -8.251
code -0.0006 code: -0.001, logic: -8.501, ones: -9.251
unfold -0.5843 unfold: -0.584, flow: -1.334, dance: -2.459
\n -0.0000 \n: -0.000
Logic -1.4517 Logic: -1.452, Bug: -2.077, Mean: -2.202
flows -1.2300 flows: -1.230, 's: -1.355, and: -1.480
like -0.4286 like: -0.429, ,: -1.179, from: -3.429
[3] "List three benefits of regular exercise."

Response: Here are three benefits of regular exercise:

  1. Improves Physical Health:
Token Logprob Top Alternatives
Here -0.0001 Here: -0.000, Regular: -9.750, A: -10.500
are -0.0000 are: -0.000
three -0.0000 three: -0.000
benefits -0.0000 benefits: -0.000
of 0.0000 of: 0.000
regular -0.0000 regular: -0.000
exercise -0.0000 exercise: -0.000, Exercise: -14.375, exercises: -14.750
:\n\n -0.0000 :\n\n: -0.000
1 -0.0000 1: -0.000
. 0.0000 .: 0.000
** -0.0000 **: -0.000
Impro -0.4097 Impro: -0.410, Improved: -1.410, Weight: -2.410
ves -0.0005 ves: -0.000, vements: -8.750, ving: -9.125
Physical -0.4622 Physical: -0.462, Cardio: -1.337, Mental: -2.837
Health -0.0001 Health: -0.000
**: -0.0000 **:: -0.000
[4] "How does a refrigerator keep food cold?"

Response: A refrigerator keeps food cold by using a combination of several technologies and principles to remove

Token Logprob Top Alternatives
A -0.2289 A: -0.229, Re: -1.604, The: -5.854
refrigerator -0.0010 refrigerator: -0.001
keeps -0.0742 keeps: -0.074, is: -3.449, ,: -3.699
food -0.0000 food: -0.000
cold -0.0000 cold: -0.000
by -0.4406 by: -0.441, through: -1.066, using: -4.441
using -0.0041 using: -0.004, utilizing: -5.879
a -0.0047 a: -0.005, refriger: -5.755
combination -0.0713 combination: -0.071, refriger: -3.321, process: -3.821
of 0.0000 of: 0.000
several -1.3681 several: -1.368, technologies: -1.368, principles: -2.118
technologies -0.6389 technologies: -0.639, components: -1.389
and -0.5262 and: -0.526, to: -0.901
principles -0.4496 principles: -0.450, mechanisms: -1.575
to -0.2880 to: -0.288, .: -1.413
remove -0.7531 remove: -0.753, transfer: -1.378, maintain: -1.878
[5] "What is the difference between HTTP and HTTPS?"

Response: HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol

Token Logprob Top Alternatives
HTTP -0.0221 HTTP: -0.022, The: -3.897, HTTPS: -6.897
( -0.0013 (: -0.001, and: -6.626
H -0.0142 H: -0.014, Hyper: -4.264
yp -0.0002 yp: -0.000
ertext -0.0000 ertext: -0.000
Transfer -0.0001 Transfer: -0.000, Transport: -10.125
Protocol -0.0000 Protocol: -0.000
) -0.0000 ): -0.000
and -0.0549 and: -0.055, is: -2.930
HTTPS -0.0000 HTTPS: -0.000
( -0.0000 (: -0.000
H -0.0000 H: -0.000, Hyper: -11.000, Secure: -12.000
yp -0.0000 yp: -0.000
ertext -0.0000 ertext: -0.000
Transfer -0.0001 Transfer: -0.000, Transport: -9.125
Protocol -0.0000 Protocol: -0.000
[6] "Suggest a short book to read on a rainy day."

Response: A rainy day is the perfect excuse to curl up with a good book! Here

Token Logprob Top Alternatives
A -0.1214 A: -0.121, What: -2.371, Perfect: -4.746
rainy -0.4776 rainy: -0.478, perfect: -1.103, cozy: -3.603
day -0.0000 day: -0.000
is -0.0017 is: -0.002, !: -7.502
the -0.0008 the: -0.001, a: -7.376
perfect -0.0006 perfect: -0.001, pur: -7.626
excuse -0.0001 excuse: -0.000, opportunity: -10.375
to -0.0000 to: -0.000
curl -0.4008 curl: -0.401, cozy: -1.401, sn: -3.151
up -0.0000 up: -0.000
with -0.0000 with: -0.000
a 0.0000 a: 0.000
good -0.0087 good: -0.009, great: -4.759
book -0.0000 book: -0.000
! -0.1882 !: -0.188, !\n\n: -1.813
Here -0.0143 Here: -0.014, I: -4.389
[7] "2+2=?"

Response: The answer is 4!

Token Logprob Top Alternatives
The -0.3403 The: -0.340, 2: -1.965, 4: -2.090
answer -0.0022 answer: -0.002, correct: -6.127
is -0.2519 is: -0.252, to: -1.502
-0.4650 : -0.465, ...: -1.215, :: -2.715
4 -0.0000 4: -0.000
! -0.0919 !: -0.092, .: -2.467
<|eot_id|> -0.0003 <|eot_id|>: -0.000

After

[0] "Explain the theory of relativity in simple terms."

Response: The theory of relativity! It's a mind-bending concept, but I

Token Logprob Top Alternatives
The -0.1074 The: -0.107, A: -2.482, Albert: -4.357
theory -0.0027 theory: -0.003, Theory: -6.003, famous: -9.128
of 0.0000 of: 0.000
rel -0.0000 rel: -0.000, special: -13.375, relative: -14.625
ativity -0.0000 ativity: -0.000
! -0.2348 !: -0.235, ,: -1.985, is: -2.860
It -0.4299 It: -0.430, One: -1.555, Albert: -2.930
's -0.0040 's: -0.004, can: -6.254, may: -6.629
a -0.0226 a: -0.023, actually: -4.023, one: -6.148
mind -0.4989 mind: -0.499, complex: -1.874, big: -1.999
-b -0.0009 -b: -0.001, -st: -7.501, -bl: -8.626
ending -0.2266 ending: -0.227, low: -1.602, ender: -6.852
concept -0.0258 concept: -0.026, idea: -3.776, topic: -6.151
, -0.5774 ,: -0.577, that: -0.827, developed: -6.702
but -0.0000 but: -0.000
I -0.0800 I: -0.080, don: -2.580, fear: -7.330
[1] "What is the capital of France?"

Response: The capital of France is Paris.

Token Logprob Top Alternatives
The -0.0586 The: -0.059, That: -2.934, Easy: -6.309
capital -0.0000 capital: -0.000, answer: -11.125
of -0.0000 of: -0.000
France -0.0000 France: -0.000
is 0.0000 is: 0.000
Paris -0.0001 Paris: -0.000, Paris: -9.500
. -0.2083 .: -0.208, !: -1.708, (: -5.208
<|eot_id|> -0.0000 <|eot_id|>: -0.000
[2] "Write a haiku about coding."

Response: Here is a haiku about coding:

Lines of code unfold
Logic flows like

Token Logprob Top Alternatives
Here -0.1551 Here: -0.155, Lines: -2.780, Code: -3.280
is -0.0337 is: -0.034, 's: -3.409
a -0.0000 a: -0.000
ha -0.0001 ha: -0.000, short: -9.750
iku 0.0000 iku: 0.000
about -0.0000 about: -0.000
coding -0.0000 coding: -0.000
:\n\n -0.0000 :\n\n: -0.000
Lines -0.4432 Lines: -0.443, Code: -1.318
of -0.0013 of: -0.001, dance: -7.876
code -0.0006 code: -0.001, logic: -8.626
unfold -0.5850 unfold: -0.585, flow: -1.335, dance: -2.460
\n -0.0000 \n: -0.000
Logic -1.3611 Logic: -1.361, Bug: -2.174, Mean: -2.174
flows -1.2459 flows: -1.246, 's: -1.371, and: -1.496
like -0.3868 like: -0.387, ,: -1.262, from: -3.512
[3] "List three benefits of regular exercise."

Response: Here are three benefits of regular exercise:

  1. Improves Physical Health:
Token Logprob Top Alternatives
Here -0.0001 Here: -0.000, Regular: -9.875, A: -10.500
are -0.0000 are: -0.000
three -0.0000 three: -0.000
benefits -0.0000 benefits: -0.000
of 0.0000 of: 0.000
regular -0.0000 regular: -0.000
exercise -0.0000 exercise: -0.000
:\n\n -0.0000 :\n\n: -0.000
1 -0.0000 1: -0.000
. 0.0000 .: 0.000
** -0.0000 **: -0.000
Impro -0.4096 Impro: -0.410, Improved: -1.410, Weight: -2.410
ves -0.0006 ves: -0.001, vements: -8.501
Physical -0.4182 Physical: -0.418, Cardio: -1.418, Mental: -2.918
Health -0.0001 Health: -0.000
**: -0.0000 **:: -0.000
[4] "How does a refrigerator keep food cold?"

Response: A refrigerator keeps food cold by using a combination of several technologies and principles to remove

Token Logprob Top Alternatives
A -0.2293 A: -0.229, Re: -1.604, The: -5.729
refrigerator -0.0010 refrigerator: -0.001
keeps -0.0784 keeps: -0.078, is: -3.328, ,: -3.703
food -0.0000 food: -0.000
cold -0.0000 cold: -0.000
by -0.3993 by: -0.399, through: -1.149, using: -4.399
using -0.0041 using: -0.004, utilizing: -5.879
a -0.0047 a: -0.005, refriger: -5.755
combination -0.0806 combination: -0.081, refriger: -3.206, process: -3.706
of 0.0000 of: 0.000
several -1.3180 several: -1.318, technologies: -1.318, principles: -2.193
technologies -0.6518 technologies: -0.652, components: -1.402
and -0.5261 and: -0.526, to: -0.901
principles -0.4470 principles: -0.447, mechanisms: -1.572
to -0.2880 to: -0.288, .: -1.413
remove -0.7062 remove: -0.706, transfer: -1.456, maintain: -1.831
[5] "What is the difference between HTTP and HTTPS?"

Response: HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol

Token Logprob Top Alternatives
HTTP -0.0196 HTTP: -0.020, The: -4.020, HTTPS: -7.020
( -0.0013 (: -0.001, and: -6.626
H -0.0161 H: -0.016, Hyper: -4.141
yp -0.0002 yp: -0.000
ertext -0.0000 ertext: -0.000
Transfer -0.0001 Transfer: -0.000, Transport: -10.125
Protocol -0.0000 Protocol: -0.000
) -0.0000 ): -0.000
and -0.0620 and: -0.062, is: -2.812
HTTPS -0.0000 HTTPS: -0.000
( -0.0000 (: -0.000
H -0.0000 H: -0.000, Hyper: -11.000, Secure: -12.000
yp -0.0000 yp: -0.000
ertext -0.0000 ertext: -0.000
Transfer -0.0001 Transfer: -0.000, Transport: -9.250
Protocol -0.0000 Protocol: -0.000
[6] "Suggest a short book to read on a rainy day."

Response: A rainy day is the perfect excuse to curl up with a good book! Here

Token Logprob Top Alternatives
A -0.1221 A: -0.122, What: -2.372, Perfect: -4.747
rainy -0.4375 rainy: -0.438, perfect: -1.188, cozy: -3.563
day -0.0000 day: -0.000
is -0.0017 is: -0.002, !: -7.502
the -0.0008 the: -0.001, a: -7.376
perfect -0.0006 perfect: -0.001, pur: -7.626
excuse -0.0001 excuse: -0.000, opportunity: -10.375
to -0.0000 to: -0.000
curl -0.4305 curl: -0.431, cozy: -1.306, sn: -3.306
up -0.0000 up: -0.000
with -0.0000 with: -0.000
a 0.0000 a: 0.000
good -0.0087 good: -0.009, great: -4.759
book -0.0000 book: -0.000
! -0.1882 !: -0.188, !\n\n: -1.813
Here -0.0129 Here: -0.013, I: -4.513
[7] "2+2=?"

Response: The answer is 4!

Token Logprob Top Alternatives
The -0.3413 The: -0.341, 2: -1.966, 4: -2.091
answer -0.0022 answer: -0.002, correct: -6.127
is -0.2519 is: -0.252, to: -1.502
-0.4214 : -0.421, ...: -1.296, :: -2.796
4 -0.0000 4: -0.000
! -0.0923 !: -0.092, .: -2.467
<|eot_id|> -0.0003 <|eot_id|>: -0.000

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds full and piecewise CUDA graph support for the Eagle speculator's prefill phase. This is a significant improvement that should boost performance. The changes are well-structured, introducing a new EaglePrefillCudaGraphManager and a dispatch_cudagraph helper method in the EagleSpeculator to cleanly manage graph execution. However, I've identified a critical issue in the memory allocation logic within the new EaglePrefillCudaGraphManager that could lead to runtime errors during CUDA graph capture. The fix is included in the review comments.

@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 3 times, most recently from ee4f68d to 63ce471 Compare March 20, 2026 03:30
@TheEpicDolphin TheEpicDolphin changed the title [Model Runner V2] Add full/piecewise cuda graph support for eagle pre… [WIP][Model Runner V2] Add full/piecewise cuda graph support for eagle pre… Mar 20, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from 63ce471 to 9847bdf Compare March 20, 2026 05:10
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 20, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 20, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from 9847bdf to b9d5e5f Compare March 20, 2026 17:42
@mergify mergify bot removed the needs-rebase label Mar 20, 2026
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b9d5e5f to b6db027 Compare March 21, 2026 01:07
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 23, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Mar 23, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b6db027 to 75e06a4 Compare March 24, 2026 03:11
@mergify mergify bot removed the needs-rebase label Mar 24, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 3 times, most recently from 7faafd1 to 07b3afc Compare March 24, 2026 23:35
@TheEpicDolphin TheEpicDolphin changed the title [WIP][Model Runner V2] Add full/piecewise cuda graph support for eagle pre… [Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 2 times, most recently from 2ccfebc to 002b02f Compare March 25, 2026 05:24
@TheEpicDolphin TheEpicDolphin changed the title [Model Runner V2] Add full cuda graph support for eagle prefill [WIP][Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 2 times, most recently from 41e8319 to b0eae6a Compare March 25, 2026 17:18
@TheEpicDolphin TheEpicDolphin marked this pull request as ready for review March 25, 2026 17:18
@TheEpicDolphin TheEpicDolphin changed the title [WIP][Model Runner V2] Add full cuda graph support for eagle prefill [Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b0eae6a to d3febbc Compare March 26, 2026 18:52
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
@TheEpicDolphin TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from d3febbc to a396b2b Compare March 27, 2026 23:46
Comment on lines +86 to +88
self.max_num_reqs,
dtype=torch.int64,
device=device,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
self.max_num_reqs,
dtype=torch.int64,
device=device,
self.max_num_reqs, dtype=torch.int64, device=device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants