[Model Runner V2] Add full cuda graph support for eagle prefill by TheEpicDolphin · Pull Request #37588 · vllm-project/vllm

TheEpicDolphin · 2026-03-19T18:39:52Z

Purpose

FULL cudagraphs are currently only used for the position 1+ drafting phase. In this PR, I apply FULL cudagraphs to the Eagle prefill path as well to reduce the CPU dispatch overhead in EagleSpeculator.propose.

Benchmarks

H200

I ran an exhaustive set of accuracy and performance benchmarks across several models (Llama3, Qwen3, Mimo, GLM 4.7 Flash), parallelizations (TP, EP, DP), and spec decode types (Eagle-1, Eagle-3, MTP), and compared main (baseline) with this PR. Here are the full results for both commits: https://docs.google.com/spreadsheets/d/1EY4OO9TrPOg4qQPTr6lKmeMpQL1EqCpi633hqU6SboA/edit?usp=sharing.

That spreadsheet is difficult to read due to the size, so I vibe-coded an HTML visualization here: https://gistpreview.github.io/?4a6fc01a426c25560fbbb03a389906ec

NOTE: "ol" means output length, and "c" stands for concurrency in the HTML visualization

In summary, we see significantly more improvements than regressions, particularly with TPOT.

GB300

Using vigil I benchmarked with the following MiniMax M2.5 config:

model: lukealonso/MiniMax-M2.5-NVFP4
mode: local
precheck: true
collect_env: true

pre_serve:
  - cmd: nvidia-smi

serving:
  roles:
    - role: worker
      vllm_engine:
        repo_path: /home/gdelfin/vllm
        env:
          HF_HOME: /home/hf-models/
          VLLM_FLASHINFER_MOE_BACKEND: latency
          VLLM_SERVER_DEV_MODE: "1"
          VLLM_USE_V2_MODEL_RUNNER: "1"
        cmd: >-
          vllm serve {model}
          -tp 4
          --performance-mode interactivity
          --trust-remote-code
          --max-num-seq 64
          --kv-cache-dtype fp8
          --compilation-config '{"mode":3,"pass_config":{"fuse_norm_quant":true,"fuse_act_quant":true,"fuse_gemm_comms":true}}'
          --speculative-config '{"method": "eagle3", "model": "novita/Eagle3-Spec-Minimax-M2.5-Exp15", "num_speculative_tokens": 3, "rejection_sample_method": "synthetic", "synthetic_acceptance_rate": 0.5}'
        health_check:
          url: http://localhost:8000/health
          timeout_s: 1200
          poll_interval_s: 5

post_serve:
  - cmd: >-
      vllm-bench
      --backend openai-chat
      --base-url http://127.0.0.1:8000
      --model {model}
      --dataset-name speed-bench
      --speed-bench-config throughput_16k
      --speed-bench-max-input-len 10240
      --speed-bench-category low_entropy
      --num-prompts 50
      --output-len 256
  - cmd: >-
      vllm-bench
      --backend openai-chat
      --base-url http://127.0.0.1:8000
      --model {model}
      --dataset-name speed-bench
      --speed-bench-config throughput_16k
      --speed-bench-max-input-len 10240
      --speed-bench-category low_entropy
      --num-warmups 50
      --num-prompts 1000
      --output-len 1536
      --sweep-max-concurrency 1,2,4,8,16,32,64
      --sweep-num-prompts-factor 10
      --reset-prefix-cache
      --save-result

And here is the comparison of results for eager vs cudagraph draft prefill:

Concurrency	Req/s (Eager)	Req/s (Cuda)	Tok/s (Eager)	Tok/s (Cuda)	Total tok/s (Eager)	Total tok/s (Cuda)	TTFT ms (Eager)	TTFT ms (Cuda)	TPOT ms (Eager)	TPOT ms (Cuda)
1	0.13	0.17	205.95	254.79	1,578.96	1,953.37	303.82	341.87	4.66	3.70
2	0.26	0.31	401.75	483.79	3,080.10	3,709.09	377.62	358.72	4.71	3.89
4	0.48	0.56	732.15	859.16	5,613.12	6,586.87	384.88	403.47	5.19	4.37
8	0.78	0.87	1,196.03	1,338.52	9,169.57	10,262.00	417.58	407.14	6.40	5.70
16	1.26	1.33	1,941.84	2,041.97	14,887.43	15,655.13	455.93	474.35	7.89	7.48
32	1.95	2.00	2,992.65	3,074.22	22,943.65	23,569.02	593.04	616.02	10.23	9.94
64	2.79	2.79	4,280.37	4,281.85	32,816.14	32,827.58	762.89	755.76	14.36	14.37

For smaller concurrencies, this PR yields better TPOT at the cost of TTFT. But the tradeoff seems worth it given the improvement in output tok/s.

NOTE: I used synthetic_acceptance_rate = 0.5 to isolate the performance improvement of Eagle prefill cudagraph.

DP + EP Edge Case

I also verified that there are no regressions for DP + EP by testing the case from #35294 and not observing any hangs:
Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve zai-org/GLM-4.7-Flash --trust-remote-code --no-enable-prefix-caching --tensor-parallel-size 1 --data-parallel-size 2 --enable-expert-parallel --seed 42 --gpu-memory-utilization 0.8 --speculative-config '{"method": "mtp", "num_speculative_tokens": 2, "model": "zai-org/GLM-4.7-Flash"}'

Client

vllm bench serve --model zai-org/GLM-4.7-Flash --host 0.0.0.0 --dataset-name hf --dataset-path philschmid/mt-bench --ignore-eos --request-rate inf --max-concurrency 16 --temperature 0

Results

Metric	main	#37588
Successful requests	1,000	1,000
Failed requests	0	0
Max request concurrency	16	16
Benchmark duration (s)	117.81	116.21
Total input tokens	75,028	75,028
Total generated tokens	256,000	256,000
Request throughput (req/s)	8.49	8.61
Output token throughput (tok/s)	2,172.98	2,202.99
Peak output token throughput (tok/s)	1,117.00	1,136.00
Peak concurrent requests	32.00	32.00
Total token throughput (tok/s)	2,809.83	2,848.64
Time to First Token
Mean TTFT (ms)	50.20	50.86
Median TTFT (ms)	46.23	46.32
P99 TTFT (ms)	256.56	293.25
Time per Output Token (excl. 1st)
Mean TPOT (ms)	7.15	7.05
Median TPOT (ms)	7.19	7.08
P99 TPOT (ms)	8.05	7.89
Inter-token Latency
Mean ITL (ms)	14.83	14.68
Median ITL (ms)	14.55	14.34
P99 ITL (ms)	20.20	20.57
Speculative Decoding
Acceptance rate (%)	53.91	54.36
Acceptance length	2.08	2.09
Drafts	122,972	122,454
Draft tokens	245,944	244,908
Accepted tokens	132,593	133,129
Position 0 acceptance (%)	86.79	86.70
Position 1 acceptance (%)	21.03	22.01

Profiling

Server

VLLM_USE_V2_MODEL_RUNNER=1 vllm serve meta-llama/Meta-Llama-3-8B-Instruct --no-enable-prefix-caching --tensor-parallel-size 1 --data-parallel-size 1 --speculative-config '{"method": "eagle", "model": "yuhuili/EAGLE-LLaMA3.1-Instruct-8B", "num_speculative_tokens": 3}' --profiler-config '{"profiler": "torch", "torch_profiler_dir": "~/traces/"}'

Client

vllm bench serve --model meta-llama/Meta-Llama-3-8B-Instruct --tokenizer meta-llama/Meta-Llama-3-8B-Instruct --dataset-name random --random-input-len 16 --random-output-len 1024 --num-prompts 8 --max-concurrency 8 --request-rate inf --ignore-eos --temperature 0

Profiling revealed a 70% decrease in the propose CPU dispatch overhead:
Before

After

Testing

Manually verified that the outputs for the following prompts remained unchanged, using meta-llama/Meta-Llama-3-8B-Instruct with Eagle-1:

Before

[0] "Explain the theory of relativity in simple terms."

Response: The theory of relativity! It's a mind-bending concept, but I

Token	Logprob	Top Alternatives
`The`	-0.1073	`The`: -0.107, `A`: -2.482, `Albert`: -4.357
`theory`	-0.0027	`theory`: -0.003, `Theory`: -6.003, `famous`: -9.128
`of`	0.0000	`of`: 0.000
`rel`	-0.0000	`rel`: -0.000, `special`: -13.375, `relative`: -14.625
`ativity`	-0.0000	`ativity`: -0.000
`!`	-0.2347	`!`: -0.235, `,`: -1.985, `is`: -2.860
`It`	-0.4301	`It`: -0.430, `One`: -1.555, `Albert`: -2.930
`'s`	-0.0043	`'s`: -0.004, `can`: -6.129, `may`: -6.629
`a`	-0.0226	`a`: -0.023, `actually`: -4.023, `one`: -6.148
`mind`	-0.4806	`mind`: -0.481, `complex`: -1.981, `big`: -1.981
`-b`	-0.0010	`-b`: -0.001, `-st`: -7.501, `-bl`: -8.501
`ending`	-0.2530	`ending`: -0.253, `low`: -1.503, `ender`: -7.003
`concept`	-0.0258	`concept`: -0.026, `idea`: -3.776, `topic`: -6.151
`,`	-0.5774	`,`: -0.577, `that`: -0.827, `developed`: -6.702
`but`	-0.0000	`but`: -0.000
`I`	-0.0801	`I`: -0.080, `don`: -2.580, `fear`: -7.205

[1] "What is the capital of France?"

Response: The capital of France is Paris.

Token	Logprob	Top Alternatives
`The`	-0.0586	`The`: -0.059, `That`: -2.934, `Easy`: -6.309
`capital`	-0.0000	`capital`: -0.000, `answer`: -11.125, `capital`: -14.125
`of`	-0.0000	`of`: -0.000, `city`: -14.750, `and`: -18.750
`France`	-0.0000	`France`: -0.000, `France`: -14.375, `Franc`: -18.250
`is`	0.0000	`is`: 0.000, `was`: -18.375, `adalah`: -18.625
`Paris`	-0.0001	`Paris`: -0.000, `Paris`: -9.625, `PAR`: -12.125
`.`	-0.2090	`.`: -0.209, `!`: -1.709, `(`: -5.084
`<\|eot_id\|>`	-0.0001	`<\|eot_id\|>`: -0.000, `It`: -11.125, `Paris`: -11.750

[2] "Write a haiku about coding."

Response: Here is a haiku about coding:

Lines of code unfold
Logic flows like

Token	Logprob	Top Alternatives
`Here`	-0.1743	`Here`: -0.174, `Lines`: -2.674, `Code`: -3.174
`is`	-0.0381	`is`: -0.038, `'s`: -3.288, `'s`: -12.163
`a`	-0.0000	`a`: -0.000
`ha`	-0.0001	`ha`: -0.000, `short`: -9.750
`iku`	0.0000	`iku`: 0.000
`about`	-0.0000	`about`: -0.000
`coding`	-0.0000	`coding`: -0.000
`:\n\n`	-0.0000	`:\n\n`: -0.000
`Lines`	-0.4447	`Lines`: -0.445, `Code`: -1.320, `Lines`: -3.695
`of`	-0.0013	`of`: -0.001, `dance`: -8.001, `and`: -8.251
`code`	-0.0006	`code`: -0.001, `logic`: -8.501, `ones`: -9.251
`unfold`	-0.5843	`unfold`: -0.584, `flow`: -1.334, `dance`: -2.459
`\n`	-0.0000	`\n`: -0.000
`Logic`	-1.4517	`Logic`: -1.452, `Bug`: -2.077, `Mean`: -2.202
`flows`	-1.2300	`flows`: -1.230, `'s`: -1.355, `and`: -1.480
`like`	-0.4286	`like`: -0.429, `,`: -1.179, `from`: -3.429

[3] "List three benefits of regular exercise."

Response: Here are three benefits of regular exercise:

Improves Physical Health:

Token	Logprob	Top Alternatives
`Here`	-0.0001	`Here`: -0.000, `Regular`: -9.750, `A`: -10.500
`are`	-0.0000	`are`: -0.000
`three`	-0.0000	`three`: -0.000
`benefits`	-0.0000	`benefits`: -0.000
`of`	0.0000	`of`: 0.000
`regular`	-0.0000	`regular`: -0.000
`exercise`	-0.0000	`exercise`: -0.000, `Exercise`: -14.375, `exercises`: -14.750
`:\n\n`	-0.0000	`:\n\n`: -0.000
`1`	-0.0000	`1`: -0.000
`.`	0.0000	`.`: 0.000
`**`	-0.0000	`**`: -0.000
`Impro`	-0.4097	`Impro`: -0.410, `Improved`: -1.410, `Weight`: -2.410
`ves`	-0.0005	`ves`: -0.000, `vements`: -8.750, `ving`: -9.125
`Physical`	-0.4622	`Physical`: -0.462, `Cardio`: -1.337, `Mental`: -2.837
`Health`	-0.0001	`Health`: -0.000
`**:`	-0.0000	`**:`: -0.000

[4] "How does a refrigerator keep food cold?"

Response: A refrigerator keeps food cold by using a combination of several technologies and principles to remove

Token	Logprob	Top Alternatives
`A`	-0.2289	`A`: -0.229, `Re`: -1.604, `The`: -5.854
`refrigerator`	-0.0010	`refrigerator`: -0.001
`keeps`	-0.0742	`keeps`: -0.074, `is`: -3.449, `,`: -3.699
`food`	-0.0000	`food`: -0.000
`cold`	-0.0000	`cold`: -0.000
`by`	-0.4406	`by`: -0.441, `through`: -1.066, `using`: -4.441
`using`	-0.0041	`using`: -0.004, `utilizing`: -5.879
`a`	-0.0047	`a`: -0.005, `refriger`: -5.755
`combination`	-0.0713	`combination`: -0.071, `refriger`: -3.321, `process`: -3.821
`of`	0.0000	`of`: 0.000
`several`	-1.3681	`several`: -1.368, `technologies`: -1.368, `principles`: -2.118
`technologies`	-0.6389	`technologies`: -0.639, `components`: -1.389
`and`	-0.5262	`and`: -0.526, `to`: -0.901
`principles`	-0.4496	`principles`: -0.450, `mechanisms`: -1.575
`to`	-0.2880	`to`: -0.288, `.`: -1.413
`remove`	-0.7531	`remove`: -0.753, `transfer`: -1.378, `maintain`: -1.878

[5] "What is the difference between HTTP and HTTPS?"

Response: HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol

Token	Logprob	Top Alternatives
`HTTP`	-0.0221	`HTTP`: -0.022, `The`: -3.897, `HTTPS`: -6.897
`(`	-0.0013	`(`: -0.001, `and`: -6.626
`H`	-0.0142	`H`: -0.014, `Hyper`: -4.264
`yp`	-0.0002	`yp`: -0.000
`ertext`	-0.0000	`ertext`: -0.000
`Transfer`	-0.0001	`Transfer`: -0.000, `Transport`: -10.125
`Protocol`	-0.0000	`Protocol`: -0.000
`)`	-0.0000	`)`: -0.000
`and`	-0.0549	`and`: -0.055, `is`: -2.930
`HTTPS`	-0.0000	`HTTPS`: -0.000
`(`	-0.0000	`(`: -0.000
`H`	-0.0000	`H`: -0.000, `Hyper`: -11.000, `Secure`: -12.000
`yp`	-0.0000	`yp`: -0.000
`ertext`	-0.0000	`ertext`: -0.000
`Transfer`	-0.0001	`Transfer`: -0.000, `Transport`: -9.125
`Protocol`	-0.0000	`Protocol`: -0.000

[6] "Suggest a short book to read on a rainy day."

Response: A rainy day is the perfect excuse to curl up with a good book! Here

Token	Logprob	Top Alternatives
`A`	-0.1214	`A`: -0.121, `What`: -2.371, `Perfect`: -4.746
`rainy`	-0.4776	`rainy`: -0.478, `perfect`: -1.103, `cozy`: -3.603
`day`	-0.0000	`day`: -0.000
`is`	-0.0017	`is`: -0.002, `!`: -7.502
`the`	-0.0008	`the`: -0.001, `a`: -7.376
`perfect`	-0.0006	`perfect`: -0.001, `pur`: -7.626
`excuse`	-0.0001	`excuse`: -0.000, `opportunity`: -10.375
`to`	-0.0000	`to`: -0.000
`curl`	-0.4008	`curl`: -0.401, `cozy`: -1.401, `sn`: -3.151
`up`	-0.0000	`up`: -0.000
`with`	-0.0000	`with`: -0.000
`a`	0.0000	`a`: 0.000
`good`	-0.0087	`good`: -0.009, `great`: -4.759
`book`	-0.0000	`book`: -0.000
`!`	-0.1882	`!`: -0.188, `!\n\n`: -1.813
`Here`	-0.0143	`Here`: -0.014, `I`: -4.389

[7] "2+2=?"

Response: The answer is 4!

Token	Logprob	Top Alternatives
`The`	-0.3403	`The`: -0.340, `2`: -1.965, `4`: -2.090
`answer`	-0.0022	`answer`: -0.002, `correct`: -6.127
`is`	-0.2519	`is`: -0.252, `to`: -1.502
	-0.4650	: -0.465, `...`: -1.215, `:`: -2.715
`4`	-0.0000	`4`: -0.000
`!`	-0.0919	`!`: -0.092, `.`: -2.467
`<\|eot_id\|>`	-0.0003	`<\|eot_id\|>`: -0.000

After

[0] "Explain the theory of relativity in simple terms."

Response: The theory of relativity! It's a mind-bending concept, but I

Token	Logprob	Top Alternatives
`The`	-0.1074	`The`: -0.107, `A`: -2.482, `Albert`: -4.357
`theory`	-0.0027	`theory`: -0.003, `Theory`: -6.003, `famous`: -9.128
`of`	0.0000	`of`: 0.000
`rel`	-0.0000	`rel`: -0.000, `special`: -13.375, `relative`: -14.625
`ativity`	-0.0000	`ativity`: -0.000
`!`	-0.2348	`!`: -0.235, `,`: -1.985, `is`: -2.860
`It`	-0.4299	`It`: -0.430, `One`: -1.555, `Albert`: -2.930
`'s`	-0.0040	`'s`: -0.004, `can`: -6.254, `may`: -6.629
`a`	-0.0226	`a`: -0.023, `actually`: -4.023, `one`: -6.148
`mind`	-0.4989	`mind`: -0.499, `complex`: -1.874, `big`: -1.999
`-b`	-0.0009	`-b`: -0.001, `-st`: -7.501, `-bl`: -8.626
`ending`	-0.2266	`ending`: -0.227, `low`: -1.602, `ender`: -6.852
`concept`	-0.0258	`concept`: -0.026, `idea`: -3.776, `topic`: -6.151
`,`	-0.5774	`,`: -0.577, `that`: -0.827, `developed`: -6.702
`but`	-0.0000	`but`: -0.000
`I`	-0.0800	`I`: -0.080, `don`: -2.580, `fear`: -7.330

[1] "What is the capital of France?"

Response: The capital of France is Paris.

Token	Logprob	Top Alternatives
`The`	-0.0586	`The`: -0.059, `That`: -2.934, `Easy`: -6.309
`capital`	-0.0000	`capital`: -0.000, `answer`: -11.125
`of`	-0.0000	`of`: -0.000
`France`	-0.0000	`France`: -0.000
`is`	0.0000	`is`: 0.000
`Paris`	-0.0001	`Paris`: -0.000, `Paris`: -9.500
`.`	-0.2083	`.`: -0.208, `!`: -1.708, `(`: -5.208
`<\|eot_id\|>`	-0.0000	`<\|eot_id\|>`: -0.000

[2] "Write a haiku about coding."

Response: Here is a haiku about coding:

Lines of code unfold
Logic flows like

Token	Logprob	Top Alternatives
`Here`	-0.1551	`Here`: -0.155, `Lines`: -2.780, `Code`: -3.280
`is`	-0.0337	`is`: -0.034, `'s`: -3.409
`a`	-0.0000	`a`: -0.000
`ha`	-0.0001	`ha`: -0.000, `short`: -9.750
`iku`	0.0000	`iku`: 0.000
`about`	-0.0000	`about`: -0.000
`coding`	-0.0000	`coding`: -0.000
`:\n\n`	-0.0000	`:\n\n`: -0.000
`Lines`	-0.4432	`Lines`: -0.443, `Code`: -1.318
`of`	-0.0013	`of`: -0.001, `dance`: -7.876
`code`	-0.0006	`code`: -0.001, `logic`: -8.626
`unfold`	-0.5850	`unfold`: -0.585, `flow`: -1.335, `dance`: -2.460
`\n`	-0.0000	`\n`: -0.000
`Logic`	-1.3611	`Logic`: -1.361, `Bug`: -2.174, `Mean`: -2.174
`flows`	-1.2459	`flows`: -1.246, `'s`: -1.371, `and`: -1.496
`like`	-0.3868	`like`: -0.387, `,`: -1.262, `from`: -3.512

[3] "List three benefits of regular exercise."

Response: Here are three benefits of regular exercise:

Improves Physical Health:

Token	Logprob	Top Alternatives
`Here`	-0.0001	`Here`: -0.000, `Regular`: -9.875, `A`: -10.500
`are`	-0.0000	`are`: -0.000
`three`	-0.0000	`three`: -0.000
`benefits`	-0.0000	`benefits`: -0.000
`of`	0.0000	`of`: 0.000
`regular`	-0.0000	`regular`: -0.000
`exercise`	-0.0000	`exercise`: -0.000
`:\n\n`	-0.0000	`:\n\n`: -0.000
`1`	-0.0000	`1`: -0.000
`.`	0.0000	`.`: 0.000
`**`	-0.0000	`**`: -0.000
`Impro`	-0.4096	`Impro`: -0.410, `Improved`: -1.410, `Weight`: -2.410
`ves`	-0.0006	`ves`: -0.001, `vements`: -8.501
`Physical`	-0.4182	`Physical`: -0.418, `Cardio`: -1.418, `Mental`: -2.918
`Health`	-0.0001	`Health`: -0.000
`**:`	-0.0000	`**:`: -0.000

[4] "How does a refrigerator keep food cold?"

Response: A refrigerator keeps food cold by using a combination of several technologies and principles to remove

Token	Logprob	Top Alternatives
`A`	-0.2293	`A`: -0.229, `Re`: -1.604, `The`: -5.729
`refrigerator`	-0.0010	`refrigerator`: -0.001
`keeps`	-0.0784	`keeps`: -0.078, `is`: -3.328, `,`: -3.703
`food`	-0.0000	`food`: -0.000
`cold`	-0.0000	`cold`: -0.000
`by`	-0.3993	`by`: -0.399, `through`: -1.149, `using`: -4.399
`using`	-0.0041	`using`: -0.004, `utilizing`: -5.879
`a`	-0.0047	`a`: -0.005, `refriger`: -5.755
`combination`	-0.0806	`combination`: -0.081, `refriger`: -3.206, `process`: -3.706
`of`	0.0000	`of`: 0.000
`several`	-1.3180	`several`: -1.318, `technologies`: -1.318, `principles`: -2.193
`technologies`	-0.6518	`technologies`: -0.652, `components`: -1.402
`and`	-0.5261	`and`: -0.526, `to`: -0.901
`principles`	-0.4470	`principles`: -0.447, `mechanisms`: -1.572
`to`	-0.2880	`to`: -0.288, `.`: -1.413
`remove`	-0.7062	`remove`: -0.706, `transfer`: -1.456, `maintain`: -1.831

[5] "What is the difference between HTTP and HTTPS?"

Response: HTTP (Hypertext Transfer Protocol) and HTTPS (Hypertext Transfer Protocol

Token	Logprob	Top Alternatives
`HTTP`	-0.0196	`HTTP`: -0.020, `The`: -4.020, `HTTPS`: -7.020
`(`	-0.0013	`(`: -0.001, `and`: -6.626
`H`	-0.0161	`H`: -0.016, `Hyper`: -4.141
`yp`	-0.0002	`yp`: -0.000
`ertext`	-0.0000	`ertext`: -0.000
`Transfer`	-0.0001	`Transfer`: -0.000, `Transport`: -10.125
`Protocol`	-0.0000	`Protocol`: -0.000
`)`	-0.0000	`)`: -0.000
`and`	-0.0620	`and`: -0.062, `is`: -2.812
`HTTPS`	-0.0000	`HTTPS`: -0.000
`(`	-0.0000	`(`: -0.000
`H`	-0.0000	`H`: -0.000, `Hyper`: -11.000, `Secure`: -12.000
`yp`	-0.0000	`yp`: -0.000
`ertext`	-0.0000	`ertext`: -0.000
`Transfer`	-0.0001	`Transfer`: -0.000, `Transport`: -9.250
`Protocol`	-0.0000	`Protocol`: -0.000

[6] "Suggest a short book to read on a rainy day."

Response: A rainy day is the perfect excuse to curl up with a good book! Here

Token	Logprob	Top Alternatives
`A`	-0.1221	`A`: -0.122, `What`: -2.372, `Perfect`: -4.747
`rainy`	-0.4375	`rainy`: -0.438, `perfect`: -1.188, `cozy`: -3.563
`day`	-0.0000	`day`: -0.000
`is`	-0.0017	`is`: -0.002, `!`: -7.502
`the`	-0.0008	`the`: -0.001, `a`: -7.376
`perfect`	-0.0006	`perfect`: -0.001, `pur`: -7.626
`excuse`	-0.0001	`excuse`: -0.000, `opportunity`: -10.375
`to`	-0.0000	`to`: -0.000
`curl`	-0.4305	`curl`: -0.431, `cozy`: -1.306, `sn`: -3.306
`up`	-0.0000	`up`: -0.000
`with`	-0.0000	`with`: -0.000
`a`	0.0000	`a`: 0.000
`good`	-0.0087	`good`: -0.009, `great`: -4.759
`book`	-0.0000	`book`: -0.000
`!`	-0.1882	`!`: -0.188, `!\n\n`: -1.813
`Here`	-0.0129	`Here`: -0.013, `I`: -4.513

[7] "2+2=?"

Response: The answer is 4!

Token	Logprob	Top Alternatives
`The`	-0.3413	`The`: -0.341, `2`: -1.966, `4`: -2.091
`answer`	-0.0022	`answer`: -0.002, `correct`: -6.127
`is`	-0.2519	`is`: -0.252, `to`: -1.502
	-0.4214	: -0.421, `...`: -1.296, `:`: -2.796
`4`	-0.0000	`4`: -0.000
`!`	-0.0923	`!`: -0.092, `.`: -2.467
`<\|eot_id\|>`	-0.0003	`<\|eot_id\|>`: -0.000

gemini-code-assist

Code Review

This pull request adds full and piecewise CUDA graph support for the Eagle speculator's prefill phase. This is a significant improvement that should boost performance. The changes are well-structured, introducing a new EaglePrefillCudaGraphManager and a dispatch_cudagraph helper method in the EagleSpeculator to cleanly manage graph execution. However, I've identified a critical issue in the memory allocation logic within the new EaglePrefillCudaGraphManager that could lead to runtime errors during CUDA graph capture. The fix is included in the review comments.

vllm/v1/worker/gpu/spec_decode/eagle/cudagraph.py

mergify · 2026-03-20T10:01:20Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-23T02:37:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @TheEpicDolphin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

njhill · 2026-03-28T04:36:41Z

vllm/v1/worker/gpu/spec_decode/eagle/speculator.py

+            self.max_num_reqs,
+            dtype=torch.int64,
+            device=device,


nit:

Suggested change

self.max_num_reqs,

dtype=torch.int64,

device=device,

self.max_num_reqs, dtype=torch.int64, device=device

mergify bot added nvidia v1 labels Mar 19, 2026

github-project-automation bot added this to NVIDIA Mar 19, 2026

gemini-code-assist bot reviewed Mar 19, 2026

View reviewed changes

vllm/v1/worker/gpu/spec_decode/eagle/cudagraph.py Outdated Show resolved Hide resolved

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 3 times, most recently from ee4f68d to 63ce471 Compare March 20, 2026 03:30

TheEpicDolphin mentioned this pull request Mar 20, 2026

[Model Runner V2] Fix draft logits not populated during cudagraph replay #37639

Merged

TheEpicDolphin changed the title ~~[Model Runner V2] Add full/piecewise cuda graph support for eagle pre…~~ [WIP][Model Runner V2] Add full/piecewise cuda graph support for eagle pre… Mar 20, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from 63ce471 to 9847bdf Compare March 20, 2026 05:10

mergify bot added the needs-rebase label Mar 20, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from 9847bdf to b9d5e5f Compare March 20, 2026 17:42

mergify bot removed the needs-rebase label Mar 20, 2026

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 20, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b9d5e5f to b6db027 Compare March 21, 2026 01:07

mergify bot added the needs-rebase label Mar 23, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b6db027 to 75e06a4 Compare March 24, 2026 03:11

mergify bot removed the needs-rebase label Mar 24, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 3 times, most recently from 7faafd1 to 07b3afc Compare March 24, 2026 23:35

TheEpicDolphin changed the title ~~[WIP][Model Runner V2] Add full/piecewise cuda graph support for eagle pre…~~ [Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 2 times, most recently from 2ccfebc to 002b02f Compare March 25, 2026 05:24

TheEpicDolphin changed the title ~~[Model Runner V2] Add full cuda graph support for eagle prefill~~ [WIP][Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch 2 times, most recently from 41e8319 to b0eae6a Compare March 25, 2026 17:18

TheEpicDolphin marked this pull request as ready for review March 25, 2026 17:18

TheEpicDolphin requested review from WoosukKwon and njhill as code owners March 25, 2026 17:18

TheEpicDolphin changed the title ~~[WIP][Model Runner V2] Add full cuda graph support for eagle prefill~~ [Model Runner V2] Add full cuda graph support for eagle prefill Mar 25, 2026

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from b0eae6a to d3febbc Compare March 26, 2026 18:52

[Model Runner V2] Add full cuda graph support for eagle prefill

a396b2b

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>

TheEpicDolphin force-pushed the gdelfin/mrv2-eagle-full-pw-cudagraph-support branch from d3febbc to a396b2b Compare March 27, 2026 23:46

njhill reviewed Mar 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Model Runner V2] Add full cuda graph support for eagle prefill#37588

[Model Runner V2] Add full cuda graph support for eagle prefill#37588
TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
TheEpicDolphin:gdelfin/mrv2-eagle-full-pw-cudagraph-support

TheEpicDolphin commented Mar 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

njhill Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

TheEpicDolphin commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmarks

H200

GB300

DP + EP Edge Case

Profiling

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Mar 20, 2026

Uh oh!

mergify bot commented Mar 23, 2026

Uh oh!

njhill Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheEpicDolphin commented Mar 19, 2026 •

edited

Loading