[Model Runner V2] Add full cuda graph support for eagle prefill#37588
[Model Runner V2] Add full cuda graph support for eagle prefill#37588TheEpicDolphin wants to merge 1 commit intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds full and piecewise CUDA graph support for the Eagle speculator's prefill phase. This is a significant improvement that should boost performance. The changes are well-structured, introducing a new EaglePrefillCudaGraphManager and a dispatch_cudagraph helper method in the EagleSpeculator to cleanly manage graph execution. However, I've identified a critical issue in the memory allocation logic within the new EaglePrefillCudaGraphManager that could lead to runtime errors during CUDA graph capture. The fix is included in the review comments.
ee4f68d to
63ce471
Compare
63ce471 to
9847bdf
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
9847bdf to
b9d5e5f
Compare
b9d5e5f to
b6db027
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
b6db027 to
75e06a4
Compare
7faafd1 to
07b3afc
Compare
2ccfebc to
002b02f
Compare
41e8319 to
b0eae6a
Compare
b0eae6a to
d3febbc
Compare
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
d3febbc to
a396b2b
Compare
| self.max_num_reqs, | ||
| dtype=torch.int64, | ||
| device=device, |
There was a problem hiding this comment.
nit:
| self.max_num_reqs, | |
| dtype=torch.int64, | |
| device=device, | |
| self.max_num_reqs, dtype=torch.int64, device=device |
Purpose
FULL cudagraphs are currently only used for the position 1+ drafting phase. In this PR, I apply FULL cudagraphs to the Eagle prefill path as well to reduce the CPU dispatch overhead in
EagleSpeculator.propose.Benchmarks
H200
I ran an exhaustive set of accuracy and performance benchmarks across several models (Llama3, Qwen3, Mimo, GLM 4.7 Flash), parallelizations (TP, EP, DP), and spec decode types (Eagle-1, Eagle-3, MTP), and compared main (baseline) with this PR. Here are the full results for both commits: https://docs.google.com/spreadsheets/d/1EY4OO9TrPOg4qQPTr6lKmeMpQL1EqCpi633hqU6SboA/edit?usp=sharing.
That spreadsheet is difficult to read due to the size, so I vibe-coded an HTML visualization here: https://gistpreview.github.io/?4a6fc01a426c25560fbbb03a389906ec
NOTE: "ol" means output length, and "c" stands for concurrency in the HTML visualization
In summary, we see significantly more improvements than regressions, particularly with TPOT.
GB300
Using vigil I benchmarked with the following MiniMax M2.5 config:
And here is the comparison of results for eager vs cudagraph draft prefill:
For smaller concurrencies, this PR yields better TPOT at the cost of TTFT. But the tradeoff seems worth it given the improvement in output tok/s.
NOTE: I used
synthetic_acceptance_rate = 0.5to isolate the performance improvement of Eagle prefill cudagraph.DP + EP Edge Case
I also verified that there are no regressions for DP + EP by testing the case from #35294 and not observing any hangs:
Server
Client
Results
Profiling
Server
Client
Profiling revealed a 70% decrease in the


proposeCPU dispatch overhead:Before
After
Testing
Manually verified that the outputs for the following prompts remained unchanged, using meta-llama/Meta-Llama-3-8B-Instruct with Eagle-1:
Before
[0] "Explain the theory of relativity in simple terms."
TheThe: -0.107,A: -2.482,Albert: -4.357theorytheory: -0.003,Theory: -6.003,famous: -9.128ofof: 0.000relrel: -0.000,special: -13.375,relative: -14.625ativityativity: -0.000!!: -0.235,,: -1.985,is: -2.860ItIt: -0.430,One: -1.555,Albert: -2.930's's: -0.004,can: -6.129,may: -6.629aa: -0.023,actually: -4.023,one: -6.148mindmind: -0.481,complex: -1.981,big: -1.981-b-b: -0.001,-st: -7.501,-bl: -8.501endingending: -0.253,low: -1.503,ender: -7.003conceptconcept: -0.026,idea: -3.776,topic: -6.151,,: -0.577,that: -0.827,developed: -6.702butbut: -0.000II: -0.080,don: -2.580,fear: -7.205[1] "What is the capital of France?"
TheThe: -0.059,That: -2.934,Easy: -6.309capitalcapital: -0.000,answer: -11.125,capital: -14.125ofof: -0.000,city: -14.750,and: -18.750FranceFrance: -0.000,France: -14.375,Franc: -18.250isis: 0.000,was: -18.375,adalah: -18.625ParisParis: -0.000,Paris: -9.625,PAR: -12.125..: -0.209,!: -1.709,(: -5.084<|eot_id|><|eot_id|>: -0.000,It: -11.125,Paris: -11.750[2] "Write a haiku about coding."
HereHere: -0.174,Lines: -2.674,Code: -3.174isis: -0.038,'s: -3.288,'s: -12.163aa: -0.000haha: -0.000,short: -9.750ikuiku: 0.000aboutabout: -0.000codingcoding: -0.000:\n\n:\n\n: -0.000LinesLines: -0.445,Code: -1.320,Lines: -3.695ofof: -0.001,dance: -8.001,and: -8.251codecode: -0.001,logic: -8.501,ones: -9.251unfoldunfold: -0.584,flow: -1.334,dance: -2.459\n\n: -0.000LogicLogic: -1.452,Bug: -2.077,Mean: -2.202flowsflows: -1.230,'s: -1.355,and: -1.480likelike: -0.429,,: -1.179,from: -3.429[3] "List three benefits of regular exercise."
HereHere: -0.000,Regular: -9.750,A: -10.500areare: -0.000threethree: -0.000benefitsbenefits: -0.000ofof: 0.000regularregular: -0.000exerciseexercise: -0.000,Exercise: -14.375,exercises: -14.750:\n\n:\n\n: -0.00011: -0.000..: 0.000****: -0.000ImproImpro: -0.410,Improved: -1.410,Weight: -2.410vesves: -0.000,vements: -8.750,ving: -9.125PhysicalPhysical: -0.462,Cardio: -1.337,Mental: -2.837HealthHealth: -0.000**:**:: -0.000[4] "How does a refrigerator keep food cold?"
AA: -0.229,Re: -1.604,The: -5.854refrigeratorrefrigerator: -0.001keepskeeps: -0.074,is: -3.449,,: -3.699foodfood: -0.000coldcold: -0.000byby: -0.441,through: -1.066,using: -4.441usingusing: -0.004,utilizing: -5.879aa: -0.005,refriger: -5.755combinationcombination: -0.071,refriger: -3.321,process: -3.821ofof: 0.000severalseveral: -1.368,technologies: -1.368,principles: -2.118technologiestechnologies: -0.639,components: -1.389andand: -0.526,to: -0.901principlesprinciples: -0.450,mechanisms: -1.575toto: -0.288,.: -1.413removeremove: -0.753,transfer: -1.378,maintain: -1.878[5] "What is the difference between HTTP and HTTPS?"
HTTPHTTP: -0.022,The: -3.897,HTTPS: -6.897((: -0.001,and: -6.626HH: -0.014,Hyper: -4.264ypyp: -0.000ertextertext: -0.000TransferTransfer: -0.000,Transport: -10.125ProtocolProtocol: -0.000)): -0.000andand: -0.055,is: -2.930HTTPSHTTPS: -0.000((: -0.000HH: -0.000,Hyper: -11.000,Secure: -12.000ypyp: -0.000ertextertext: -0.000TransferTransfer: -0.000,Transport: -9.125ProtocolProtocol: -0.000[6] "Suggest a short book to read on a rainy day."
AA: -0.121,What: -2.371,Perfect: -4.746rainyrainy: -0.478,perfect: -1.103,cozy: -3.603dayday: -0.000isis: -0.002,!: -7.502thethe: -0.001,a: -7.376perfectperfect: -0.001,pur: -7.626excuseexcuse: -0.000,opportunity: -10.375toto: -0.000curlcurl: -0.401,cozy: -1.401,sn: -3.151upup: -0.000withwith: -0.000aa: 0.000goodgood: -0.009,great: -4.759bookbook: -0.000!!: -0.188,!\n\n: -1.813HereHere: -0.014,I: -4.389[7] "2+2=?"
TheThe: -0.340,2: -1.965,4: -2.090answeranswer: -0.002,correct: -6.127isis: -0.252,to: -1.502: -0.465,...: -1.215,:: -2.71544: -0.000!!: -0.092,.: -2.467<|eot_id|><|eot_id|>: -0.000After
[0] "Explain the theory of relativity in simple terms."
TheThe: -0.107,A: -2.482,Albert: -4.357theorytheory: -0.003,Theory: -6.003,famous: -9.128ofof: 0.000relrel: -0.000,special: -13.375,relative: -14.625ativityativity: -0.000!!: -0.235,,: -1.985,is: -2.860ItIt: -0.430,One: -1.555,Albert: -2.930's's: -0.004,can: -6.254,may: -6.629aa: -0.023,actually: -4.023,one: -6.148mindmind: -0.499,complex: -1.874,big: -1.999-b-b: -0.001,-st: -7.501,-bl: -8.626endingending: -0.227,low: -1.602,ender: -6.852conceptconcept: -0.026,idea: -3.776,topic: -6.151,,: -0.577,that: -0.827,developed: -6.702butbut: -0.000II: -0.080,don: -2.580,fear: -7.330[1] "What is the capital of France?"
TheThe: -0.059,That: -2.934,Easy: -6.309capitalcapital: -0.000,answer: -11.125ofof: -0.000FranceFrance: -0.000isis: 0.000ParisParis: -0.000,Paris: -9.500..: -0.208,!: -1.708,(: -5.208<|eot_id|><|eot_id|>: -0.000[2] "Write a haiku about coding."
HereHere: -0.155,Lines: -2.780,Code: -3.280isis: -0.034,'s: -3.409aa: -0.000haha: -0.000,short: -9.750ikuiku: 0.000aboutabout: -0.000codingcoding: -0.000:\n\n:\n\n: -0.000LinesLines: -0.443,Code: -1.318ofof: -0.001,dance: -7.876codecode: -0.001,logic: -8.626unfoldunfold: -0.585,flow: -1.335,dance: -2.460\n\n: -0.000LogicLogic: -1.361,Bug: -2.174,Mean: -2.174flowsflows: -1.246,'s: -1.371,and: -1.496likelike: -0.387,,: -1.262,from: -3.512[3] "List three benefits of regular exercise."
HereHere: -0.000,Regular: -9.875,A: -10.500areare: -0.000threethree: -0.000benefitsbenefits: -0.000ofof: 0.000regularregular: -0.000exerciseexercise: -0.000:\n\n:\n\n: -0.00011: -0.000..: 0.000****: -0.000ImproImpro: -0.410,Improved: -1.410,Weight: -2.410vesves: -0.001,vements: -8.501PhysicalPhysical: -0.418,Cardio: -1.418,Mental: -2.918HealthHealth: -0.000**:**:: -0.000[4] "How does a refrigerator keep food cold?"
AA: -0.229,Re: -1.604,The: -5.729refrigeratorrefrigerator: -0.001keepskeeps: -0.078,is: -3.328,,: -3.703foodfood: -0.000coldcold: -0.000byby: -0.399,through: -1.149,using: -4.399usingusing: -0.004,utilizing: -5.879aa: -0.005,refriger: -5.755combinationcombination: -0.081,refriger: -3.206,process: -3.706ofof: 0.000severalseveral: -1.318,technologies: -1.318,principles: -2.193technologiestechnologies: -0.652,components: -1.402andand: -0.526,to: -0.901principlesprinciples: -0.447,mechanisms: -1.572toto: -0.288,.: -1.413removeremove: -0.706,transfer: -1.456,maintain: -1.831[5] "What is the difference between HTTP and HTTPS?"
HTTPHTTP: -0.020,The: -4.020,HTTPS: -7.020((: -0.001,and: -6.626HH: -0.016,Hyper: -4.141ypyp: -0.000ertextertext: -0.000TransferTransfer: -0.000,Transport: -10.125ProtocolProtocol: -0.000)): -0.000andand: -0.062,is: -2.812HTTPSHTTPS: -0.000((: -0.000HH: -0.000,Hyper: -11.000,Secure: -12.000ypyp: -0.000ertextertext: -0.000TransferTransfer: -0.000,Transport: -9.250ProtocolProtocol: -0.000[6] "Suggest a short book to read on a rainy day."
AA: -0.122,What: -2.372,Perfect: -4.747rainyrainy: -0.438,perfect: -1.188,cozy: -3.563dayday: -0.000isis: -0.002,!: -7.502thethe: -0.001,a: -7.376perfectperfect: -0.001,pur: -7.626excuseexcuse: -0.000,opportunity: -10.375toto: -0.000curlcurl: -0.431,cozy: -1.306,sn: -3.306upup: -0.000withwith: -0.000aa: 0.000goodgood: -0.009,great: -4.759bookbook: -0.000!!: -0.188,!\n\n: -1.813HereHere: -0.013,I: -4.513[7] "2+2=?"
TheThe: -0.341,2: -1.966,4: -2.091answeranswer: -0.002,correct: -6.127isis: -0.252,to: -1.502: -0.421,...: -1.296,:: -2.79644: -0.000!!: -0.092,.: -2.467<|eot_id|><|eot_id|>: -0.000