Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
b5adae6
run eagle with full cudagraph
zixi-qi Jun 27, 2025
53223d5
run eagle with full cudagraph
zixi-qi Jun 27, 2025
5f5683e
Merge branch 'eagle-full-cudagraph' of github.com:zixi-qi/vllm into e…
zixi-qi Jul 23, 2025
c38e003
rebase and add unit test
zixi-qi Jul 23, 2025
f36f8c1
Fix bad lm-eval fork (#21318)
mgoin Jul 21, 2025
41d76db
[perf] Speed up align sum kernels (#21079)
hj-mistral Jul 21, 2025
302677b
[v1][sampler] Inplace logprobs comparison to get the token rank (#21283)
houseroad Jul 21, 2025
b60e53c
[XPU] Enable external_launcher to serve as an executor via torchrun (…
chaojun-zhang Jul 22, 2025
657be61
[Doc] Fix CPU doc format (#21316)
bigPYJ1151 Jul 22, 2025
f226a8b
[Intel GPU] Ray Compiled Graph avoid NCCL for Intel GPU (#21338)
ratnampa Jul 22, 2025
e780c7d
Revert "[Performance] Performance improvements in non-blockwise fp8 C…
minosfuture Jul 22, 2025
49ad485
[Core] Minimize number of dict lookup in _maybe_evict_cached_block (#…
Jialin Jul 22, 2025
0e5124d
[V1] [Hybrid] Add new test to verify that hybrid views into KVCacheTe…
tdoublep Jul 22, 2025
d96a375
[Refactor] Fix Compile Warning #1444-D (#21208)
yewentao256 Jul 22, 2025
8a8f6bd
Fix kv_cache_dtype handling for out-of-tree HPU plugin (#21302)
kzawora-intel Jul 22, 2025
b1373c2
[Misc] DeepEPHighThroughtput - Enable Inductor pass (#21311)
varun-sundar-rabindranath Jul 22, 2025
8b8a283
[Bug] DeepGemm: Fix Cuda Init Error (#21312)
yewentao256 Jul 22, 2025
f0ea54f
Update fp4 quantize API (#21327)
wenscarl Jul 22, 2025
759f3ba
[Feature][eplb] add verify ep or tp or dp (#21102)
lengrongfu Jul 22, 2025
7aea174
Add arcee model (#21296)
alyosha-swamy Jul 22, 2025
7aa2bac
[Bugfix] Fix eviction cached blocked logic (#21357)
simon-mo Jul 22, 2025
fae5235
[Misc] Remove deprecated args in v0.10 (#21349)
kebe7jun Jul 22, 2025
25d0c72
[Core] Optimize update checks in LogitsProcessor (#21245)
Jialin Jul 22, 2025
c7f963b
[benchmark] Port benchmark request sent optimization to benchmark_ser…
Jialin Jul 22, 2025
40ab4c4
[Core] Introduce popleft_n and append_n in FreeKVCacheBlockQueue to f…
Jialin Jul 22, 2025
80634e8
[Misc] unify variable for LLM instance v2 (#21356)
andyxning Jul 22, 2025
46b75f4
[perf] Add fused MLA QKV + strided layernorm (#21116)
mickaelseznec Jul 22, 2025
29646b5
[feat]: add SM100 support for cutlass FP8 groupGEMM (#20447)
djmmoss Jul 22, 2025
a7cae7c
[Perf] Cuda Kernel for Per Token Group Quant (#21083)
yewentao256 Jul 22, 2025
db98d04
Adds parallel model weight loading for runai_streamer (#21330)
bbartels Jul 22, 2025
6666593
[feat] Enable mm caching for transformers backend (#21358)
zucchini-nlp Jul 22, 2025
3eb125c
Revert "[Refactor] Fix Compile Warning #1444-D (#21208)" (#21384)
yewentao256 Jul 22, 2025
6728377
Add tokenization_kwargs to encode for embedding model truncation (#21…
Receiling Jul 22, 2025
5aafc16
[Bugfix] Decode Tokenized IDs to Strings for `hf_processor` in `llm.c…
ariG23498 Jul 22, 2025
8fcfe36
[CI/Build] Fix test failure due to updated model repo (#21375)
DarkLight1337 Jul 22, 2025
faf8b1a
Fix Flashinfer Allreduce+Norm enable disable calculation based on `fi…
xinli-sw Jul 22, 2025
b3dead9
[Model] Add Qwen3CoderToolParser (#21396)
ranpox Jul 22, 2025
d64c0ff
[Misc] Copy HF_TOKEN env var to Ray workers (#21406)
ruisearch42 Jul 22, 2025
b2f7613
[BugFix] Fix ray import error mem cleanup bug (#21381)
joerunde Jul 22, 2025
74d8cbc
[CI/Build] Fix model executor tests (#21387)
DarkLight1337 Jul 23, 2025
701a331
[Bugfix][ROCm][Build] Fix build regression on ROCm (#21393)
gshtras Jul 23, 2025
7c61321
Simplify weight loading in Transformers backend (#21382)
hmellor Jul 23, 2025
f8e8456
[BugFix] Update python to python3 calls for image; fix prefix & input…
ericehanley Jul 23, 2025
062ac71
[BUGFIX] deepseek-v2-lite failed due to fused_qkv_a_proj name update …
xuechendi Jul 23, 2025
44653f8
[Bugfix][CUDA] fixes CUDA FP8 kv cache dtype supported (#21420)
elvischenv Jul 23, 2025
071801e
Changing "amdproduction" allocation. (#21409)
Alexei-V-Ivanov-AMD Jul 23, 2025
a7f791d
[Bugfix] Fix nightly transformers CI failure (#21427)
Isotr0py Jul 23, 2025
c6e12ff
[Core] Add basic unit test for maybe_evict_cached_block (#21400)
Jialin Jul 23, 2025
97c24f6
[Cleanup] Only log MoE DP setup warning if DP is enabled (#21315)
mgoin Jul 23, 2025
d840c8a
add clear messages for deprecated models (#21424)
youkaichao Jul 23, 2025
0790c5e
[Bugfix] ensure tool_choice is popped when `tool_choice:null` is pass…
gcalmettes Jul 23, 2025
8c5ed35
Fixed typo in profiling logs (#21441)
sergiopaniego Jul 23, 2025
5d860d9
[Docs] Fix bullets and grammars in tool_calling.md (#21440)
windsonsea Jul 23, 2025
c8ea28a
[Sampler] Introduce logprobs mode for logging (#21398)
houseroad Jul 23, 2025
54e6fce
Mamba V2 Test not Asserting Failures. (#21379)
fabianlim Jul 23, 2025
507f651
[Misc] fixed nvfp4_moe test failures due to invalid kwargs (#21246)
Jul 23, 2025
d698fd2
[Docs] Clean up v1/metrics.md (#21449)
windsonsea Jul 23, 2025
169cb78
[Model] add Hunyuan V1 Dense Model support. (#21368)
Jul 23, 2025
94a6358
[V1] Check all pooling tasks during profiling (#21299)
DarkLight1337 Jul 23, 2025
a1fb3aa
[Bugfix][Qwen][DCA] fixes bug in dual-chunk-flash-attn backend for qw…
sighingnow Jul 23, 2025
09c7ebb
[Tests] Add tests for headless internal DP LB (#21450)
njhill Jul 23, 2025
5d0155d
[Core][Model] PrithviMAE Enablement on vLLM v1 engine (#20577)
christian-pinto Jul 23, 2025
49b48be
Add test case for compiling multiple graphs (#21044)
sarckk Jul 23, 2025
f08230a
[TPU][TEST] Fix the downloading issue in TPU v1 test 11. (#21418)
QiliangCui Jul 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions examples/offline_inference/eagle.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ def parse_args():
parser.add_argument("--enable_chunked_prefill", action="store_true")
parser.add_argument("--max_num_batched_tokens", type=int, default=2048)
parser.add_argument("--temp", type=float, default=0)
parser.add_argument("--compilation_config", type=str, default="")
return parser.parse_args()


Expand Down Expand Up @@ -94,6 +95,9 @@ def main():
"max_model_len": max_model_len,
},
disable_log_stats=False,
compilation_config=(
json.loads(args.compilation_config) if args.compilation_config else None
),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The direct call to json.loads can cause the script to crash with a json.JSONDecodeError if an invalid JSON string is passed to the --compilation_config argument. Consider adding a try-except block to handle potential parsing errors gracefully.

compilation_config = None
if args.compilation_config:
    try:
        compilation_config = json.loads(args.compilation_config)
    except json.JSONDecodeError as e:
        raise ValueError(f"Invalid JSON for --compilation_config: {e}") from e

)

sampling_params = SamplingParams(temperature=args.temp, max_tokens=256)
Expand Down
24 changes: 23 additions & 1 deletion vllm/v1/spec_decode/eagle.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
from typing import Any, Optional

import torch
import torch.nn as nn

Expand Down Expand Up @@ -74,6 +76,7 @@ def __init__(
1,
device=device,
dtype=torch.int32)
self.draft_attn_metadata = None

def propose(
self,
Expand Down Expand Up @@ -169,6 +172,13 @@ def propose(
self.positions[:num_tokens] = target_positions
self.hidden_states[:num_tokens] = target_hidden_states

# copy attention metadata for full cudagraph mode
if self.draft_attn_metadata is not None and num_tokens <= self.cudagraph_batch_sizes[-1]:
self.draft_attn_metadata.seq_lens[:attn_metadata.seq_lens.shape[0]].copy_(attn_metadata.seq_lens.clone())
self.draft_attn_metadata.slot_mapping[:attn_metadata.slot_mapping.shape[0]].copy_(attn_metadata.slot_mapping.clone())
self.draft_attn_metadata.query_start_loc[:attn_metadata.query_start_loc.shape[0]].copy_(attn_metadata.query_start_loc.clone())
self.draft_attn_metadata.block_table[:attn_metadata.block_table.shape[0]].copy_(attn_metadata.block_table.clone())

with set_forward_context(per_layer_attn_metadata,
self.vllm_config,
num_tokens=num_input_tokens):
Expand Down Expand Up @@ -254,6 +264,13 @@ def propose(
self.positions[:batch_size] = clamped_positions
self.hidden_states[:batch_size] = hidden_states

# copy attention metadata for full cudagraph mode
if self.draft_attn_metadata is not None:
self.draft_attn_metadata.seq_lens[:attn_metadata.seq_lens.shape[0]].copy_(attn_metadata.seq_lens.clone())
self.draft_attn_metadata.slot_mapping[:attn_metadata.slot_mapping.shape[0]].copy_(attn_metadata.slot_mapping.clone())
self.draft_attn_metadata.query_start_loc[:attn_metadata.query_start_loc.shape[0]].copy_(attn_metadata.query_start_loc.clone())
self.draft_attn_metadata.block_table[:attn_metadata.block_table.shape[0]].copy_(attn_metadata.block_table.clone())

# Run the model.
with set_forward_context(per_layer_attn_metadata,
self.vllm_config,
Expand Down Expand Up @@ -369,8 +386,13 @@ def load_model(self, target_model: nn.Module) -> None:
def dummy_run(
self,
num_tokens: int,
attn_metadata: Optional[dict[str, Any]],
) -> None:
with set_forward_context(None, self.vllm_config,
if attn_metadata is not None and self.draft_attn_metadata is None:
attn_metadata[self.attn_layer_names[0]].scheduler_metadata = None
self.draft_attn_metadata = attn_metadata[self.attn_layer_names[0]] # assume only one draft layer
with set_forward_context(attn_metadata,
self.vllm_config,
num_tokens=num_tokens):
self.model(
self.input_ids[:num_tokens],
Expand Down
4 changes: 2 additions & 2 deletions vllm/v1/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -1860,7 +1860,7 @@ def maybe_randomize_inputs(self, input_ids: torch.Tensor):
Randomize input_ids if VLLM_RANDOMIZE_DP_DUMMY_INPUTS is set.
This is to help balance expert-selection
- during profile_run
- during DP rank dummy run
- during DP rank dummy run
"""
dp_size = self.vllm_config.parallel_config.data_parallel_size
randomize_inputs = envs.VLLM_RANDOMIZE_DP_DUMMY_INPUTS and dp_size > 1
Expand Down Expand Up @@ -1982,7 +1982,7 @@ def _dummy_run(

if self.speculative_config and self.speculative_config.use_eagle():
assert isinstance(self.drafter, EagleProposer)
self.drafter.dummy_run(num_tokens)
self.drafter.dummy_run(num_tokens, attn_metadata)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my hypothesis:

  • the attn_metadata contains tensors
  • cudagraphs is baking in the addresses of those tensors
  • during runtime, the captured cudagraphs still read from these tensors.

Does the eagle forward pass use the tensors in the attn_metadata? If so, every time we invoke the eagle head, we may need to copy data into the tensors in the attn_metadata.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right this is partially the reason for the numerical gap. As an experiment I copied over the attn_metadata constructed for eager mode into the captured attn_metadata in latest commit:

# copy attention metadata for full cudagraph mode
if self.draft_attn_metadata is not None:
    self.draft_attn_metadata.seq_lens[:attn_metadata.seq_lens.shape[0]].copy_(attn_metadata.seq_lens.clone())
    self.draft_attn_metadata.slot_mapping[:attn_metadata.slot_mapping.shape[0]].copy_(attn_metadata.slot_mapping.clone())
    self.draft_attn_metadata.query_start_loc[:attn_metadata.query_start_loc.shape[0]].copy_(attn_metadata.query_start_loc.clone())
    self.draft_attn_metadata.block_table[:attn_metadata.block_table.shape[0]].copy_(attn_metadata.block_table.clone())

As a result, I got better numerics but there is still a gap comparing with piecewise mode:

  • VLLM_USE_V1=1 python examples/offline_inference/eagle.py --num_spec_tokens 7 --num_prompts 1 --compilation_config '{"full_cuda_graph": true, "cudagraph_capture_sizes": [1]}'
--------------------------------------------------
mean acceptance length: 2.46
--------------------------------------------------
acceptance at token 0:0.69
acceptance at token 1:0.38
acceptance at token 2:0.20
acceptance at token 3:0.12
acceptance at token 4:0.06
acceptance at token 5:0.00
acceptance at token 6:0.00
  • VLLM_USE_V1=1 python examples/offline_inference/eagle.py --num_spec_tokens 7 --num_prompts 1 --compilation_config '{"full_cuda_graph": false, "cudagraph_capture_sizes": [1]}'
--------------------------------------------------
mean acceptance length: 2.82
--------------------------------------------------
acceptance at token 0:0.77
acceptance at token 1:0.51
acceptance at token 2:0.28
acceptance at token 3:0.13
acceptance at token 4:0.05
acceptance at token 5:0.03
acceptance at token 6:0.03

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it seems there might still be some discrepancy in attention computation between eager mode and cudagraph mode. Will try to investigate more and would also appreciate if you have any suggestions to check from torch.compile perspective

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we can directly reuse the persistent buffer from attnmetadata.

One more thing, I think you should also consider the padding issue if possible. The buffer from attnmetadata should have correctly filled values for the padding region

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, currently the issue seems to be inductor related: #20190 (comment)


logit_indices = np.cumsum(num_scheduled_tokens) - 1
return hidden_states, hidden_states[logit_indices]
Expand Down