[Spec Decode][Benchmark] Add Blitzedit dataset #23605

ekagra-ranjan · 2025-08-26T02:11:02Z

I have been looking for datasets where Ngram is better than Eagle for exploring the idea of combining Ngram and EAGLE #18633. InstructCoder being an editing task was the go to dataset in vLLM for Ngram until I found that fixing the prompt made EAGLE quite strong and better than Ngram on InstructCoder dataset. An ideal dataset would be the one where the overlap bw input and output are high. Blazedit dataset is a promising one since it can allow observing AL of Ngram over different input-output overlap.

This PR add Blazedit dataset.
Compared to InstructCode dataset which we used for Ngram, it is a longer dataset and has associated each data with the normalized Levenshtein distance [0.0, 1.0] which can help with observing the gain of Ngram wrt to the overlap between input and output
Source
- Blog: https://huggingface.co/blog/ganler/blazedit
- HF dataset: https://huggingface.co/vdaita
- available in 2 versions: 5k char (vdaita/edit_5k_char) and 10k char (vdaita/edit_10k_char)
This needs model which can support >3k seq len for the 5k char variant of dataset. llama 3.1 8b only supports 2048 so couldn't run this dataset completely but will be useful for models which have longer seq len.

Sample Cmd:
time VLLM_USE_V1=1 python3 examples/offline_inference/spec_decode.py --method eagle --num_spec_tokens 3 --tp 1 --dataset-name hf --dataset-path vdaita/edit_5k_char --num-prompts 90 --hf-output-len 2048 --blazedit-min-distance 0.01 --blazedit-max-distance 0.99 --print-output

gemini-code-assist

Code Review

This PR adds support for the Blitzedit dataset for benchmarking. The changes correctly add command-line arguments, integrate the new dataset class into the factory function, and implement the dataset loading and sampling logic. My review focuses on cleaning up some leftover debugging code and unused variables in the new BlazeditDataset implementation to improve code quality and ensure the dataset is not unnecessarily filtered.

vllm/benchmarks/datasets.py

Signed-off-by: Ekagra Ranjan <[email protected]>

mergify · 2025-09-04T15:51:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ekagra-ranjan.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Ekagra Ranjan <[email protected]>

LiuXiaoxuanPKU · 2025-09-04T21:34:03Z

Can you share some numbers of InstructCoder if any?

ekagra-ranjan · 2025-09-05T14:25:57Z

I have some numbers here: #18971

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]> Signed-off-by: xuebwang-amd <[email protected]>

mergify bot added the performance Performance-related issues label Aug 26, 2025

gemini-code-assist bot reviewed Aug 26, 2025

View reviewed changes

vllm/benchmarks/datasets.py Outdated Show resolved Hide resolved

ekagra-ranjan changed the title ~~[Spec Dec][Benchmark] Add Blitzedit dataset~~ [Spec Decode][Benchmark] Add Blitzedit dataset Sep 3, 2025

ywang96 approved these changes Sep 3, 2025

View reviewed changes

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 3, 2025

ywang96 enabled auto-merge (squash) September 3, 2025 23:26

add blitzedit

75103ca

Signed-off-by: Ekagra Ranjan <[email protected]>

auto-merge was automatically disabled September 4, 2025 15:51
Head branch was pushed to by a user without write access

ekagra-ranjan force-pushed the er-blazeit-data branch from 65672e8 to 75103ca Compare September 4, 2025 15:51

mergify bot added the needs-rebase label Sep 4, 2025

Merge branch 'main' into er-blazeit-data

1c8d903

Signed-off-by: Ekagra Ranjan <[email protected]>

mergify bot removed the needs-rebase label Sep 4, 2025

ywang96 and others added 2 commits September 5, 2025 09:51

Merge branch 'main' into er-blazeit-data

afcb97d

Merge branch 'main' into er-blazeit-data

558dbf4

ekagra-ranjan mentioned this pull request Sep 5, 2025

[Spec Decode][Hybrid] Add ngram-eagle SD method #24344

Open

3 tasks

Merge branch 'main' into er-blazeit-data

ac0969b

ekagra-ranjan mentioned this pull request Sep 8, 2025

[Benchmark] Update bench doc with mtbench, blazedit, spec bench #24450

Merged

ywang96 merged commit cd08636 into vllm-project:main Sep 8, 2025
38 checks passed

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025

[Spec Decode][Benchmark] Add Blitzedit dataset (vllm-project#23605)

832f314

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025

[Spec Decode][Benchmark] Add Blitzedit dataset (vllm-project#23605)

e96319c

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[Spec Decode][Benchmark] Add Blitzedit dataset (vllm-project#23605)

e3f4ebe

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

sducouedic pushed a commit to sducouedic/vllm that referenced this pull request Oct 16, 2025

[Spec Decode][Benchmark] Add Blitzedit dataset (vllm-project#23605)

1f46c29

Signed-off-by: Ekagra Ranjan <[email protected]> Co-authored-by: Roger Wang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Spec Decode][Benchmark] Add Blitzedit dataset #23605

[Spec Decode][Benchmark] Add Blitzedit dataset #23605

Uh oh!

ekagra-ranjan commented Aug 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

LiuXiaoxuanPKU commented Sep 4, 2025

Uh oh!

ekagra-ranjan commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Spec Decode][Benchmark] Add Blitzedit dataset #23605

[Spec Decode][Benchmark] Add Blitzedit dataset #23605

Uh oh!

Conversation

ekagra-ranjan commented Aug 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Sep 4, 2025

Uh oh!

LiuXiaoxuanPKU commented Sep 4, 2025

Uh oh!

ekagra-ranjan commented Sep 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ekagra-ranjan commented Aug 26, 2025 •

edited by github-actions bot

Loading