feat: add TxtSlicesDataset to allow sampling slices from txt file for benchmarking by jdebache · Pull Request #30156 · vllm-project/vllm

jdebache · 2025-12-05T21:22:48Z

Purpose

Sampling randomly directly from a tokenizer for benchmarking creates data that is not ideal to benchmark when using speculative decoding or expert parallelism.

On the other hand, random datasets are very flexible and offer complete control on the input and output sequence lengths, which is desirable to create reproducible benchmarks.

This PR introduces a new type of benchmarking dataset called TxtSlicesDataset which offers a compromise between the flexibility of a random dataset and the fidelity of a real dataset. It allows sampling slices from a user-provided txt file.

Content

The implementation of TxtSlicesDataset
Fixes to typing in datasets.py
A unit test for the new dataset type

gemini-code-assist

Code Review

This pull request introduces TxtSlicesDataset for benchmarking, which samples data from a text file. It also includes significant refactoring by moving utility functions from datasets.py to a new dataset_utils.py file and improving typing throughout. The changes are well-structured. My review focuses on improving the robustness and reproducibility of the new TxtSlicesDataset and its tests. I've pointed out a resource leak in the tests and potential for non-reproducible behavior due to the use of the global random module. I've also identified a missing check that could lead to a crash with certain input files.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

mergify · 2025-12-06T11:00:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hypdeb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

cursor

Cursor Bugbot has reviewed your changes and found 4 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

Comment @cursor review or bugbot run to trigger another review on this PR

mergify · 2026-01-27T13:49:08Z

Documentation preview: https://vllm--30156.org.readthedocs.build/en/30156/

mergify · 2026-01-27T13:52:47Z

Hi @hypdeb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

jdebache · 2026-01-27T14:06:44Z

Hey, I'm trying to run the doc generation locally to figure out what is wrong. How long is it supposed to take?

I still can't reproduce locally, but from the error logs, it looks like it's trying to mock vllm over and over again for some reason.

jdebache · 2026-01-27T14:15:17Z

Cannot reproduce the documentation generation error locally. I get:

INFO    -  Doc file 'features/spec_decode.md' contains an unrecognized relative link '../../tests/spec_decode/e2e', it was left as is.
INFO    -  Doc file 'design/fused_moe_modular_kernel.md' contains a link './moe_kernel_features.md#fused-moe-experts-kernels', but the doc 'design/moe_kernel_features.md' does not contain an anchor '#fused-moe-experts-kernels'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'gpu.md#nvidia-cuda', but the doc 'getting_started/installation/gpu.md' does not contain an anchor '#nvidia-cuda'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'gpu.md#amd-rocm', but the doc 'getting_started/installation/gpu.md' does not contain an anchor '#amd-rocm'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'gpu.md#intel-xpu', but the doc 'getting_started/installation/gpu.md' does not contain an anchor '#intel-xpu'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'cpu.md#intelamd-x86', but the doc 'getting_started/installation/cpu.md' does not contain an anchor '#intelamd-x86'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'cpu.md#arm-aarch64', but the doc 'getting_started/installation/cpu.md' does not contain an anchor '#arm-aarch64'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'cpu.md#apple-silicon', but the doc 'getting_started/installation/cpu.md' does not contain an anchor '#apple-silicon'.
INFO    -  Doc file 'getting_started/installation/README.md' contains a link 'cpu.md#ibm-z-s390x', but the doc 'getting_started/installation/cpu.md' does not contain an anchor '#ibm-z-s390x'.

Aborted with 7 warnings in strict mode!
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

mergify · 2026-01-27T14:47:44Z

Documentation preview: https://vllm--30156.org.readthedocs.build/en/30156/

mergify · 2026-01-28T06:41:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @hypdeb.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-01-28T13:30:43Z

Hi @hypdeb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

mergify · 2026-01-28T14:30:21Z

Hi @hypdeb, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

jdebache · 2026-01-29T08:00:21Z

Hey @hmellor, would you mind having a look? We found this dataset generation useful while benchmarking with features that are data dependent such as EP and speculative decoding.

jdebache · 2026-01-29T08:01:24Z

Note that I wanted to make additional fixes to the typing in datasets.py, however, it caused a cascade of issues, in particular in the documentation generator. Keeping those would have made the changes too large.

Signed-off-by: jdebache <jdebache@nvidia.com>

jdebache · 2026-04-14T07:09:23Z

Okay, I have now addressed all review comments.

… reverting changes made there Signed-off-by: jdebache <jdebache@nvidia.com>

Signed-off-by: jdebache <jdebache@nvidia.com>

DarkLight1337

LGTM now, thanks for your patience!

jdebache · 2026-04-14T08:41:34Z

Thanks for your thorough review!

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com>

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

mergify Bot added the performance Performance-related issues label Dec 5, 2025

gemini-code-assist Bot reviewed Dec 5, 2025

View reviewed changes

Comment thread tests/benchmarks/test_txt_slices_dataset.py Outdated

Comment thread vllm/benchmarks/datasets.py Outdated

Comment thread vllm/benchmarks/datasets.py Outdated

Comment thread vllm/benchmarks/datasets.py Outdated

chatgpt-codex-connector Bot reviewed Dec 5, 2025

View reviewed changes

Comment thread vllm/benchmarks/datasets.py Outdated

mergify Bot added the needs-rebase label Dec 6, 2025

jdebache force-pushed the datasets_refactor branch from d1ba173 to 5c92be2 Compare December 6, 2025 11:20

mergify Bot removed the needs-rebase label Dec 6, 2025

jdebache force-pushed the datasets_refactor branch from 5c92be2 to 5e2e032 Compare January 27, 2026 12:59

cursor Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread vllm/benchmarks/datasets/datasets.py

Comment thread vllm/benchmarks/datasets/datasets.py

Comment thread vllm/benchmarks/datasets/datasets.py

Comment thread vllm/benchmarks/datasets.py Outdated

jdebache force-pushed the datasets_refactor branch from 5e2e032 to 615803d Compare January 27, 2026 13:15

jdebache requested a review from hmellor as a code owner January 27, 2026 13:48

mergify Bot added the documentation Improvements or additions to documentation label Jan 27, 2026

jdebache requested review from aarnphm and chaunceyjiang as code owners January 27, 2026 14:05

mergify Bot added the frontend label Jan 27, 2026

mergify Bot added the needs-rebase label Jan 28, 2026

jdebache force-pushed the datasets_refactor branch from 5a5e79d to efba577 Compare January 28, 2026 13:26

mergify Bot removed the needs-rebase label Jan 28, 2026

jdebache force-pushed the datasets_refactor branch 2 times, most recently from 4fb80eb to db25703 Compare January 29, 2026 07:56

jdebache added 5 commits April 14, 2026 09:08

support different distribution between ISL and OSL

12b55ed

Signed-off-by: jdebache <jdebache@nvidia.com>

address review comments

dedb739

Signed-off-by: jdebache <jdebache@nvidia.com>

improve doc a bit

7e49d99

Signed-off-by: jdebache <jdebache@nvidia.com>

address review comments

7f44682

Signed-off-by: jdebache <jdebache@nvidia.com>

address review comments

63b4470

Signed-off-by: jdebache <jdebache@nvidia.com>

jdebache force-pushed the datasets_refactor branch from 1e747e2 to 63b4470 Compare April 14, 2026 07:08

DarkLight1337 reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm/benchmarks/datasets/utils.py

DarkLight1337 reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm/benchmarks/throughput.py Outdated

jdebache added 2 commits April 14, 2026 07:18

apply changes to input/output range ratio changes to throughput.py by…

f857a0c

… reverting changes made there Signed-off-by: jdebache <jdebache@nvidia.com>

rename datasets sampling shared logic file to utils.py

2cbd9a0

Signed-off-by: jdebache <jdebache@nvidia.com>

DarkLight1337 reviewed Apr 14, 2026

View reviewed changes

Comment thread vllm/benchmarks/datasets/datasets.py Outdated

adddress review comments

8204e31

Signed-off-by: jdebache <jdebache@nvidia.com>

DarkLight1337 approved these changes Apr 14, 2026

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 14, 2026 07:54

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 14, 2026

DarkLight1337 merged commit 893b2af into vllm-project:main Apr 14, 2026
49 of 50 checks passed

whk-lab pushed a commit to whk-lab/vllm that referenced this pull request Apr 23, 2026

feat: add TxtSlicesDataset to allow sampling slices from txt file for…

e226638

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

izhuhaoran mentioned this pull request Apr 26, 2026

Bugfix: fix SpecBench sample argument error #40927

Merged

mystous pushed a commit to mystous/vllm_hybrid that referenced this pull request May 10, 2026

feat: add TxtSlicesDataset to allow sampling slices from txt file for…

b296bdf

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

feat: add TxtSlicesDataset to allow sampling slices from txt file for…

5090db1

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

my-other-github-account pushed a commit to my-other-github-account/vllm that referenced this pull request May 15, 2026

feat: add TxtSlicesDataset to allow sampling slices from txt file for…

258c6c1

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

feat: add TxtSlicesDataset to allow sampling slices from txt file for…

4c52482

… benchmarking (vllm-project#30156) Signed-off-by: jdebache <jdebache@nvidia.com>

Uh oh!

Conversation

jdebache commented Dec 5, 2025 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Content

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify Bot commented Dec 6, 2025

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jan 27, 2026

Uh oh!

mergify Bot commented Jan 27, 2026

Uh oh!

jdebache commented Jan 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jdebache commented Jan 27, 2026

Uh oh!

mergify Bot commented Jan 27, 2026

Uh oh!

mergify Bot commented Jan 28, 2026

Uh oh!

mergify Bot commented Jan 28, 2026

Uh oh!

mergify Bot commented Jan 28, 2026

Uh oh!

jdebache commented Jan 29, 2026

Uh oh!

jdebache commented Jan 29, 2026

Uh oh!

jdebache commented Apr 14, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 left a comment

Choose a reason for hiding this comment

Uh oh!

jdebache commented Apr 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jdebache commented Dec 5, 2025 •

edited by github-actions Bot

Loading

jdebache commented Jan 27, 2026 •

edited

Loading