refactoring of the benchmark dataset capabilities and added a random … by hypdeb · Pull Request #30150 · vllm-project/vllm

hypdeb · 2025-12-05T17:40:15Z

Purpose

Added a new dataset option which:
- Downloads/reads a txt file
- Tokenizes it
- Generates samples by randomly slicing the resulting token sequence

The goal of this dataset is to provide a way to generate a benchmarking dataset with sequences that are more realistic than the standard random option. This is useful when benchmarking expert parallel and speculative decoding based deployments.

I took the opportunity to refactor the datasets as this single file was growing huge and becoming hard to maintain. I also fixed type hinting issues

TODO: add some more unit tests.

Note that I would like to merge this after #30009.

…txt slices dataset option Signed-off-by: Julien Debache <jdebache@cpu-0005.cm.cluster> Signed-off-by: <>

gemini-code-assist

Code Review

This pull request introduces a new vllm.benchmarks.datasets module, centralizing benchmark dataset handling. It defines a BenchmarkDataset abstract base class and a SampleRequest dataclass for standardized request representation. The module includes implementations for various datasets: CustomDataset (for JSONL prompts), SpecBench (for Spec-Bench JSONL), BurstGPTDataset (from CSV), ShareGPTDataset (from JSON), SonnetDataset (from text files), PrefixRepetitionRandomDataset (synthetic with repeated prefixes), TxtSlicesDataset (from text files/URLs), and several HuggingFaceDataset subclasses (ConversationDataset, MultiModalConversationDataset, VisionArenaDataset, MMVUDataset, InstructCoderDataset, MTBenchDataset, BlazeditDataset, AIMODataset, NextEditPredictionDataset, ASRDataset, MLPerfDataset, MMStarDataset). A RandomDataset and its multimodal (RandomMultiModalDataset) and reranking (RandomDatasetForReranking) variants are also added for synthetic data generation. The arg_utils.py file provides a comprehensive argument parser for configuring these datasets. Review comments highlighted the need to refine the maybe_oversample_requests logic in abstractions.py to correctly handle None request_ids, fix an incorrect super().sample(**kwargs) call in SpecBench, and correct the __all__ list in __init__.py by removing duplicates and adding missing exports. Additionally, a comment suggested ensuring request_ids are always assigned in PrefixRepetitionRandomDataset for consistency.

gemini-code-assist · 2025-12-05T17:43:59Z

vllm/benchmarks/datasets/abstractions.py

+        ids = [req.request_id for req in requests]
+        if len(ids) != len(set(ids)):
+            raise ValueError(
+                "Duplicate request_id found in the sampled "
+                "requests. Please ensure that each request_id "
+                "is unique."
+            )


The check for duplicate request_ids is too strict. If multiple requests have request_id=None (which is allowed by the SampleRequest type hint), this check will incorrectly raise a ValueError for duplicate IDs. The check should only enforce uniqueness among non-None request_ids.

Suggested change

ids = [req.request_id for req in requests]

if len(ids) != len(set(ids)):

raise ValueError(

"Duplicate request_id found in the sampled "

"requests. Please ensure that each request_id "

"is unique."

)

ids = [req.request_id for req in requests if req.request_id is not None]

if len(ids) != len(set(ids)):

raise ValueError(

"Duplicate request_id found in the sampled "

"requests. Please ensure that each request_id "

"is unique."

)

gemini-code-assist · 2025-12-05T17:43:59Z

vllm/benchmarks/datasets/custom.py

+        **kwargs,
+    ) -> list[SampleRequest]:
+        # leverage CustomDataset sample
+        return super().sample(**kwargs)


The call super().sample(**kwargs) is incorrect. It only passes the kwargs dictionary to the parent method, omitting all other named arguments like tokenizer, num_requests, etc. This will lead to a TypeError at runtime.

You should pass all arguments to the parent's sample method. A better long-term solution would be to remove this overridden sample method entirely and let SpecBench inherit it directly from CustomDataset.

Suggested change

return super().sample(**kwargs)

return super().sample(tokenizer=tokenizer, num_requests=num_requests, request_id_prefix=request_id_prefix, no_oversample=no_oversample, lora_path=lora_path, max_loras=max_loras, output_len=output_len, enable_multimodal_chat=enable_multimodal_chat, skip_chat_template=skip_chat_template, **kwargs)

gemini-code-assist · 2025-12-05T17:44:00Z

vllm/benchmarks/datasets/__init__.py

+__all__ = [
+    "add_dataset_parser",
+    "get_samples",
+    "BenchmarkDataset",
+    "SampleRequest",
+    "CustomDataset",
+    "SpecBench",
+    "HuggingFaceDataset",
+    "InstructCoderDataset",
+    "MLPerfDataset",
+    "MMStarDataset",
+    "MMVUDataset",
+    "MTBenchDataset",
+    "MultiModalConversationDataset",
+    "NextEditPredictionDataset",
+    "VisionArenaDataset",
+    "PrefixRepetitionRandomDataset",
+    "RandomDataset",
+    "RandomDatasetForReranking",
+    "RandomMultiModalDataset",
+    "ShareGPTDataset",
+    "SonnetDataset",
+    "BurstGPTDataset",
+    "CustomDataset",
+    "SpecBench",
+    "HuggingFaceDataset",
+    "InstructCoderDataset",
+    "MLPerfDataset",
+    "MMStarDataset",
+    "MMVUDataset",
+    "MTBenchDataset",
+    "MultiModalConversationDataset",
+    "NextEditPredictionDataset",
+    "VisionArenaDataset",
+    "TxtSlicesDataset",
+]


The __all__ list has several issues:

It contains duplicate entries for many dataset classes (e.g., CustomDataset, SpecBench, HuggingFaceDataset and its subclasses).

It is missing some imported classes like AIMODataset, ASRDataset, BlazeditDataset, and ConversationDataset.

To improve correctness and maintainability, the list should be cleaned up to remove duplicates and include all public objects from this module. Sorting the list alphabetically is also a good practice.

__all__ = [ "add_dataset_parser", "AIMODataset", "ASRDataset", "BenchmarkDataset", "BlazeditDataset", "BurstGPTDataset", "ConversationDataset", "CustomDataset", "get_samples", "HuggingFaceDataset", "InstructCoderDataset", "MLPerfDataset", "MMStarDataset", "MMVUDataset", "MTBenchDataset", "MultiModalConversationDataset", "NextEditPredictionDataset", "PrefixRepetitionRandomDataset", "RandomDataset", "RandomDatasetForReranking", "RandomMultiModalDataset", "SampleRequest", "ShareGPTDataset", "SonnetDataset", "SpecBench", "TxtSlicesDataset", "VisionArenaDataset", ]

gemini-code-assist · 2025-12-05T17:44:00Z

vllm/benchmarks/datasets/prefix_repetition.py

+                requests.append(
+                    SampleRequest(
+                        prompt=prompt,
+                        prompt_len=prompt_len,
+                        expected_output_len=output_len,
+                    )
+                )


The SampleRequest objects are created without a request_id. While this may not cause an immediate bug if this dataset isn't used with features that require unique IDs (like oversampling), it's inconsistent with other dataset implementations and good practice. The request_id_prefix argument is available but unused. Assigning a unique request_id to each request is important for tracking and debugging.

Suggested change

requests.append(

SampleRequest(

prompt=prompt,

prompt_len=prompt_len,

expected_output_len=output_len,

)

)

requests.append(

SampleRequest(

prompt=prompt,

prompt_len=prompt_len,

expected_output_len=output_len,

request_id=request_id_prefix + str(len(requests)),

)

)

hypdeb · 2025-12-05T17:48:58Z

This is too large.

refactoring of the benchmark dataset capabilities and added a random …

237145e

…txt slices dataset option Signed-off-by: Julien Debache <jdebache@cpu-0005.cm.cluster> Signed-off-by: <>

mergify bot added the performance Performance-related issues label Dec 5, 2025

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

hypdeb closed this Dec 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactoring of the benchmark dataset capabilities and added a random …#30150

refactoring of the benchmark dataset capabilities and added a random …#30150
hypdeb wants to merge 1 commit intovllm-project:mainfrom
hypdeb:datasets_refactor

hypdeb commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

gemini-code-assist bot Dec 5, 2025

Uh oh!

hypdeb commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	return super().sample(**kwargs)
	return super().sample(tokenizer=tokenizer, num_requests=num_requests, request_id_prefix=request_id_prefix, no_oversample=no_oversample, lora_path=lora_path, max_loras=max_loras, output_len=output_len, enable_multimodal_chat=enable_multimodal_chat, skip_chat_template=skip_chat_template, **kwargs)

Uh oh!

Conversation

hypdeb commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

hypdeb commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hypdeb commented Dec 5, 2025 •

edited by github-actions bot

Loading