Skip to content

refactoring of the benchmark dataset capabilities and added a random …#30150

Closed
hypdeb wants to merge 1 commit intovllm-project:mainfrom
hypdeb:datasets_refactor
Closed

refactoring of the benchmark dataset capabilities and added a random …#30150
hypdeb wants to merge 1 commit intovllm-project:mainfrom
hypdeb:datasets_refactor

Conversation

@hypdeb
Copy link
Contributor

@hypdeb hypdeb commented Dec 5, 2025

Purpose

  • Added a new dataset option which:
    • Downloads/reads a txt file
    • Tokenizes it
    • Generates samples by randomly slicing the resulting token sequence

The goal of this dataset is to provide a way to generate a benchmarking dataset with sequences that are more realistic than the standard random option. This is useful when benchmarking expert parallel and speculative decoding based deployments.

I took the opportunity to refactor the datasets as this single file was growing huge and becoming hard to maintain. I also fixed type hinting issues

TODO: add some more unit tests.

Note that I would like to merge this after #30009.

…txt slices dataset option

Signed-off-by: Julien Debache <jdebache@cpu-0005.cm.cluster>

Signed-off-by:  <>
@mergify mergify bot added the performance Performance-related issues label Dec 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new vllm.benchmarks.datasets module, centralizing benchmark dataset handling. It defines a BenchmarkDataset abstract base class and a SampleRequest dataclass for standardized request representation. The module includes implementations for various datasets: CustomDataset (for JSONL prompts), SpecBench (for Spec-Bench JSONL), BurstGPTDataset (from CSV), ShareGPTDataset (from JSON), SonnetDataset (from text files), PrefixRepetitionRandomDataset (synthetic with repeated prefixes), TxtSlicesDataset (from text files/URLs), and several HuggingFaceDataset subclasses (ConversationDataset, MultiModalConversationDataset, VisionArenaDataset, MMVUDataset, InstructCoderDataset, MTBenchDataset, BlazeditDataset, AIMODataset, NextEditPredictionDataset, ASRDataset, MLPerfDataset, MMStarDataset). A RandomDataset and its multimodal (RandomMultiModalDataset) and reranking (RandomDatasetForReranking) variants are also added for synthetic data generation. The arg_utils.py file provides a comprehensive argument parser for configuring these datasets. Review comments highlighted the need to refine the maybe_oversample_requests logic in abstractions.py to correctly handle None request_ids, fix an incorrect super().sample(**kwargs) call in SpecBench, and correct the __all__ list in __init__.py by removing duplicates and adding missing exports. Additionally, a comment suggested ensuring request_ids are always assigned in PrefixRepetitionRandomDataset for consistency.

Comment on lines +135 to +141
ids = [req.request_id for req in requests]
if len(ids) != len(set(ids)):
raise ValueError(
"Duplicate request_id found in the sampled "
"requests. Please ensure that each request_id "
"is unique."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The check for duplicate request_ids is too strict. If multiple requests have request_id=None (which is allowed by the SampleRequest type hint), this check will incorrectly raise a ValueError for duplicate IDs. The check should only enforce uniqueness among non-None request_ids.

Suggested change
ids = [req.request_id for req in requests]
if len(ids) != len(set(ids)):
raise ValueError(
"Duplicate request_id found in the sampled "
"requests. Please ensure that each request_id "
"is unique."
)
ids = [req.request_id for req in requests if req.request_id is not None]
if len(ids) != len(set(ids)):
raise ValueError(
"Duplicate request_id found in the sampled "
"requests. Please ensure that each request_id "
"is unique."
)

**kwargs,
) -> list[SampleRequest]:
# leverage CustomDataset sample
return super().sample(**kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The call super().sample(**kwargs) is incorrect. It only passes the kwargs dictionary to the parent method, omitting all other named arguments like tokenizer, num_requests, etc. This will lead to a TypeError at runtime.

You should pass all arguments to the parent's sample method. A better long-term solution would be to remove this overridden sample method entirely and let SpecBench inherit it directly from CustomDataset.

Suggested change
return super().sample(**kwargs)
return super().sample(tokenizer=tokenizer, num_requests=num_requests, request_id_prefix=request_id_prefix, no_oversample=no_oversample, lora_path=lora_path, max_loras=max_loras, output_len=output_len, enable_multimodal_chat=enable_multimodal_chat, skip_chat_template=skip_chat_template, **kwargs)

Comment on lines +333 to +368
__all__ = [
"add_dataset_parser",
"get_samples",
"BenchmarkDataset",
"SampleRequest",
"CustomDataset",
"SpecBench",
"HuggingFaceDataset",
"InstructCoderDataset",
"MLPerfDataset",
"MMStarDataset",
"MMVUDataset",
"MTBenchDataset",
"MultiModalConversationDataset",
"NextEditPredictionDataset",
"VisionArenaDataset",
"PrefixRepetitionRandomDataset",
"RandomDataset",
"RandomDatasetForReranking",
"RandomMultiModalDataset",
"ShareGPTDataset",
"SonnetDataset",
"BurstGPTDataset",
"CustomDataset",
"SpecBench",
"HuggingFaceDataset",
"InstructCoderDataset",
"MLPerfDataset",
"MMStarDataset",
"MMVUDataset",
"MTBenchDataset",
"MultiModalConversationDataset",
"NextEditPredictionDataset",
"VisionArenaDataset",
"TxtSlicesDataset",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The __all__ list has several issues:

  1. It contains duplicate entries for many dataset classes (e.g., CustomDataset, SpecBench, HuggingFaceDataset and its subclasses).
  2. It is missing some imported classes like AIMODataset, ASRDataset, BlazeditDataset, and ConversationDataset.

To improve correctness and maintainability, the list should be cleaned up to remove duplicates and include all public objects from this module. Sorting the list alphabetically is also a good practice.

__all__ = [
    "add_dataset_parser",
    "AIMODataset",
    "ASRDataset",
    "BenchmarkDataset",
    "BlazeditDataset",
    "BurstGPTDataset",
    "ConversationDataset",
    "CustomDataset",
    "get_samples",
    "HuggingFaceDataset",
    "InstructCoderDataset",
    "MLPerfDataset",
    "MMStarDataset",
    "MMVUDataset",
    "MTBenchDataset",
    "MultiModalConversationDataset",
    "NextEditPredictionDataset",
    "PrefixRepetitionRandomDataset",
    "RandomDataset",
    "RandomDatasetForReranking",
    "RandomMultiModalDataset",
    "SampleRequest",
    "ShareGPTDataset",
    "SonnetDataset",
    "SpecBench",
    "TxtSlicesDataset",
    "VisionArenaDataset",
]

Comment on lines +79 to +85
requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The SampleRequest objects are created without a request_id. While this may not cause an immediate bug if this dataset isn't used with features that require unique IDs (like oversampling), it's inconsistent with other dataset implementations and good practice. The request_id_prefix argument is available but unused. Assigning a unique request_id to each request is important for tracking and debugging.

Suggested change
requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
)
)
requests.append(
SampleRequest(
prompt=prompt,
prompt_len=prompt_len,
expected_output_len=output_len,
request_id=request_id_prefix + str(len(requests)),
)
)

@hypdeb
Copy link
Contributor Author

hypdeb commented Dec 5, 2025

This is too large.

@hypdeb hypdeb closed this Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant