Skip to content

Support multi-modal datasets#495

Merged
shanjiaz merged 99 commits into
vllm-project:mainfrom
DarkLight1337:mm-dataset
May 26, 2026
Merged

Support multi-modal datasets#495
shanjiaz merged 99 commits into
vllm-project:mainfrom
DarkLight1337:mm-dataset

Conversation

@DarkLight1337
Copy link
Copy Markdown
Member

@DarkLight1337 DarkLight1337 commented Apr 30, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

FIX #290

By forwarding the conversations directly into Chat Completions API, and also observing that AutoProcessor returns tokenizer instance for text-only models, we can simplify the code a lot compared to #344.

cc @shx2005

Description

  • Enable preprocessing of multimodal datasets. No user flag is required; we detect this automatically based on isinstance(..., ProcessorMixin).
  • Add --trust-remote-code flag to data preparation and training scripts.
  • Add hf_name and filter_fn options to DatasetConfig.
  • Add support for ShareGPT4V dataset.
    • Only samples from COCO are supported right now.
    • You need to download the images separately since it's not handled by HF Datasets. Use the COCO_DIR environment variable to control where COCO images are read from.
    • To avoid copying the images when storing the preprocessed dataset and HTTP transfer to vLLM, we express image inputs in terms of file URLs instead of base64-encoded images. For vLLM to access those files, you should pass --allowed-media-domain-paths /path/to/coco when serving it.
  • Disable caching for dataset normalization as it makes debugging normalize_fn more difficult. To reduce overhead, this step is now executed after raw_dataset.select.
  • Use --enforce-eager for vLLM in e2e tests by default to reduce startup time.
  • Add torchaudio and torchvision to dependencies.
  • Fix whitespace issues in the help text of the scripts.

Related Issue

#290

Tests

Add integration and e2e tests to ensure MM support. Note:

  • e2e smoke tests use a dummy COCO image so they can be run in CI.
  • e2e regression tests are only run if the real COCO images are downloaded.

I have filled in:

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan/results, such as providing test command and pasting the results.
  • (Optional) The necessary documentation update.
  • I (a human) have written or reviewed the code in this pr to the best of my ability.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8de7675f-2ba2-4ba6-9c77-4146c89f99dd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

This PR extends the data generation and training pipeline to support multimodal models (text+image) by introducing processor-based preprocessing, dataset normalization functions, COCO image filtering, vLLM Chat Completions API support for multimodal inputs, and corresponding CLI flags for remote code execution.

Changes

Cohort / File(s) Summary
Documentation
docs/cli/prepare_data.md, docs/cli/train.md
Added --trust-remote-code CLI flag documentation for both prepare_data and train scripts, allowing remote code execution from HuggingFace Hub during processor/tokenizer loading.
Configuration & Linting
pyproject.toml
Enabled PLR0915 linting rule for scripts to suppress "too many statements" warnings on the updated parse_args functions.
CLI Scripts
scripts/prepare_data.py, scripts/train.py, scripts/data_generation_offline.py
Added --trust-remote-code CLI flags and wired them through the pipeline; added optional _vllm_messages field extraction and forwarding to vLLM generation.
Data Generation Configuration
src/speculators/data_generation/configs.py
Made DatasetConfig keyword-only, added optional hf_name and filter_fn fields, introduced COCO dataset configuration with image filtering and conversation normalization for ShareGPT4V data.
Data Generation Preprocessing
src/speculators/data_generation/preprocessing.py
Major refactoring: switched from tokenizer-only to processor-based (ProcessorLike interface), added dataset normalization functions, implemented vLLM message conversion for multimodal inputs, added trust_remote_code parameter, and refactored loss-mask generation to use HF assistant token masks.
vLLM Client Integration
src/speculators/data_generation/vllm_client.py
Extended generate_hidden_states and generate_hidden_states_async with optional messages parameter to support Chat Completions API for multimodal inputs alongside existing Completions API path.
Training Integration
src/speculators/train/data.py, src/speculators/train/utils.py
Updated training data pipeline to forward optional _vllm_messages to hidden states generation; added trust_remote_code parameter to tokenizer loading in resolve_mask_token_id.
Test Updates
tests/integration/datagen/test_preprocessing.py, tests/integration/datagen/test_regex_patterns.py
Replaced tokenizer-based tests with processor-based tests, added unit tests for vLLM message conversion (_hf_to_vllm_conv), added multimodal integration test with image processing, and strengthened preprocessing assertions.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Prepare as prepare_data.py
    participant Config as configs.DatasetConfig
    participant Preprocess as preprocessing.py
    participant Processor as AutoProcessor
    participant vLLM as vLLM Client
    
    User->>Prepare: --trust-remote-code flag
    Prepare->>Config: Load dataset with config
    Config->>Preprocess: load_and_preprocess_dataset(trust_remote_code=True)
    Preprocess->>Processor: Load processor with trust_remote_code
    Processor-->>Preprocess: Processor instance
    Preprocess->>Preprocess: Normalize dataset & extract _vllm_messages
    Preprocess->>Preprocess: apply_chat_template for text+images
    Preprocess-->>Prepare: (HFDataset, Processor)
    Prepare->>Prepare: save preprocessed data
Loading
sequenceDiagram
    participant Training as train.py
    participant DataPipeline as train/data.py
    participant HiddenStates as vllm_client.py
    participant vLLM as vLLM Server
    
    Training->>Training: parse_args(--trust-remote-code)
    Training->>DataPipeline: Build dataset with processor
    DataPipeline->>DataPipeline: Extract item._vllm_messages
    DataPipeline->>HiddenStates: generate_hidden_states(messages=_vllm_messages)
    alt Messages provided (multimodal)
        HiddenStates->>vLLM: Chat Completions API with messages
    else No messages (text-only)
        HiddenStates->>vLLM: Completions API with token_ids
    end
    vLLM-->>HiddenStates: Hidden states & token_ids
    HiddenStates-->>DataPipeline: Extracted output
    DataPipeline-->>Training: Training batch with embeddings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • PR #433: Modifies scripts/data_generation_offline.py for vLLM message forwarding and shares upstream data generation pipeline changes.
  • PR #354: Modifies scripts/train.py CLI argument parsing and training script initialization logic.

Suggested labels

enhancement, data-generation, training, two-reviews

Suggested reviewers

  • shanjiaz
  • rahul-tuli
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.71% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Support multi-modal datasets' directly describes the main objective of the PR, which adds multimodal dataset preprocessing capabilities.
Linked Issues check ✅ Passed The PR addresses the preprocessing and dataset-processing components of issue #290, implementing AutoProcessor support, multimodal conversation handling, and ShareGPT4V COCO dataset support as required.
Out of Scope Changes check ✅ Passed All changes are scoped to multimodal dataset preprocessing, dataset configuration, and related infrastructure (--trust-remote-code flags, vLLM chat API integration). No unrelated modifications detected.
Description check ✅ Passed The PR description clearly relates to the changeset, covering multimodal dataset support, new CLI flags, dataset configuration options, and ShareGPT4V dataset integration.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify
Copy link
Copy Markdown

mergify Bot commented Apr 30, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@mergify
Copy link
Copy Markdown

mergify Bot commented May 1, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@DarkLight1337 DarkLight1337 changed the title Support multi-modal datasets (preprocessing part) Support multi-modal datasets May 1, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 1, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 1, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@shx2005
Copy link
Copy Markdown

shx2005 commented May 1, 2026

@DarkLight1337 Thanks a lot for working on this and for cc'ing me!

I haven't had a chance to look through this PR carefully yet, but I'll review it in more detail shortly. I also just organized a related implementation in #497, where I have been working on the full multimodal E2E training flow and am currently wrapping up the final testing/validation.

There may be some overlap between the two PRs. I'll take a closer look at the concrete implementations in both PRs and see how we can best align them, combine the useful parts, or avoid duplicated work.

Thanks again for the help and for pushing this forward!

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 1, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 1, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 1, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 1, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 1, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 1, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 22, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 22, 2026
@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify mergify Bot removed the quality-failed label May 22, 2026
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
@mergify
Copy link
Copy Markdown

mergify Bot commented May 22, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

@DarkLight1337
Copy link
Copy Markdown
Member Author

I have addressed your comments, please take another look!

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Copy link
Copy Markdown
Collaborator

@shanjiaz shanjiaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for updating! Looks good to me.

@DarkLight1337
Copy link
Copy Markdown
Member Author

@fynnsu could you take another look as well?

Copy link
Copy Markdown
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data-generation documentation Improvements or additions to documentation enhancement New feature or request training two-reviews

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: VL model (Qwen3-VL) image-text pair input support

4 participants