Support multi-modal datasets by DarkLight1337 · Pull Request #495 · vllm-project/speculators

DarkLight1337 · 2026-04-30T17:55:24Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

By forwarding the conversations directly into Chat Completions API, and also observing that AutoProcessor returns tokenizer instance for text-only models, we can simplify the code a lot compared to #344.

cc @shx2005

Description

Enable preprocessing of multimodal datasets. No user flag is required; we detect this automatically based on isinstance(..., ProcessorMixin).
Add --trust-remote-code flag to data preparation and training scripts.
Add hf_name and filter_fn options to DatasetConfig.
Add support for ShareGPT4V dataset.
- Only samples from COCO are supported right now.
- You need to download the images separately since it's not handled by HF Datasets. Use the COCO_DIR environment variable to control where COCO images are read from.
- To avoid copying the images when storing the preprocessed dataset and HTTP transfer to vLLM, we express image inputs in terms of file URLs instead of base64-encoded images. For vLLM to access those files, you should pass --allowed-media-domain-paths /path/to/coco when serving it.
Disable caching for dataset normalization as it makes debugging normalize_fn more difficult. To reduce overhead, this step is now executed after raw_dataset.select.
Use --enforce-eager for vLLM in e2e tests by default to reduce startup time.
Add torchaudio and torchvision to dependencies.
Fix whitespace issues in the help text of the scripts.

Related Issue

#290

Tests

Add integration and e2e tests to ensure MM support. Note:

e2e smoke tests use a dummy COCO image so they can be run in CI.
e2e regression tests are only run if the real COCO images are downloaded.

I have filled in:

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan/results, such as providing test command and pasting the results.
(Optional) The necessary documentation update.
I (a human) have written or reviewed the code in this pr to the best of my ability.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

coderabbitai · 2026-04-30T17:56:41Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8de7675f-2ba2-4ba6-9c77-4146c89f99dd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

This PR extends the data generation and training pipeline to support multimodal models (text+image) by introducing processor-based preprocessing, dataset normalization functions, COCO image filtering, vLLM Chat Completions API support for multimodal inputs, and corresponding CLI flags for remote code execution.

Changes

Cohort / File(s)	Summary
Documentation `docs/cli/prepare_data.md`, `docs/cli/train.md`	Added `--trust-remote-code` CLI flag documentation for both prepare_data and train scripts, allowing remote code execution from HuggingFace Hub during processor/tokenizer loading.
Configuration & Linting `pyproject.toml`	Enabled PLR0915 linting rule for scripts to suppress "too many statements" warnings on the updated `parse_args` functions.
CLI Scripts `scripts/prepare_data.py`, `scripts/train.py`, `scripts/data_generation_offline.py`	Added `--trust-remote-code` CLI flags and wired them through the pipeline; added optional `_vllm_messages` field extraction and forwarding to vLLM generation.
Data Generation Configuration `src/speculators/data_generation/configs.py`	Made `DatasetConfig` keyword-only, added optional `hf_name` and `filter_fn` fields, introduced COCO dataset configuration with image filtering and conversation normalization for ShareGPT4V data.
Data Generation Preprocessing `src/speculators/data_generation/preprocessing.py`	Major refactoring: switched from tokenizer-only to processor-based (`ProcessorLike` interface), added dataset normalization functions, implemented vLLM message conversion for multimodal inputs, added `trust_remote_code` parameter, and refactored loss-mask generation to use HF assistant token masks.
vLLM Client Integration `src/speculators/data_generation/vllm_client.py`	Extended `generate_hidden_states` and `generate_hidden_states_async` with optional `messages` parameter to support Chat Completions API for multimodal inputs alongside existing Completions API path.
Training Integration `src/speculators/train/data.py`, `src/speculators/train/utils.py`	Updated training data pipeline to forward optional `_vllm_messages` to hidden states generation; added `trust_remote_code` parameter to tokenizer loading in `resolve_mask_token_id`.
Test Updates `tests/integration/datagen/test_preprocessing.py`, `tests/integration/datagen/test_regex_patterns.py`	Replaced tokenizer-based tests with processor-based tests, added unit tests for vLLM message conversion (`_hf_to_vllm_conv`), added multimodal integration test with image processing, and strengthened preprocessing assertions.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/CLI
    participant Prepare as prepare_data.py
    participant Config as configs.DatasetConfig
    participant Preprocess as preprocessing.py
    participant Processor as AutoProcessor
    participant vLLM as vLLM Client
    
    User->>Prepare: --trust-remote-code flag
    Prepare->>Config: Load dataset with config
    Config->>Preprocess: load_and_preprocess_dataset(trust_remote_code=True)
    Preprocess->>Processor: Load processor with trust_remote_code
    Processor-->>Preprocess: Processor instance
    Preprocess->>Preprocess: Normalize dataset & extract _vllm_messages
    Preprocess->>Preprocess: apply_chat_template for text+images
    Preprocess-->>Prepare: (HFDataset, Processor)
    Prepare->>Prepare: save preprocessed data

sequenceDiagram
    participant Training as train.py
    participant DataPipeline as train/data.py
    participant HiddenStates as vllm_client.py
    participant vLLM as vLLM Server
    
    Training->>Training: parse_args(--trust-remote-code)
    Training->>DataPipeline: Build dataset with processor
    DataPipeline->>DataPipeline: Extract item._vllm_messages
    DataPipeline->>HiddenStates: generate_hidden_states(messages=_vllm_messages)
    alt Messages provided (multimodal)
        HiddenStates->>vLLM: Chat Completions API with messages
    else No messages (text-only)
        HiddenStates->>vLLM: Completions API with token_ids
    end
    vLLM-->>HiddenStates: Hidden states & token_ids
    HiddenStates-->>DataPipeline: Extracted output
    DataPipeline-->>Training: Training batch with embeddings

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

PR #433: Modifies scripts/data_generation_offline.py for vLLM message forwarding and shares upstream data generation pipeline changes.
PR #354: Modifies scripts/train.py CLI argument parsing and training script initialization logic.

Suggested labels

enhancement, data-generation, training, two-reviews

Suggested reviewers

shanjiaz
rahul-tuli

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.71% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Support multi-modal datasets' directly describes the main objective of the PR, which adds multimodal dataset preprocessing capabilities.
Linked Issues check	✅ Passed	The PR addresses the preprocessing and dataset-processing components of issue `#290`, implementing AutoProcessor support, multimodal conversation handling, and ShareGPT4V COCO dataset support as required.
Out of Scope Changes check	✅ Passed	All changes are scoped to multimodal dataset preprocessing, dataset configuration, and related infrastructure (--trust-remote-code flags, vLLM chat API integration). No unrelated modifications detected.
Description check	✅ Passed	The PR description clearly relates to the changeset, covering multimodal dataset support, new CLI flags, dataset configuration options, and ShareGPT4V dataset integration.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-04-30T18:01:33Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

mergify · 2026-05-01T02:16:54Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-01T06:45:38Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

shx2005 · 2026-05-01T06:58:53Z

@DarkLight1337 Thanks a lot for working on this and for cc'ing me!

I haven't had a chance to look through this PR carefully yet, but I'll review it in more detail shortly. I also just organized a related implementation in #497, where I have been working on the full multimodal E2E training flow and am currently wrapping up the final testing/validation.

There may be some overlap between the two PRs. I'll take a closer look at the concrete implementations in both PRs and see how we can best align them, combine the useful parts, or avoid duplicated work.

Thanks again for the help and for pushing this forward!

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-01T08:50:20Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-01T08:52:12Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-01T09:51:07Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-22T03:57:18Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-22T03:58:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

mergify · 2026-05-22T03:59:30Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-22T04:01:06Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify · 2026-05-22T04:02:41Z

The quality checks have failed. Please run make style and make quality under
the root directory to address the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/speculators/blob/main/CONTRIBUTING.md

DarkLight1337 · 2026-05-22T04:02:53Z

I have addressed your comments, please take another look!

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

shanjiaz

Thanks for updating! Looks good to me.

DarkLight1337 · 2026-05-26T14:45:10Z

@fynnsu could you take another look as well?

fynnsu

Thank you!

DarkLight1337 added 2 commits April 30, 2026 17:52

Support multi-modal datasets (preprocessing part)

f07f575

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Doc

76a7a2d

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Doc

b8f3f4a

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot added the quality-failed label Apr 30, 2026

DarkLight1337 force-pushed the mm-dataset branch from 8cfc213 to b8f3f4a Compare May 1, 2026 02:15

mergify Bot removed the quality-failed label May 1, 2026

mergify Bot added the quality-failed label May 1, 2026

DarkLight1337 changed the title ~~Support multi-modal datasets (preprocessing part)~~ Support multi-modal datasets May 1, 2026

Iterate

8688e1b

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 1, 2026

mergify Bot added the quality-failed label May 1, 2026

DarkLight1337 added 2 commits May 1, 2026 08:34

Fix

ddabe1c

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Simplify

c9f845e

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 1, 2026

mergify Bot added the quality-failed label May 1, 2026

Clean

7b44b1a

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 1, 2026

mergify Bot added the quality-failed label May 1, 2026

Use ShareGPT4V to avoid outdated version of datasets

ce2a072

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 1, 2026

mergify Bot added the quality-failed label May 1, 2026

Improve UX

a81744a

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot added the quality-failed label May 22, 2026

DarkLight1337 added 2 commits May 22, 2026 03:55

Use graph by default for server

7f0c317

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Change default

be339a8

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 22, 2026

Fix llm args

b6336ed

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot added quality-failed and removed quality-failed labels May 22, 2026

Be more robust

1d31818

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot added quality-failed and removed quality-failed labels May 22, 2026

mergify Bot added the quality-failed label May 22, 2026

Fix whitespace

5362c7a

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 22, 2026

mergify Bot added the quality-failed label May 22, 2026

Doc

e509c6d

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot removed the quality-failed label May 22, 2026

Format

2d39e73

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

mergify Bot added the quality-failed label May 22, 2026

mypy

ee95bd6

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

shanjiaz approved these changes May 24, 2026

View reviewed changes

fynnsu approved these changes May 26, 2026

View reviewed changes

Merge branch 'main' into mm-dataset

f86b93e

coderabbitai Bot mentioned this pull request Jun 3, 2026

[Features] Qwen3.6 PartialRoPE supports #568

Open

4 tasks

Conversation

DarkLight1337 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Description

Related Issue

Tests

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

mergify Bot commented Apr 30, 2026

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

shx2005 commented May 1, 2026

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

mergify Bot commented May 1, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

mergify Bot commented May 22, 2026

Uh oh!

DarkLight1337 commented May 22, 2026

Uh oh!

shanjiaz left a comment

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented May 26, 2026

Uh oh!

fynnsu left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

DarkLight1337 commented Apr 30, 2026 •

edited

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading