Support multi-modal datasets#495
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughThis PR extends the data generation and training pipeline to support multimodal models (text+image) by introducing processor-based preprocessing, dataset normalization functions, COCO image filtering, vLLM Chat Completions API support for multimodal inputs, and corresponding CLI flags for remote code execution. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Prepare as prepare_data.py
participant Config as configs.DatasetConfig
participant Preprocess as preprocessing.py
participant Processor as AutoProcessor
participant vLLM as vLLM Client
User->>Prepare: --trust-remote-code flag
Prepare->>Config: Load dataset with config
Config->>Preprocess: load_and_preprocess_dataset(trust_remote_code=True)
Preprocess->>Processor: Load processor with trust_remote_code
Processor-->>Preprocess: Processor instance
Preprocess->>Preprocess: Normalize dataset & extract _vllm_messages
Preprocess->>Preprocess: apply_chat_template for text+images
Preprocess-->>Prepare: (HFDataset, Processor)
Prepare->>Prepare: save preprocessed data
sequenceDiagram
participant Training as train.py
participant DataPipeline as train/data.py
participant HiddenStates as vllm_client.py
participant vLLM as vLLM Server
Training->>Training: parse_args(--trust-remote-code)
Training->>DataPipeline: Build dataset with processor
DataPipeline->>DataPipeline: Extract item._vllm_messages
DataPipeline->>HiddenStates: generate_hidden_states(messages=_vllm_messages)
alt Messages provided (multimodal)
HiddenStates->>vLLM: Chat Completions API with messages
else No messages (text-only)
HiddenStates->>vLLM: Completions API with token_ids
end
vLLM-->>HiddenStates: Hidden states & token_ids
HiddenStates-->>DataPipeline: Extracted output
DataPipeline-->>Training: Training batch with embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
@DarkLight1337 Thanks a lot for working on this and for cc'ing me! I haven't had a chance to look through this PR carefully yet, but I'll review it in more detail shortly. I also just organized a related implementation in #497, where I have been working on the full multimodal E2E training flow and am currently wrapping up the final testing/validation. There may be some overlap between the two PRs. I'll take a closer look at the concrete implementations in both PRs and see how we can best align them, combine the useful parts, or avoid duplicated work. Thanks again for the help and for pushing this forward! |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
The quality checks have failed. Please run |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
The quality checks have failed. Please run |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
I have addressed your comments, please take another look! |
shanjiaz
left a comment
There was a problem hiding this comment.
Thanks for updating! Looks good to me.
|
@fynnsu could you take another look as well? |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
FIX #290
By forwarding the conversations directly into Chat Completions API, and also observing that
AutoProcessorreturns tokenizer instance for text-only models, we can simplify the code a lot compared to #344.cc @shx2005
Description
isinstance(..., ProcessorMixin).--trust-remote-codeflag to data preparation and training scripts.hf_nameandfilter_fnoptions toDatasetConfig.COCO_DIRenvironment variable to control where COCO images are read from.--allowed-media-domain-paths /path/to/cocowhen serving it.normalize_fnmore difficult. To reduce overhead, this step is now executed afterraw_dataset.select.--enforce-eagerfor vLLM in e2e tests by default to reduce startup time.torchaudioandtorchvisionto dependencies.Related Issue
#290
Tests
Add integration and e2e tests to ensure MM support. Note:
I have filled in: