Qwen2.5-VL-7B egale3 train#102
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @Lzhang-hub, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented initial support for training the Qwen2.5-VL-7B model using the Eagle3 method. This involves significant changes across data preparation, model definition, and training scripts to accommodate the unique requirements of vision-language models, such as handling image inputs and specialized rotary embeddings. The goal is to enable efficient training of this multimodal model within our existing framework.
Highlights
- Qwen2.5-VL-7B Model Support: I've added comprehensive support for training the Qwen2.5-VL-7B model, enabling it to leverage its multimodal capabilities within our framework.
- Eagle3 Draft Model Integration for VLMs: I've integrated the Eagle3 draft model specifically for Qwen2.5-VL, including a new
Qwen2_5_VLForCausalLMEagle3model andQwenVLOnlineEagle3Modelfor online training, which handles the unique multimodal rotary embedding of Qwen2.5-VL. - Enhanced VLM Data Preparation: I've updated the data preparation scripts to support
sharegpt4vandallava4vdatasets, ensuring thatpixel_valuesandimage_grid_thware correctly processed for VLM training. - VLM Training Script Adaptations: I've modified the training script to incorporate VLM-specific logic, such as loading
AutoProcessorand passing image-related inputs (pixel_values,image_grid_thw) through the training loop. - New Configuration and Training Script: I've added a new configuration file and a dedicated shell script to streamline the setup and execution of Qwen2.5-VL-7B Eagle3 training runs.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces support for training the Qwen2.5-VL-7B model, which is a significant and complex addition. The changes are extensive, touching data preparation, model definition, training scripts, and core components. While the overall approach seems sound, I've identified several critical issues that need to be addressed. These include a missing dependency, incorrect handling of position_ids in the multimodal rotary position embedding logic which will likely cause runtime errors and incorrect model behavior, and bugs in the training and testing scripts. I've also included some suggestions for refactoring to improve code maintainability. Addressing these points will be crucial for the stability and correctness of the new VLM training capabilities.
| input_ids = padding(input_ids, left=False) | ||
| target = padding(target, left=False) | ||
| loss_mask = padding(loss_mask, left=False) |
There was a problem hiding this comment.
The position_ids are computed once before the TTT loop but are not updated within the loop. As input_ids, target, and loss_mask are shifted in each iteration using padding(..., left=False), position_ids should also be updated similarly to maintain correct positional information for subsequent TTT steps. Without this, the rotary embeddings will be computed with stale position information.
| input_ids = padding(input_ids, left=False) | |
| target = padding(target, left=False) | |
| loss_mask = padding(loss_mask, left=False) | |
| input_ids = padding(input_ids, left=False) | |
| target = padding(target, left=False) | |
| loss_mask = padding(loss_mask, left=False) | |
| position_ids = padding(position_ids, left=False) |
| from tqdm import tqdm | ||
| from transformers import PreTrainedTokenizer | ||
| from transformers import PreTrainedTokenizer,ImageProcessingMixin | ||
| from qwen_vl_utils import process_vision_info |
There was a problem hiding this comment.
|
|
||
| else: | ||
| lck = len(cache_hidden[0]) | ||
| cos, sin = self.rotary_emb(query_states, position_ids+ lck) |
There was a problem hiding this comment.
The logic for updating position_ids within the TTT loop by adding lck is incorrect for multimodal inputs. position_ids has a shape of (3, batch_size, seq_len) for Qwen-VL, where each of the 3 components corresponds to different modalities (text, image height, image width). Adding a scalar lck will broadcast incorrectly. Only the text-related position IDs (at index 0) should be offset.
A correct update would be to modify only the text-related part of the position IDs. However, a better approach would be to handle the position updates in the QwenVLOnlineEagle3Model.forward loop, which would simplify the logic here.
| cos, sin = self.rotary_emb(query_states, position_ids+ lck) | |
| cos, sin = self.rotary_emb(query_states, position_ids) |
scripts/train_eagle3_online.py
Outdated
| eval_logdict[f"train/ploss_{i}"] = plosses[i].item() | ||
| for i in range(len(acces)): | ||
| eval_logdict[f"train/acc_{i}"] = acces[i] |
There was a problem hiding this comment.
The wandb logging for evaluation metrics seems to be using the wrong keys. The metrics are logged under train/ploss_{i} and train/acc_{i} which is misleading during the evaluation phase. This should be corrected to eval/ploss_{i} and eval/acc_{i} to accurately reflect that these are evaluation metrics.
| eval_logdict[f"train/ploss_{i}"] = plosses[i].item() | |
| for i in range(len(acces)): | |
| eval_logdict[f"train/acc_{i}"] = acces[i] | |
| eval_logdict[f"eval/ploss_{i}"] = plosses[i].item() | |
| for i in range(len(acces)): | |
| eval_logdict[f"eval/acc_{i}"] = acces[i] |
|
|
||
| class Qwen2_5_VLForCausalLMEagle3(Eagle3DraftModel): | ||
|
|
||
| config_class = LlamaConfig |
There was a problem hiding this comment.
| print(f"Loss mask sum: {processed['loss_mask'][0].sum()}") | ||
| loss_mask = processed["loss_mask"][0].squeeze(0).tolist() | ||
| input_ids = input_ids.squeeze(0) | ||
| current_mask = input_ids[0] |
There was a problem hiding this comment.
| if args.is_vlm: | ||
| plosses, _, acces = eagle3_model( | ||
| input_ids=data["input_ids"].cuda(), | ||
| attention_mask=data["attention_mask"].cuda(), | ||
| loss_mask=data["loss_mask"].cuda(), | ||
| pixel_values=data["pixel_values"].cuda(), | ||
| image_grid_thw=data["image_grid_thw"].cuda(), | ||
| ) | ||
| else: | ||
| plosses, _, acces = eagle3_model( | ||
| input_ids=data["input_ids"].cuda(), | ||
| attention_mask=data["attention_mask"].cuda(), | ||
| loss_mask=data["loss_mask"].cuda(), | ||
| ) |
There was a problem hiding this comment.
The model call is duplicated in the training loop for the VLM and non-VLM cases. This can be refactored to reduce code duplication and improve readability by constructing a dictionary of model inputs and then unpacking it for the model call.
model_inputs = {
"input_ids": data["input_ids"].cuda(),
"attention_mask": data["attention_mask"].cuda(),
"loss_mask": data["loss_mask"].cuda(),
}
if args.is_vlm:
model_inputs["pixel_values"] = data["pixel_values"].cuda()
model_inputs["image_grid_thw"] = data["image_grid_thw"].cuda()
plosses, _, acces = eagle3_model(**model_inputs)|
Great job!!!! |
|
The dataset.map step is very slow, and it hangs when num_proc is greater than 1. |
Can you provide your training script? I processed 30,000 datasets, image max size 2k, which takes about 15 minutes. |
|
@Lzhang-hub I use the default command with my own data |
no, but the overall acc seems correct |
* support qwen2_5_vl online * delete nohup * add qwen2.5-vl eagle model * add todo * clean dev code * support batch and fix position_ids bug * add eval wandb metrics * fix eval bug * fix eval dataloader bug * add comment * merge main * rename vlm online eagle3 model name * clean code * fix ttt input embeds bug Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> * fix eval metrics bug * merge qwen-vl draft model to llama3 Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> * fix qwen vl train shell * add timeout config Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> * qwenvl draft input without image embedding Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> * qwenvl draft input without image embedding Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com> * Revert "qwenvl draft input without image embedding" This reverts commit 1e8eab8. * fix gitignore * fix wandb error * fix lint --------- Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
|
Does it work for qwen2.5-vl-3B? |
@mmdbhs I haven’t tried it, but theoretically it should work. You can give it a try, and if you encounter any problems, feel free to provide feedback at any time. |
|
Great job, does it work for qwen2.5-vl-32B? |
It may be need tp, now not supported. |
|
@oswen You need install sglang from source. v0.5.1 not support qwen-vl eagle3 infer. |
thanks for replying,so which branch of sglang should I chose?the master? |
pip uninstall sglang pip install sglang==0.5.3 |
|
After following your procedure above, the ACC of my draft model reached around 0.6. Similarly, I deployed it using SGLang, but when I tested the Accept Length, I found that with the model you provided, the Accept Length could exceed 3.0, whereas mine only reached about 2.2. Do you have any idea what might be causing this difference? @Lzhang-hub Train Scriptprepare datapython scripts/prepare_data.py --dataset allava4v --sample-size 100000 --split-eval trainbash examples/run_qwen2_5_vl_eagle3_online.sh Acc result
Test ScriptTest your modelpython -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --speculative-draft-model Rayzl/qwen2.5-vl-7b-eagle3-sgl --trust-remote-code --chat-template qwen2-vl --chunked-prefill-size -1 --cuda-graph-max-bs 1 --speculative-algo EAGLE3 --speculative-num-steps 4 --speculative-eagle-topk 6 --speculative-num-draft-tokens 24 --tp 1 --mem-fraction-static 0.7 --host localhost --port 9001 python run_mmstar.py --host http://localhost --port 9001 --parallel 1 --num-questions 100 Result: Test My trained modelpython -m sglang.launch_server --model-path Qwen/Qwen2.5-VL-7B-Instruct --speculative-draft-model ${my_local_dir}/epoch_9 --trust-remote-code --chat-template qwen2-vl --chunked-prefill-size -1 --cuda-graph-max-bs 1 --speculative-algo EAGLE3 --speculative-num-steps 4 --speculative-eagle-topk 6 --speculative-num-draft-tokens 24 --tp 1 --mem-fraction-static 0.7 --host localhost --port 9001 python run_mmstar.py --host http://localhost --port 9001 --parallel 1 --num-questions 100 Latency: 60.094 s |
Resolved. It caused due to wrong data preprocess. |
I have the same issue,have you solved it? thanks @LugerW-A |
No idea. Just accept it, use num_proc=0. |
OKay,thanks |
|
May I ask if this model can be trained with a 48GB A6000 GPU? I encountered a resource limit exceeded issue during kernel compilation using this GPU. |
|
@LugerW-A @icicle4 |



Motivation
This is a draft pr for support train qwen2.5-vl-7b model.
Modifications
prepare data
pixel_valuesandimage_grid_thwwere added, except forinput_ids loss_mask attention_mask.train
QwenVLOnlineEagle3Modelin core/eagle3.py, the main difference is that the input for the draft model is not input_ids, but input embeds that integrate image embeds.acc
benchmark:

loss metrics

acc metrics

speedup
server: sglang for qwen-2.5-vl eagle3 infer
benchmark scripts: use mmstar benchmark
Note: draft model
Rayzl/qwen2.5-vl-7b-eagle3-sglis only train on 30k vqa datasets, more data is still training.server cmd:
benchmark:
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100result:
server cmd:
benchmark:
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 100result:
e2e speedup 1.5x
Train scripts
Note:
TODO
Checklist