[FSDP, VLM] feat: add vlm training for FSDP#501
Conversation
4585622 to
95c371f
Compare
slime/rollout/sglang_rollout.py
Outdated
| # Process images for training (like tokenization for images) | ||
| if images_for_training and state.processor is not None: | ||
| processed = state.processor(images=images_for_training, return_tensors="pt") | ||
| sample.pixel_values = processed["pixel_values"] |
There was a problem hiding this comment.
we shouldn't try to pass the pixel values from sglang to megatron, but instead maybe re-process the image from the training side is better.
4d7cf35 to
cafac52
Compare
|
Hi @nanjiangwill, glad to see this important feature! I'd love to help with this PR if you're open to collaborating. I can help with supporting qwen-vl models and geo3k training example. Please LMK if you are happy with collaborating! |
hi @coding-famer, that will be amazing! can i have ur email/contact number so i can reach out to you? |
Have sent you an email! |
|
Hi @nanjiangwill and @coding-famer, I'd love to help with this PR about the multi-turn part if you are open to this. |
heyy thanks for reaching out! can i have your email? |
Have sent the email! |
|
我的神 @nanjiangwill 😭 |
cafac52 to
c29983b
Compare
|
牛逼! |
fcc6ef1 to
a478470
Compare
|
请问大概什么时候才会合并呀 |
肯定会早于星际之门集群建设好 |
This PR will be merge in the next few days. Pls just give us some time to do final check and review. |
|
🈚️敌! |
|
We ran three experiments comparing different reward model configurations on 8*H100:
Results showed all three configurations perform similarly, indicating:
I will later remove them. Just leave a note here to mark: |
zhuzilin
left a comment
There was a problem hiding this comment.
LGTM! Left some minor comments.
| flat_rollout_log_probs, dtype=torch.float32, device=torch.cuda.current_device() | ||
| ), | ||
| "multimodal_inputs": multimodal_data, | ||
| "multimodal_num_items": multimodal_num_items, |
There was a problem hiding this comment.
here we need sth like:
packed_batch = {
"tokens": ...,
}
if multimodal_inputs:
for key, mm_tensor in multimodal_inputs[i].items():
...
packed_batch.extend({
"multimodal_inputs": multimodal_data,
"multimodal_num_items": multimodal_num_items,
})
result.append(packed_batch)There was a problem hiding this comment.
used .update() instead of .extend() since dictionaries use update to merge key-value pairs.
slime/ray/rollout.py
Outdated
| ): | ||
| # group norm | ||
| rewards = torch.tensor(raw_rewards, dtype=torch.float) | ||
| rewards = torch.tensor(raw_rewards, dtype=torch.float16) |
There was a problem hiding this comment.
hmm.. it seems better to move this type conversion into the custom reward model? otherwise, it may influence the other users who are using dense rm for rlhf.
There was a problem hiding this comment.
move float16 specifically to notes in geo3k_vlm example
|
Hi @nanjiangwill, this is awesome! I am also working on RL with VLMs. Would love to contribute and collaborate on further works! my email is simondong0919 at gmail.com love to get in touch |
Co-authored-by: Chenhe Gu <chenhegu0109@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>
|
It looks like this commit changed the behavior of the apply_chat_template parameter. As a result, this parameter becomes useless during dataset construction: instead of applying the chat template, the data from prompt_key is fed directly into something like: messages = [{"role": "user", "content": messages}] In custom generate scenarios, do you realize what this can lead to? You’re effectively replacing what used to be a str with a dict (or even a list of dicts). If downstream code doesn’t explicitly validate types, it will just get stuffed into the prompt and you end up with outputs like: First rollout sample: ['<|im_start|>user\n[{'role': 'user', 'content': This kind of change alters the default behavior in a non-backward-compatible way. At the very least, compatibility with existing users should be considered — and there should be warnings or clear notices. In my case, it indirectly caused training quality to degrade. I wouldn’t have noticed this logic change if it weren’t for a new training job that made the issue obvious. |
|
@adol001 really sorry about this, we will revert this change. |
Co-authored-by: Chenhe Gu <chenhegu0109@gmail.com> Co-authored-by: Jin Pan <jpan236@wisc.edu> Co-authored-by: Jinn <47354855+jhinpan@users.noreply.github.com>

Goal: Support VLM training on slime with FSDP
TODO
class AutoModelForVision2Seq/AutoModelForImageTextToTextQwen2.5-VL/Qwen3-VL(we just use default hf) new-feature TBD @WindowsXp-Betaexamples