[worker] fix: get all `multi_modal_inputs` keys with in a microbatch by HollowMan6 · Pull Request #3315 · verl-project/verl

HollowMan6 · 2025-09-02T23:15:55Z

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Address the first issue in #3281 (comment)

More work on top of #1999

Currently, the code gets the keys from the first row within the microbatch, This can go wrong if the dataset is a mixture of pure-text with multi-modal, where the first data in the microbatch is a pure-text one (no pixel_values or image_grid_thw exists in the key), and the microbatch still contains multi-modal data.

This PR fixes this issue by collecting all available keys for multi_modal_inputs within the microbatch, and so that we can concatenate those multi-modal tensors together without ignoring some of them under the above situation.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request correctly fixes a bug where multi_modal_inputs keys were only sourced from the first item in a microbatch, potentially missing keys from other items. The fix involves iterating through all items to collect all possible keys. My review focuses on improving the performance of this new logic. The updated implementation for collecting keys involves nested loops, which can be inefficient. I've suggested a more performant approach that iterates through the inputs only once to collect all values, which should be more efficient, especially for large microbatches.

verl/workers/actor/dp_actor.py

verl/workers/actor/megatron_actor.py

verl/workers/critic/dp_critic.py

verl/workers/engine/fsdp/engine_impl.py

verl/workers/engine/megatron/engine_impl.py

verl/workers/reward_model/megatron/reward_model.py

verl/workers/actor/dp_actor.py

HollowMan6 · 2025-09-03T22:39:48Z

Currently, the code seems to be problematic, and some of the ranks will hang at:

    _to_dtype_if_needed (torch/distributed/fsdp/_fully_shard/_fsdp_common.py:170)
    all_gather_inputs (torch/distributed/fsdp/_fully_shard/_fsdp_param.py:744)
    _get_param_all_gather_inputs (torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py:256)
    decorate_context (torch/utils/_contextlib.py:120)
    foreach_all_gather (torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py:176)
    decorate_context (torch/utils/_contextlib.py:120)
    unshard (torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:279)
    pre_forward (torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:351)
    _pre_forward (torch/distributed/fsdp/_fully_shard/_fsdp_state.py:248)
    _fn (torch/_dynamo/eval_frame.py:929)
    fsdp_hook_wrapper (torch/distributed/fsdp/_fully_shard/_fsdp_state.py:62)
    inner (torch/nn/modules/module.py:1806)
    _call_impl (torch/nn/modules/module.py:1879)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    __call__ (transformers/modeling_layers.py:94)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:903)
    ulysses_wrapped_decoder_forward (verl/models/transformers/monkey_patch.py:194)
    ulysses_wrapped_decoder_forward (verl/models/transformers/monkey_patch.py:194)
    _call_impl (torch/nn/modules/module.py:1784)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1314)
    _call_impl (torch/nn/modules/module.py:1784)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1486)
    wrapper (transformers/utils/generic.py:940)
    inner (torch/nn/modules/module.py:1827)
    _call_impl (torch/nn/modules/module.py:1879)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    _forward_micro_batch (verl/workers/actor/dp_actor.py:172)
    compute_log_prob (verl/workers/actor/dp_actor.py:339)
    log (verl/utils/profiler/performance.py:118)
    f (verl/utils/profiler/performance.py:105)
    compute_log_prob (verl/workers/fsdp_workers.py:896)
    wrapper (verl/utils/profiler/profile.py:256)
    inner (verl/single_controller/base/decorator.py:430)
    func (verl/single_controller/ray/base.py:701)
    _resume_span (ray/util/tracing/tracing_helper.py:461)
    actor_method_executor (ray/_private/function_manager.py:689)
    main_loop (ray/_private/worker.py:964)
    <module> (ray/_private/workers/default_worker.py:321)

I'm not entirely sure what's going on here. Any help would be much appreciated!

@gemini-code-assist

gemini-code-assist

Code Review

This pull request refactors the handling of multi_modal_inputs to correctly process microbatches with mixed pure-text and multi-modal data. It introduces a new utility function, extract_multi_modal_inputs, to gather all keys from all samples in a microbatch, fixing a bug where keys were only taken from the first sample. My review focuses on ensuring the new utility function is robust. I've identified a potential regression where None values are not handled, which could lead to a TypeError. I've provided a suggestion to fix this.

verl/utils/model.py

ririOuO · 2025-09-11T10:50:07Z

Currently, the code seems to be problematic, and some of the ranks will hang at:

    _to_dtype_if_needed (torch/distributed/fsdp/_fully_shard/_fsdp_common.py:170)
    all_gather_inputs (torch/distributed/fsdp/_fully_shard/_fsdp_param.py:744)
    _get_param_all_gather_inputs (torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py:256)
    decorate_context (torch/utils/_contextlib.py:120)
    foreach_all_gather (torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py:176)
    decorate_context (torch/utils/_contextlib.py:120)
    unshard (torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:279)
    pre_forward (torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py:351)
    _pre_forward (torch/distributed/fsdp/_fully_shard/_fsdp_state.py:248)
    _fn (torch/_dynamo/eval_frame.py:929)
    fsdp_hook_wrapper (torch/distributed/fsdp/_fully_shard/_fsdp_state.py:62)
    inner (torch/nn/modules/module.py:1806)
    _call_impl (torch/nn/modules/module.py:1879)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    __call__ (transformers/modeling_layers.py:94)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:903)
    ulysses_wrapped_decoder_forward (verl/models/transformers/monkey_patch.py:194)
    ulysses_wrapped_decoder_forward (verl/models/transformers/monkey_patch.py:194)
    _call_impl (torch/nn/modules/module.py:1784)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1314)
    _call_impl (torch/nn/modules/module.py:1784)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    forward (transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py:1486)
    wrapper (transformers/utils/generic.py:940)
    inner (torch/nn/modules/module.py:1827)
    _call_impl (torch/nn/modules/module.py:1879)
    _wrapped_call_impl (torch/nn/modules/module.py:1773)
    _forward_micro_batch (verl/workers/actor/dp_actor.py:172)
    compute_log_prob (verl/workers/actor/dp_actor.py:339)
    log (verl/utils/profiler/performance.py:118)
    f (verl/utils/profiler/performance.py:105)
    compute_log_prob (verl/workers/fsdp_workers.py:896)
    wrapper (verl/utils/profiler/profile.py:256)
    inner (verl/single_controller/base/decorator.py:430)
    func (verl/single_controller/ray/base.py:701)
    _resume_span (ray/util/tracing/tracing_helper.py:461)
    actor_method_executor (ray/_private/function_manager.py:689)
    main_loop (ray/_private/worker.py:964)
    <module> (ray/_private/workers/default_worker.py:321)

I'm not entirely sure what's going on here. Any help would be much appreciated!

@gemini-code-assist

Hi, @HollowMan6 . I met the same problem. Have you found any reasons? I use Qwen2.5-VL., When I use pure-text data for training, this process will not hang, but when I incorporate some multimodal data, and use this code #3315 (comment) to process multi_modal_inputs, I got the same hang problem.

gemini-code-assist · 2025-09-11T10:50:40Z

Hi @ririOuO and @HollowMan6,

Thank you for bringing this issue to my attention and for the additional context, @ririOuO. The hang you're experiencing with multimodal data, specifically when using Qwen2.5-VL and incorporating multimodal inputs, points to a potential mismatch in how multi_modal_inputs are being prepared for the model, especially in a distributed FSDP environment.

The stack trace you provided, particularly the calls to _to_dtype_if_needed, all_gather_inputs, foreach_all_gather, and unshard within torch.distributed.fsdp, strongly suggests a communication deadlock or an inconsistency in tensor shapes/types across distributed ranks during FSDP's collective operations.

Looking at the changes in this pull request, the newly introduced extract_multi_modal_inputs function in verl/utils/model.py is the most likely source of this behavior. Let me explain why:

Original Logic (before this PR, in dp_actor.py and similar files):

            if "image_bound" in micro_batch["multi_modal_inputs"][0]:  # minicpm-o logic
                # ... keep as list ...
            else:
                # ... concatenate ...

This logic determined whether to keep multi_modal_inputs as a list of tensors or concatenate them based only on whether the first item in the micro_batch["multi_modal_inputs"] contained an "image_bound" key.

New Logic (in extract_multi_modal_inputs):

    for inputs in batch_data:
        if "image_bound" in inputs:
            has_image_bound = True
    # ...
    if has_image_bound:  # minicpm-o logic
        multi_modal_inputs[key] = values # list of tensors
    else:
        multi_modal_inputs[key] = torch.cat(values, dim=0) # concatenated tensor

The new extract_multi_modal_inputs function now sets has_image_bound = True if any dictionary within the batch_data (i.e., any sample in the microbatch) contains an "image_bound" key. If has_image_bound is True, the function returns multi_modal_inputs where each key's value is a list[torch.Tensor]. Otherwise, it concatenates them into a single torch.Tensor.

The Problem:

Your observation that the hang occurs with multimodal data but not pure-text data is key. The original PR aimed to correctly handle mixed batches where the first item might be pure-text but subsequent items are multimodal. While the new function correctly identifies all multimodal keys, the change in the has_image_bound check means that if even one sample in a microbatch contains "image_bound", then all multi_modal_inputs for that entire microbatch will be returned as list[torch.Tensor].

If the Qwen2.5-VL model (or other models you are using) expects a single, concatenated torch.Tensor for its multimodal inputs (like pixel_values or image_grid_thw) when processed by FSDP, but it now receives a list[torch.Tensor] due to the image_bound flag being set by any sample in the batch, this type or structure mismatch can cause FSDP's collective operations (like all_gather_inputs) to hang because ranks might be expecting different data structures or sizes.

Suggested Next Steps:

Verify Model Expectation: Confirm whether the Qwen2.5-VL model's forward pass, especially when used with FSDP, expects multi_modal_inputs to be a single concatenated torch.Tensor or a list[torch.Tensor] for non-minicpm-o cases. The minicpm-o logic comment suggests that list[torch.Tensor] is specific to that model type.
Refine extract_multi_modal_inputs: If other models (like Qwen2.5-VL) expect concatenated tensors by default, you might need to adjust extract_multi_modal_inputs to be more granular. For example, it could:
- Concatenate tensors for keys like pixel_values and image_grid_thw by default.
- Only keep values as a list for specific keys or under specific conditions (e.g., if the model explicitly supports and expects a list for that input, or if image_bound is present and the model is indeed minicpm-o).
A possible modification could involve passing a flag to extract_multi_modal_inputs indicating whether the current model expects concatenated tensors or lists for its multimodal inputs, or making the decision based on the specific keys rather than a global has_image_bound flag for the entire batch.

I hope this analysis helps in debugging the hang. Let me know if you have further questions!

HollowMan6 · 2025-09-11T10:57:46Z

Hi @ririOuO! Unfortunately, I haven't figured out the possible cause for the GPU hang. I now suspect the issue can be rooted back to Hugging Face's transformers library. If you have any ideas/findings, please let us know here. (BTW, I don't think the above comments by @gemini-code-assist are valid, as we won't have "image_bound" in the keys for the setup).

techkang · 2025-09-15T10:15:18Z

Maybe it's a bug related to Qwen VL: QwenLM/Qwen3-VL#346

HollowMan6 · 2025-09-15T10:30:05Z

Maybe it's a bug related to Qwen VL: QwenLM/Qwen2.5-VL#346

Thanks for your debugging! It's very likely, but I'm not completely sure if it's the root cause.

techkang · 2025-09-16T07:31:22Z

I think this is a bug related to FSDP warp policy in verl. After comment https://github.com/volcengine/verl/blob/5c98ed1b313be5ae3de10a3e800cbb769374e1a5/verl/workers/fsdp_workers.py#L446-L450 these lines, I can successfully train mixed-rl-dataset. I opened a issue to discuss this. #3491

hiyouga

That's right, we need a special way to process the multimodal inputs for mixed text-image data

hiyouga · 2025-09-16T14:02:08Z

@techkang @HollowMan6 The hanging problem will be solved by this PR #3496

hiyouga · 2025-09-16T14:05:58Z

@HollowMan6 Could you please fix the precommit problems by running pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always?

HollowMan6 · 2025-09-16T14:16:31Z

Thank you for reviewing and helping debugging the issue! I just fixed the linting issue @hiyouga

hiyouga · 2025-09-18T02:55:42Z

The ci was failed, could you please check the reward model inputs?

HollowMan6 · 2025-09-18T08:06:14Z

Hi @hiyouga ! I suspect this PR might not be the direct cause of the error you just mentioned, as last time before I did the rebase on top of the latest commits, the megatron backend was still just in the stuck situation for qwen2.5-vl:

Anyway, I saw your fix #3496 has just been merged, so I did another round of rebase. Would you mind retriggering the CI so that I can confirm if the hanging has already been solved and target the root cause for this shape mismatch?

hiyouga · 2025-09-18T10:34:27Z

Sure, thanks for your response. Hope it could pass the ci checks

hiyouga · 2025-09-18T16:18:37Z

The ci is still failed :(

It runs correctly in the main branch: https://github.com/volcengine/verl/actions/runs/17815858894/job/50648814397

HollowMan6 · 2025-09-18T20:00:51Z

https://github.com/volcengine/verl/compare/00c0bb9998f7715e018913305fc97f3e960d25a6..3517288597372fda37381005bb4129faf7c006ba

I just added multi_modal_inputs_idx back for the megatron backend, though I'm not quite sure if this will fix the issue or not (according to my previous understanding, multi_modal_inputs_idx is not necessary and it's not used elsewhere). As currently I don't have available GPUs at my hand for debugging, this is the only place I can imagine that can break the megatron backend somehow. @hiyouga Would you mind retriggering the CI? If this still fails, I will try to figure out what's going on once I get some available GPUs later.

@gemini-code-assist please also suggest.

gemini-code-assist · 2025-09-18T20:01:20Z

Hi @HollowMan6,

Thank you for the update and for adding multi_modal_inputs_idx back. I understand your point about image_bound not being present in your setup, which means my previous hypothesis about extract_multi_modal_inputs returning a list[torch.Tensor] would indeed be incorrect for your specific use case, as it would always default to concatenating the tensors.

Given that the initial hang seems to be resolved (likely by the merged PR you mentioned), and the current issue is a "shape mismatch" in the CI for the Megatron backend, let's focus on that. The extract_multi_modal_inputs function, when has_image_bound is False (as you've indicated for your setup), will always attempt to concatenate the collected values along dim=0 using torch.cat.

This "shape mismatch" error during torch.cat typically occurs when the tensors being concatenated do not have compatible dimensions (i.e., all dimensions except the concatenation dimension must be the same). For example, if you're concatenating images, they must all have the same height and width.

Here are some debugging steps I'd suggest for the Megatron backend, especially since you've re-introduced multi_modal_inputs_idx:

Inspect multi_modal_inputs_idx:
- Verify the contents of batch["multi_modal_inputs_idx"] when the CI fails. Are the indices correct and within the bounds of batch["multi_modal_inputs"]? While extract_multi_modal_inputs has a check (if i < len(batch_data)), an incorrect multi_modal_inputs_idx could lead to an unexpected subset of data being processed.
- Is it possible that multi_modal_inputs_idx is sometimes empty or contains indices that point to entries where multi_modal_inputs are not consistently structured?
Examine batch["multi_modal_inputs"] content:
- Before calling extract_multi_modal_inputs, log the shape and dtype of the tensors within batch["multi_modal_inputs"] for the relevant keys (e.g., pixel_values, image_grid_thw).
- Specifically, check the shapes of the tensors that are being passed to torch.cat within extract_multi_modal_inputs. You could add temporary print statements inside extract_multi_modal_inputs like this:
```
# Inside extract_multi_modal_inputs, before the torch.cat line
for key, values in multi_modal_inputs_collected.items():
    if not has_image_bound: # Only for the concatenation path
        print(f"Key: {key}")
        for i, val in enumerate(values):
            print(f"  Value {i} shape: {val.shape if isinstance(val, torch.Tensor) else 'Not a tensor'}")
        # ... then the torch.cat call
```
This will help confirm if the individual tensors within values (for a given key) have consistent shapes before concatenation. If they don't, that's the root cause of the shape mismatch.
Data Preprocessing Consistency:
- If the shapes are inconsistent, the issue likely lies in the data loading or preprocessing pipeline that generates batch["multi_modal_inputs"]. Ensure that all multimodal inputs (e.g., images) are resized or padded to a uniform shape before being batched together, especially when mixing different types of data or sources.

Since you don't have GPUs for debugging right now, these logging steps in the CI environment (if possible) would be invaluable to pinpoint the exact tensors causing the shape mismatch. I hope this helps you narrow down the problem!

Currently, the code gets the keys from the first row within the microbatch, This can go wrong if the dataset is a mixture of pure-text with multi-modal, where the first data in the microbatch is a pure-text one (no `pixel_values` or `image_grid_thw` exists in the key), and the microbatch still contains multi-modal data. This PR fixes this issue by collecting all available keys for `multi_modal_inputs` within the microbatch, and so that we can concatenate those multi-modal tensors together without ignoring some of them under the above situation. Signed-off-by: Hollow Man <hollowman@opensuse.org>

HollowMan6 · 2025-09-19T06:23:47Z

Looks like adding back multi_modal_inputs_idx is the solution and CI now is fixed @hiyouga @vermouth1992

hiyouga

LGTM!

…andling `multi_modal_inputs` (#3641) Follow up for #3553 ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Without those changes in #3315, the error when we train the mixture modal dataset will remain unresolved, so it would be a good idea to add them back. ```logs File "verl/workers/actor/megatron_actor.py", line 639, in update_policy metric_micro_batch = self.forward_backward_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "verl/workers/actor/megatron_actor.py", line 587, in forward_backward_batch losses_reduced = forward_backward_func( ^^^^^^^^^^^^^^^^^^^^^^ File "megatron/core/pipeline_parallel/schedules.py", line 595, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ^^^^^^^^^^^^^ File "megatron/core/pipeline_parallel/schedules.py", line 402, in forward_step output_tensor, loss_func = forward_step_func(data_iterator, model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "verl/workers/actor/megatron_actor.py", line 497, in forward_step multi_modal_inputs[key] = torch.cat( ^^^^^^^^^^ RuntimeError: torch.cat(): expected a non-empty list of Tensors ``` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…erl-project#3315) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Address the first issue in verl-project#3281 (comment) More work on top of verl-project#1999 Currently, the code gets the keys from the first row within the microbatch, This can go wrong if the dataset is a mixture of pure-text with multi-modal, where the first data in the microbatch is a pure-text one (no `pixel_values` or `image_grid_thw` exists in the key), and the microbatch still contains multi-modal data. This PR fixes this issue by collecting all available keys for `multi_modal_inputs` within the microbatch, and so that we can concatenate those multi-modal tensors together without ignoring some of them under the above situation. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

- Revert megatron actor changes in this PR that causes perf degradation: verl-project#3206 - We have to revert following PRs that modify the files too: verl-project#3513 and verl-project#3315 - We will add them back when we fix the problem - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python ``` > Demonstrate the high-level design if this PR is complex, and list the specific changes. > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)

…andling `multi_modal_inputs` (verl-project#3641) Follow up for verl-project#3553 ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Without those changes in verl-project#3315, the error when we train the mixture modal dataset will remain unresolved, so it would be a good idea to add them back. ```logs File "verl/workers/actor/megatron_actor.py", line 639, in update_policy metric_micro_batch = self.forward_backward_batch( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "verl/workers/actor/megatron_actor.py", line 587, in forward_backward_batch losses_reduced = forward_backward_func( ^^^^^^^^^^^^^^^^^^^^^^ File "megatron/core/pipeline_parallel/schedules.py", line 595, in forward_backward_no_pipelining output_tensor, num_tokens = forward_step( ^^^^^^^^^^^^^ File "megatron/core/pipeline_parallel/schedules.py", line 402, in forward_step output_tensor, loss_func = forward_step_func(data_iterator, model) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "verl/workers/actor/megatron_actor.py", line 497, in forward_step multi_modal_inputs[key] = torch.cat( ^^^^^^^^^^ RuntimeError: torch.cat(): expected a non-empty list of Tensors ``` ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [X] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [X] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

…erl-project#3315) ### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. Address the first issue in verl-project#3281 (comment) More work on top of verl-project#1999 Currently, the code gets the keys from the first row within the microbatch, This can go wrong if the dataset is a mixture of pure-text with multi-modal, where the first data in the microbatch is a pure-text one (no `pixel_values` or `image_grid_thw` exists in the key), and the microbatch still contains multi-modal data. This PR fixes this issue by collecting all available keys for `multi_modal_inputs` within the microbatch, and so that we can concatenate those multi-modal tensors together without ignoring some of them under the above situation. ### Checklist Before Starting - [X] Search for similar PRs. Paste at least one query link here: ... - [X] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [X] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [X] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) Signed-off-by: Hollow Man <hollowman@opensuse.org>

### What does this PR do? - Revert megatron actor changes in this PR that causes perf degradation: verl-project#3206 - We have to revert following PRs that modify the files too: verl-project#3513 and verl-project#3315 - We will add them back when we fix the problem ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)