[vlm] Support ulysses sequence parallelism for vlm #1739

popomen · 2025-05-28T12:52:50Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Only apply Ulysses sequence parallel to the LLM part of the VLM model, which is the main component, to avoid the Image features and image tokens do not match issue from occurring before masked_scatter.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

For the VLM model, we only pad the inputs before forward pass without slicing them; instead, we perform slicing after the embedding stage.
In cases where ViT and LLM share/reuse FlashAttention, distinguish the ViT scenario and skip the Ulysses logic.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

python -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/mnt/hdfs/zhudelin123/data/geo3k/train.parquet \
    data.val_files=/mnt/hdfs/zhudelin123/data/geo3k/test.parquet \
    data.train_batch_size=64 \
    data.max_prompt_length=2048 \
    data.max_response_length=2048 \
    data.filter_overlong_prompts=True \
    data.truncation=error \
    data.image_key=images \
    actor_rollout_ref.model.path=/mnt/hdfs/Qwen2.5-VL-7B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=64 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.model.use_fused_kernels=True \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=4 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=[console,wandb] \
    trainer.project_name=nanzhe_verl_grpo_example_geo3k \
    trainer.experiment_name=qwen2_5_vl_7b_sp2_test \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=2 \
    trainer.save_freq=-1 \
    trainer.test_freq=-1 \
    trainer.default_hdfs_dir=null \
    trainer.total_epochs=1 \
    trainer.resume_mode=disable

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

CLAassistant · 2025-05-28T12:52:57Z

All committers have signed the CLA.

hiyouga

overall LGTM!

vermouth1992 · 2025-05-28T15:31:13Z

Could you modify the vlm test to use sp=2? Thanks

vermouth1992 · 2025-05-29T01:17:23Z

overall LGTM!

Do we need per-model modification for ulysses?

hiyouga · 2025-05-29T08:02:46Z

@vermouth1992 TRUE. To implement SP for VLMs, we need to insert some operations into the top-level module. Since the definition of this module (e.g.,Qwen2VLModel) varies across models, we currently need to apply model-specific patches for Ulysses. The transformers library is being refactored to unify VLMs, so we’ll be able to adopt a more unified approach once their work is done.

verl/models/transformers/qwen2_vl.py

verl/workers/actor/dp_actor.py

hiyouga

Please also enable SP for the unittest:

verl/.github/workflows/e2e_ppo_trainer.yml

Lines 189 to 193 in 904a252

    
                     TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \ 
        
                       MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \ 
        
                       MODEL_ID=Qwen/Qwen2-VL-2B-Instruct \ 
        
                       ADV_ESTIMATOR=grpo RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \ 
        
                       bash tests/e2e/ppo_trainer/run_function_reward.sh

popomen · 2025-05-29T08:18:51Z

overall LGTM!

Do we need per-model modification for ulysses?

I've attempted a less invasive approach. For the slice inputs section, you only need to inject Ulysses logic with a single line of code. However, for the alltoall operation within the attention block, you still need to patch the entire method.

verl/models/transformers/monkey_patch.py

tests/e2e/ppo_trainer/run_function_reward.sh

verl/models/transformers/monkey_patch.py

hiyouga

LGTM

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Only apply Ulysses sequence parallel to the LLM part of the VLM model, which is the main component, to avoid `the Image features and image tokens do not match` issue from occurring before `masked_scatter`. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes 1. For the VLM model, we only pad the inputs before forward pass without slicing them; instead, we perform slicing after the embedding stage. 2. In cases where ViT and LLM share/reuse FlashAttention, distinguish the ViT scenario and skip the Ulysses logic. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ``` python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=/mnt/hdfs/zhudelin123/data/geo3k/train.parquet \ data.val_files=/mnt/hdfs/zhudelin123/data/geo3k/test.parquet \ data.train_batch_size=64 \ data.max_prompt_length=2048 \ data.max_response_length=2048 \ data.filter_overlong_prompts=True \ data.truncation=error \ data.image_key=images \ actor_rollout_ref.model.path=/mnt/hdfs/Qwen2.5-VL-7B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.model.use_fused_kernels=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=[console,wandb] \ trainer.project_name=nanzhe_verl_grpo_example_geo3k \ trainer.experiment_name=qwen2_5_vl_7b_sp2_test \ trainer.n_gpus_per_node=8 \ trainer.nnodes=2 \ trainer.save_freq=-1 \ trainer.test_freq=-1 \ trainer.default_hdfs_dir=null \ trainer.total_epochs=1 \ trainer.resume_mode=disable ``` <img width="481" alt="image" src="https://github.com/user-attachments/assets/066db41d-46cf-4bc8-9d50-b9a8189c7654" /> ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

verl/workers/actor/dp_actor.py

+                    if is_vlm_model:
+                        # vlm model's inputs will be sliced after embedding
+                        input_ids_rmpad, position_ids_rmpad, pad_size = ulysses_pad(
+                            input_ids_rmpad,
+                            position_ids_rmpad=position_ids_rmpad,
+                            sp_size=self.ulysses_sequence_parallel_size,
+                        )
+                    else:


### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Only apply Ulysses sequence parallel to the LLM part of the VLM model, which is the main component, to avoid `the Image features and image tokens do not match` issue from occurring before `masked_scatter`. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes 1. For the VLM model, we only pad the inputs before forward pass without slicing them; instead, we perform slicing after the embedding stage. 2. In cases where ViT and LLM share/reuse FlashAttention, distinguish the ViT scenario and skip the Ulysses logic. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ``` python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=/mnt/hdfs/zhudelin123/data/geo3k/train.parquet \ data.val_files=/mnt/hdfs/zhudelin123/data/geo3k/test.parquet \ data.train_batch_size=64 \ data.max_prompt_length=2048 \ data.max_response_length=2048 \ data.filter_overlong_prompts=True \ data.truncation=error \ data.image_key=images \ actor_rollout_ref.model.path=/mnt/hdfs/Qwen2.5-VL-7B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.model.use_fused_kernels=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=[console,wandb] \ trainer.project_name=nanzhe_verl_grpo_example_geo3k \ trainer.experiment_name=qwen2_5_vl_7b_sp2_test \ trainer.n_gpus_per_node=8 \ trainer.nnodes=2 \ trainer.save_freq=-1 \ trainer.test_freq=-1 \ trainer.default_hdfs_dir=null \ trainer.total_epochs=1 \ trainer.resume_mode=disable ``` <img width="481" alt="image" src="https://github.com/user-attachments/assets/066db41d-46cf-4bc8-9d50-b9a8189c7654" /> ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Only apply Ulysses sequence parallel to the LLM part of the VLM model, which is the main component, to avoid `the Image features and image tokens do not match` issue from occurring before `masked_scatter`. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes 1. For the VLM model, we only pad the inputs before forward pass without slicing them; instead, we perform slicing after the embedding stage. 2. In cases where ViT and LLM share/reuse FlashAttention, distinguish the ViT scenario and skip the Ulysses logic. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ``` python -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=/mnt/hdfs/zhudelin123/data/geo3k/train.parquet \ data.val_files=/mnt/hdfs/zhudelin123/data/geo3k/test.parquet \ data.train_batch_size=64 \ data.max_prompt_length=2048 \ data.max_response_length=2048 \ data.filter_overlong_prompts=True \ data.truncation=error \ data.image_key=images \ actor_rollout_ref.model.path=/mnt/hdfs/Qwen2.5-VL-7B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=64 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.model.use_fused_kernels=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.rollout.tensor_model_parallel_size=4 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=4 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger=[console,wandb] \ trainer.project_name=nanzhe_verl_grpo_example_geo3k \ trainer.experiment_name=qwen2_5_vl_7b_sp2_test \ trainer.n_gpus_per_node=8 \ trainer.nnodes=2 \ trainer.save_freq=-1 \ trainer.test_freq=-1 \ trainer.default_hdfs_dir=null \ trainer.total_epochs=1 \ trainer.resume_mode=disable ``` <img width="481" alt="image" src="https://github.com/user-attachments/assets/066db41d-46cf-4bc8-9d50-b9a8189c7654" /> ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if necessary.

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

vermouth1992 requested a review from hiyouga May 28, 2025 13:09

hiyouga previously approved these changes May 28, 2025

View reviewed changes

[Bugfix] Fix ulysses sequence parallelism for vlm

ff37220

popomen dismissed hiyouga’s stale review via ff37220 May 29, 2025 03:04

popomen force-pushed the fix_vlm_ulysses branch from 2eebc97 to ff37220 Compare May 29, 2025 03:04