[feat] support kimi_vl VLM model by ShareLer · Pull Request #1639 · verl-project/verl

ShareLer · 2025-05-22T09:17:57Z

Checklist Before Starting

Search for similar PR(s).
Some code will conflict with this PR [FEAT] Add the support for more VLMs(Gemma3) #1613

What does this PR do?

Add initial support for Kimi_vl;
Add sp patch for kimi_vl.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

Add some minor changes to be compatible with kimi_vl
Add patch to support ulysses_sequence_parallel

API

Demonstrate how the API changes if any.

Usage Example

python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=$DATA_PATH/geo3k/test.parquet \
    data.val_files=$DATA_PATH/geo3k/test.parquet \
    data.train_batch_size=16 \
    data.max_prompt_length=2048 \
    data.max_response_length=4096 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    data.image_key=images \
    data.shuffle=False \
    +data.trust_remote_code=True \
    actor_rollout_ref.model.path=moonshotai/Kimi-VL-A3B-Instruct \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \
    actor_rollout_ref.actor.ppo_mini_batch_size=8 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.01 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.model.enable_gradient_checkpointing=False \
    actor_rollout_ref.model.trust_remote_code=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.rollout.tensor_model_parallel_size=8\
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.enable_chunked_prefill=False \
    actor_rollout_ref.rollout.enforce_eager=False \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.use_kl_in_reward=False \
    trainer.val_before_train=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='Kimi_VL_test' \
    trainer.experiment_name='kimi_vl_grpo_geo3k_cp2' \
    trainer.n_gpus_per_node=8\
    trainer.nnodes=1\
    trainer.save_freq=50 \
    trainer.test_freq=5 \
    trainer.total_epochs=15 $@

Test & Problem

During the dev, I discovered some issues, but they did not affect the code for this PR.
Existing problems：（with vllm==0.8.5.post1）

Occasional errors of vllm

  File "/home/sharele/anaconda3/lib/python3.11/site-packages/vllm/v1/attention/backends/mla/common.py", line 504, in build
    self.page_size)
    ^^^^^^^^^^^^^^
AttributeError: 'MLACommonMetadataBuilder' object has no attribute 'page_size'

releated: vllm-project/vllm#16908
Reference this method to avoid the problem temporarily: vllm-project/vllm#16908 (comment)

Garbled output from vllm under specific circumstances
During test, I found that when SamplingParams.n > 1，vllm's output is some meaningless characters or keeps repeating. This will affect grpo.

releated: vllm-project/vllm#18378
Note: Using the Hopper architecture gpu can avoid this problem, but it is not clear whether there are still potential issues.

Training curve:
The training curve will comming soon after I solve the second problem.

Additional Info.

Issue Number: support for kimi-vl？ #1428
Training: FSDP
Inference: vLLM

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

Signed-off-by: ShareLer <ShareLe@163.com>

hiyouga

Overall looks good to me!

verl/models/transformers/kimi_vl.py

Signed-off-by: ShareLer <ShareLe@163.com>

ShareLer · 2025-05-30T01:58:34Z

@hiyouga It seems that there are some problems with the CI environment. Could you help retrigger the failed CI?

hiyouga

No problem, let's merge

### Checklist Before Starting - [x] Search for similar PR(s). Some code will conflict with this PR verl-project#1613 ### What does this PR do? Add initial support for Kimi_vl; Add sp patch for kimi_vl. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Add some minor changes to be compatible with kimi_vl - Add patch to support ulysses_sequence_parallel ### API > Demonstrate how the API changes if any. ### Usage Example ```bash python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=$DATA_PATH/geo3k/test.parquet \ data.val_files=$DATA_PATH/geo3k/test.parquet \ data.train_batch_size=16 \ data.max_prompt_length=2048 \ data.max_response_length=4096 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.image_key=images \ data.shuffle=False \ +data.trust_remote_code=True \ actor_rollout_ref.model.path=moonshotai/Kimi-VL-A3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=False \ actor_rollout_ref.model.trust_remote_code=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=8\ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=8 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.val_before_train=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='Kimi_VL_test' \ trainer.experiment_name='kimi_vl_grpo_geo3k_cp2' \ trainer.n_gpus_per_node=8\ trainer.nnodes=1\ trainer.save_freq=50 \ trainer.test_freq=5 \ trainer.total_epochs=15 $@ ``` ### Test & Problem During the dev, I discovered some issues, but they did not affect the code for this PR. Existing problems：（with vllm==0.8.5.post1） - Occasional errors of vllm ```python File "/home/sharele/anaconda3/lib/python3.11/site-packages/vllm/v1/attention/backends/mla/common.py", line 504, in build self.page_size) ^^^^^^^^^^^^^^ AttributeError: 'MLACommonMetadataBuilder' object has no attribute 'page_size' ``` releated: vllm-project/vllm#16908 Reference this method to avoid the problem temporarily: vllm-project/vllm#16908 (comment) - Garbled output from vllm under specific circumstances During test, I found that when SamplingParams.n > 1，vllm's output is some meaningless characters or keeps repeating. This will affect grpo. releated: vllm-project/vllm#18378 Note: Using the Hopper architecture gpu can avoid this problem, but it is not clear whether there are still potential issues. Training curve: The training curve will comming soon after I solve the second problem. ### Additional Info. - **Issue Number**: verl-project#1428 - **Training**: FSDP - **Inference**: vLLM ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(verl-project#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(verl-project#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). Some code will conflict with this PR verl-project#1613 ### What does this PR do? Add initial support for Kimi_vl; Add sp patch for kimi_vl. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Add some minor changes to be compatible with kimi_vl - Add patch to support ulysses_sequence_parallel ### API > Demonstrate how the API changes if any. ### Usage Example ```bash python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=$DATA_PATH/geo3k/test.parquet \ data.val_files=$DATA_PATH/geo3k/test.parquet \ data.train_batch_size=16 \ data.max_prompt_length=2048 \ data.max_response_length=4096 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.image_key=images \ data.shuffle=False \ +data.trust_remote_code=True \ actor_rollout_ref.model.path=moonshotai/Kimi-VL-A3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=False \ actor_rollout_ref.model.trust_remote_code=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=8\ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=8 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.val_before_train=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='Kimi_VL_test' \ trainer.experiment_name='kimi_vl_grpo_geo3k_cp2' \ trainer.n_gpus_per_node=8\ trainer.nnodes=1\ trainer.save_freq=50 \ trainer.test_freq=5 \ trainer.total_epochs=15 $@ ``` ### Test & Problem During the dev, I discovered some issues, but they did not affect the code for this PR. Existing problems：（with vllm==0.8.5.post1） - Occasional errors of vllm ```python File "/home/sharele/anaconda3/lib/python3.11/site-packages/vllm/v1/attention/backends/mla/common.py", line 504, in build self.page_size) ^^^^^^^^^^^^^^ AttributeError: 'MLACommonMetadataBuilder' object has no attribute 'page_size' ``` releated: vllm-project/vllm#16908 Reference this method to avoid the problem temporarily: vllm-project/vllm#16908 (comment) - Garbled output from vllm under specific circumstances During test, I found that when SamplingParams.n > 1，vllm's output is some meaningless characters or keeps repeating. This will affect grpo. releated: vllm-project/vllm#18378 Note: Using the Hopper architecture gpu can avoid this problem, but it is not clear whether there are still potential issues. Training curve: The training curve will comming soon after I solve the second problem. ### Additional Info. - **Issue Number**: verl-project#1428 - **Training**: FSDP - **Inference**: vLLM ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(verl-project#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(verl-project#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). Some code will conflict with this PR verl-project#1613 ### What does this PR do? Add initial support for Kimi_vl; Add sp patch for kimi_vl. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Add some minor changes to be compatible with kimi_vl - Add patch to support ulysses_sequence_parallel ### API > Demonstrate how the API changes if any. ### Usage Example ```bash python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=$DATA_PATH/geo3k/test.parquet \ data.val_files=$DATA_PATH/geo3k/test.parquet \ data.train_batch_size=16 \ data.max_prompt_length=2048 \ data.max_response_length=4096 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.image_key=images \ data.shuffle=False \ +data.trust_remote_code=True \ actor_rollout_ref.model.path=moonshotai/Kimi-VL-A3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=False \ actor_rollout_ref.model.trust_remote_code=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=8\ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=8 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.val_before_train=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='Kimi_VL_test' \ trainer.experiment_name='kimi_vl_grpo_geo3k_cp2' \ trainer.n_gpus_per_node=8\ trainer.nnodes=1\ trainer.save_freq=50 \ trainer.test_freq=5 \ trainer.total_epochs=15 $@ ``` ### Test & Problem During the dev, I discovered some issues, but they did not affect the code for this PR. Existing problems：（with vllm==0.8.5.post1） - Occasional errors of vllm ```python File "/home/sharele/anaconda3/lib/python3.11/site-packages/vllm/v1/attention/backends/mla/common.py", line 504, in build self.page_size) ^^^^^^^^^^^^^^ AttributeError: 'MLACommonMetadataBuilder' object has no attribute 'page_size' ``` releated: vllm-project/vllm#16908 Reference this method to avoid the problem temporarily: vllm-project/vllm#16908 (comment) - Garbled output from vllm under specific circumstances During test, I found that when SamplingParams.n > 1，vllm's output is some meaningless characters or keeps repeating. This will affect grpo. releated: vllm-project/vllm#18378 Note: Using the Hopper architecture gpu can avoid this problem, but it is not clear whether there are still potential issues. Training curve: The training curve will comming soon after I solve the second problem. ### Additional Info. - **Issue Number**: verl-project#1428 - **Training**: FSDP - **Inference**: vLLM ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(verl-project#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(verl-project#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). Some code will conflict with this PR verl-project#1613 ### What does this PR do? Add initial support for Kimi_vl; Add sp patch for kimi_vl. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Add some minor changes to be compatible with kimi_vl - Add patch to support ulysses_sequence_parallel ### API > Demonstrate how the API changes if any. ### Usage Example ```bash python3 -m verl.trainer.main_ppo \ algorithm.adv_estimator=grpo \ data.train_files=$DATA_PATH/geo3k/test.parquet \ data.val_files=$DATA_PATH/geo3k/test.parquet \ data.train_batch_size=16 \ data.max_prompt_length=2048 \ data.max_response_length=4096 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.image_key=images \ data.shuffle=False \ +data.trust_remote_code=True \ actor_rollout_ref.model.path=moonshotai/Kimi-VL-A3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ulysses_sequence_parallel_size=2 \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.01 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=False \ actor_rollout_ref.model.trust_remote_code=True \ actor_rollout_ref.actor.fsdp_config.param_offload=True \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=8\ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \ actor_rollout_ref.rollout.enable_chunked_prefill=False \ actor_rollout_ref.rollout.enforce_eager=False \ actor_rollout_ref.rollout.free_cache_engine=False \ actor_rollout_ref.rollout.n=8 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ algorithm.use_kl_in_reward=False \ trainer.val_before_train=False \ trainer.critic_warmup=0 \ trainer.logger=['console','wandb'] \ trainer.project_name='Kimi_VL_test' \ trainer.experiment_name='kimi_vl_grpo_geo3k_cp2' \ trainer.n_gpus_per_node=8\ trainer.nnodes=1\ trainer.save_freq=50 \ trainer.test_freq=5 \ trainer.total_epochs=15 $@ ``` ### Test & Problem During the dev, I discovered some issues, but they did not affect the code for this PR. Existing problems：（with vllm==0.8.5.post1） - Occasional errors of vllm ```python File "/home/sharele/anaconda3/lib/python3.11/site-packages/vllm/v1/attention/backends/mla/common.py", line 504, in build self.page_size) ^^^^^^^^^^^^^^ AttributeError: 'MLACommonMetadataBuilder' object has no attribute 'page_size' ``` releated: vllm-project/vllm#16908 Reference this method to avoid the problem temporarily: vllm-project/vllm#16908 (comment) - Garbled output from vllm under specific circumstances During test, I found that when SamplingParams.n > 1，vllm's output is some meaningless characters or keeps repeating. This will affect grpo. releated: vllm-project/vllm#18378 Note: Using the Hopper architecture gpu can avoid this problem, but it is not clear whether there are still potential issues. Training curve: The training curve will comming soon after I solve the second problem. ### Additional Info. - **Issue Number**: verl-project#1428 - **Training**: FSDP - **Inference**: vLLM ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add CI test(s) if necessary. --------- Signed-off-by: ShareLer <ShareLe@163.com>

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(verl-project#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(verl-project#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <ShareLe@163.com>

ShareLer added 3 commits May 21, 2025 18:41

[feat] support kimi_vl

310ab23

Signed-off-by: ShareLer <ShareLe@163.com>

fix position_ids in sp

654114d

Signed-off-by: ShareLer <ShareLe@163.com>

change seq_len for pos_emb

98045be

Signed-off-by: ShareLer <ShareLe@163.com>

ccclyu added the status: need review label May 23, 2025

eric-haibin-lin requested a review from hiyouga May 27, 2025 20:37

hiyouga approved these changes May 28, 2025

View reviewed changes

hiyouga requested changes May 28, 2025

View reviewed changes

verl/models/transformers/kimi_vl.py Show resolved Hide resolved

ShareLer added 2 commits May 29, 2025 14:53

add license

cc9ad37

Signed-off-by: ShareLer <ShareLe@163.com>

fix conflicts

bb6375a

Signed-off-by: ShareLer <ShareLe@163.com>

hiyouga approved these changes May 30, 2025

View reviewed changes

hiyouga merged commit 96903e0 into verl-project:main May 30, 2025
32 of 36 checks passed

ShareLer mentioned this pull request Jun 7, 2025

fix sequence parallelism conflict in kimiVL #1899

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] support kimi_vl VLM model#1639

[feat] support kimi_vl VLM model#1639
hiyouga merged 5 commits intoverl-project:mainfrom
ShareLer:kimi_vl_support

ShareLer commented May 22, 2025 •

edited

Loading

Uh oh!

hiyouga left a comment

Uh oh!

Uh oh!

ShareLer commented May 30, 2025

Uh oh!

hiyouga left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ShareLer commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test & Problem

Additional Info.

Checklist Before Submitting

Uh oh!

hiyouga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShareLer commented May 30, 2025

Uh oh!

hiyouga left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ShareLer commented May 22, 2025 •

edited

Loading