fix sequence parallelism conflict in kimiVL #1899

ShareLer · 2025-06-07T02:04:26Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Fix sequence parallelism conflict in kimiVL patch.

Background:
A recent VLM-related PR(#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward.
However, the SP logic I implemented in KimiVL's PR(#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'.
Since these two PR were developed in parallel which led to logical conflicts after the PR were merged.

High-Level Design

Demonstrate the high-level design if this PR is complex.

Specific Changes

Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank.
Adjust the processing related to position_ids in _ulysses_flash_attn_forward.

API

Demonstrate how the API changes if any.

Usage Example

Provide usage example(s) for easier usage.

# Add code snippet or script demonstrating how to use this

Test

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: [Note which backend this PR will affect: vLLM, SGLang, both, or none]

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

Signed-off-by: ShareLer <[email protected]>

eric-haibin-lin

thanks! would u mind adding a unit test in tests/models/test_transformers_ulysses.py that reproduce this error?

hiyouga

LGTM. Now we should split not the image features but the input ids only

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

ShareLer · 2025-06-10T11:13:31Z

thanks! would u mind adding a unit test in tests/models/test_transformers_ulysses.py that reproduce this error?

It might not be possible to add the unit test of KimiVL because the unit tests in tests/models/test_transformers_ulysses.py rely on importing the model's config from transformers (ie: 'KimiVLConfig') to create a mock model by from_config, while in transformers the model and config of kimiVL are not defined.

It is necessary to add ulysses-related unit tests for VLM. Perhaps I can first try to add a unit test related to ulysses for QwenVL.

hiyouga · 2025-06-10T11:21:51Z

@ShareLer The unittest for qwen2vl ulysses already exists

verl/.github/workflows/e2e_ppo_trainer.yml

Lines 200 to 208 in 85fef90

    
                 - name: Running Geo3k VLM PPO E2E training tests on 8 L20 GPUs with rmpad using function rm 
        
                   run: | 
        
                     ray stop --force 
        
                     TRAIN_FILES=$HOME/data/geo3k/train.parquet VAL_FILES=$HOME/data/geo3k/test.parquet \ 
        
                       MAX_PROMPT_LEN=1536 MAX_RESPONSE_LEN=1536 \ 
        
                       MODEL_ID=Qwen/Qwen2-VL-2B-Instruct \ 
        
                       ADV_ESTIMATOR=gae RM_PAD=True USE_KL=True ENABLE_CHUNKED_PREFILL=False \ 
        
                       SP_SIZE=2 \ 
        
                       bash tests/e2e/ppo_trainer/run_function_reward.sh

I think current state is sufficient

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? Fix sequence parallelism conflict in kimiVL patch. Background: A recent VLM-related PR(volcengine#1739 ) has modified the sequence parallelism logic of VLM: Split inputs_embeds after the model's embedding layer instand of spliting input_ids and position_ids before forward. However, the SP logic I implemented in KimiVL's PR(volcengine#1639 ) was still implemented in accordance with the old logic. And split the image token at the combination of image_token and text_token to avoid the problem of 'the Image features and image tokens do not match'. Since these two PR were developed in parallel which led to logical conflicts after the PR were merged. ### High-Level Design > Demonstrate the high-level design if this PR is complex. ### Specific Changes - Delete the patch for _merge_with_image_features which to assign the image token to the corresponding SP rank. - Adjust the processing related to position_ids in _ulysses_flash_attn_forward. ### API > Demonstrate how the API changes if any. ### Usage Example > Provide usage example(s) for easier usage. ```python # Add code snippet or script demonstrating how to use this ``` ### Test ![image](https://github.com/user-attachments/assets/82ef7a74-66f8-4bb0-a0fc-3702b215c8c0) ### Additional Info. - **Issue Number**: Fixes issue # or discussion # if any. - **Training**: [Note which backend this PR will affect: FSDP, Megatron, both, or none] - **Inference**: [Note which backend this PR will affect: vLLM, SGLang, both, or none] ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [ ] Add `[BREAKING]` to the PR title if it breaks any API. - [ ] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [ ] New CI unit test(s) are added to cover the code path. - [ ] Rely on existing unit tests on CI that covers the code path. --------- Signed-off-by: ShareLer <[email protected]>

ShareLer and others added 7 commits May 21, 2025 18:41

[feat] support kimi_vl

310ab23

Signed-off-by: ShareLer <[email protected]>

fix position_ids in sp

654114d

Signed-off-by: ShareLer <[email protected]>

change seq_len for pos_emb

98045be

Signed-off-by: ShareLer <[email protected]>

add license

cc9ad37

Signed-off-by: ShareLer <[email protected]>

fix conflicts

bb6375a

Signed-off-by: ShareLer <[email protected]>

Merge branch 'volcengine:main' into kimi_vl_support

ea995cb

fix sequence parallelism conflict for kimiVL

c59cf5f

Signed-off-by: ShareLer <[email protected]>

vermouth1992 requested a review from hiyouga June 7, 2025 02:04

eric-haibin-lin reviewed Jun 8, 2025

View reviewed changes

hiyouga approved these changes Jun 9, 2025

View reviewed changes

vermouth1992 approved these changes Jun 10, 2025

View reviewed changes

vermouth1992 merged commit ea121f0 into volcengine:main Jun 10, 2025
34 of 35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix sequence parallelism conflict in kimiVL #1899

fix sequence parallelism conflict in kimiVL #1899

Uh oh!

ShareLer commented Jun 7, 2025 •

edited

Loading

Uh oh!

eric-haibin-lin left a comment

Uh oh!

hiyouga left a comment •

edited

Loading

Uh oh!

Uh oh!

ShareLer commented Jun 10, 2025

Uh oh!

hiyouga commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix sequence parallelism conflict in kimiVL #1899

fix sequence parallelism conflict in kimiVL #1899

Uh oh!

Conversation

ShareLer commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

High-Level Design

Specific Changes

API

Usage Example

Test

Additional Info.

Checklist Before Submitting

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

hiyouga left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ShareLer commented Jun 10, 2025

Uh oh!

hiyouga commented Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ShareLer commented Jun 7, 2025 •

edited

Loading

hiyouga left a comment •

edited

Loading