Fix FSDP + llava-next/llava-onevision #38141

Arvin-xiong · 2025-05-15T08:13:02Z

Recently, I have been trying to use FSDP to train llava_next, but i need to find the minimum spliting module, which is the _no_split_modules in the code. However, i did not find the corresponding class implementation in the modeling_llava_next.py. (LlavaNextVisionAttention in v4.51.3 and LlamaDecoderLayer in latest code)

github-actions · 2025-05-15T08:13:13Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

zucchini-nlp

Makes sense to me! Though I would expect _no_split_modules to be correctly propagated from the module's children. Since llava can support any vision/LM backbone, we can't list all possible model names here

A scalable solution will be to check why CLIP _no_split_modules didn't get propagated, since I remember a while ago it worked fine. @Arvin-xiong would you like to investigate it further?

Arvin-xiong · 2025-05-16T10:47:04Z

A scalable solution will be to check why CLIP _no_split_modules didn't get propagated, since I remember a while ago it worked fine. @Arvin-xiong would you like to investigate it further?

I conducted some research, for llava-next, _no_split_modules was initially set to ["LlavaNextVisionAttention"], and then modified to ["LlamaDecoderLayer"] in later PR, but CLIP was not mentioned. I also found some comments from the previous llava-next PR. FYI.

The get_module_class_from_name in the fsdp_auto_wrap_policy function recursively traverses the children of the model to match the name of the _no_split_modules value, so it should not be able to obtain CLIP related layers.

I have some doubts that this is a legacy issue from before, but if there is any other relevant information that can be provided to me, i think i will continue to investigate it further.

zucchini-nlp · 2025-05-16T10:58:17Z

@Arvin-xiong ah, that comes from a different repo then, because afaik this code block below is able to grab all class attributes when loading, and is what we use for splitting layers per device in multi-gpu inference.

I hope this helps you and probably we need to update PEFT in this case, so that each no_split_module can be wrapped in its own instance

transformers/src/transformers/modeling_utils.py

Lines 2677 to 2688 in 1e921a3

    
               def _get_no_split_modules(self, device_map: str): 
        
                   """ 
        
                   Get the modules of the model that should not be spit when using device_map. We iterate through the modules to 
        
                   get the underlying `_no_split_modules`. 
        
                   Args: 
        
                       device_map (`str`): 
        
                           The device map value. Options are ["auto", "balanced", "balanced_low_0", "sequential"] 
        
                   Returns: 
        
                       `List[str]`: List of modules that should not be split 
        
                   """

zucchini-nlp · 2025-05-16T10:58:32Z

cc @BenjaminBossan from PEFT as well

Arvin-xiong · 2025-05-20T11:24:23Z

Thanks for your patient reply.

BenjaminBossan · 2025-06-03T09:32:33Z

I tried to understand the issue but I'm not very familiar with the mechanics of _no_split_modules. What is PEFT supposed to do in this case?

zucchini-nlp · 2025-06-03T10:38:54Z

@BenjaminBossan the _no_split_modules is used to indicate if there are any modules that should not be slit across different devices, when auto-splitting to multi gpu with accelerate. I didn't know it was used by PEFT to wrap each module in its own instance but it makes sense though

So I think recursing over _no_split_modules to grab all children will solve the issue for VLM models

BenjaminBossan · 2025-06-03T17:42:34Z

So I think recursing over _no_split_modules to grab all children will solve the issue for VLM models

Okay, so here, instead of just checking on the model, all children would need to be visited:

https://github.com/huggingface/peft/blob/6d133307adf28e11d061c20aa1f040c1d26d2012/src/peft/utils/other.py#L959-L961

See discussion in huggingface/transformers#38141 for context. In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules. However, this currently neglects to visit the children of the model, which can be required for some architectures. This PR fixes that. Note that the _get_no_split_modules function is largely copied from transformers. One change is that it doesn't take the device_map argument. That argument is used in transformers inside an error message but not for the logic proper. I think it's safe to remove. Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely making local imports global (there was no reason for them to be local).

BenjaminBossan · 2025-06-04T10:35:57Z

I created a PR which I think should address the issue: huggingface/peft#2570. Could you please check?

See discussion in huggingface/transformers#38141 for context. In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules. However, this currently neglects to visit the children of the model, which can be required for some architectures. This PR fixes that. Note that the _get_no_split_modules function is largely copied from transformers. One change is that it doesn't take the device_map argument. That argument is used in transformers inside an error message but not for the logic proper. I think it's safe to remove. Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely making local imports global (there was no reason for them to be local).

Fix FSDP + llava-next/llava-onevision

3240048

github-actions bot marked this pull request as draft May 15, 2025 08:13

Merge branch 'main' into arvin/fix_llava_no_split_modules

682db2c

Arvin-xiong marked this pull request as ready for review May 15, 2025 08:21

github-actions bot requested a review from zucchini-nlp May 15, 2025 08:22

zucchini-nlp reviewed May 15, 2025

View reviewed changes

BenjaminBossan mentioned this pull request Jun 4, 2025

FIX: Correctly determine no_split_modules huggingface/peft#2570

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix FSDP + llava-next/llava-onevision #38141

Fix FSDP + llava-next/llava-onevision #38141

Uh oh!

Arvin-xiong commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

zucchini-nlp left a comment •

edited

Loading

Uh oh!

Arvin-xiong commented May 16, 2025

Uh oh!

zucchini-nlp commented May 16, 2025 •

edited

Loading

Uh oh!

zucchini-nlp commented May 16, 2025

Uh oh!

Arvin-xiong commented May 20, 2025

Uh oh!

BenjaminBossan commented Jun 3, 2025

Uh oh!

zucchini-nlp commented Jun 3, 2025

Uh oh!

BenjaminBossan commented Jun 3, 2025

Uh oh!

BenjaminBossan commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix FSDP + llava-next/llava-onevision #38141

Are you sure you want to change the base?

Fix FSDP + llava-next/llava-onevision #38141

Uh oh!

Conversation

Arvin-xiong commented May 15, 2025

Uh oh!

github-actions bot commented May 15, 2025

Uh oh!

zucchini-nlp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Arvin-xiong commented May 16, 2025

Uh oh!

zucchini-nlp commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zucchini-nlp commented May 16, 2025

Uh oh!

Arvin-xiong commented May 20, 2025

Uh oh!

BenjaminBossan commented Jun 3, 2025

Uh oh!

zucchini-nlp commented Jun 3, 2025

Uh oh!

BenjaminBossan commented Jun 3, 2025

Uh oh!

BenjaminBossan commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zucchini-nlp left a comment •

edited

Loading

zucchini-nlp commented May 16, 2025 •

edited

Loading