Skip to content

Conversation

@Arvin-xiong
Copy link

Recently, I have been trying to use FSDP to train llava_next, but i need to find the minimum spliting module, which is the _no_split_modules in the code. However, i did not find the corresponding class implementation in the modeling_llava_next.py. (LlavaNextVisionAttention in v4.51.3 and LlamaDecoderLayer in latest code)

@github-actions
Copy link
Contributor

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

@github-actions github-actions bot marked this pull request as draft May 15, 2025 08:13
@Arvin-xiong Arvin-xiong marked this pull request as ready for review May 15, 2025 08:21
@github-actions github-actions bot requested a review from zucchini-nlp May 15, 2025 08:22
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! Though I would expect _no_split_modules to be correctly propagated from the module's children. Since llava can support any vision/LM backbone, we can't list all possible model names here

A scalable solution will be to check why CLIP _no_split_modules didn't get propagated, since I remember a while ago it worked fine. @Arvin-xiong would you like to investigate it further?

@Arvin-xiong
Copy link
Author

A scalable solution will be to check why CLIP _no_split_modules didn't get propagated, since I remember a while ago it worked fine. @Arvin-xiong would you like to investigate it further?

I conducted some research, for llava-next, _no_split_modules was initially set to ["LlavaNextVisionAttention"], and then modified to ["LlamaDecoderLayer"] in later PR, but CLIP was not mentioned. I also found some comments from the previous llava-next PR. FYI.

The get_module_class_from_name in the fsdp_auto_wrap_policy function recursively traverses the children of the model to match the name of the _no_split_modules value, so it should not be able to obtain CLIP related layers.

I have some doubts that this is a legacy issue from before, but if there is any other relevant information that can be provided to me, i think i will continue to investigate it further.

@zucchini-nlp
Copy link
Member

zucchini-nlp commented May 16, 2025

@Arvin-xiong ah, that comes from a different repo then, because afaik this code block below is able to grab all class attributes when loading, and is what we use for splitting layers per device in multi-gpu inference.

I hope this helps you and probably we need to update PEFT in this case, so that each no_split_module can be wrapped in its own instance

def _get_no_split_modules(self, device_map: str):
"""
Get the modules of the model that should not be spit when using device_map. We iterate through the modules to
get the underlying `_no_split_modules`.
Args:
device_map (`str`):
The device map value. Options are ["auto", "balanced", "balanced_low_0", "sequential"]
Returns:
`List[str]`: List of modules that should not be split
"""

@zucchini-nlp
Copy link
Member

cc @BenjaminBossan from PEFT as well

@Arvin-xiong
Copy link
Author

Thanks for your patient reply.

@BenjaminBossan
Copy link
Member

I tried to understand the issue but I'm not very familiar with the mechanics of _no_split_modules. What is PEFT supposed to do in this case?

@zucchini-nlp
Copy link
Member

@BenjaminBossan the _no_split_modules is used to indicate if there are any modules that should not be slit across different devices, when auto-splitting to multi gpu with accelerate. I didn't know it was used by PEFT to wrap each module in its own instance but it makes sense though

So I think recursing over _no_split_modules to grab all children will solve the issue for VLM models

@BenjaminBossan
Copy link
Member

So I think recursing over _no_split_modules to grab all children will solve the issue for VLM models

Okay, so here, instead of just checking on the model, all children would need to be visited:

https://github.com/huggingface/peft/blob/6d133307adf28e11d061c20aa1f040c1d26d2012/src/peft/utils/other.py#L959-L961

BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jun 4, 2025
See discussion in huggingface/transformers#38141
for context.

In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules.
However, this currently neglects to visit the children of the model,
which can be required for some architectures. This PR fixes that.

Note that the _get_no_split_modules function is largely copied from
transformers. One change is that it doesn't take the device_map
argument. That argument is used in transformers inside an error message
but not for the logic proper. I think it's safe to remove.

Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely
making local imports global (there was no reason for them to be local).
@BenjaminBossan
Copy link
Member

I created a PR which I think should address the issue: huggingface/peft#2570. Could you please check?

BenjaminBossan added a commit to huggingface/peft that referenced this pull request Jun 16, 2025
See discussion in huggingface/transformers#38141
for context.

In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules.
However, this currently neglects to visit the children of the model,
which can be required for some architectures. This PR fixes that.

Note that the _get_no_split_modules function is largely copied from
transformers. One change is that it doesn't take the device_map
argument. That argument is used in transformers inside an error message
but not for the logic proper. I think it's safe to remove.

Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely
making local imports global (there was no reason for them to be local).
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
See discussion in huggingface/transformers#38141
for context.

In the PEFT fsdp_auto_wrap policy, we determine the _no_split_modules.
However, this currently neglects to visit the children of the model,
which can be required for some architectures. This PR fixes that.

Note that the _get_no_split_modules function is largely copied from
transformers. One change is that it doesn't take the device_map
argument. That argument is used in transformers inside an error message
but not for the logic proper. I think it's safe to remove.

Morever, I made an unrelated change to fsdp_auto_wrap_policy, namely
making local imports global (there was no reason for them to be local).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants