Skip to content

[FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents#1288

Merged
ETOgaosion merged 12 commits intoverl-project:mainfrom
0x404:fsdp_hfmodel
May 7, 2025
Merged

[FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents#1288
ETOgaosion merged 12 commits intoverl-project:mainfrom
0x404:fsdp_hfmodel

Conversation

@0x404
Copy link
Copy Markdown
Collaborator

@0x404 0x404 commented Apr 28, 2025

Before, FSDPCheckpointManager will not save hf model when hf_model is given in checkpoint_contents, instead, it only save the hf model's config.

This PR correctly save the huggingface model when 'hf_model' is in checkpoint_contents.

@ETOgaosion
Copy link
Copy Markdown
Collaborator

Thanks a lot for contribution! It is a troubling bug.

@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented May 5, 2025

Hi @ETOgaosion, I have merged the main branch and resolve some conflicts since FSDP2 is merged. Please take a look when you have time :)

@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented May 6, 2025

looks like there are several CI failed due to Error: Connection timed out (api.github.com:443)

@ETOgaosion
Copy link
Copy Markdown
Collaborator

@ETOgaosion
Copy link
Copy Markdown
Collaborator

looks like there are several CI failed due to Error: Connection timed out (api.github.com:443)

Yeah, working on it, some CI machines got network error

@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented May 6, 2025

Seems here is a bug? https://github.com/volcengine/verl/actions/runs/14829253212/job/41690118956?pr=1288

Yes, in test_fsdp_ckpt the model is initialized using from_config thus it doesn't have the name_or_path and therefore it doesn't have the pretrained generation config. In this case, we don't need to save the generation config. I have fixed this and let's rerun the CI.

@ETOgaosion ETOgaosion merged commit ba6a2e0 into verl-project:main May 7, 2025
28 of 34 checks passed
ETOgaosion pushed a commit that referenced this pull request May 16, 2025
…lity (#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…in checkpoint_contents (verl-project#1288)

Before, `FSDPCheckpointManager` will not save hf model when `hf_model`
is given in `checkpoint_contents`, instead, it only save the hf model's
config.

This PR correctly save the huggingface model when 'hf_model' is in
`checkpoint_contents`.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
…lity (#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project/verl#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…in checkpoint_contents (verl-project#1288)

Before, `FSDPCheckpointManager` will not save hf model when `hf_model`
is given in `checkpoint_contents`, instead, it only save the hf model's
config.

This PR correctly save the huggingface model when 'hf_model' is in
`checkpoint_contents`.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…in checkpoint_contents (verl-project#1288)

Before, `FSDPCheckpointManager` will not save hf model when `hf_model`
is given in `checkpoint_contents`, instead, it only save the hf model's
config.

This PR correctly save the huggingface model when 'hf_model' is in
`checkpoint_contents`.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants