[FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents#1288
Conversation
|
Thanks a lot for contribution! It is a troubling bug. |
|
Hi @ETOgaosion, I have merged the main branch and resolve some conflicts since FSDP2 is merged. Please take a look when you have time :) |
|
looks like there are several CI failed due to |
Yeah, working on it, some CI machines got network error |
Yes, in |
…lity (#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in #1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
…in checkpoint_contents (verl-project#1288) Before, `FSDPCheckpointManager` will not save hf model when `hf_model` is given in `checkpoint_contents`, instead, it only save the hf model's config. This PR correctly save the huggingface model when 'hf_model' is in `checkpoint_contents`.
…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
…lity (#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project/verl#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
…in checkpoint_contents (verl-project#1288) Before, `FSDPCheckpointManager` will not save hf model when `hf_model` is given in `checkpoint_contents`, instead, it only save the hf model's config. This PR correctly save the huggingface model when 'hf_model' is in `checkpoint_contents`.
…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
…in checkpoint_contents (verl-project#1288) Before, `FSDPCheckpointManager` will not save hf model when `hf_model` is given in `checkpoint_contents`, instead, it only save the hf model's config. This PR correctly save the huggingface model when 'hf_model' is in `checkpoint_contents`.
…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.
Before,
FSDPCheckpointManagerwill not save hf model whenhf_modelis given incheckpoint_contents, instead, it only save the hf model's config.This PR correctly save the huggingface model when 'hf_model' is in
checkpoint_contents.