[model merger] refactor model merger for better usage and maintainability by 0x404 · Pull Request #1468 · verl-project/verl

0x404 · 2025-05-10T04:59:45Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

This PR refactors model_merge, making the code cleaner and more maintainable:

now verl checkpointer manager will save model config and processor/tokenizer (introduced in [FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents #1288), so there is no need for hf_model_path. This PR deprecates this argument and keeps it for backward compatibility.
the current model_merge has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience.
generally cleans up the code and makes it look better.

Test

Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.

Additional Info.

Training: both
Inference: none

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

ETOgaosion · 2025-05-15T05:58:52Z

Excellent work! This is very helpful.

…till use --hf_model_path

0x404 · 2025-05-15T11:49:36Z

The CI failed because the megaton checkpoint manager currently doesn't save hf model config. As a temporary solution, we can use the --hf_model_path. Hi @ETOgaosion, could you please re-trigger the CI to verify correctness?

ccclyu

LGTM. Thanks for great contribution!

ETOgaosion · 2025-05-16T10:37:02Z

There might be an issue with Qwen3 model_merger, but not seems to be our bug, could you also track this if there are good solutions and when you are free? @0x404

#1484

0x404 · 2025-05-16T10:39:46Z

https://github.com/volcengine/verl/actions/runs/15061995464/job/42342396903 this CI failed due to the breaking change of model merge which required sub-command. I have update the Qwen3 e2e CI, and let's re-run the CI.

There might be an issue with Qwen3 model_merger, but not seems to be our bug, could you also track this if there are good solutions and when you are free? @0x404

Yes, I will take a look latter.

@ETOgaosion

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (#1484, #1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue #1484, #1255, #1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in #1468, fixes the comment #1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…erl-project#1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in verl-project#1468, fixes the comment verl-project#1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/908cda4f822744a943143f51703771ff168e05f0/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…erl-project#1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in verl-project#1468, fixes the comment verl-project#1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project/verl#1484, verl-project/verl#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project/verl#1484, verl-project/verl#1255, verl-project/verl#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…(#1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in verl-project/verl#1468, fixes the comment verl-project/verl#1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/bb9d1d10a940644abf808462aaeab0dc10a95be6/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…erl-project#1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in verl-project#1468, fixes the comment verl-project#1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…lity (verl-project#1468) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR refactors `model_merge`, making the code cleaner and more maintainable: - now verl checkpointer manager will save model config and processor/tokenizer (introduced in verl-project#1288), so there is no need for `hf_model_path`. This PR deprecates this argument and keeps it for backward compatibility. - the current `model_merge` has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience. - generally cleans up the code and makes it look better. ### Test Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints. The current CI should test this PR correctly. ### Additional Info. - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/2fbb82e8b54494905f44044219296b6654a872b6/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

…erl-project#1562) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR enables the Megatron backend checkpoint manager to save hf model config into verl checkpoints, and simplify our CI since the `--hf_model_path` has been deprecated in verl-project#1468, fixes the comment verl-project#1468 (comment). Note: several changed lines in `verl/utils/megatron_utils.py` are unrelated to this PR; they were automatically reformatted by pre-commit hooks. ### Test The current CI e2e tests should sufficient cover for this PR. ### Additional Info. - **Training**: Megatron - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

0x404 added 4 commits May 10, 2025 11:21

refactor model merger

d36467b

docs: update model merger usage

e55d7fe

update model merger commands in CI

24021bc

Update E2E training scripts to include VERL_EXP_NAME

b873a04

ETOgaosion requested a review from ccclyu May 15, 2025 06:01

0x404 added 2 commits May 15, 2025 15:02

fix typo in checkpoint.rst

ada4ebf

fix: megatron checkpoint manager currently don't save model config, s…

e03e9e9

…till use --hf_model_path

fix: return none if no mapping found in name_mapping

983f2ff

ccclyu approved these changes May 16, 2025

View reviewed changes

0x404 added 2 commits May 16, 2025 10:32

Merge remote-tracking branch 'upstream/main' into model_merger

05059ef

ci: fix testing qwen3 checkpoint

df7fde5

ETOgaosion approved these changes May 16, 2025

View reviewed changes

ETOgaosion merged commit 3f4647f into verl-project:main May 16, 2025
29 of 31 checks passed

This was referenced May 18, 2025

[megatron] feat: save hf model config in megatron checkpoint manager #1562

Merged

[merger] fix: avoid setting torch's global device to meta #1564

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[model merger] refactor model merger for better usage and maintainability#1468

[model merger] refactor model merger for better usage and maintainability#1468
ETOgaosion merged 9 commits intoverl-project:mainfrom
0x404:model_merger

0x404 commented May 10, 2025

Uh oh!

ETOgaosion commented May 15, 2025

Uh oh!

0x404 commented May 15, 2025

Uh oh!

ccclyu left a comment

Uh oh!

ETOgaosion commented May 16, 2025 •

edited

Loading

Uh oh!

0x404 commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

0x404 commented May 10, 2025

Checklist Before Starting

What does this PR do?

Test

Additional Info.

Checklist Before Submitting

Uh oh!

ETOgaosion commented May 15, 2025

Uh oh!

0x404 commented May 15, 2025

Uh oh!

ccclyu left a comment

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0x404 commented May 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ETOgaosion commented May 16, 2025 •

edited

Loading