Skip to content

[model merger] refactor model merger for better usage and maintainability#1468

Merged
ETOgaosion merged 9 commits intoverl-project:mainfrom
0x404:model_merger
May 16, 2025
Merged

[model merger] refactor model merger for better usage and maintainability#1468
ETOgaosion merged 9 commits intoverl-project:mainfrom
0x404:model_merger

Conversation

@0x404
Copy link
Copy Markdown
Collaborator

@0x404 0x404 commented May 10, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

This PR refactors model_merge, making the code cleaner and more maintainable:

  • now verl checkpointer manager will save model config and processor/tokenizer (introduced in [FSDPCheckpointManager] feat: save huggingface model when 'hf_model' in checkpoint_contents #1288), so there is no need for hf_model_path. This PR deprecates this argument and keeps it for backward compatibility.
  • the current model_merge has two purposes, merge checkpoints and test checkpoints (mainly for CI). This PR separates these two purposes into two sub-commands to better manage user input argument for improved user experience.
  • generally cleans up the code and makes it look better.

Test

Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.

Additional Info.

  • Training: both
  • Inference: none

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@ETOgaosion
Copy link
Copy Markdown
Collaborator

Excellent work! This is very helpful.

@ETOgaosion ETOgaosion requested a review from ccclyu May 15, 2025 06:01
@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented May 15, 2025

The CI failed because the megaton checkpoint manager currently doesn't save hf model config. As a temporary solution, we can use the --hf_model_path. Hi @ETOgaosion, could you please re-trigger the CI to verify correctness?

Copy link
Copy Markdown
Collaborator

@ccclyu ccclyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for great contribution!

@ETOgaosion
Copy link
Copy Markdown
Collaborator

ETOgaosion commented May 16, 2025

There might be an issue with Qwen3 model_merger, but not seems to be our bug, could you also track this if there are good solutions and when you are free? @0x404

#1484

@0x404
Copy link
Copy Markdown
Collaborator Author

0x404 commented May 16, 2025

https://github.com/volcengine/verl/actions/runs/15061995464/job/42342396903 this CI failed due to the breaking change of model merge which required sub-command. I have update the Qwen3 e2e CI, and let's re-run the CI.

There might be an issue with Qwen3 model_merger, but not seems to be our bug, could you also track this if there are good solutions and when you are free? @0x404

Yes, I will take a look latter.

@ETOgaosion ETOgaosion merged commit 3f4647f into verl-project:main May 16, 2025
29 of 31 checks passed
vermouth1992 pushed a commit that referenced this pull request May 18, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(#1484,
#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
#1484,
#1255,
#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
ETOgaosion pushed a commit that referenced this pull request May 23, 2025
…1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
#1468, fixes the comment
#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
ETOgaosion pushed a commit to Jianbing-D/verl that referenced this pull request Jun 8, 2025
…erl-project#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project#1468, fixes the comment
verl-project#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
wwwjn pushed a commit to wwwjn/verl that referenced this pull request Jun 10, 2025
…erl-project#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project#1468, fixes the comment
verl-project#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/908cda4f822744a943143f51703771ff168e05f0/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…erl-project#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project#1468, fixes the comment
verl-project#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project/verl#1484,
verl-project/verl#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project/verl#1484,
verl-project/verl#1255,
verl-project/verl#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
paolo328 added a commit to paolo328/Verl that referenced this pull request Nov 27, 2025
…(#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project/verl#1468, fixes the comment
verl-project/verl#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/bb9d1d10a940644abf808462aaeab0dc10a95be6/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…erl-project#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project#1468, fixes the comment
verl-project#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…lity (verl-project#1468)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR refactors `model_merge`, making the code cleaner and more
maintainable:

- now verl checkpointer manager will save model config and
processor/tokenizer (introduced in
verl-project#1288), so there is no need for
`hf_model_path`. This PR deprecates this argument and keeps it for
backward compatibility.
- the current `model_merge` has two purposes, merge checkpoints and test
checkpoints (mainly for CI). This PR separates these two purposes into
two sub-commands to better manage user input argument for improved user
experience.
- generally cleans up the code and makes it look better.

### Test
Our current CI hasn't tested DDP+FSDP e2e training. This PR also adds
DDP+FSDP e2e into CI and tests merging DDP+FSDP checkpoints.

The current CI should test this PR correctly.


### Additional Info.

- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/2fbb82e8b54494905f44044219296b6654a872b6/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…erl-project#1562)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR enables the Megatron backend checkpoint manager to save hf model
config into verl checkpoints, and simplify our CI since the
`--hf_model_path` has been deprecated in
verl-project#1468, fixes the comment
verl-project#1468 (comment).

Note: several changed lines in `verl/utils/megatron_utils.py` are
unrelated to this PR; they were automatically reformatted by pre-commit
hooks.

### Test

The current CI e2e tests should sufficient cover for this PR.

### Additional Info.

- **Training**: Megatron
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants