[merger] fix: avoid setting torch's global device to meta by 0x404 · Pull Request #1564 · verl-project/verl

0x404 · 2025-05-18T07:40:10Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

This PR fixes several issues (#1484, #1255) that cause the error: "Cannot copy out of meta tensor; no data!".

The related code in our part is:
https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132

The torch.device("meta") context manager sets the current global torch device to "meta". During auto_model_class.from_config, various import statements load third-party libraries, whose __init__.py files may contain global statements that use torch for calculations.

For example, transformers imports [torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization:

QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero

In this case, when using the torch.device("meta") context manager, torch.linspace(0, 1, 17) gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the .tolist() call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using init_empty_weights from accelerate, which patches nn.Module.register_parameter instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue.

Here's a simple illustration:

>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!

cc @ETOgaosion

Additional Info.

Issue Number: Fixes issue qwen3 merge_model error #1484, Cannot copy out of meta tensor in model_merger.py #1255, [model merger] refactor model merger for better usage and maintainability #1468 (comment)
Training: both
Inference: none

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/908cda4f822744a943143f51703771ff168e05f0/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/bb9d1d10a940644abf808462aaeab0dc10a95be6/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

@ETOgaosion

…ct#1564) ### Checklist Before Starting - [x] Search for similar PR(s). ### What does this PR do? This PR fixes several issues (verl-project#1484, verl-project#1255) that cause the error: "Cannot copy out of meta tensor; no data!". The related code in our part is: https://github.com/volcengine/verl/blob/2fbb82e8b54494905f44044219296b6654a872b6/scripts/model_merger.py#L131-L132 The `torch.device("meta")` context manager sets the current global torch device to "meta". During `auto_model_class.from_config`, various import statements load third-party libraries, whose `__init__.py` files may contain global statements that use torch for calculations. For example, transformers imports [[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization: ```python QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() # no zero ``` In this case, when using the `torch.device("meta")` context manager, `torch.linspace(0, 1, 17)` gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the `.tolist()` call to fail with the error "Cannot copy out of meta tensor; no data!" To fix this, we're now using `init_empty_weights` from `accelerate`, which patches `nn.Module.register_parameter` instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue. Here's a simple illustration: ```python >>> import torch >>> from accelerate import init_empty_weights >>> with init_empty_weights(): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... >>> QMAP_UNSIGNED [0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0] >>> with torch.device("meta"): ... QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist() ... Traceback (most recent call last): File "<stdin>", line 2, in <module> File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__ return func(*args, **kwargs) NotImplementedError: Cannot copy out of meta tensor; no data! ``` cc @ETOgaosion ### Additional Info. - **Issue Number**: Fixes issue verl-project#1484, verl-project#1255, verl-project#1468 (comment) - **Training**: both - **Inference**: none ### Checklist Before Submitting - [x] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [x] Add `[BREAKING]` to the PR title if it breaks any API. - [x] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [x] Add CI test(s) if neccessary.

[merger] fix: avoid setting torch's global device to meta

cefa857

vermouth1992 approved these changes May 18, 2025

View reviewed changes

vermouth1992 merged commit 530154e into verl-project:main May 18, 2025
31 checks passed

0x404 mentioned this pull request May 19, 2025

[minor] fix: use init_empty_weights instead of torch.device("meta") #1587

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[merger] fix: avoid setting torch's global device to meta#1564

[merger] fix: avoid setting torch's global device to meta#1564
vermouth1992 merged 1 commit intoverl-project:mainfrom
0x404:fix_meta

0x404 commented May 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

0x404 commented May 18, 2025

Checklist Before Starting

What does this PR do?

Additional Info.

Checklist Before Submitting

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants