Skip to content

[merger] fix: avoid setting torch's global device to meta#1564

Merged
vermouth1992 merged 1 commit intoverl-project:mainfrom
0x404:fix_meta
May 18, 2025
Merged

[merger] fix: avoid setting torch's global device to meta#1564
vermouth1992 merged 1 commit intoverl-project:mainfrom
0x404:fix_meta

Conversation

@0x404
Copy link
Copy Markdown
Collaborator

@0x404 0x404 commented May 18, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

This PR fixes several issues (#1484, #1255) that cause the error: "Cannot copy out of meta tensor; no data!".

The related code in our part is:
https://github.com/volcengine/verl/blob/d36b5e81d6de598a87eecc6edf658ece0eb43582/scripts/model_merger.py#L131-L132

The torch.device("meta") context manager sets the current global torch device to "meta". During auto_model_class.from_config, various import statements load third-party libraries, whose __init__.py files may contain global statements that use torch for calculations.

For example, transformers imports [torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33), which executes the following during initialization:

QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero

In this case, when using the torch.device("meta") context manager, torch.linspace(0, 1, 17) gets created on the meta device, which only assigns metadata and cannot be moved to CPU. This causes the .tolist() call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using init_empty_weights from accelerate, which patches nn.Module.register_parameter instead of patching torch's global device (https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170), thus avoiding this issue.

Here's a simple illustration:

>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!

cc @ETOgaosion

Additional Info.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if neccessary.

@vermouth1992 vermouth1992 merged commit 530154e into verl-project:main May 18, 2025
31 checks passed
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/908cda4f822744a943143f51703771ff168e05f0/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/bb9d1d10a940644abf808462aaeab0dc10a95be6/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…ct#1564)

### Checklist Before Starting

- [x] Search for similar PR(s).

### What does this PR do?

This PR fixes several issues
(verl-project#1484,
verl-project#1255) that cause the error:
"Cannot copy out of meta tensor; no data!".

The related code in our part is:

https://github.com/volcengine/verl/blob/2fbb82e8b54494905f44044219296b6654a872b6/scripts/model_merger.py#L131-L132

The `torch.device("meta")` context manager sets the current global torch
device to "meta". During `auto_model_class.from_config`, various import
statements load third-party libraries, whose `__init__.py` files may
contain global statements that use torch for calculations.

For example, transformers imports
[[torchao](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33)](https://github.com/pytorch/ao/blob/5549da8af975be6ff14330feb56c4abe3405b6f9/torchao/optim/subclass_4bit.py#L33),
which executes the following during initialization:

```python
QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()  # no zero
```

In this case, when using the `torch.device("meta")` context manager,
`torch.linspace(0, 1, 17)` gets created on the meta device, which only
assigns metadata and cannot be moved to CPU. This causes the `.tolist()`
call to fail with the error "Cannot copy out of meta tensor; no data!"

To fix this, we're now using `init_empty_weights` from `accelerate`,
which patches `nn.Module.register_parameter` instead of patching torch's
global device
(https://github.com/huggingface/accelerate/blob/417bc529654a70e61013fd21263826a2f1f9e1a6/src/accelerate/big_modeling.py#L96-L170),
thus avoiding this issue.

Here's a simple illustration:

```python
>>> import torch
>>> from accelerate import init_empty_weights
>>> with init_empty_weights():
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
>>> QMAP_UNSIGNED
[0.0625, 0.125, 0.1875, 0.25, 0.3125, 0.375, 0.4375, 0.5, 0.5625, 0.625, 0.6875, 0.75, 0.8125, 0.875, 0.9375, 1.0]
>>> with torch.device("meta"):
...     QMAP_UNSIGNED = torch.linspace(0, 1, 17)[1:].tolist()
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_device.py", line 104, in __torch_function__
    return func(*args, **kwargs)
NotImplementedError: Cannot copy out of meta tensor; no data!
```

cc @ETOgaosion 

### Additional Info.

- **Issue Number**: Fixes issue
verl-project#1484,
verl-project#1255,
verl-project#1468 (comment)
- **Training**: both
- **Inference**: none

### Checklist Before Submitting

- [x] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [x] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [x] Add `[BREAKING]` to the PR title if it breaks any API.
- [x] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add CI test(s) if neccessary.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants