Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] failed to find frozen {param} in named params #6620

Closed
ssklzx opened this issue Oct 11, 2024 · 4 comments
Closed

[BUG] failed to find frozen {param} in named params #6620

ssklzx opened this issue Oct 11, 2024 · 4 comments
Assignees
Labels
bug Something isn't working training

Comments

@ssklzx
Copy link

ssklzx commented Oct 11, 2024

Describe the bug
failed to find frozen {param} in named params

To Reproduce
use accerate deepspeed to train flux. 1

accelerator = Accelerator()
model, optimizer, data = accelerator.prepare(model, optimizer, data)
device_map = {}
model = accelerate.dispatch_model(model, device_map=device_map)
accelerator.save_state(save_path)

When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model

Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
main()
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
accelerator.save_state(save_path)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
self._save_checkpoint(save_dir,
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
raise ValueError(f"failed to find frozen {param} in named params")
ValueError: failed to find frozen Parameter containing:
tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
...,
[-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
device='cuda:1', dtype=torch.bfloat16) in named params
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
[rank0]: main()
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
[rank0]: accelerator.save_state(save_path)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
[rank0]: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
[rank0]: self._save_checkpoint(save_dir,
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
[rank0]: frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
[rank0]: raise ValueError(f"failed to find frozen {param} in named params")
[rank0]: ValueError: failed to find frozen Parameter containing:
[rank0]: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[rank0]: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[rank0]: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
[rank0]: ...,
[rank0]: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[rank0]: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[rank0]: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
[rank0]: device='cuda:1', dtype=torch.bfloat16) in named params
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v
wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs
E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_flux_lora_deepspeed.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:00:14
host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30158)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name,That's the code below:
self.param_names = {param: name for name, param in model.named_parameters()}
But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param

Here are the error functions:
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for param in self.module.parameters():
if param.requires_grad:
continue
if param not in self.param_names:
raise ValueError(f"failed to find frozen {param} in named params")
name = self.param_names[param]
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for name, param in self.module.named_parameters():
if param.requires_grad:
continue
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments

@ssklzx ssklzx added bug Something isn't working training labels Oct 11, 2024
@ssklzx ssklzx changed the title failed to find frozen {param} in named params [BUG] failed to find frozen {param} in named params Oct 11, 2024
@jomayeri
Copy link
Contributor

@ssklzx Please make a PR with this change and we will review it.

@jomayeri jomayeri self-assigned this Oct 11, 2024
@tjruwase
Copy link
Contributor

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

@ssklzx, DeepSpeed needs to know all the model parameters for correct functionality:
https://github.com/microsoft/DeepSpeed/blob/85b7469ea00f7719a27e3e8d1ffaa8765575f820/deepspeed/runtime/engine.py#L271-L272

So, I think the right question here is why some model parameters are missing in self.param_names

@ssklzx
Copy link
Author

ssklzx commented Oct 16, 2024

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

@ssklzx, DeepSpeed needs to know all the model parameters for correct functionality:

DeepSpeed/deepspeed/runtime/engine.py

Lines 271 to 272 in 85b7469

needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict

self.param_names = {param: name for name, param in model.named_parameters()}
So, I think the right question here is why some model parameters are missing in self.param_names

I know the reason
Because after the initialization of 'self. param_name', I will change the position of some parameters, such as moving them from 'CUDA: 0' to 'CUDA: 1', so I will not be able to find these transferred parameters

for example:
model, optimizer, data = accelerator.prepare(model, optimizer, data). # initialization 'self. param_name'
model = accelerate.dispatch_model(model, device_map=device_map) # change parameter position
accelerator.save_state(save_path) # report errors

@tjruwase
Copy link
Contributor

@ssklzx, are you able to provide a self-contained repro for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants