-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] failed to find frozen {param} in named params #6620
Comments
@ssklzx Please make a PR with this change and we will review it. |
@ssklzx, DeepSpeed needs to know all the model parameters for correct functionality: So, I think the right question here is why some model parameters are missing in |
I know the reason for example: |
@ssklzx, are you able to provide a self-contained repro for this? |
Describe the bug
failed to find frozen {param} in named params
To Reproduce
use accerate deepspeed to train flux. 1
accelerator = Accelerator()
model, optimizer, data = accelerator.prepare(model, optimizer, data)
device_map = {}
model = accelerate.dispatch_model(model, device_map=device_map)
accelerator.save_state(save_path)
When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
main()
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
accelerator.save_state(save_path)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
self._save_checkpoint(save_dir,
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
raise ValueError(f"failed to find frozen {param} in named params")
ValueError: failed to find frozen Parameter containing:
tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
...,
[-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
device='cuda:1', dtype=torch.bfloat16) in named params
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
[rank0]: main()
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
[rank0]: accelerator.save_state(save_path)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
[rank0]: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
[rank0]: self._save_checkpoint(save_dir,
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
[rank0]: frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
[rank0]: raise ValueError(f"failed to find frozen {param} in named params")
[rank0]: ValueError: failed to find frozen Parameter containing:
[rank0]: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[rank0]: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[rank0]: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
[rank0]: ...,
[rank0]: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[rank0]: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[rank0]: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
[rank0]: device='cuda:1', dtype=torch.bfloat16) in named params
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v
wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs
E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_flux_lora_deepspeed.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:00:14
host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30158)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name,That's the code below:
self.param_names = {param: name for name, param in model.named_parameters()}
But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param
Here are the error functions:
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for param in self.module.parameters():
if param.requires_grad:
continue
if param not in self.param_names:
raise ValueError(f"failed to find frozen {param} in named params")
name = self.param_names[param]
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments
Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for name, param in self.module.named_parameters():
if param.requires_grad:
continue
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments
The text was updated successfully, but these errors were encountered: