[BUG] failed to find frozen {param} in named params #6620

ssklzx · 2024-10-11T07:06:57Z

Describe the bug
failed to find frozen {param} in named params

To Reproduce
use accerate deepspeed to train flux. 1

accelerator = Accelerator()
model, optimizer, data = accelerator.prepare(model, optimizer, data)
device_map = {}
model = accelerate.dispatch_model(model, device_map=device_map)
accelerator.save_state(save_path)

When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model

Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
main()
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
accelerator.save_state(save_path)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
model.save_checkpoint(output_dir, ckpt_id, save_model_func_kwargs)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
self._save_checkpoint(save_dir,
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
raise ValueError(f"failed to find frozen {param} in named params")
ValueError: failed to find frozen Parameter containing:
tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
...,
[-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
device='cuda:1', dtype=torch.bfloat16) in named params
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
[rank0]: main()
[rank0]: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
[rank0]: accelerator.save_state(save_path)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
[rank0]: model.save_checkpoint(output_dir, ckpt_id, save_model_func_kwargs)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
[rank0]: self._save_checkpoint(save_dir,
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
[rank0]: frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
[rank0]: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
[rank0]: raise ValueError(f"failed to find frozen {param} in named params")
[rank0]: ValueError: failed to find frozen Parameter containing:
[rank0]: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[rank0]: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[rank0]: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
[rank0]: ...,
[rank0]: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[rank0]: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[rank0]: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
[rank0]: device='cuda:1', dtype=torch.bfloat16) in named params
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v
wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs
E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_flux_lora_deepspeed.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:00:14
host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30158)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name，That's the code below：
self.param_names = {param: name for name, param in model.named_parameters()}
But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param

Here are the error functions：
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for param in self.module.parameters():
if param.requires_grad:
continue
if param not in self.param_names:
raise ValueError(f"failed to find frozen {param} in named params")
name = self.param_names[param]
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for name, param in self.module.named_parameters():
if param.requires_grad:
continue
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments

jomayeri · 2024-10-11T16:37:39Z

@ssklzx Please make a PR with this change and we will review it.

tjruwase · 2024-10-15T00:40:43Z

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

@ssklzx, DeepSpeed needs to know all the model parameters for correct functionality:
https://github.com/microsoft/DeepSpeed/blob/85b7469ea00f7719a27e3e8d1ffaa8765575f820/deepspeed/runtime/engine.py#L271-L272

So, I think the right question here is why some model parameters are missing in self.param_names

ssklzx · 2024-10-16T10:11:25Z

Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem

@ssklzx, DeepSpeed needs to know all the model parameters for correct functionality:

DeepSpeed/deepspeed/runtime/engine.py

Lines 271 to 272 in 85b7469

needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict

self.param_names = {param: name for name, param in model.named_parameters()}
So, I think the right question here is why some model parameters are missing in self.param_names

I know the reason
Because after the initialization of 'self. param_name', I will change the position of some parameters, such as moving them from 'CUDA: 0' to 'CUDA: 1', so I will not be able to find these transferred parameters

for example:
model, optimizer, data = accelerator.prepare(model, optimizer, data). # initialization 'self. param_name'
model = accelerate.dispatch_model(model, device_map=device_map) # change parameter position
accelerator.save_state(save_path) # report errors

tjruwase · 2024-10-21T12:48:28Z

@ssklzx, are you able to provide a self-contained repro for this?

ssklzx added bug Something isn't working training labels Oct 11, 2024

ssklzx changed the title ~~failed to find frozen {param} in named params~~ [BUG] failed to find frozen {param} in named params Oct 11, 2024

jomayeri self-assigned this Oct 11, 2024

tjruwase mentioned this issue Oct 15, 2024

modify_load_save_model #6626

Open

jomayeri closed this as completed Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] failed to find frozen {param} in named params #6620

[BUG] failed to find frozen {param} in named params #6620

ssklzx commented Oct 11, 2024 •

edited

Loading

jomayeri commented Oct 11, 2024

tjruwase commented Oct 15, 2024

ssklzx commented Oct 16, 2024 •

edited

Loading

needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict

tjruwase commented Oct 21, 2024

[BUG] failed to find frozen {param} in named params #6620

[BUG] failed to find frozen {param} in named params #6620

Comments

ssklzx commented Oct 11, 2024 • edited Loading

train_flux_lora_deepspeed.py FAILED

Failures: <NO_OTHER_FAILURES>

jomayeri commented Oct 11, 2024

tjruwase commented Oct 15, 2024

ssklzx commented Oct 16, 2024 • edited Loading

needed for zero_to_fp32 weights reconstruction to remap nameless data to state_dict

tjruwase commented Oct 21, 2024

ssklzx commented Oct 11, 2024 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

ssklzx commented Oct 16, 2024 •

edited

Loading