Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

magicwang1111 · 2024-08-09T03:10:15Z

          see what 'glt log' says at the top?

Originally posted by @bghira in #676 (comment)

i have successful install bitsandbytes 0.43.3 but have a new error.

2024-08-09 10:56:17,298 [INFO] (main) Loading our accelerator...
Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2762, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 1341, in main
results = accelerator.prepare(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare
result = tuple(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1435, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 784, in init
self._log_and_throw(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

[rank0]:[W809 10:56:25.219082534 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

The text was updated successfully, but these errors were encountered:

bghira · 2024-08-09T03:27:51Z

duplicated by #644

magicwang1111 · 2024-08-09T03:53:32Z

duplicated by #644

Haven‘t ues multiGPU ,only one gpu.

bghira · 2024-08-09T09:10:41Z

still same error

bghira closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024

bghira mentioned this issue Sep 14, 2024

accelerate muti gpu with gradient_checkpointing throws an error #972

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

magicwang1111 commented Aug 9, 2024

bghira commented Aug 9, 2024

magicwang1111 commented Aug 9, 2024

bghira commented Aug 9, 2024

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

Comments

magicwang1111 commented Aug 9, 2024

bghira commented Aug 9, 2024

magicwang1111 commented Aug 9, 2024

bghira commented Aug 9, 2024