Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modules with uninitialized parameters can't be used with `DistributedDataParallel #686

Closed
magicwang1111 opened this issue Aug 9, 2024 · 3 comments

Comments

@magicwang1111
Copy link

          see what 'glt log' says at the top?

Originally posted by @bghira in #676 (comment)

i have successful install bitsandbytes 0.43.3 but have a new error.

2024-08-09 10:56:17,298 [INFO] (main) Loading our accelerator...
Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules
Traceback (most recent call last):
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 2762, in
main()
File "/mnt/data1/wangxi/SimpleTuner/train.py", line 1341, in main
results = accelerator.prepare(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1299, in prepare
result = tuple(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1300, in
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1176, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/accelerate/accelerator.py", line 1435, in prepare_model
model = torch.nn.parallel.DistributedDataParallel(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 784, in init
self._log_and_throw(
File "/home/wangxi/miniconda3/envs/flux/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 1127, in _log_and_throw
raise err_type(err_msg)
RuntimeError: Modules with uninitialized parameters can't be used with DistributedDataParallel. Run a dummy forward pass to correctly initialize the modules

[rank0]:[W809 10:56:25.219082534 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present, but this warning has only been added since PyTorch 2.4 (function operator())

@bghira
Copy link
Owner

bghira commented Aug 9, 2024

duplicated by #644

@bghira bghira closed this as not planned Won't fix, can't repro, duplicate, stale Aug 9, 2024
@magicwang1111
Copy link
Author

duplicated by #644

Haven‘t ues multiGPU ,only one gpu.

@bghira
Copy link
Owner

bghira commented Aug 9, 2024

still same error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants