Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cogvideo training error #10315

Open
linwenzhao1 opened this issue Dec 20, 2024 · 3 comments
Open

cogvideo training error #10315

linwenzhao1 opened this issue Dec 20, 2024 · 3 comments
Labels
bug Something isn't working training

Comments

@linwenzhao1
Copy link

Describe the bug

Fine tuning the model on both Gpus reports the following error: RuntimeError: CUDA driver error: invalid argument
Do you know what the problem is?

Reproduction

rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/transformers/cogvideox_transformer_3d.py", line 148, in forward
rank1: ff_output = self.ff(norm_hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/attention.py", line 1242, in forward
rank1: hidden_states = module(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/diffusers-0.32.0.dev0-py3.11.egg/diffusers/models/activations.py", line 88, in forward
rank1: hidden_states = self.proj(hidden_states)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
rank1: return self._call_impl(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
rank1: return forward_call(*args, **kwargs)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/nn/modules/linear.py", line 125, in forward
rank1: return F.linear(input, self.weight, self.bias)
rank1: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
rank1: RuntimeError: CUDA driver error: invalid argument
Steps: 0%| | 0/133600000 [00:12<?, ?it/s]
[rank0]:[W1220 14:39:33.155016577 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_proce ss_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint ha s always been present, but this warning has only been added since PyTorch 2.4 (function operator())
W1220 14:39:35.723000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 381224 closing signal SIGTERM
E1220 14:39:36.039000 381051 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 381223) of binary: /home/conda_env/controlnet/bin/python
Traceback (most recent call last):
File "/home/conda_env/controlnet/bin/accelerate", line 8, in
sys.exit(main())
^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1159, in launch_command
multi_gpu_launcher(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher
distrib_run.run(args)
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/conda_env/controlnet/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_controlnet.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-12-20_14:39:35
host : robot
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 381223)
error_file: <N/A>
traceback : To enable traceback see: https:/pytorch.org/docs/stable/elastic/errors.html

Logs

No response

System Info

ubuntu20.04
cuda 12.0
torch 2.5
diffusers 0.32.0.dev0

Who can help?

No response

@linwenzhao1 linwenzhao1 added the bug Something isn't working label Dec 20, 2024
@hlky
Copy link
Collaborator

hlky commented Dec 20, 2024

cc @linoytsaban @sayakpaul for training.

@hlky hlky added the training label Dec 20, 2024
@sayakpaul
Copy link
Member

Cc: @a-r-r-o-w for training

@a-r-r-o-w
Copy link
Member

Unable to deduce what exactly caused this error. I see it happens in the attention feed-forward projection but nothing hinting why. Could you maybe run with CUDA_LAUNCH_BLOCKING=1 and share your results? Does it also happen with pytorch nightly? I'm able to run the scripts for CogVideoX in https://github.com/a-r-r-o-w/finetrainers just fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

4 participants