You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
After deploying the JARK stack successfully and connecting to the dogbooth Jupyter notebook to follow the different execution steps, an error occurs while running the launch training step (step 15) and the training gets stuck immediately.
Notebook dreambooth training step:
# Launch the training and push the output model to huggingface
! accelerate launch train_dreambooth.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="a photo of [v]dog" \
--resolution=768 \
--train_batch_size=1 \
--gradient_accumulation_steps=1 \
--gradient_checkpointing \
--learning_rate=1e-6 \
--lr_scheduler="constant" \
--enable_xformers_memory_efficient_attention \
--use_8bit_adam \
--lr_warmup_steps=0 \
--max_train_steps=800 \
--push_to_hub
Error output:
Steps: 0%| | 0/800 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1443, in <module> main(args) File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1224, in main for step, batch in enumerate(train_dataloader): File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in __iter__ current_batch = next(dataloader_iter) File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 673, in __getitem__ instance_image = Image.open(self.instance_images_path[index % self.num_instance_images]) File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3227, in open fp = builtins.open(filename, "rb") IsADirectoryError: [Errno 21] Is a directory: '/home/jovyan/diffusers/examples/dreambooth/dog/.huggingface' Steps: 0%| | 0/800 [00:00<?, ?it/s] Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command simple_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 688, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python3.10', 'train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1', '--instance_data_dir=dog', '--output_dir=dogbooth', '--instance_prompt=a photo of [v]dog', '--resolution=768', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=1e-6', '--lr_scheduler=constant', '--enable_xformers_memory_efficient_attention', '--use_8bit_adam', '--lr_warmup_steps=0', '--max_train_steps=800', '--push_to_hub']' returned non-zero exit status 1.
✋ I have searched the open/closed issues and my issue is not listed.
I met the same problem with you: IsADirectoryError: [Errno 21] Is a directory: '/content/diffusers/examples/dreambooth/dog/.huggingface' Have you solved it?
This issue has been automatically marked as stale because it has been open 30 days
with no activity. Remove stale label or comment or this issue will be closed in 10 days
Description
After deploying the JARK stack successfully and connecting to the dogbooth Jupyter notebook to follow the different execution steps, an error occurs while running the launch training step (step 15) and the training gets stuck immediately.
Notebook dreambooth training step:
Error output:
Steps: 0%| | 0/800 [00:00<?, ?it/s]Traceback (most recent call last): File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1443, in <module> main(args) File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 1224, in main for step, batch in enumerate(train_dataloader): File "/opt/conda/lib/python3.10/site-packages/accelerate/data_loader.py", line 454, in __iter__ current_batch = next(dataloader_iter) File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/opt/conda/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/home/jovyan/diffusers/examples/dreambooth/train_dreambooth.py", line 673, in __getitem__ instance_image = Image.open(self.instance_images_path[index % self.num_instance_images]) File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3227, in open fp = builtins.open(filename, "rb") IsADirectoryError: [Errno 21] Is a directory: '/home/jovyan/diffusers/examples/dreambooth/dog/.huggingface' Steps: 0%| | 0/800 [00:00<?, ?it/s] Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in <module> sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1082, in launch_command simple_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 688, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/opt/conda/bin/python3.10', 'train_dreambooth.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-2-1', '--instance_data_dir=dog', '--output_dir=dogbooth', '--instance_prompt=a photo of [v]dog', '--resolution=768', '--train_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=1e-6', '--lr_scheduler=constant', '--enable_xformers_memory_efficient_attention', '--use_8bit_adam', '--lr_warmup_steps=0', '--max_train_steps=800', '--push_to_hub']' returned non-zero exit status 1.
Versions
Module version [Required]:
Terraform version:
Terraform v1.8.3
Provider version(s):
Terraform v1.8.3
Reproduction Code [Required]
Steps to reproduce the behavior:
Expected behavior
The dreambooth training process completes and the model is created, allowing to continue with inference.
Actual behavior
The training fails.
Terminal Output screenshots
The text was updated successfully, but these errors were encountered: