Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CudaOutOfMemory on Flux Lora Training #9156

Closed
m-pektas opened this issue Aug 12, 2024 · 13 comments
Closed

CudaOutOfMemory on Flux Lora Training #9156

m-pektas opened this issue Aug 12, 2024 · 13 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@m-pektas
Copy link

m-pektas commented Aug 12, 2024

Describe the bug

I tried to train the flux-dev model with Lora on A100 40GB. But it raises the CudaOutOfMemory exception.

Reproduction

# Accelerate command
export MODEL_NAME="black-forest-labs/FLUX.1-dev"
export INSTANCE_DIR="woman"
export OUTPUT_DIR="trained-flux-lora-woman"

accelerate launch train_dreambooth_lora_flux.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --mixed_precision="fp16" \
  --instance_prompt="ohwx woman" \
  --resolution=512 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-5 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=500 \
  --validation_prompt="professional photography of ohwx woman" \
  --validation_epochs=25 \
  --seed="0" \
  --use_8bit_adam
# Accelerate Config
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: 'NO'
downcast_bf16: 'no'
enable_cpu_affinity: true
gpu_ids: all
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

[Previous line repeated 4 more times]
  File "/home/muhammed_pektas/anaconda3/envs/hflora/lib/python3.12/site-packages/torch/nn/modules/module.py", line 805, in _apply
    param_applied = fn(param)
                    ^^^^^^^^^
  File "/home/muhammed_pektas/anaconda3/envs/hflora/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1160, in convert
    return t.to(
           ^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacity of 39.39 GiB of which 14.31 MiB is free. Including non-PyTorch memory, this process has 39.37 GiB memory in use. Of the allocated memory 38.79 GiB is allocated by PyTorch, and 73.66 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.

System Info

Pip
req_flux.txt

Hardware
NVIDIA A100-SXM4-40GB

Who can help?

No response

@m-pektas m-pektas added the bug Something isn't working label Aug 12, 2024
@tolgacangoz
Copy link
Contributor

tolgacangoz commented Aug 13, 2024

Did/Could you examine its README_flux.md file?

@m-pektas
Copy link
Author

Did/Could you examine its README_flux.md file?

Sorry for the late reply @tolgacangoz . I can train flux lora with SimpleTuner repository as you mentioned in your shared document. I was created this issue, because I can't when I follow this document.

@jcarioti
Copy link

jcarioti commented Sep 4, 2024

Have you had any success @m-pektas?

I am trying to train Dreambooth using Flux fp8, but I'd like to avoid using SimpleTuner if possible.

@asomoza
Copy link
Member

asomoza commented Sep 5, 2024

cc: @linoytsaban

@m-pektas
Copy link
Author

Have you had any success @m-pektas?

I am trying to train Dreambooth using Flux fp8, but I'd like to avoid using SimpleTuner if possible.

Currently, I am using SimpleTuner. I did not try this repo with latest updates. Do you have memory problem still ? @jcarioti

@jcarioti
Copy link

Have you had any success @m-pektas?
I am trying to train Dreambooth using Flux fp8, but I'd like to avoid using SimpleTuner if possible.

Currently, I am using SimpleTuner. I did not try this repo with latest updates. Do you have memory problem still ? @jcarioti

No, but I haven’t tried with this repo. I was able to run Flux FP8 with ComfyUI, but was hoping to do LoRA training directly in this diffusers lib without ComfyUI or SimpleTuner as overhead.

Copy link

github-actions bot commented Oct 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Oct 8, 2024
@linoytsaban linoytsaban removed the stale Issues that haven't received updates label Oct 11, 2024
@linoytsaban
Copy link
Collaborator

Hey @m-pektas is this still an issue? I'm unable to reproduce (also training on A100), could you share your env? specifically version of diffusers, transformers, accelerate and peft?

@m-pektas
Copy link
Author

Hey @linoytsaban , I haven't tried again it recently. But I shared my package versions under req_flux.txt. If you didn't reporoduce it even use the same versions, maybe it fixed or i had another problem.

@leisuzz
Copy link
Contributor

leisuzz commented Nov 1, 2024

Can you try regarding #9829 ? I have saved memory by implementing this :)

@a-r-r-o-w
Copy link
Member

Gentle bump in this thread to figure out if this issue has been resolved yet. I see a merged PR so if yes, I think we can close it :)

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Dec 13, 2024
@m-pektas
Copy link
Author

I didn’t have the chance to check if it was resolved, but it is stale and as we can see, there are merged PRs related to it. I am closing it, if it occurs again, we can reopen it. Thanks for your comments and contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

7 participants