Skip to content

add boft support in stable-diffusion#1295

Merged
regisss merged 17 commits into
mainfrom
boft
Jul 11, 2025
Merged

add boft support in stable-diffusion#1295
regisss merged 17 commits into
mainfrom
boft

Conversation

@sywangyi
Copy link
Copy Markdown
Collaborator

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi sywangyi requested a review from regisss as a code owner August 28, 2024 08:50
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@imangohari1
Copy link
Copy Markdown
Contributor

Hi @sywangyi
Thanks for this PR.
Could you rebase this with OH main and make sure make style is applied.
Please share the results of the CI tests for test_diffusers.py with and without this changes.
Thanks.

@sywangyi
Copy link
Copy Markdown
Collaborator Author

same with latest main. 1 case fail

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLImg2ImgPipelineTests::test_stable_diffusion_xl_img2img_euler - AssertionError: 0.21911774845123289 not less than 0.01
================================ 1 failed, 108 passed, 46 skipped, 274 warnings in 776.40s (0:12:56) =================================

Copy link
Copy Markdown
Contributor

@imangohari1 imangohari1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sywangyi
I spent some time on this PR and did some testing:

  • I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
    0001-fea-dreambooth-reworked-the-readme.patch

  • I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

    • I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

@sywangyi
Copy link
Copy Markdown
Collaborator Author

Hi @sywangyi I spent some time on this PR and did some testing:

  • I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
    0001-fea-dreambooth-reworked-the-readme.patch

  • I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

    • I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

yes. I have file a bug to pytorch training team about the perf issue, will cc you in the jira

@imangohari1
Copy link
Copy Markdown
Contributor

Overal LGTM although there is a performance issue with boft. Thanks @sywangyi
@regisss how do you suggest we proceed?

@libinta
Copy link
Copy Markdown
Collaborator

libinta commented Sep 18, 2024

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

@sywangyi
Copy link
Copy Markdown
Collaborator Author

sywangyi commented Sep 18, 2024

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@imangohari1
Copy link
Copy Markdown
Contributor

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@sywangyi 1.18.0 release build id 410.

@libinta
Copy link
Copy Markdown
Collaborator

libinta commented Sep 24, 2024

@sywangyi do you have test result?

@kaixuanliu
Copy link
Copy Markdown
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

@imangohari1
Copy link
Copy Markdown
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior.
Our team is looking at rewriting the code that is causing recompile.

@yao-matrix
Copy link
Copy Markdown
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

@imangohari1
Copy link
Copy Markdown
Contributor

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

I am not if the issue is resolved or not.
@sywangyi WDYT?

@sywangyi
Copy link
Copy Markdown
Collaborator Author

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

@Luca-Calabria
Copy link
Copy Markdown
Contributor

Luca-Calabria commented Dec 3, 2024

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Just to update: RnD guy found the low level issue that produces the slow compilation when "torch.block_diag" operation run. You can find all the details in the ticket.
The development time, to support operations like "torch.block_diag" is estimated around 2-3 weeks.

@Luca-Calabria
Copy link
Copy Markdown
Contributor

Update: RnD guys said the new feature is targeted for release 1.21.0

@Luca-Calabria
Copy link
Copy Markdown
Contributor

The fix is under PR review, I'll test it when it is ready

@Luca-Calabria
Copy link
Copy Markdown
Contributor

@sywangyi the build with the fix is ready 1.21.0-376. I tested it. Please check the ticket for all the details.

@Luca-Calabria
Copy link
Copy Markdown
Contributor

Luca-Calabria commented Apr 4, 2025

@sywangyi @libinta the build 1.21.0-376 solved the long compilation time during torch.block_diag() operation.
I suggest to rebase this PR and start the final review.

Perfomance on 1 epoch and 10 classes:
Lazy Mode

04/03/2025 13:28:12 - INFO - main - ***** Running training *****
04/03/2025 13:28:12 - INFO - main - Num examples = 10
04/03/2025 13:28:12 - INFO - main - Num batches each epoch = 2
04/03/2025 13:28:12 - INFO - main - Num Epochs = 1
04/03/2025 13:28:12 - INFO - main - Instantaneous batch size per device = 1
04/03/2025 13:28:12 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8
04/03/2025 13:28:12 - INFO - main - Gradient Accumulation steps = 1
04/03/2025 13:28:12 - INFO - main - Total optimization steps = 1
Steps: 100%|███| 1/1 [06:20<00:00, 359.46s/it, loss=0.229, lr=5e-6]HPU Memory before entering the train : 31292
HPU Memory consumed at the end of the train (end-begin): 1413
HPU Peak Memory consumed during the train (max-begin): 54332
HPU Total Peak Memory consumed during the train (max): 85624
CPU Memory before entering the train : 28540
CPU Memory consumed at the end of the train (end-begin): 498
CPU Peak Memory consumed during the train (max-begin): 498
CPU Total Peak Memory consumed during the train (max): 29038
Steps: 100%|███| 1/1 [07:01<00:00, 421.52s/it, loss=0.229, lr=5e-6]

regisss and others added 2 commits April 6, 2025 18:01
@sywangyi
Copy link
Copy Markdown
Collaborator Author

sywangyi commented Apr 29, 2025

@libinta will we merge the PR in 1.21?, since 1.21 fix the the slow compilation when "torch.block_diag" operation run

@Luca-Calabria
Copy link
Copy Markdown
Contributor

yes, 32min in my env, is the perf expected? @Luca-Calabria

@sywangyi to improve the performance further we need some extra work. I think the current PR can be merged because the fix on "torch.block_diag" is done and it allows us to run the training without stuck.
I'll prepare a Jira epic with the next steps we could to do to improve perf further

@libinta libinta added run-test Run CI for PRs from external contributors and removed synapse1.22 labels May 14, 2025
@regisss
Copy link
Copy Markdown
Collaborator

regisss commented Jun 26, 2025

Does this PR work with Synapse 1.21?

@sywangyi
Copy link
Copy Markdown
Collaborator Author

Does this PR work with Synapse 1.21?

yes

sywangyi added 2 commits June 27, 2025 09:23
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@yao-matrix
Copy link
Copy Markdown
Contributor

@regisss , maybe we can merge it? Since it works on 1.21 and it's a 10-month journey w/ times of rebase.

Comment thread tests/test_diffusers.py Outdated
Comment thread examples/stable-diffusion/text_to_image_generation.py Outdated
Comment thread examples/stable-diffusion/text_to_image_generation.py Outdated
@Luca-Calabria
Copy link
Copy Markdown
Contributor

@sywangyi it seems there is an incompatibility with numpy >2.0, which seems the version used now.

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/numpy/__init__.py", line 400, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

Maybe it is solved in the last main commit? Please merge again the main branch to check this.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@sywangyi
Copy link
Copy Markdown
Collaborator Author

@sywangyi it seems there is an incompatibility with numpy >2.0, which seems the version used now.

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/numpy/__init__.py", line 400, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

Maybe it is solved in the last main commit? Please merge again the main branch to check this.

resovled by upgrade tensorboard

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me. Can you also check that the Diffusers CI passes please? CI is offline and I don't have access to any Gaudi server at the moment.

Comment thread examples/stable-diffusion/text_to_image_generation.py
@sywangyi
Copy link
Copy Markdown
Collaborator Author

sywangyi commented Jul 11, 2025

It looks good to me. Can you also check that the Diffusers CI passes please? CI is offline and I don't have access to any Gaudi server at the moment.

I check the GaudiStableDiffusionPipelineTester and DreamBooth test which is diffuser+peft test, ALL could pass.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
@regisss regisss merged commit 3f67ca8 into main Jul 11, 2025
6 of 8 checks passed
@regisss regisss deleted the boft branch July 11, 2025 09:51
pbielak pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Jul 11, 2025
Due to the error `AttributeError: \`np.string_\` was removed in the NumPy
2.0 release. Use \`np.bytes_\` instead.`, we need to pin the Tensorboard
version to `tensorboard==2.19.0` - similarly to the fix in [1].

[1] huggingface#1295
@pbielak pbielak mentioned this pull request Jul 11, 2025
astachowiczhabana pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Jul 14, 2025
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
astachowiczhabana pushed a commit that referenced this pull request Sep 10, 2025
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
gplutop7 pushed a commit to HabanaAI/optimum-habana-fork that referenced this pull request Oct 15, 2025
)

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Co-authored-by: Wang, Yi <yi.a.wang@intel.com>
Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse1.22

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants