add boft support in stable-diffusion by sywangyi · Pull Request #1295 · huggingface/optimum-habana

sywangyi · 2024-08-28T08:50:09Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

HuggingFaceDocBuilderDev · 2024-08-28T08:54:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

imangohari1 · 2024-09-12T15:17:04Z

Hi @sywangyi
Thanks for this PR.
Could you rebase this with OH main and make sure make style is applied.
Please share the results of the CI tests for test_diffusers.py with and without this changes.
Thanks.

sywangyi · 2024-09-13T14:08:26Z

same with latest main. 1 case fail

FAILED tests/test_diffusers.py::GaudiStableDiffusionXLImg2ImgPipelineTests::test_stable_diffusion_xl_img2img_euler - AssertionError: 0.21911774845123289 not less than 0.01
================================ 1 failed, 108 passed, 46 skipped, 274 warnings in 776.40s (0:12:56) =================================

imangohari1

Hi @sywangyi
I spent some time on this PR and did some testing:

I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
0001-fea-dreambooth-reworked-the-readme.patch
I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.
- I've provided the tested cmd below.

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

sywangyi · 2024-09-14T08:29:42Z

Hi @sywangyi I spent some time on this PR and did some testing:

I've reworked the README file for a better read. Please apply the changes with the attached patch using git am < 000* (don't copy past the changes, apply the patch please).
0001-fea-dreambooth-reworked-the-readme.patch

I've tested the PEFT example with both lora and boft. The lora example finishes in about 6min (5m47.993s) but the boft one has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on why boft is so significantly slower than lora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.

I've provided the tested cmd below.
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"

logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16   --use_hpu_graphs_for_training   --use_hpu_graphs_for_inference   --gaudi_config_name Habana/stable-diffusion   lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile

time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py   --pretrained_model_name_or_path=$MODEL_NAME    --instance_data_dir=$INSTANCE_DIR   --output_dir=$OUTPUT_DIR   --class_data_dir=$CLASS_DIR   --with_prior_preservation --prior_loss_weight=1.0   --instance_prompt="a photo of sks dog"   --class_prompt="a photo of dog"   --resolution=512   --train_batch_size=1   --num_class_images=200   --gradient_accumulation_steps=1   --learning_rate=1e-4   --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=800   --mixed_precision=bf16     --gaudi_config_name Habana/stable-diffusion   boft 2>&1 | tee $logfile

yes. I have file a bug to pytorch training team about the perf issue, will cc you in the jira

imangohari1 · 2024-09-17T15:21:08Z

Overal LGTM although there is a performance issue with boft. Thanks @sywangyi
@regisss how do you suggest we proceed?

libinta · 2024-09-18T21:05:14Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

sywangyi · 2024-09-18T23:10:17Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

imangohari1 · 2024-09-20T20:18:19Z

@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional

which version do you mean? I think habana pytorch training team is still working on it.

@sywangyi 1.18.0 release build id 410.

libinta · 2024-09-24T22:02:12Z

@sywangyi do you have test result?

kaixuanliu · 2024-09-26T01:53:53Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

imangohari1 · 2024-09-26T16:32:10Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior.
Our team is looking at rewriting the code that is causing recompile.

yao-matrix · 2024-11-11T07:55:13Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

imangohari1 · 2024-11-12T04:28:04Z

@sywangyi do you have test result?

Hi, @libinta , we used a docker image with build 438 for 1.18.0 release from @yeonsily and have a test. The perf of boft is still significantly slower than lora, for single card finetune, lora takes ~4min 30s, while boft needs more than 2 hours.

I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. Our team is looking at rewriting the code that is causing recompile.

@imangohari1 , do we have update on this?

I am not if the issue is resolved or not.
@sywangyi WDYT?

sywangyi · 2024-11-12T05:04:51Z

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Luca-Calabria · 2024-12-03T11:08:13Z

according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet

Just to update: RnD guy found the low level issue that produces the slow compilation when "torch.block_diag" operation run. You can find all the details in the ticket.
The development time, to support operations like "torch.block_diag" is estimated around 2-3 weeks.

Luca-Calabria · 2025-01-14T13:46:14Z

Update: RnD guys said the new feature is targeted for release 1.21.0

Luca-Calabria · 2025-03-21T09:21:10Z

The fix is under PR review, I'll test it when it is ready

Luca-Calabria · 2025-04-01T11:26:44Z

@sywangyi the build with the fix is ready 1.21.0-376. I tested it. Please check the ticket for all the details.

Luca-Calabria · 2025-04-04T08:45:56Z

@sywangyi @libinta the build 1.21.0-376 solved the long compilation time during torch.block_diag() operation.
I suggest to rebase this PR and start the final review.

Perfomance on 1 epoch and 10 classes:
Lazy Mode

04/03/2025 13:28:12 - INFO - main - ***** Running training *****
04/03/2025 13:28:12 - INFO - main - Num examples = 10
04/03/2025 13:28:12 - INFO - main - Num batches each epoch = 2
04/03/2025 13:28:12 - INFO - main - Num Epochs = 1
04/03/2025 13:28:12 - INFO - main - Instantaneous batch size per device = 1
04/03/2025 13:28:12 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 8
04/03/2025 13:28:12 - INFO - main - Gradient Accumulation steps = 1
04/03/2025 13:28:12 - INFO - main - Total optimization steps = 1
Steps: 100%|███| 1/1 [06:20<00:00, 359.46s/it, loss=0.229, lr=5e-6]HPU Memory before entering the train : 31292
HPU Memory consumed at the end of the train (end-begin): 1413
HPU Peak Memory consumed during the train (max-begin): 54332
HPU Total Peak Memory consumed during the train (max): 85624
CPU Memory before entering the train : 28540
CPU Memory consumed at the end of the train (end-begin): 498
CPU Peak Memory consumed during the train (max-begin): 498
CPU Total Peak Memory consumed during the train (max): 29038
Steps: 100%|███| 1/1 [07:01<00:00, 421.52s/it, loss=0.229, lr=5e-6]

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi · 2025-04-29T02:00:56Z

@libinta will we merge the PR in 1.21？, since 1.21 fix the the slow compilation when "torch.block_diag" operation run

Luca-Calabria · 2025-04-29T08:14:50Z

yes, 32min in my env, is the perf expected? @Luca-Calabria

@sywangyi to improve the performance further we need some extra work. I think the current PR can be merged because the fix on "torch.block_diag" is done and it allows us to run the training without stuck.
I'll prepare a Jira epic with the next steps we could to do to improve perf further

regisss · 2025-06-26T13:55:52Z

Does this PR work with Synapse 1.21?

sywangyi · 2025-06-27T01:21:17Z

Does this PR work with Synapse 1.21?

yes

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

yao-matrix · 2025-07-03T06:54:50Z

@regisss , maybe we can merge it? Since it works on 1.21 and it's a 10-month journey w/ times of rebase.

Luca-Calabria · 2025-07-07T15:01:42Z

@sywangyi it seems there is an incompatibility with numpy >2.0, which seems the version used now.

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/numpy/__init__.py", line 400, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.

Maybe it is solved in the last main commit? Please merge again the main branch to check this.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi · 2025-07-10T11:14:36Z

@sywangyi it seems there is an incompatibility with numpy >2.0, which seems the version used now.
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/numpy/__init__.py", line 400, in __getattr__
[rank0]:     raise AttributeError(
[rank0]: AttributeError: `np.string_` was removed in the NumPy 2.0 release. Use `np.bytes_` instead.
Maybe it is solved in the last main commit? Please merge again the main branch to check this.

resovled by upgrade tensorboard

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

regisss

It looks good to me. Can you also check that the Diffusers CI passes please? CI is offline and I don't have access to any Gaudi server at the moment.

sywangyi · 2025-07-11T00:29:15Z

It looks good to me. Can you also check that the Diffusers CI passes please? CI is offline and I don't have access to any Gaudi server at the moment.

I check the GaudiStableDiffusionPipelineTester and DreamBooth test which is diffuser+peft test, ALL could pass.

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Due to the error `AttributeError: \`np.string_\` was removed in the NumPy 2.0 release. Use \`np.bytes_\` instead.`, we need to pin the Tensorboard version to `tensorboard==2.19.0` - similarly to the fix in [1]. [1] huggingface#1295

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

) Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Wang, Yi <yi.a.wang@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

add boft support in stable-diffusion

8394c03

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

sywangyi requested a review from regisss as a code owner August 28, 2024 08:50

add testcase

7f81362

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Merge branch 'main' into boft

1ef942a

imangohari1 suggested changes Sep 14, 2024

View reviewed changes

fea(dreambooth): reworked the readme

3827489

libinta added the synapse1.20 label Nov 21, 2024

libinta added synapse 1.21 and removed synapse1.20 labels Feb 3, 2025

regisss and others added 2 commits April 6, 2025 18:01

Update PR doc build workflow (#1904)

c677785

Merge branch 'main' into boft

26acc10

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

Merge branch 'main' into boft

413c001

libinta added the synapse1.22 label May 7, 2025

Merge branch 'main' into boft

418625f

libinta added run-test Run CI for PRs from external contributors and removed synapse1.22 labels May 14, 2025

sywangyi added 2 commits June 27, 2025 09:23

Merge branch 'main' into boft

dedeaf1

minor fix

d3b32a7

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

regisss reviewed Jul 7, 2025

View reviewed changes

Comment thread tests/test_diffusers.py Outdated

Comment thread examples/stable-diffusion/text_to_image_generation.py Outdated

Comment thread examples/stable-diffusion/text_to_image_generation.py Outdated

astachowiczhabana self-assigned this Jul 9, 2025

astachowiczhabana added the synapse1.22 label Jul 9, 2025

boft refine and using peft 0.16.0

6d19d89

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

fix oft issue in peft 0.16.0

dfee7db

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

regisss reviewed Jul 10, 2025

View reviewed changes

Comment thread examples/stable-diffusion/text_to_image_generation.py

add requirement install in testcase

454946a

Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>

regisss approved these changes Jul 11, 2025

View reviewed changes

regisss merged commit 3f67ca8 into main Jul 11, 2025
6 of 8 checks passed

regisss deleted the boft branch July 11, 2025 09:51

pbielak mentioned this pull request Jul 11, 2025

Pin Tensorboard version #2135

Closed

Conversation

sywangyi commented Aug 28, 2024

What does this PR do?

Before submitting

Uh oh!

HuggingFaceDocBuilderDev commented Aug 28, 2024

Uh oh!

imangohari1 commented Sep 12, 2024

Uh oh!

sywangyi commented Sep 13, 2024

Uh oh!

imangohari1 left a comment

Choose a reason for hiding this comment

Uh oh!

sywangyi commented Sep 14, 2024

Uh oh!

imangohari1 commented Sep 17, 2024

Uh oh!

libinta commented Sep 18, 2024

Uh oh!

sywangyi commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

imangohari1 commented Sep 20, 2024

Uh oh!

libinta commented Sep 24, 2024

Uh oh!

kaixuanliu commented Sep 26, 2024

Uh oh!

imangohari1 commented Sep 26, 2024

Uh oh!

yao-matrix commented Nov 11, 2024

Uh oh!

imangohari1 commented Nov 12, 2024

Uh oh!

sywangyi commented Nov 12, 2024

Uh oh!

Luca-Calabria commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luca-Calabria commented Jan 14, 2025

Uh oh!

Luca-Calabria commented Mar 21, 2025

Uh oh!

Luca-Calabria commented Apr 1, 2025

Uh oh!

Luca-Calabria commented Apr 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sywangyi commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luca-Calabria commented Apr 29, 2025

Uh oh!

regisss commented Jun 26, 2025

Uh oh!

sywangyi commented Jun 27, 2025

Uh oh!

yao-matrix commented Jul 3, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Luca-Calabria commented Jul 7, 2025

Uh oh!

sywangyi commented Jul 10, 2025

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sywangyi commented Jul 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

sywangyi commented Sep 18, 2024 •

edited

Loading

Luca-Calabria commented Dec 3, 2024 •

edited

Loading

Luca-Calabria commented Apr 4, 2025 •

edited

Loading

sywangyi commented Apr 29, 2025 •

edited

Loading

sywangyi commented Jul 11, 2025 •

edited

Loading