Conversation
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
|
Hi @sywangyi |
|
same with latest main. 1 case fail FAILED tests/test_diffusers.py::GaudiStableDiffusionXLImg2ImgPipelineTests::test_stable_diffusion_xl_img2img_euler - AssertionError: 0.21911774845123289 not less than 0.01 |
imangohari1
left a comment
There was a problem hiding this comment.
Hi @sywangyi
I spent some time on this PR and did some testing:
-
I've reworked the README file for a better read. Please apply the changes with the attached patch using
git am < 000*(don't copy past the changes, apply the patch please).
0001-fea-dreambooth-reworked-the-readme.patch -
I've tested the PEFT example with both
loraandboft. The lora example finishes in about 6min (5m47.993s) but theboftone has been running for ~80min and only compeleted 24% (Steps: 24%|██▎ | 188/800 [1:21:18<3:53:56, 22.94s/it, loss=0.0225, lr=0.0001]). Any thoughts on whyboftis so significantly slower thanlora? Is this bc of lack of hpu graphs? Let's investigate this a bit more.- I've provided the tested
cmdbelow.
- I've provided the tested
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="dog"
export CLASS_DIR="path-to-class-images"
export OUTPUT_DIR="out"
logfile=pr1295.$(date -u +%Y%m%d%H%M).$(hostname).log
time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --class_data_dir=$CLASS_DIR --with_prior_preservation --prior_loss_weight=1.0 --instance_prompt="a photo of sks dog" --class_prompt="a photo of dog" --resolution=512 --train_batch_size=1 --num_class_images=200 --gradient_accumulation_steps=1 --learning_rate=1e-4 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=800 --mixed_precision=bf16 --use_hpu_graphs_for_training --use_hpu_graphs_for_inference --gaudi_config_name Habana/stable-diffusion lora --unet_r 8 --unet_alpha 8 2>&1 | tee $logfile
time python ../../gaudi_spawn.py --world_size 8 --use_mpi train_dreambooth.py --pretrained_model_name_or_path=$MODEL_NAME --instance_data_dir=$INSTANCE_DIR --output_dir=$OUTPUT_DIR --class_data_dir=$CLASS_DIR --with_prior_preservation --prior_loss_weight=1.0 --instance_prompt="a photo of sks dog" --class_prompt="a photo of dog" --resolution=512 --train_batch_size=1 --num_class_images=200 --gradient_accumulation_steps=1 --learning_rate=1e-4 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=800 --mixed_precision=bf16 --gaudi_config_name Habana/stable-diffusion boft 2>&1 | tee $logfile
yes. I have file a bug to pytorch training team about the perf issue, will cc you in the jira |
|
@sywangyi please test with latest synapse SW, if there is still issue, we dont need to merge this change for next synapse release as it's not functional |
which version do you mean? I think habana pytorch training team is still working on it. |
|
@sywangyi do you have test result? |
I have tested this with driver 1.18.0-460 and the corresponding docker. It still shows the same behavior. |
@imangohari1 , do we have update on this? |
I am not if the issue is resolved or not. |
|
according to https://habana.atlassian.net/browse/HS-3208, it has not been resolved yet |
Just to update: RnD guy found the low level issue that produces the slow compilation when "torch.block_diag" operation run. You can find all the details in the ticket. |
|
Update: RnD guys said the new feature is targeted for release 1.21.0 |
|
The fix is under PR review, I'll test it when it is ready |
|
@sywangyi the build with the fix is ready 1.21.0-376. I tested it. Please check the ticket for all the details. |
|
@sywangyi @libinta the build 1.21.0-376 solved the long compilation time during torch.block_diag() operation. Perfomance on 1 epoch and 10 classes: 04/03/2025 13:28:12 - INFO - main - ***** Running training ***** |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
|
@libinta will we merge the PR in 1.21?, since 1.21 fix the the slow compilation when "torch.block_diag" operation run |
@sywangyi to improve the performance further we need some extra work. I think the current PR can be merged because the fix on "torch.block_diag" is done and it allows us to run the training without stuck. |
|
Does this PR work with Synapse 1.21? |
yes |
|
@regisss , maybe we can merge it? Since it works on 1.21 and it's a 10-month journey w/ times of rebase. |
|
@sywangyi it seems there is an incompatibility with numpy >2.0, which seems the version used now. Maybe it is solved in the last main commit? Please merge again the main branch to check this. |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
resovled by upgrade tensorboard |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
regisss
left a comment
There was a problem hiding this comment.
It looks good to me. Can you also check that the Diffusers CI passes please? CI is offline and I don't have access to any Gaudi server at the moment.
I check the GaudiStableDiffusionPipelineTester and DreamBooth test which is diffuser+peft test, ALL could pass. |
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com>
Due to the error `AttributeError: \`np.string_\` was removed in the NumPy 2.0 release. Use \`np.bytes_\` instead.`, we need to pin the Tensorboard version to `tensorboard==2.19.0` - similarly to the fix in [1]. [1] huggingface#1295
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Signed-off-by: Wang, Yi A <yi.a.wang@intel.com> Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
What does this PR do?
Fixes # (issue)
Before submitting