Fuse loras #4473

patrickvonplaten · 2023-08-04T18:00:30Z

What does this PR do?

Great idea from @williamberman to allow fusing lora weights into the original weights

You can try it out with:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipe.load_lora_weights("stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors")
pipe.fuse_lora()

pipe.to(torch_dtype=torch.float16)
pipe.to("cuda")

torch.manual_seed(0)

prompt = "beautiful scenery nature glass bottle landscape, , purple galaxy bottle"
negative_prompt = "text, watermark"

image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=25).images[0]

You can reverse the effect of fuse_lora() by calling pipe.unfuse_lora(). Refer to the test cases to have a better handle on the implications.

TODOS:

Conv layers
Attention linear layers
Text encoder LoRA fusion

BTW @sayakpaul we should probably refactor the attention lora processors to completely remove them and instead just work with the compatible linear layer class in the other attention processors (we'll have to do this anyways for the peft refactor) Done in #4765 by @patrickvonplaten.

sayakpaul · 2023-08-04T18:06:02Z

BTW @sayakpaul we should probably refactor the attention lora processors to completely remove them and instead just work with the compatible linear layer class in the other attention processors (we'll have to do this anyways for the peft refactor)

Could you please elaborate? I didn't get it at all.

HuggingFaceDocBuilderDev · 2023-08-04T18:07:57Z

The documentation is not available anymore as the PR was closed or merged.

src/diffusers/models/unet_2d_condition.py

apolinario · 2023-08-04T20:26:32Z

Kohya has a similar functionality and they have this ratio param: https://github.com/kohya-ss/sd-scripts/blob/main/networks/merge_lora.py#L33

python networks\merge_lora.py --sd_model ..\model\model.ckpt --save_to ..\lora_train1\model-char1-merged.safetensors --models ..\lora_train1\last.safetensors ..\lora_train2\last.safetensors --ratios 0.8 0.5

Would it make sense we enable this weighting into the (single) loading too we are doing too?

apolinario · 2023-08-04T20:30:24Z

Also as mentioned internally on Slack, would be great to have a pipe.unet.unfuse_lora() for use-cases where on infra one may want to keep a model warm and swap out LoRAs

sayakpaul · 2023-08-05T02:45:37Z

@apolinario FWIW, this PR doesn't concern loading multiple LoRAs and fusing them. It simply fuses the LoRA params and the original model params in a way that doesn't affect the rest of our attention processing mechanism.

We already have issue requests for multiple LoRAs on GH.

apolinario · 2023-08-05T08:43:24Z

@sayakpaul sorry, I was not suggesting that we should tackle loading multiple LoRAs in this PR, I just used the loading multiple LoRAs code example of kohya to showcase the ratios param which was the thing I wanted to ask if we should do smth analogous or not on this PR

I have edited my comment for clarity

sayakpaul · 2023-08-05T09:21:29Z

IIUC, our scale parameter can be used to control how much the LoRA parama are being merged to the corresponding original params. What am I missing out on?

apolinario · 2023-08-05T09:29:27Z

Yup you are right, sounds redundant then if there's a general/global scale param for loading the LoRA

Edit: as discussed offline, while we do have a cross_attention_kwargs={"scale": 0.5} as stated here, that happens during inference, so in this PR we would need some sort of a scale param for the model merging

patrickvonplaten · 2023-08-07T17:05:08Z

BTW @sayakpaul we should probably refactor the attention lora processors to completely remove them and instead just work with the compatible linear layer class in the other attention processors (we'll have to do this anyways for the peft refactor)

Could you please elaborate? I didn't get it at all.

For this PR we need to add a specific fuse_lora function to

diffusers/src/diffusers/models/attention_processor.py

Line 1140 in aef11cb

class LoRAXFormersAttnProcessor(nn.Module):

and

diffusers/src/diffusers/models/attention_processor.py

Line 1254 in aef11cb

class LoRAAttnProcessor2_0(nn.Module):

to get it working. That's ok for now. However in the midterm we should try to fully delete these two classes and instead adapt:

diffusers/src/diffusers/models/attention_processor.py

Line 142 in aef11cb

self.to_k = nn.Linear(cross_attention_dim, inner_dim, bias=bias)

to

diffusers/src/diffusers/models/lora.py

Line 107 in aef11cb

class LoRACompatibleLinear(nn.Linear):

(this will need some changes in the loading and training of LoRA then, but we can make it work with fully backwards compatibility I believe - also cc @williamberman )

patrickvonplaten · 2023-08-23T20:22:47Z

@sayakpaul do you want to give this PR a try or do you prefer me to finish it?

sayakpaul · 2023-08-24T04:27:37Z

Giving it a try.

src/diffusers/models/lora.py

sayakpaul · 2023-08-24T09:58:58Z

@patrickvonplaten up for a review here.

sayakpaul · 2023-08-24T10:18:14Z

With this benchmarking script, I get the following on a V100:

{'fuse': False, 'total_time (ms)': '95874.1', 'memory (mb)': 13572}

{'fuse': True, 'total_time (ms)': '83744.8', 'memory (mb)': 13543}

apolinario · 2023-08-24T11:38:18Z

Two questions:

Are we supporting a scale param for fusing the LoRA as previously discussed? Or would cross_attention_kwargs={"scale": 0.5} still work with fused LoRAs?
What happens if I fuse two LoRAs and then call unfuse_lora? Does it go back to the original state?

sayakpaul · 2023-08-28T12:28:47Z

I am observing some inconsistencies in the tests and unfuse_lora(). I will update the PR once it's ready for review.

patrickvonplaten · 2023-08-28T14:23:55Z

Ah I've found the problem I think. The problem is that we currently don't fuse the text encoder weights, when calling:

unload_lora_weights()

this line:

diffusers/src/diffusers/loaders.py

Line 1727 in 53f2e74

self._remove_text_encoder_monkey_patch()

is hit and causes the inference to be different.

The PR as it's currently already works correctly for non-textencoder LoRAs such as: https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_offset_example-lora_1.0.safetensors

you can try by doing:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipe.load_lora_weights("stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors")
pipe.unet.fuse_lora()
#This should have no effect as the LoRA is fused to the `unet` already, but instead it is removing the LoRA's effect
pipe.unload_lora_weights()

pipe.to(torch_dtype=torch.float16)
pipe.to("cuda")

torch.manual_seed(0)
prompt = "a mecha robot"
image = pipe(prompt, num_inference_steps=25).images[0]
image

I think we can actually also also fuse the text encoder weights no? Don't think that should be too difficult - what do you think @sayakpaul ?

patrickvonplaten · 2023-08-28T14:26:04Z

Apart from this the PR looks very nice! Tests that have been added are great

sayakpaul · 2023-08-29T03:23:48Z

@apolinario https://gist.github.com/sayakpaul/cd0395669002ae82634e57e5d26cc0e0.

Regular:

With fused LoRA:

With `unload_lora_weights()` (no-op):

With `unfuse_lora()`:

sayakpaul · 2023-08-29T03:37:25Z

@patrickvonplaten PR is ready for a review. Would be great if you could do some testing as well.

My only question at this point is how do we allow for scale when the fuse_lora() has been called. We need to think about that and once we have a way, we should write thorough test cases.

tests/models/test_lora_layers.py

patrickvonplaten · 2023-08-29T07:14:21Z

Very cool! Everything works fine and the implementation is clean - nice job!

Yes I think you touch upon a good point here regarding "lora_scale" - we should probably handle this in a new PR.
We currently have a also this 🚧 regarding the "lora_scale": #4751

=> I'd suggest to do the following in a follow-up PR:

a) Make sure that "lora_scale" is applied to all LoRA layers. Instead of passing a cross_attention_kwargs through the attention processors, it's probably better to work with a "setter"-method here that iterates through all layers and if the layer is a LoRA layer it just sets the "lora_scale" for the specific class
b) Also solve:

My only question at this point is how do we allow for scale when the fuse_lora() has been called. We need to think about that and once we have a way, we should write thorough test cases.
=> I think we can do this by just multiplying the fused weight with that "lora_scale" and it should work

c) We should also through a warning if a user fuses more than one LoRA IMO

Feel free to take this PR - otherwise happy to take a look myself :-)

Merging this one now!

apolinario · 2023-08-29T07:14:59Z

Very nice @sayakpaul, however the robot after the pipe.unfuse_lora() is different than the robot before fusing the LoRA:

From your own gist and example (and I reproduced it here):

Before fusing the LoRA	After fusing and unfusing the LoRA

prompt: `a mecha robot` - seed: `0`	prompt: `a mecha robot` - seed: `0`

Is this expected? Do we know why? I think this can affect hot-unloading use-cases as after pipe.unfuse_lora() my understanding is that users should expect the same image as if the LoRA had never been fused

patrickvonplaten · 2023-08-29T07:21:47Z

src/diffusers/loaders.py

+        self.w_up = self.w_up.to(device=device, dtype=dtype)
+        self.w_down = self.w_down.to(device, dtype=dtype)
+        unfused_weight = fused_weight - torch.bmm(self.w_up[None, :], self.w_down[None, :])[0]
+        self.regular_linear_layer.weight.data = unfused_weight.to(device=device, dtype=dtype)


Actually the reason unfuse gives different results might be because we don't do the computation in full fp32 precision here. In the _fuse_lora function we do the computation in full fp32 precision, but here we don't I think. Can we try to make sure that:

unfused_weight = fused_weight - torch.bmm(self.w_up[None, :], self.w_down[None, :])[0]

is always computed in full fp32 precision and only then we lower it potentially to fp16 dtype?

Guess we could also easily check this by casting the whole model to fp32 before doing fuse and unfuse to check cc @apolinario

This is a good hypothesis but would this explain the behaviour of this happening in some LoRAs but not others, and keeping some residual style?

davizca87/vulcan → different image after unloading

ostris/crayon_style_lora_sdxl → same image after unloading

davizca87/sun-flower → different image after unloading (and also different than the image unlaoded in 1)

TheLastBen/Papercut_SDXL→ same image after unloading

Failed cases ❌

davizca87/sun-flower seems to keep some of the sunflower vibe in the background

original generation with lora fused after unfusing (residual effects from the LoRA?)

davizca87/vulcan seems to keep some of the vulcan style in the outlines of the robot

original generation with lora fused after unfusing (residual effects from the LoRA?)

Success cases ✅

ostris/crayon_style_lora_sdxl seems to produce a perceptually identical image after unfusing

original generation with lora fused after unfusing (same as original generation)

TheLastBen/Papercut_SDXL and nerijs/pixel-art-xl also exhibit the same correct behavior

I personally always worry about how numerical precision stems through a network and affect the end results. Have seen enough cases because of this to not sleep well at night. So, I will start with what @patrickvonplaten suggested.

FWIW, though, there's actually a fast test that ensures the following trip doesn't have side-effects:

load_lora_weights() -> fuse_lora() -> unload_lora_weights() gives you the outputs you would expect after doing fuse_lora().

Let me know if anything is unclear.

Very nice! I guess an analogous test could be made to address what I reported for a future PR

Namely asserting that the two generate with no LoRA are matching:
generate with no LoRA before unfusing → load_lora_weights() → fuse_lora() → unfuse_lora() → generate with no LoRA after unfusing

This is the workflow I've reported above, the unfused unet seemingly still contains somewhat of a residue of the lora

Should be fixed in: #4833 . Was a tricky issue that was caused by the patched text encoder LoRA layers being fully removed when doing unload_lora and therefore loosing their ability to unfuse. Hope it was ok to help out a bit here @sayakpaul

Had started fuse-lora-pt2 and mentioned in Slack that I am looking into it. But okay.

patrickvonplaten · 2023-08-29T21:17:38Z

Very cool! Everything works fine and the implementation is clean - nice job!

Yes I think you touch upon a good point here regarding "lora_scale" - we should probably handle this in a new PR. We currently have a also this 🚧 regarding the "lora_scale": #4751

=> I'd suggest to do the following in a follow-up PR:

a) Make sure that "lora_scale" is applied to all LoRA layers. Instead of passing a cross_attention_kwargs through the attention processors, it's probably better to work with a "setter"-method here that iterates through all layers and if the layer is a LoRA layer it just sets the "lora_scale" for the specific class

b) Also solve:

My only question at this point is how do we allow for scale when the fuse_lora() has been called. We need to think about that and once we have a way, we should write thorough test cases.
=> I think we can do this by just multiplying the fused weight with that "lora_scale" and it should work

c) We should also through a warning if a user fuses more than one LoRA IMO

Feel free to take this PR - otherwise happy to take a look myself :-)

Merging this one now!

@sayakpaul if you have some time to look into this before next week that would be incredible! This way we can make a super nice SDXL LoRA release :-)

sayakpaul · 2023-08-30T04:38:15Z

@sayakpaul if you have some time to look into this before next week that would be incredible! This way we can make a super nice SDXL LoRA release :-)

Prioritizing it.

xhinker · 2023-09-23T20:14:42Z

I see. My understanding was that we would get that for "free" - my understanding was, once a LoRA is fused to the Unet, loading another LoRA would be the same as loading it LoRA to a base model (given it's loading to a fused Unet). What am I missing?

@apolinario your understanding is correct. However, I think we need to think about that design a bit to not break the API consistency and consider the repercussions of that. I am not suggesting we shouldn't allow it, of course, we should. But the API and/or docs for it need to be very tight to not introduce any confusion for the users.

I agree! We should be careful about introducing new APIs here. A suggestion here would be to do the following: There is no harm in allowing the user to fuse multiple LoRAs, but as soon as two LoRAs are fused, the unfuse LoRA function doesn't work anymore. => What do you think about the idea of keeping an internal counter of how many LoRAs are fused and when this number is > 1, we throw an error when doing unfuse LoRA that states that multiple LoRAs have been fused and that the user needs to reload the whole model instead

I'd say only we have a full PEFT integration, we should have full support for fusing and unfusing multiple LoRAs by name.

Thoughts?

@patrickvonplaten @sayakpaul how about adding a LoRA state dictionary to remember the LoRAs and Scale Value. for example:

lora_state = [
     {
           'lora_name':"LoRA1",
           'lora_scale':0.7
     },
     {
           'lora_name':"LoRA2",
           'lora_scale':0.5
     }
]

When a new LoRA is added into the pipeline, update the scale delta to pipeline if the LoRA exists already, and update the lora_state object at the same time, add a new record if never added before,

Now, whenever user call the unfuse() function, the pipeline will remove the LoRA weights times the lora_state's lora_scale. In theory, it should work, thoughts?

* Fuse loras * initial implementation. * add slow test one. * styling * add: test for checking efficiency * print * position * place model offload correctly * style * style. * unfuse test. * final checks * remove warning test * remove warnings altogether * debugging * tighten up tests. * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * denugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debuging * debugging * debugging * debugging * suit up the generator initialization a bit. * remove print * update assertion. * debugging * remove print. * fix: assertions. * style * can generator be a problem? * generator * correct tests. * support text encoder lora fusion. * tighten up tests. --------- Co-authored-by: Sayak Paul <[email protected]>

Fuse loras

ad0ca34

sayakpaul reviewed Aug 4, 2023

View reviewed changes

src/diffusers/models/unet_2d_condition.py Outdated Show resolved Hide resolved

apolinario mentioned this pull request Aug 17, 2023

LoRA weights baking. #4642

Closed

sayakpaul added 2 commits August 24, 2023 12:35

initial implementation.

697a6a7

merge into main

957f36e

sayakpaul reviewed Aug 24, 2023

View reviewed changes

src/diffusers/models/lora.py Outdated Show resolved Hide resolved

sayakpaul added 10 commits August 24, 2023 13:19

add slow test one.

703e9aa

styling

f9a7737

add: test for checking efficiency

15b7652

print

b4a9a44

position

a6d6402

place model offload correctly

a167a74

style

14aa423

style.

16311f7

unfuse test.

a355544

final checks

d8050b5

sayakpaul added 5 commits August 28, 2023 14:56

remove print

55f5958

update assertion.

73bdcb1

debugging

940ed1b

remove print.

8fcd42a

fix: assertions.

6e99561

sayakpaul added 3 commits August 28, 2023 17:59

style

c3adb8c

can generator be a problem?

2d6cd03

generator

53f2e74

sayakpaul added 2 commits August 29, 2023 07:53

correct tests.

b9ea1fc

support text encoder lora fusion.

9cb8ec3

tighten up tests.

50c611d

patrickvonplaten commented Aug 29, 2023

View reviewed changes

tests/models/test_lora_layers.py Show resolved Hide resolved

patrickvonplaten changed the title ~~[WIP] Fuse loras~~ Fuse loras Aug 29, 2023

patrickvonplaten merged commit c583f3b into main Aug 29, 2023

patrickvonplaten deleted the fuse_loras branch August 29, 2023 07:14

patrickvonplaten commented Aug 29, 2023

View reviewed changes

patrickvonplaten mentioned this pull request Aug 29, 2023

Fix Unfuse Lora #4833

Merged

sayakpaul mentioned this pull request Aug 30, 2023

[Core] LoRA improvements pt. 3 #4842

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fuse loras #4473

Fuse loras #4473

patrickvonplaten commented Aug 4, 2023 •

edited by sayakpaul

Loading

sayakpaul commented Aug 4, 2023

HuggingFaceDocBuilderDev commented Aug 4, 2023 •

edited

Loading

apolinario commented Aug 4, 2023 •

edited

Loading

apolinario commented Aug 4, 2023

sayakpaul commented Aug 5, 2023

apolinario commented Aug 5, 2023 •

edited

Loading

sayakpaul commented Aug 5, 2023

apolinario commented Aug 5, 2023 •

edited

Loading

patrickvonplaten commented Aug 7, 2023

patrickvonplaten commented Aug 23, 2023

sayakpaul commented Aug 24, 2023

sayakpaul commented Aug 24, 2023

sayakpaul commented Aug 24, 2023

apolinario commented Aug 24, 2023

sayakpaul commented Aug 28, 2023

patrickvonplaten commented Aug 28, 2023

patrickvonplaten commented Aug 28, 2023

sayakpaul commented Aug 29, 2023

sayakpaul commented Aug 29, 2023 •

edited

Loading

patrickvonplaten commented Aug 29, 2023

apolinario commented Aug 29, 2023 •

edited

Loading

patrickvonplaten Aug 29, 2023

apolinario Aug 29, 2023 •

edited

Loading

apolinario Aug 29, 2023 •

edited

Loading

sayakpaul Aug 29, 2023

apolinario Aug 29, 2023 •

edited

Loading

patrickvonplaten Aug 29, 2023

sayakpaul Aug 30, 2023

patrickvonplaten commented Aug 29, 2023

sayakpaul commented Aug 30, 2023

xhinker commented Sep 23, 2023 •

edited

Loading

Fuse loras #4473

Fuse loras #4473

Conversation

patrickvonplaten commented Aug 4, 2023 • edited by sayakpaul Loading

What does this PR do?

sayakpaul commented Aug 4, 2023

HuggingFaceDocBuilderDev commented Aug 4, 2023 • edited Loading

apolinario commented Aug 4, 2023 • edited Loading

apolinario commented Aug 4, 2023

sayakpaul commented Aug 5, 2023

apolinario commented Aug 5, 2023 • edited Loading

sayakpaul commented Aug 5, 2023

apolinario commented Aug 5, 2023 • edited Loading

patrickvonplaten commented Aug 7, 2023

patrickvonplaten commented Aug 23, 2023

sayakpaul commented Aug 24, 2023

sayakpaul commented Aug 24, 2023

sayakpaul commented Aug 24, 2023

apolinario commented Aug 24, 2023

sayakpaul commented Aug 28, 2023

patrickvonplaten commented Aug 28, 2023

patrickvonplaten commented Aug 28, 2023

sayakpaul commented Aug 29, 2023

Regular:

With fused LoRA:

With unload_lora_weights() (no-op):

With unfuse_lora():

sayakpaul commented Aug 29, 2023 • edited Loading

patrickvonplaten commented Aug 29, 2023

apolinario commented Aug 29, 2023 • edited Loading

patrickvonplaten Aug 29, 2023

Choose a reason for hiding this comment

apolinario Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

apolinario Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

sayakpaul Aug 29, 2023

Choose a reason for hiding this comment

apolinario Aug 29, 2023 • edited Loading

Choose a reason for hiding this comment

patrickvonplaten Aug 29, 2023

Choose a reason for hiding this comment

sayakpaul Aug 30, 2023

Choose a reason for hiding this comment

patrickvonplaten commented Aug 29, 2023

sayakpaul commented Aug 30, 2023

xhinker commented Sep 23, 2023 • edited Loading

patrickvonplaten commented Aug 4, 2023 •

edited by sayakpaul

Loading

HuggingFaceDocBuilderDev commented Aug 4, 2023 •

edited

Loading

apolinario commented Aug 4, 2023 •

edited

Loading

apolinario commented Aug 5, 2023 •

edited

Loading

apolinario commented Aug 5, 2023 •

edited

Loading

With `unload_lora_weights()` (no-op):

With `unfuse_lora()`:

sayakpaul commented Aug 29, 2023 •

edited

Loading

apolinario commented Aug 29, 2023 •

edited

Loading

apolinario Aug 29, 2023 •

edited

Loading

apolinario Aug 29, 2023 •

edited

Loading

apolinario Aug 29, 2023 •

edited

Loading

xhinker commented Sep 23, 2023 •

edited

Loading