Support inference with OFT networks #13692

v0xie · 2023-10-19T20:09:47Z

Description

This PR adds support for inference of OFT networks trained with kohya-ss sd-scripts. The implementation is based on kohya's implementation here: https://github.com/kohya-ss/sd-scripts/blob/main/networks/oft.py

~~This is a draft PR because of these major issues:~~

~~The network remains affects all generations after loading even when supposed to be unloaded~~
~~It noticeably slows down inference~~

The current implementation replaces sd_module's forward method with a custom forward method, which I believe is the cause of the network continuing to affect future generations and the speed of inference.

Any suggestions on how to fix these issues is greatly appreciated!

Related links:
OFT project page: https://oft.wyliu.com
OFT official repository: https://github.com/Zeju1997/oft

Screenshots/videos:

Checklist:

I have read contributing wiki page
I have performed a self-review of my own code
My code follows the style guidelines
My code passes tests

v0xie · 2023-10-20T14:27:26Z

Here's a .safetensors OFT network I trained on part of the Dreambooth dataset: https://huggingface.co/v0xie/sdxl-oft_monster_toy/resolve/main/monster_toy_oft.safetensors
It's a bit under-baked but it should do fine for testing purposes.

SDXL OFT network trained with kohya-ss sd-scripts
Unique Token / Class Token: monster_toy toy
Linear/Alpha: 8/8
Conv/Alpha: 4/4
Trained for 16 epochs / 400 total steps

v0xie · 2023-10-21T23:25:24Z

The slowdown during inference is fixed now; anecdotally, it's at least as fast as using any other network!

Instead of replacing the forward method, now we're merging the weights into the weights of the original module (merge_weight function). I'm trying to figure out how to undo this replacement or if it's even necessary to replace the weights.

v0xie · 2023-10-22T16:53:20Z

The remaining issue of weights continuing to be loaded is fixed, and this PR is ready to merge.

Also fixed is another bug where the multiplier was being applied again in finalize_updown which unpredictably increased the effect of U-Net / Text Encoder multipliers. Kohya's implementation applies the multiplier in the get_weight function, so I've maintained the same behavior for this implementation.

KohakuBlueleaf · 2023-10-27T04:18:02Z

@v0xie The method kohya implemented is too unefficient.
I will try to make some change based on your PR
(you can check my implementation in lycoris dev branch)

… diag_oft has MultiheadAttn which kohya's doesn't?, attempt create new module based off network_lora.py, errors about tensor dim mismatch

…r kohya and kb

Support LyCORIS diag-oft OFT implementation (minus MultiheadAttention layer), maintains support for kohya-ss OFT

v0xie · 2023-11-04T02:56:00Z

@KohakuBlueleaf I just added support for diag-oft module.

I don't know how to handle the Linear MultiheadAttention layers yet so those are skipped until I can better understand how they work. The implementation is a bit complicated since I'm maintaining support for kohya OFT module, so maybe it would be better to have two separate files?

KohakuBlueleaf · 2023-11-04T08:00:56Z

@KohakuBlueleaf I just added support for diag-oft module.

I don't know how to handle the Linear MultiheadAttention layers yet so those are skipped until I can better understand how they work. The implementation is a bit complicated since I'm maintaining support for kohya OFT module, so maybe it would be better to have two separate files?

Kohya's OFT is mathmetically identical with my implementation (but unefficient)
You just need to ensure the key name have ability to handle 2 formats
And math part just use my implementation.

Use same updown implementation for LyCORIS OFT as kohya-ss OFT

KohakuBlueleaf · 2023-11-11T13:03:27Z

@v0xie I have do more investigation on it
it looks like what kohya did is actually "Re-scaled Constrained OFT", and what I have done is "OFT".
and in their Re-scaled COFT implementation, the weight is actually "Q". and the weight in my implementation is actually "R".

I will add constrained OFT in the future and I will use Kohya's naming if I use Q in here.
And sry you still need 2 (or, a if-else) different rebuild weight method.

Basically the idea is:

if using_lycoris:
  # direct use weight + I as R
else:
  # use kohya's implementation to rebuild R first
# rebuild weight from R use my implementation

And a warning here.
Kohya's implementation is wrong since their training doesn't have "(RW)X" but "RX"
Will open an issue on sd-scripts

KohakuBlueleaf · 2023-11-11T15:48:44Z

@v0xie I changed my implementation
it will only produce "oft_diag" as "R" (or, DR)

so you can just use my logic mentioned above

v0xie · 2023-11-12T05:50:04Z

@v0xie I changed my implementation it will only produce "oft_diag" as "R" (or, DR)

so you can just use my logic mentioned above

Thank you for this, I'll take a look at reworking the implementation.

FurkanGozukara · 2023-11-12T17:12:34Z

did you guys got better results (quality not speed) than proper dreambooth training?

I don't believe it but would like to get your opinion

KohakuBlueleaf · 2023-11-13T05:23:28Z

did you guys got better results (quality not speed) than proper dreambooth training?

I don't believe it but would like to get your opinion

It is hardly depends on how you defined "better"
And hardly depends on your configs too.

It is quite like "high rank lokr" vs "low rank lokr" vs "dreambooth"
low rank lokr/large block oft can be very closed to or better then dreambooth on fidelity (since sometime you want some constrain), but high rank lokr/small block oft will more close to low rank lora, which is better on diversity.

The differences is more introduced by the model capacity (most of time it is basically the trainable param count)

You may want to check the paper of LyCORIS, which discuss on different metrics and it is very clear that when you want a better "fidelity" you need larger model capacity, and when you want better "diversity" you need smaller capacity.

Models with same capacity have subtle differences, really hard to say which one is "better" in this case.
In our experiments it is more like, "some algo is more robust on the required property so it is better" or "some algo is faster/more efficient in time", but not "it actually produce siginificant better result than others"

KohakuBlueleaf · 2023-11-13T05:25:27Z

And we(LyCORIS team) don't have any metric evaluation on OFT yet.
I think it is planned (will ask other member in LyCORIS later)
You can wait for few days or weeks for the result.

cyber-meow · 2023-11-15T19:41:59Z

I would like to draw attention on the fact that the inference with OFT (or more generally, any method based on multiplication of additional matrix rather than addition) could be much more complicated if we consider the following two widely adopted practices in the community: using an additional network with a multiplier different from 1 and using multiple networks together.

Using additional network with smaller weight

Let us write $\gamma$ for the scaling factor to apply OFT with smaller/larger weight. I can at least think of three ways to achieve this.

The current implementation pretty much regards OFT as LoRA, and simply does $W' = (I + \gamma(R-I))W$
Provided that $R$ is parameterized as $R=(I+Q)(I-Q)^{-1}$, another natural idea is to define $R_\gamma = (I+\gamma Q)(I-\gamma Q)^{-1}$ and set $W' = I + R_\gamma W$. This may not be so different from the first method if we suppose that $(I+\gamma Q)(I-\gamma Q)^{-1}\approx I + 2\gamma Q$
However, what really sets OFT apart is this viewpoint of fine-tuning some rotation / orthogonal transformation instead of a weight difference. Therefore, what aligns more with this viewpoint is to apply a fractional power $W'=R^{\gamma}W$. In the case of OFT, we have Cayley parameterization that produces matrices from the special orthogonal group. I think to implement matrix power efficiently we may need to rather store it in a different format where we have directly the $2\times 2$ rotational matrices (see e.g. https://math.stackexchange.com/questions/4353632/decomposition-of-son-into-so2-and-inversions). I am not totally sure here, but I believe this approach should be considered as well.

Using multiple additional networks

For this I can think of two different approaches:

Either we apply it directly to the current weight no matter what other networks have been used
Either we apply it to the base weight with no additional networks and then add the difference to the current weight

v0xie · 2023-11-16T01:20:20Z

@cyber-meow Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization https://arxiv.org/abs/2311.06243

Found this new technical report while looking up more about OFT; maybe it has some insight on how weights should be applied?

v0xie · 2023-11-16T02:47:31Z

I think this is now ready to merge! Any issues you see @KohakuBlueleaf ?

This PR now has:

Support for networks trained with LyCORIS OFT/COFT network implementation (https://github.com/KohakuBlueleaf/LyCORIS/blob/dev/lycoris/modules/diag_oft.py).
Support for networks trained with kohya-ss' COFT network implementation (https://github.com/kohya-ss/sd-scripts/blob/main/networks/oft.py).
Both networks use the same optimized code for calc_updown() (from LyCORIS).

Some quirks in this implementation:

LyCORIS has MultiheadAttention layers which are not supported
- I think this is an up-stream thing to handle; other OFT implementations handle attention layers as a special case (see attention_processor.py in https://github.com/tripplyons/oft).
kohya-ss' COFT implementation has alpha (which is the constraint), and LyCORIS does not.
- In either case, constraint is ignored for inference if we assume the weights are pre-constrained / rescaled (which they should be).

KohakuBlueleaf · 2023-12-13T17:16:49Z

@v0xie Sorry I haven't checked this things
I need to notice that your implementation for LyCORIS' OFT is completely wrong
You should not do the skew things and should not take it as "oft_blocks"
It is not "oft_blocks" because IT IS the "R"

Don't do any extra operation on it

KohakuBlueleaf · 2023-12-13T17:30:30Z

Actually You also do something wrong in Kohya.
You only rebuilt the Q and take it as R.

v0xie added 7 commits October 17, 2023 23:35

wip incorrect OFT implementation

ec718f7

inference working but SLOW

1c6efdb

faster by using cached R in forward

853e21d

faster by calculating R in updown and using cached R in forward

eb01d7f

refactor: fix constraint, re-use get_weight

321680c

style: formatting

d10c4db

style: fix ambiguous variable name

0550659

v0xie added 4 commits October 21, 2023 13:43

refactor: use forward hook instead of custom forward

2d8c894

fix: return orig weights during updown, merge weights before forward

7683547

fix: support multiplier, no forward pass hook

fce86ab

style: cleanup oft

76f5abd

v0xie added 4 commits October 21, 2023 17:37

fix: use merge_weight to cache value

de8ee92

refactor: remove used OFT functions

4a50c96

fix: multiplier applied twice in finalize_updown

3b8515d

style: conform style

6523edb

v0xie marked this pull request as ready for review October 22, 2023 16:53

v0xie requested a review from AUTOMATIC1111 as a code owner October 22, 2023 16:53

v0xie added 6 commits November 1, 2023 22:34

test implementation based on kohaku diag-oft implementation

a2fad6e

detect diag_oft type

65ccd63

no idea what i'm doing, trying to support both type of OFT, kblueleaf…

d727ddf

… diag_oft has MultiheadAttn which kohya's doesn't?, attempt create new module based off network_lora.py, errors about tensor dim mismatch

skip multihead attn for now

fe1967a

refactor: move factorization to lyco_helpers, separate calc_updown fo…

f6c8201

…r kohya and kb

Merge pull request #1 from v0xie/oft-faster

1dd25be

Support LyCORIS diag-oft OFT implementation (minus MultiheadAttention layer), maintains support for kohya-ss OFT

v0xie marked this pull request as draft November 4, 2023 21:52

v0xie added 3 commits November 4, 2023 14:54

refactor: use same updown for both kohya OFT and LyCORIS diag-oft

329c8ba

refactor: remove unused function

bbf00a9

Merge pull request #2 from v0xie/network-oft-change-impl

7edd50f

Use same updown implementation for LyCORIS OFT as kohya-ss OFT

fix: ignore calc_scale() for COFT which has very small alpha

d6d0b22

feat: LyCORIS/kohya OFT network support

eb667e7

v0xie marked this pull request as ready for review November 16, 2023 02:47

AUTOMATIC1111 approved these changes Nov 19, 2023

View reviewed changes

AUTOMATIC1111 merged commit 2207ef3 into AUTOMATIC1111:dev Nov 19, 2023
3 checks passed

w-e-w mentioned this pull request Dec 4, 2023

1.7.0-RC #14196

Closed

KohakuBlueleaf mentioned this pull request Dec 13, 2023

Fix wrong implementation in network_oft #14300

Merged

4 tasks

w-e-w mentioned this pull request Dec 16, 2023

1.7.0 #14323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support inference with OFT networks #13692

Support inference with OFT networks #13692

v0xie commented Oct 19, 2023 •

edited

Loading

v0xie commented Oct 20, 2023

v0xie commented Oct 21, 2023

v0xie commented Oct 22, 2023

KohakuBlueleaf commented Oct 27, 2023

v0xie commented Nov 4, 2023

KohakuBlueleaf commented Nov 4, 2023

KohakuBlueleaf commented Nov 11, 2023 •

edited

Loading

KohakuBlueleaf commented Nov 11, 2023

v0xie commented Nov 12, 2023

FurkanGozukara commented Nov 12, 2023

KohakuBlueleaf commented Nov 13, 2023 •

edited

Loading

KohakuBlueleaf commented Nov 13, 2023 •

edited

Loading

cyber-meow commented Nov 15, 2023 •

edited

Loading

v0xie commented Nov 16, 2023

v0xie commented Nov 16, 2023

KohakuBlueleaf commented Dec 13, 2023

KohakuBlueleaf commented Dec 13, 2023

Support inference with OFT networks #13692

Support inference with OFT networks #13692

Conversation

v0xie commented Oct 19, 2023 • edited Loading

Description

Any suggestions on how to fix these issues is greatly appreciated!

Screenshots/videos:

Checklist:

v0xie commented Oct 20, 2023

v0xie commented Oct 21, 2023

v0xie commented Oct 22, 2023

KohakuBlueleaf commented Oct 27, 2023

v0xie commented Nov 4, 2023

KohakuBlueleaf commented Nov 4, 2023

KohakuBlueleaf commented Nov 11, 2023 • edited Loading

KohakuBlueleaf commented Nov 11, 2023

v0xie commented Nov 12, 2023

FurkanGozukara commented Nov 12, 2023

KohakuBlueleaf commented Nov 13, 2023 • edited Loading

KohakuBlueleaf commented Nov 13, 2023 • edited Loading

cyber-meow commented Nov 15, 2023 • edited Loading

Using additional network with smaller weight

Using multiple additional networks

v0xie commented Nov 16, 2023

v0xie commented Nov 16, 2023

KohakuBlueleaf commented Dec 13, 2023

KohakuBlueleaf commented Dec 13, 2023

v0xie commented Oct 19, 2023 •

edited

Loading

KohakuBlueleaf commented Nov 11, 2023 •

edited

Loading

KohakuBlueleaf commented Nov 13, 2023 •

edited

Loading

KohakuBlueleaf commented Nov 13, 2023 •

edited

Loading

cyber-meow commented Nov 15, 2023 •

edited

Loading