Skip to content

offload device and main device have a huge impact on lora #245

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wwwffbf opened this issue Jan 2, 2025 · 33 comments
Open

offload device and main device have a huge impact on lora #245

wwwffbf opened this issue Jan 2, 2025 · 33 comments

Comments

@wwwffbf
Copy link

wwwffbf commented Jan 2, 2025

I trained a character lora using diffusion pipe, and found in the default hunyuan video wraper workflow:
the offload device gives a better lora effect, while the main device deviates significantly from the lora training set.

I'm very confused. Does anyone know why

@kijai
Copy link
Owner

kijai commented Jan 2, 2025

It's possible the LoRa simply fails to load unless the model is initialized on offload device.

@wwwffbf
Copy link
Author

wwwffbf commented Jan 3, 2025

@kijai Thanks for the quick reply! should I always use the offload device instead of the main device?
I've found that both are basically the same speed with auto cpu offload turned off.

@0000111100001111
Copy link

Loras don't really seem to work properly here. With main device they don't do anything at all, but even with offload device they're not doing what they're supposed to I think.

They work without issue in native comfyui, but the same base model, loras and prompt don't do much when used with the wrapper nodes, even though it claims to load successfully. There's definitely an effect, as it does change the image and you can create artifacts with >= 2.0 lora strength, but I don't think it's recognizing any movement from the Lora. Maybe it's related to how the wrapper is handling the prompt/clip? Thought initially it was the block swap maybe killing the lora, but the result seems similar without it.

Sadly native comfyui's hunyuan is somewhat useless without the blockswap for my card. Is that feature one that could be added as a custom node maybe, or is it too integral to the process to be easily adaptable?

@kijai
Copy link
Owner

kijai commented Jan 5, 2025

Loras don't really seem to work properly here. With main device they don't do anything at all, but even with offload device they're not doing what they're supposed to I think.

They work without issue in native comfyui, but the same base model, loras and prompt don't do much when used with the wrapper nodes, even though it claims to load successfully. There's definitely an effect, as it does change the image and you can create artifacts with >= 2.0 lora strength, but I don't think it's recognizing any movement from the Lora. Maybe it's related to how the wrapper is handling the prompt/clip? Thought initially it was the block swap maybe killing the lora, but the result seems similar without it.

Sadly native comfyui's hunyuan is somewhat useless without the blockswap for my card. Is that feature one that could be added as a custom node maybe, or is it too integral to the process to be easily adaptable?

I've had no issues with LoRas, or seen any particular LoRa related difference between the wrapper and native, both use comfy LoRa loading too.

@0000111100001111
Copy link

Hmm, maybe it's a local issue somehow.

I'm trying now with a fully updated comfyui/wrapper and a clean and simple workflow for both. Take for instance the beautiful 'Titty drop' lora (tittydrop_v1) from civitai, since that's a very recognizable movement. In native it's working instantly and accurately every time, no matter the seed and other settings. She's always performing that particular movement, even though the scene and details may change. In the wrapper, with the same resolution, frame count, flow, guidance and prompt, it's yet to succeed once, and there's usually not even any real hint of that movement going on even though the actual scene is similar in both cases.

Disconnecting the select lora node in the wrapper instantly produces a somewhat different scene however, one where the woman is now further away, or not facing the camera, or some other variation. So it's doing something.

@0000111100001111
Copy link

Actually, I just managed to reproduce the movement itself in the wrapper by increasing the lora strength to 1.8. It's greatly distorting the image, artifacts everywhere, and she's often lifting empty air instead of a shirt, but the movement is now consistently there every time.

In native I'm just using a strength of 1.0 though. Even 0.8 is working fine there.

Is it possible that native comfyui is somehow using different block weights for the hunyuan lora or some such?

@kijai
Copy link
Owner

kijai commented Jan 5, 2025

Actually, I just managed to reproduce the movement itself in the wrapper by increasing the lora strength to 1.8. It's greatly distorting the image, artifacts everywhere, and she's often lifting empty air instead of a shirt, but the movement is now consistently there every time.

In native I'm just using a strength of 1.0 though. Even 0.8 is working fine there.

Is it possible that native comfyui is somehow using different block weights for the hunyuan lora or some such?

No, there's no such difference. Native version is different in many aspects as comfy implemented the model his own way, that's why we can't compare same seed 1:1 either. I'm not sure about the block swap effect on LoRa as I rarely use that myself, have you tried the new auto CPU offloading instead?

There of course may be things happening with low VRAM that I've not been able to experience myself, native comfy does use LoRAs differently in low VRAM mode too.

@0000111100001111
Copy link

I see. Sadly I can't seem to make it work properly no matter what I try.

CPU offloading instead of block swapping doesn't seem to change anything. Using block swapping but changing the amount of blocks swapped does not appear to affect the final result either. If block swapping or cpuload is what's killing the lora then it seems to be a binary effect rather than one based on which blocks are being swapped.

I'm also trying to run it without either, but that's quite a challenge since it looks like it's circumventing even comfyui's normal memory management and so fills up vram with the model. I'm forced to run it at very low res and frames to finish in acceptable time, and with those settings I can't really tell if it's working or not... The resolution's making the entire image unreliable.

@kijai
Copy link
Owner

kijai commented Jan 5, 2025

I see. Sadly I can't seem to make it work properly no matter what I try.

CPU offloading instead of block swapping doesn't seem to change anything. Using block swapping but changing the amount of blocks swapped does not appear to affect the final result either. If block swapping or cpuload is what's killing the lora then it seems to be a binary effect rather than one based on which blocks are being swapped.

I'm also trying to run it without either, but that's quite a challenge since it looks like it's circumventing even comfyui's normal memory management and so fills up vram with the model. I'm forced to run it at very low res and frames to finish in acceptable time, and with those settings I can't really tell if it's working or not... The resolution's making the entire image unreliable.

I only really have experience with the early LoRAs, been weeks without PC now due to hardware failure, but the Arcane LoRAs I mostly tested with worked perfectly for me.

@0000111100001111
Copy link

I tried that one just now. It worked instantly in native but failed to apply any style in the wrapper, just giving a photorealistic video instead. However, since that one applies a complete style and not just a particular movement, it was a bit easier telling if it was working even at low resolution, so I tried it with no offloading. From what I can tell, it's working. Poorly, since it's very noisy and and it's morphing a lot, and the style is not entirely consistent with native, but it might just be the low res (128x256) and frame count(13) doing it. The image is definitely stylized though, across two different seeds.

So it looks like the conclusion has to be that both block swapping and cpu offloading at least partially kills the effect of loras.

@0000111100001111
Copy link

Think I might have found a solution. I noticed this as part of your recent commits, where it loads loras in the ModelLoader node:

comfy.model_management.load_models_gpu([patcher], force_full_load=True, force_patch_weights=True)
was turned into this
comfy.model_management.load_models_gpu([patcher])

Thought force patch weights sounded promising, and indeed it was. Forcing that one to true again seems to have re-enabled lora effects for both block swap and cpu offload. And forcing full load to true again seems to have improved it further( Possibly because it would otherwise just patch the weights currently loaded, which in a low memory environment would be < all?).

I guess doing the force full load flag is gonna make vram usage spike for a few seconds while patching the weights until the swap/offload takes place later, but that's a fair price to pay for loras working.

@kijai
Copy link
Owner

kijai commented Jan 5, 2025

Think I might have found a solution. I noticed this as part of your recent commits, where it loads loras in the ModelLoader node:

comfy.model_management.load_models_gpu([patcher], force_full_load=True, force_patch_weights=True)
was turned into this
comfy.model_management.load_models_gpu([patcher])

Thought force patch weights sounded promising, and indeed it was. Forcing that one to true again seems to have re-enabled lora effects for both block swap and cpu offload. And forcing full load to true again seems to have improved it further( Possibly because it would otherwise just patch the weights currently loaded, which in a low memory environment would be < all?).

I guess doing the force full load flag is gonna make vram usage spike for a few seconds while patching the weights until the swap/offload takes place later, but that's a fair price to pay for loras working.

Interesting, I reverted it as it made zero difference for me and someone was indeed complaining about the memory use. I think this issue is about the comfy LoRa loading skipping applying the LoRa if it determines there's not enough memory, and either of the force args bypasses that. Can you test if you need both or if either of them is enough?

@0000111100001111
Copy link

PatchWeights alone was making it do something different, but it didn't look entirely proper. Only when I also activated FullLoad did it look right. I just now checked FullLoad alone, and that seemingly produces the same result as having both active, as far as Lora behavior is concerned.

However, there's definitely something wrong here. The lora effect seems to disappear from one generation to the next. It's not just an unlucky seed or anything like that, since identical settings will work fine after a restart of comfyui or switching between block swap and cpu offload.

I'm thinking maybe the forced model offloading is somehow messing up the lora-patched weight, making them revert to unpatched/non-lora.

@0000111100001111
Copy link

Well, with ForceFullLoad active, loras are working using either blockswap or cpuoffload, but not more than once. On the second generation it's lost the lora effect. At that point, the way to re-activate the lora is to re-run the HyVideoModelLoader node, e.g. by changing the lora weight by a tiny fraction.

It's as if on subsequent runs it's using a cached version of the model weights that's from before the lora was patched in...

main_device vs offload_device doesn't seem to matter, nor does cpuoffload vs blockswap. Was thinking maybe the sampler node's force_offload flag set to false would help, but doesn't look like it. Adding in the force_patch_weights=True doesn't seem to help with this one either, though I need to retest it.

@kijai
Copy link
Owner

kijai commented Jan 5, 2025

Well, with ForceFullLoad active, loras are working using either blockswap or cpuoffload, but not more than once. On the second generation it's lost the lora effect. At that point, the way to re-activate the lora is to re-run the HyVideoModelLoader node, e.g. by changing the lora weight by a tiny fraction.

It's as if on subsequent runs it's using a cached version of the model weights that's from before the lora was patched in...

main_device vs offload_device doesn't seem to matter, nor does cpuoffload vs blockswap. Was thinking maybe the sampler node's force_offload flag set to false would help, but doesn't look like it. Adding in the force_patch_weights=True doesn't seem to help with this one either, though I need to retest it.

That's pretty weird as there shouldn't be anything unloading it, after the model loader the model isn't modified at all, just moved between devices... something still must be doing that if the effect disappears like that.

@0000111100001111
Copy link

Ah, looks like that part was my fault. I was passing through a node that had mm.UnloadAllModels in it, after the sampler node, which broke the full-load lora application for the next run. Granted, I would have still expected it to be able to (re-)load the full lora-patched weights even if a node or the user forces a full model unload, so there might be a bug here nonetheless.

@kijai
Copy link
Owner

kijai commented Jan 6, 2025

Ah, looks like that part was my fault. I was passing through a node that had mm.UnloadAllModels in it, after the sampler node, which broke the full-load lora application for the next run. Granted, I would have still expected it to be able to (re-)load the full lora-patched weights even if a node or the user forces a full model unload, so there might be a bug here nonetheless.

That explains it yes, and it is this way because this allows using torch.compile with LoRAs, which currently doesn't work with the native version, and you just experienced the downside.

@0000111100001111
Copy link

Alright! Well, removing the model unload node and keeping the forcefullload=true parameter, it now seems to work consistently with loras with both the block swap method and the cpu offload. The full load does saturate my vram for 15-20 seconds, but it's not really a problem here. As soon as it starts sampling it unloads everything again. I guess it could maybe be a node param, with the warning that not doing a full load will prevent loras from working correctly when doing blockswap/offload.

@pondloso
Copy link

pondloso commented Jan 6, 2025

I'm also had a same issue.
With first render lora is work just fine after that is like with out lora.
I need to reload model to make it work again.

But i see kijai suggestion about new cpu offload i test 3 time is look like lora is work just fine no need to reload model.
But cpu offload it take 20% slower to gen(use less vram and can do higher res).

Maybe block swap just happen to crash with lora...but why it work in first render and just shut it self off after that?

@kijai
Copy link
Owner

kijai commented Jan 6, 2025

I'm also had a same issue.
With first render lora is work just fine after that is like with out lora.
I need to reload model to make it work again.

But i see kijai suggestion about new cpu offload i test 3 time is look like lora is work just fine no need to reload model.
But cpu offload it take 20% slower to gen(use less vram and can do higher res).

Maybe block swap just happen to crash with lora...but why it work in first render and just shut it self off after that?

Sure you're not using any unloading nodes or such? They would unload LoRa and unless the model loader is ran again, they'd stay unloaded. Such nodes are completely unnecessary with these wrapper nodes, as they already include force offload options that won't interfere with the loaded Loras.

@ioritree
Copy link

ioritree commented Jan 6, 2025

same here , first times lora work ,secound times lora no work ,need adjust lora's strength a little will work again.
other nodes only used "Clean VRAM Used"

@kijai
Copy link
Owner

kijai commented Jan 6, 2025

same here , first times lora work ,secound times lora no work ,need adjust lora's strength a little will work again. other nodes only used "Clean VRAM Used"

Anything that does "unload_all_models" will cause this, and also is completely unnecessary.

@TingTingin
Copy link

I having the same issue lora's seem to not load unless you have the forcefullload=true argument added in the code would be nice if this was a node parameter

@rkfg
Copy link

rkfg commented Jan 21, 2025

I'm making my own UI frontend for ComfyUI and hit this issue I think. At first I thought it's something with the way I submit the JSON to the API because loras seem to work once and then turn off. Then I reproduced it in ComfyUI itself, it's very easy and the workflow is the most basic one (from the examples).

  1. Attach a lora, generate something
  2. Change the prompt only (add a period at the end for example)
  3. Generate again
  4. The output would be very different
  5. Detach the lora, generate again
  6. Get the same output, which means in 4) the lora was not applied

It looks like a caching issue of sorts. If I change the lora weight, the graph re-executes from the beginning and the output is correct. However, if the model loading node is not executed due to caching, lora evaporates for some reason. I suppose the force_full_load=True parameter from above masks the actual reason for this behavior but if it works I'll try it out as a temporary measure. EDIT: I added force_full_load=True, force_patch_weights=True to both instances of comfy.model_management.load_models_gpu([patcher]) and this behavior is unchanged. So it's probably something else. I can open a separate issue if needed.

I have to note, this issue does not exist in another workflow that uses the stock ComfyUI nodes (a big mess of SD3/Flux conditioners and simple model loaders). However, those nodes don't support Enhance-A-Video and other useful tricks.

@EduSouza-programmer
Copy link

Estou criando meu próprio frontend de UI para o ComfyUI e acho que encontrei esse problema. No começo, pensei que era algo com a maneira como envio o JSON para a API, porque o loras parece funcionar uma vez e depois desliga. Então, reproduzi no próprio ComfyUI, é muito fácil e o fluxo de trabalho é o mais básico (dos exemplos).

  1. Anexar uma lora, gerar algo
  2. Alterar apenas o prompt (adicionar um ponto no final, por exemplo)
  3. Gerar novamente
  4. A saída seria muito diferente
  5. Desanexe a lora, gere novamente
  6. Obtenha a mesma saída, o que significa que em 4) o lora não foi aplicado

Parece um problema de cache. Se eu alterar o peso do lora, o gráfico será executado novamente do início e a saída estará correta. No entanto, se o nó de carregamento do modelo não for executado devido ao cache, o lora evapora por algum motivo. Suponho que o force_full_load=Trueparâmetro acima mascara o motivo real desse comportamento, mas se funcionar, tentarei como uma medida temporária. EDIT: adicionei force_full_load=True, force_patch_weights=Truea ambas as instâncias de comfy.model_management.load_models_gpu([patcher])e esse comportamento não mudou. Então provavelmente é outra coisa. Posso abrir um problema separado, se necessário.

Tenho que observar que esse problema não existe em outro fluxo de trabalho que usa os nós ComfyUI de estoque (uma grande confusão de condicionadores SD3/Flux e carregadores de modelo simples). No entanto, esses nós não suportam Enhance-A-Video e outros truques úteis.

I'm facing the same problem and would like to add more information to my tests yesterday. It does seem that the problem is in comfyUI... everything is very strange to me, simply after some configurations for a next generation the lora no longer works. I tested the same thing in the wrapper, after I tested it in comfyUI and the same thing happened, I don't understand. I'm using an RTX 3090, 64gb of ram and everything perfectly updated.

@EduSouza-programmer
Copy link

The only way I've seen to generate with lora consistently and without losing "lora" is to clear the caches and unload everything. The problem is that you need to do this every time you generate again, I'm going to open an issue in the comfyui repository, this is important for publicizing the problem.

Image

@rkfg
Copy link

rkfg commented Jan 21, 2025

I'm not sure it's a ComfyUI issue as it only happens with Kijai's nodes. As I said, if you use only the stock nodes and stock LoraLoaderModelOnly it works fine every time.

@kijai
Copy link
Owner

kijai commented Jan 21, 2025

I'm making my own UI frontend for ComfyUI and hit this issue I think. At first I thought it's something with the way I submit the JSON to the API because loras seem to work once and then turn off. Then I reproduced it in ComfyUI itself, it's very easy and the workflow is the most basic one (from the examples).

  1. Attach a lora, generate something
  2. Change the prompt only (add a period at the end for example)
  3. Generate again
  4. The output would be very different
  5. Detach the lora, generate again
  6. Get the same output, which means in 4) the lora was not applied

It looks like a caching issue of sorts. If I change the lora weight, the graph re-executes from the beginning and the output is correct. However, if the model loading node is not executed due to caching, lora evaporates for some reason. I suppose the force_full_load=True parameter from above masks the actual reason for this behavior but if it works I'll try it out as a temporary measure. EDIT: I added force_full_load=True, force_patch_weights=True to both instances of comfy.model_management.load_models_gpu([patcher]) and this behavior is unchanged. So it's probably something else. I can open a separate issue if needed.

I have to note, this issue does not exist in another workflow that uses the stock ComfyUI nodes (a big mess of SD3/Flux conditioners and simple model loaders). However, those nodes don't support Enhance-A-Video and other useful tricks.

And you don't have ANY extra nodes in the workflow? I can't reproduce this with my examples at all, any unload_all_models call would remove the LoRA and there are other nodes that use those for whatever reason.

@rkfg
Copy link

rkfg commented Jan 21, 2025

Here's the workflow I use, it only has your nodes:

Hunyuan T2V (2).json

The first run:

HunyuanVideo_1024_00001.mp4

The second run (added two more periods in the prompt):

HunyuanVideo_1024_00002.mp4

The change is very funny tbh, almost intentional.

@EduSouza-programmer
Copy link

As I mentioned, this behavior is unusual because I can reproduce it even with the ComfyUI nodes. Additionally, if the LoRA is not entirely removed during the second generation—for instance, due to modifications in the prompt—the LoRA's effect gradually diminishes. Eventually, it reaches a point where it has no impact at all.

@kijai
Copy link
Owner

kijai commented Jan 22, 2025

As I mentioned, this behavior is unusual because I can reproduce it even with the ComfyUI nodes. Additionally, if the LoRA is not entirely removed during the second generation—for instance, due to modifications in the prompt—the LoRA's effect gradually diminishes. Eventually, it reaches a point where it has no impact at all.

You mean even with the native nodes? That sounds bizarre... issues like that would transfer to the wrapper too possibly as I'm using ComfyUI LoRA loading too.

@Krishicall
Copy link

I am also experiencing this, but for me the lora has no effect at all even from the first gen. The lora works very well with comfy native nodes, but Id like to use the enhance a video node. Will try the code change tonight.

@Krishicall
Copy link

Krishicall commented Jan 24, 2025

I am also experiencing this, but for me the lora has no effect at all even from the first gen. The lora works very well with comfy native nodes, but Id like to use the enhance a video node. Will try the code change tonight.

I was able to both fix and recreate the issue.

  1. I was using native comfy 'Load Vae' Node and 'Vae Decode (Tiled)' nodes.
  2. Switching these to 'HunyuanVideo Vae Loader' and 'HunyuanVideo Decode', combined with the above code change fixed the issue.
  3. Code change + native comfy Vae nodes created the lora offloading after first generation issue (if I remember correctly).
  4. No code change + Hunyuan Vae nodes = lora has no effect.
  5. Memory usage is quite high with the fix, and limits resolution/number of frames severely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants