Replies: 2 comments
-
it's a 12B parameter model. an H100 doesn't have the issue due to its dispatch layer |
Beta Was this translation helpful? Give feedback.
-
dunno why SD3 would be like that, it's just 2B. you should see sublinear reduction in per-step speed vs a prolinear increase in per-step throughput on Flux, on a 4090, a batch size of 1 at 1024px uses about 3.5 seconds per step but a batch size of 2 uses about 6.5 sconds. that's less than 2x the time taken for 2x the throughput. maybe you're looking at the runtime calculator in the progress bar. idk. you will maybe use max run steps and then expect epochs time? |
Beta Was this translation helpful? Give feedback.
-
I have been using this repo for a while but here is a problem I have been witnessing from day 1: no matter what model I use (SD3/FLUX), what training I do (LoRA/full model/some of my customized architectures), what GPUs I use (A6000s, A100s), and precision (mostly bf16 but I tested with fp16 too), the training time I have always scales linearly with my batch size. I tried to debug this but so far got nowhere. Is this a common issue?
Beta Was this translation helpful? Give feedback.
All reactions