time takes to run one iteration differs significantly depending on the step or setup #5

Junggy · 2023-10-09T17:15:37Z

Hello, thanks for uploading the working code!

Unfortunately,
When I was running the code, I found out that traning time signifcantly differs depending on setup / step.

some setup, one epoch took more than an hr (and increase even more per step)
or some setup, one epoch took 5 mins, but at one point each eopch started to take about an hr or more, which made it impossible to run enough epoch to reproduce the good result.

not sure what is causing the issue. its possible with some artifact happening internally and sparse conv takes more time .. but not really sure... as far as I see, torch.nn.utils.clip_grad_value_(self.network.parameters(), 10) does some important job to make the training faster but at one point it start to take forever again...

did you also have similar experience during the training?
and what is the best option to prevent this happening?

looking forward to hearing back from you

fanegg · 2023-10-14T02:33:48Z

@Junggy Hi, we didn't have similar problems with the dataset used in our experiment. Could you tell me your data setting? In particular, the training time mainly depends on the image resolution. Considering that image masks significantly impact the sparsity of sparse conv, we recommend you manually check on the masks of your data.

Junggy · 2023-10-22T10:24:09Z

Thanks for your reply!

actually I trained with zju mocap dataset, CoreView315 - which at the end I found right setup and trained fairly fast and got reasonable result

However, now I started training with CoreView377, with the same setup, training takes around 10 times more, one epch took me around 6000 seconds, while during evaluation inference was fairly fast and result was also reasonable ... Its hard to tell what is wrong tho.

what was average time on your side to train zju dataset per epoch?
and I see one epoch is defined as 500 iteration. How many actual epochs (means # train image = 1 epoch, not 500 iter = 1 epoch) usually required to converge the result?

looking forward to hearing back from you!

Junggy · 2023-11-02T18:22:28Z

Actually I found out this is issue with GPU memory.
somehow this code instead of throwing error with cuda, it just allocate everything in gpu shared memory (cpu memory) that significantly slow down the progress. just lowered resolution to 0.3 (previously using 0.5 as resolution) to put everything in gpu and now one epoch takes again to 211 second instead of 3000-5000 seconds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

time takes to run one iteration differs significantly depending on the step or setup #5

time takes to run one iteration differs significantly depending on the step or setup #5

Junggy commented Oct 9, 2023

fanegg commented Oct 14, 2023 •

edited

Loading

Junggy commented Oct 22, 2023 •

edited

Loading

Junggy commented Nov 2, 2023

time takes to run one iteration differs significantly depending on the step or setup #5

time takes to run one iteration differs significantly depending on the step or setup #5

Comments

Junggy commented Oct 9, 2023

fanegg commented Oct 14, 2023 • edited Loading

Junggy commented Oct 22, 2023 • edited Loading

Junggy commented Nov 2, 2023

fanegg commented Oct 14, 2023 •

edited

Loading

Junggy commented Oct 22, 2023 •

edited

Loading