Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time takes to run one iteration differs significantly depending on the step or setup #5

Open
Junggy opened this issue Oct 9, 2023 · 3 comments

Comments

@Junggy
Copy link

Junggy commented Oct 9, 2023

Hello, thanks for uploading the working code!

Unfortunately,
When I was running the code, I found out that traning time signifcantly differs depending on setup / step.

some setup, one epoch took more than an hr (and increase even more per step)
or some setup, one epoch took 5 mins, but at one point each eopch started to take about an hr or more, which made it impossible to run enough epoch to reproduce the good result.

not sure what is causing the issue. its possible with some artifact happening internally and sparse conv takes more time .. but not really sure... as far as I see, torch.nn.utils.clip_grad_value_(self.network.parameters(), 10) does some important job to make the training faster but at one point it start to take forever again...

did you also have similar experience during the training?
and what is the best option to prevent this happening?

looking forward to hearing back from you

@fanegg
Copy link
Owner

fanegg commented Oct 14, 2023

@Junggy Hi, we didn't have similar problems with the dataset used in our experiment. Could you tell me your data setting? In particular, the training time mainly depends on the image resolution. Considering that image masks significantly impact the sparsity of sparse conv, we recommend you manually check on the masks of your data.

@Junggy
Copy link
Author

Junggy commented Oct 22, 2023

Thanks for your reply!

actually I trained with zju mocap dataset, CoreView315 - which at the end I found right setup and trained fairly fast and got reasonable result

However, now I started training with CoreView377, with the same setup, training takes around 10 times more, one epch took me around 6000 seconds, while during evaluation inference was fairly fast and result was also reasonable ... Its hard to tell what is wrong tho.

what was average time on your side to train zju dataset per epoch?
and I see one epoch is defined as 500 iteration. How many actual epochs (means # train image = 1 epoch, not 500 iter = 1 epoch) usually required to converge the result?

looking forward to hearing back from you!

@Junggy
Copy link
Author

Junggy commented Nov 2, 2023

Actually I found out this is issue with GPU memory.
somehow this code instead of throwing error with cuda, it just allocate everything in gpu shared memory (cpu memory) that significantly slow down the progress. just lowered resolution to 0.3 (previously using 0.5 as resolution) to put everything in gpu and now one epoch takes again to 211 second instead of 3000-5000 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants