-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
time takes to run one iteration differs significantly depending on the step or setup #5
Comments
@Junggy Hi, we didn't have similar problems with the dataset used in our experiment. Could you tell me your data setting? In particular, the training time mainly depends on the image resolution. Considering that image masks significantly impact the sparsity of sparse conv, we recommend you manually check on the masks of your data. |
Thanks for your reply! actually I trained with zju mocap dataset, CoreView315 - which at the end I found right setup and trained fairly fast and got reasonable result However, now I started training with CoreView377, with the same setup, training takes around 10 times more, one epch took me around 6000 seconds, while during evaluation inference was fairly fast and result was also reasonable ... Its hard to tell what is wrong tho. what was average time on your side to train zju dataset per epoch? looking forward to hearing back from you! |
Actually I found out this is issue with GPU memory. |
Hello, thanks for uploading the working code!
Unfortunately,
When I was running the code, I found out that traning time signifcantly differs depending on setup / step.
some setup, one epoch took more than an hr (and increase even more per step)
or some setup, one epoch took 5 mins, but at one point each eopch started to take about an hr or more, which made it impossible to run enough epoch to reproduce the good result.
not sure what is causing the issue. its possible with some artifact happening internally and sparse conv takes more time .. but not really sure... as far as I see, torch.nn.utils.clip_grad_value_(self.network.parameters(), 10) does some important job to make the training faster but at one point it start to take forever again...
did you also have similar experience during the training?
and what is the best option to prevent this happening?
looking forward to hearing back from you
The text was updated successfully, but these errors were encountered: