-
-
Notifications
You must be signed in to change notification settings - Fork 241
Restart tracker #235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restart tracker #235
Conversation
* use timestamp and project_name to update row with pd * update sentinel to fixed uuid (to prevent auroreload from messing with object()) * some logging improvements
|
May I suggest adding an actual To remind our use case is not to stop / do something / start, but to do intermediary logging to avoid losing data should the program be killed by SLURM or it crashes because of GPU corruption, etc. Because the program gets restarted from the same checkpoint (this is ML training), we thought that saving the data at the same time as the checkpoint is saved will provide the best synchronization. Note, the checkpoint is also saved when the program exits normally. After writing the above, this is not a restart that we actually want, but |
|
So @stas00 you want an method to write the CSV without stopping CodeCarbon and when stop is called the CSV entry is updated so you have only one line, is that what you need ? You don't need to pause the process, right ? Do you have access to internet on your env ? Because the new API could keep track of all your trainning and you could aggregate data per training, projects as needed. We hope to release the new package in september but we will be happy to discuss your use-case with you before. |
What you said. I don't think the number of lines matters, since the training process gets restarted many times. i.e. it may run for weeks or months before it completes. So if it's faster to append, it's probably better.
The compute nodes have no internet. We sync the data to an outside server periodically. |
|
OK, I've just pushed a new method flush() that compute emissions and add them to the CSV, and/or call the API if you use it. |
|
This is perfect, @benoit-cty. Thank you! If this larger-scope PR will take a while to complete, perhaps we could merge just the |
|
We could merge my part into master (see #236 ), do you need a new version of the package for pip ? |
|
That would be ideal on both accounts if it's not too much trouble. Thank you, @benoit-cty! |
|
|
|
FWIW, |
|
Yeah I think it's not worth thinking too hard about this while there's no explicit need. Thanks! |
Addressing #217
start(),stop()) pair multiple timesWIP -> this may delete or mess with existing csv files
How it works: the gist of it is that if
tracker.stop()stores and_end_timeso callingtracker.start()afterstop()will use this to detect it was re-started. Eachpersistence_objshould handle this appropriately. As of now, only the CSV output is implemented (see issue below)Issue: I could not get the scheduler to just
start()again so I create a new one if the tracker is re-startedIssue: There is currently no ID in the csv file (there used to be one but now it's gone) so the current PR expects
(timestamp, project_name)to be a unique pair.Warning: only the latest
emissions_datais written. This means thetimestampis that of the latest write not the first one. Is this an issue?To do:
durationimplem