Restart tracker #235

vict0rsch · 2021-08-18T16:45:35Z

Addressing #217

call the (start(), stop()) pair multiple times
use pandas to read/update the csv output

WIP -> this may delete or mess with existing csv files

How it works: the gist of it is that if tracker.stop() stores and _end_time so calling tracker.start() after stop() will use this to detect it was re-started. Each persistence_obj should handle this appropriately. As of now, only the CSV output is implemented (see issue below)

Issue: I could not get the scheduler to just start() again so I create a new one if the tracker is re-started

Issue: There is currently no ID in the csv file (there used to be one but now it's gone) so the current PR expects (timestamp, project_name) to be a unique pair.

Warning: only the latest emissions_data is written. This means the timestamp is that of the latest write not the first one. Is this an issue?

To do:

* use timestamp and project_name to update row with pd * update sentinel to fixed uuid (to prevent auroreload from messing with object()) * some logging improvements

stas00 · 2021-08-20T19:53:00Z

May I suggest adding an actual restart method, which saves the user the trouble of implementing it and I think it'll help with the future where a specific use-case is common enough to warrant a method of its own.

To remind our use case is not to stop / do something / start, but to do intermediary logging to avoid losing data should the program be killed by SLURM or it crashes because of GPU corruption, etc. Because the program gets restarted from the same checkpoint (this is ML training), we thought that saving the data at the same time as the checkpoint is saved will provide the best synchronization. Note, the checkpoint is also saved when the program exits normally.

After writing the above, this is not a restart that we actually want, but cc.flush - perhaps it is a much better logical name choice for the functionality that we are asking for? restart was just the only thing I could think of to trigger intermediary data saving.

benoit-cty · 2021-08-21T14:23:43Z

So @stas00 you want an method to write the CSV without stopping CodeCarbon and when stop is called the CSV entry is updated so you have only one line, is that what you need ? You don't need to pause the process, right ?

Do you have access to internet on your env ? Because the new API could keep track of all your trainning and you could aggregate data per training, projects as needed. We hope to release the new package in september but we will be happy to discuss your use-case with you before.

stas00 · 2021-08-21T15:39:30Z

So @stas00 you want an method to write the CSV without stopping CodeCarbon and when stop is called the CSV entry is updated so you have only one line, is that what you need ? You don't need to pause the process, right ?

What you said.

I don't think the number of lines matters, since the training process gets restarted many times. i.e. it may run for weeks or months before it completes. So if it's faster to append, it's probably better.

Do you have access to internet on your env ? Because the new API could keep track of all your trainning and you could aggregate data per training, projects as needed. We hope to release the new package in september but we will be happy to discuss your use-case with you before.

The compute nodes have no internet. We sync the data to an outside server periodically.

…start-tracker

benoit-cty · 2021-08-21T16:59:01Z

OK, I've just pushed a new method flush() that compute emissions and add them to the CSV, and/or call the API if you use it.
I use it in a Keras callback and it do the job of adding a line in the CSV after each epoch.
You can see it in mnist_callback.py

stas00 · 2021-08-21T17:16:19Z

This is perfect, @benoit-cty. Thank you!

If this larger-scope PR will take a while to complete, perhaps we could merge just the flush you have just added and the example? And then we can start using it right away. I have modified my side of things to use it and tested it to work.

benoit-cty · 2021-08-21T19:03:16Z

We could merge my part into master (see #236 ), do you need a new version of the package for pip ?

stas00 · 2021-08-22T04:10:43Z

That would be ideal on both accounts if it's not too much trouble.

Thank you, @benoit-cty!

vict0rsch · 2021-09-09T20:01:12Z

tracker.flush() implemented in #236 now do we really need to restart?

stas00 · 2021-09-09T20:47:00Z

FWIW, flush()-implementation addressed our needs, so our group doesn't need the restart anymore.

vict0rsch · 2021-09-09T20:59:51Z

Yeah I think it's not worth thinking too hard about this while there's no explicit need. Thanks!

vict0rsch added 4 commits August 18, 2021 16:24

store final_emissions_data

1249281

handle restart with _end_time and duration

ff00f21

Update existing row + re-init scheduler

d805ca8

* use timestamp and project_name to update row with pd * update sentinel to fixed uuid (to prevent auroreload from messing with object()) * some logging improvements

fix tests

55d8b58

vict0rsch marked this pull request as draft August 18, 2021 16:48

vict0rsch requested review from AlexisBogroff, Lothiraldan, SaboniAmine and benoit-cty August 18, 2021 16:48

vict0rsch mentioned this pull request Aug 18, 2021

Allow tracker pause and resume bigscience-workshop/codecarbon#1

Closed

vict0rsch added 4 commits August 18, 2021 19:12

fix cssv writing

9e8b119

move _set_conf()

d900805

improve logs

d891daf

add resume tests (csv NOT api)

0afb40c

benoit-cty added 3 commits August 21, 2021 18:51

Add a flush mechanism

09b0c3d

Merge branch 'restart-tracker' of github.com:mlco2/codecarbon into re…

fa26657

…start-tracker

merge

124e3a8

benoit-cty mentioned this pull request Aug 21, 2021

Method to flush emission to CSV #236

Merged

vict0rsch closed this Sep 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restart tracker #235

Restart tracker #235

Uh oh!

vict0rsch commented Aug 18, 2021 •

edited

Loading

Uh oh!

stas00 commented Aug 20, 2021 •

edited

Loading

Uh oh!

benoit-cty commented Aug 21, 2021 •

edited

Loading

Uh oh!

stas00 commented Aug 21, 2021

Uh oh!

benoit-cty commented Aug 21, 2021

Uh oh!

stas00 commented Aug 21, 2021

Uh oh!

benoit-cty commented Aug 21, 2021 •

edited

Loading

Uh oh!

stas00 commented Aug 22, 2021

Uh oh!

vict0rsch commented Sep 9, 2021 •

edited

Loading

Uh oh!

stas00 commented Sep 9, 2021

Uh oh!

vict0rsch commented Sep 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Restart tracker #235

Restart tracker #235

Uh oh!

Conversation

vict0rsch commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benoit-cty commented Aug 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 21, 2021

Uh oh!

benoit-cty commented Aug 21, 2021

Uh oh!

stas00 commented Aug 21, 2021

Uh oh!

benoit-cty commented Aug 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 22, 2021

Uh oh!

vict0rsch commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Sep 9, 2021

Uh oh!

vict0rsch commented Sep 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vict0rsch commented Aug 18, 2021 •

edited

Loading

stas00 commented Aug 20, 2021 •

edited

Loading

benoit-cty commented Aug 21, 2021 •

edited

Loading

benoit-cty commented Aug 21, 2021 •

edited

Loading

vict0rsch commented Sep 9, 2021 •

edited

Loading