Skip to content

Conversation

@vict0rsch
Copy link
Contributor

@vict0rsch vict0rsch commented Aug 18, 2021

Addressing #217

  • call the (start(), stop()) pair multiple times
  • use pandas to read/update the csv output

WIP -> this may delete or mess with existing csv files

How it works: the gist of it is that if tracker.stop() stores and _end_time so calling tracker.start() after stop() will use this to detect it was re-started. Each persistence_obj should handle this appropriately. As of now, only the CSV output is implemented (see issue below)

Issue: I could not get the scheduler to just start() again so I create a new one if the tracker is re-started

Issue: There is currently no ID in the csv file (there used to be one but now it's gone) so the current PR expects (timestamp, project_name) to be a unique pair.

Warning: only the latest emissions_data is written. This means the timestamp is that of the latest write not the first one. Is this an issue?

To do:

  • check duration implem
  • Doc
  • re-start tests (csv)
  • Handle CodeCarbon API
  • re-start tests (API)

* use timestamp and project_name to update row with pd
* update sentinel to fixed uuid (to prevent auroreload from messing with object())
* some logging improvements
@stas00
Copy link
Contributor

stas00 commented Aug 20, 2021

May I suggest adding an actual restart method, which saves the user the trouble of implementing it and I think it'll help with the future where a specific use-case is common enough to warrant a method of its own.

To remind our use case is not to stop / do something / start, but to do intermediary logging to avoid losing data should the program be killed by SLURM or it crashes because of GPU corruption, etc. Because the program gets restarted from the same checkpoint (this is ML training), we thought that saving the data at the same time as the checkpoint is saved will provide the best synchronization. Note, the checkpoint is also saved when the program exits normally.

After writing the above, this is not a restart that we actually want, but cc.flush - perhaps it is a much better logical name choice for the functionality that we are asking for? restart was just the only thing I could think of to trigger intermediary data saving.

@benoit-cty
Copy link
Contributor

benoit-cty commented Aug 21, 2021

So @stas00 you want an method to write the CSV without stopping CodeCarbon and when stop is called the CSV entry is updated so you have only one line, is that what you need ? You don't need to pause the process, right ?

Do you have access to internet on your env ? Because the new API could keep track of all your trainning and you could aggregate data per training, projects as needed. We hope to release the new package in september but we will be happy to discuss your use-case with you before.

@stas00
Copy link
Contributor

stas00 commented Aug 21, 2021

So @stas00 you want an method to write the CSV without stopping CodeCarbon and when stop is called the CSV entry is updated so you have only one line, is that what you need ? You don't need to pause the process, right ?

What you said.

I don't think the number of lines matters, since the training process gets restarted many times. i.e. it may run for weeks or months before it completes. So if it's faster to append, it's probably better.

Do you have access to internet on your env ? Because the new API could keep track of all your trainning and you could aggregate data per training, projects as needed. We hope to release the new package in september but we will be happy to discuss your use-case with you before.

The compute nodes have no internet. We sync the data to an outside server periodically.

@benoit-cty
Copy link
Contributor

OK, I've just pushed a new method flush() that compute emissions and add them to the CSV, and/or call the API if you use it.
I use it in a Keras callback and it do the job of adding a line in the CSV after each epoch.
You can see it in mnist_callback.py

@stas00
Copy link
Contributor

stas00 commented Aug 21, 2021

This is perfect, @benoit-cty. Thank you!

If this larger-scope PR will take a while to complete, perhaps we could merge just the flush you have just added and the example? And then we can start using it right away. I have modified my side of things to use it and tested it to work.

@benoit-cty
Copy link
Contributor

benoit-cty commented Aug 21, 2021

We could merge my part into master (see #236 ), do you need a new version of the package for pip ?

@stas00
Copy link
Contributor

stas00 commented Aug 22, 2021

That would be ideal on both accounts if it's not too much trouble.

Thank you, @benoit-cty!

@vict0rsch
Copy link
Contributor Author

vict0rsch commented Sep 9, 2021

tracker.flush() implemented in #236 now do we really need to restart?

@stas00
Copy link
Contributor

stas00 commented Sep 9, 2021

FWIW, flush()-implementation addressed our needs, so our group doesn't need the restart anymore.

@vict0rsch
Copy link
Contributor Author

Yeah I think it's not worth thinking too hard about this while there's no explicit need. Thanks!

@vict0rsch vict0rsch closed this Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants