Allow tracker pause and resume #1

jaketae · 2021-08-13T18:28:23Z

This PR allows pausing and restarting of the carbon emissions tracker.

from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
# start
tracker.start()
# pause
tracker.pause()
# resume
tracker.resume()
# pause again
tracker.pause()
# resume again
tracker.resume()
# stop
tracker.stop()

Each .pause() call will dump tracked statistics in a CSV file. If emissions.csv already exists due to a previous pause, the next CSV will be titled emissions_1.csv, the next next, emissions_2.csv, etc.

In the Megatron-DeepSpeed codebase, we could do

def codecarbon_tracker_pause():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    print('codecarbon STOP')
    _GLOBAL_CODECARBON_TRACKER.pause()


def codecarbon_tracker_resume():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    print('codecarbon STOP')
    _GLOBAL_CODECARBON_TRACKER.resume()

def codecarbon_tracker_restart():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    codecarbon_tracker_pause()
    codecarbon_tracker_resume()

stas00

Your PR resolves the need to be able to make intermediary checkpoints, @jaketae - thank you!

I tested it work correctly (thank you for the proposed Megatron code - you made it super easy for me).

So if there are 4 checkpoint saving events during each training, this approach leads to 4 csv files per gpu which get appended to.

I think for the consumer of these checkpoints it might be easier if it were one file per gpu, but your solution definitely works for me. I will let others comment on how they want the results to be.

Do you think we should just use your fork for now after you merge this? While you propose this to the upstream?

When it's ready to be used please let me know and I will put it through tests on multi-node and then in production.

I updated bigscience-workshop/Megatron-DeepSpeed#15 and will await your feedback to deploy.

stas00 · 2021-08-13T22:56:48Z

For an example, here is the output on 1 node 2 gpus and 3 scheduled checkpoint savings and 4th on exiting:

$ ls -1sh
total 40K
4.0K emissions-000_1.csv
4.0K emissions-000_2.csv
4.0K emissions-000_3.csv
4.0K emissions-000_4.csv
4.0K emissions-000.csv
4.0K emissions-001_1.csv
4.0K emissions-001_2.csv
4.0K emissions-001_3.csv
4.0K emissions-001_4.csv
4.0K emissions-001.csv

stas00 · 2021-08-14T03:58:16Z

So the only remaining part is to remove the CC info logs, since each debug print will get multiplied 256 times (or more times).

OK, fixed in the meg-ds PR

jaketae · 2021-08-14T06:53:29Z

@stas00 Thanks for the review and feedback!

Something I didn't realize yesterday night is that codecarbon is actually smart about file management and does not override existing files. I found this snippet of code in theFileOutput class:

    def out(self, data: EmissionsData):
        file_exists: bool = os.path.isfile(self.save_file_path)
        if file_exists and not self.has_valid_headers(data):
            logger.info("Backing up old emission file")
            os.rename(self.save_file_path, self.save_file_path + ".bak")
            file_exists = False

        with open(self.save_file_path, "a+") as f:
            writer = csv.DictWriter(f, fieldnames=data.values.keys())
            if not file_exists:
                writer.writeheader()
            writer.writerow(data.values)

So long story short, there was no need to create CSV files for each pause for fear of overriding. Codecarbon will append new logs at the final row of the existing CSV entry and update the file itself. So the implication of this is that we won't end up with a ton of CSV files in the multi-node, multi-GPU setup. I pushed a commit that removes the code that used to dump files on each pause.

As logging, EmissionsTracker seems to take in a log_level argument during initialization, so I wonder if we can just do

tracker = EmissionsTracker(..., log_level="error")

to suppress info logs. On my local, this seems to take care of them.

What I got from the carbon footprint WG meeting is that the majority of the codecarbon maintainers are on vacation. Given this, I wouldn't expect an upstream PR to be merged anytime soon. So I second your suggestion of using this fork for now while I propose this solution to the original repo. When that gets merged, we can start using the upstream codecarbon again.

stas00 · 2021-08-15T03:59:08Z

yes, indeed, I forgot I already had the logging level figured out in the first place, so we are good there.

Thank you for improving your solution, @jaketae - that's even better.

Meanwhile we did get a response from the maintainers of cc and a different proposal was made mlco2#217 (comment) to which I made some suggestion.

If you get a chance, please have a look and recommend which of the approaches seems to be more efficient for our needs?

It was also mentioned that distributed env might have some special treatment, I asked in the Issue for details.

jaketae · 2021-08-15T17:41:37Z

Frankly speaking, the proposed solution in the upstream seems more elegant. As much as I'd want my solution to be chosen, I would vote for mlco2#217 (comment) for its simplicity, as long as it works as expected in our distributed setup. If you confirm that it works, I can open another PR that implements what you suggest, i.e. keeping the run_id internally somewhere, though now this might be of lower priority. If for some reason it doesn't work, we can return to this solution and see if this satisfies our needs. What do you think @stas00?

stas00 · 2021-08-15T17:46:15Z

Sounds like a plan to me, @jaketae - thank you for taking care of this!

vict0rsch · 2021-08-18T16:49:42Z

Hi everyone, I'm one of the maintainers of CC and coming back from vacation :)

Regarding logging log_level="error" would work indeed, you can also use configuration files and/or env variables as described here https://github.com/mlco2/codecarbon#configuration

I've spent some time working on this issue here mlco2#235 it's not finished but you're welcome to use it / comment / suggest / contribute there

jaketae · 2021-08-18T17:21:22Z

Hey @vict0rsch, thanks for addressing the issue. @stas00 is currently on vacation, but he should be able to get back to you later this week.

In the meantime, I have a few questions:

What is the difference between your proposed solution and need to be able to restart the tracker mlco2/codecarbon#217 (comment)? From the comment, it seemed like restarting would be possible very simply by keeping a run_id variable returned by the initial tracker.start() call. Is there something additional you implement that is not taken care of by this solution?
Am I correct in saying that each "pause" of the tracker would create a new CSV file in your PR? In other words, if I start, stop, restart, and finally stop a tracker, would it result in two logs?

Thanks again for your work and clarification!

vict0rsch · 2021-08-19T08:14:05Z

Hey @jaketae the issue with the run_id is that it simply does not exist in itself. It actually depends on the persistence_objects which are basically output classes and those handle the concept of id as they like.

For instance, in a csv file, there currently is no id column (there used to be a uuid which we might want to bring back for this very purpose) and in this PR I considered (timestamp, project_name) to be a unique id.

On the other hand, the CodeCarbonAPI we're working on handles ids very differently, they are created on the backend and sent to the codecarbon client, as they are attached to a user, an organization, a project, an experiment.

But in spirit my implementation is very similar and you just have to

from codecarbon import EmissionsTracker

cc = EmissionsTracker(**kwargs)

cc.start()
do_something()
cc.stop()

do_something_else()

cc.start()
one_more_thing()
cc.stop()

...

This process will update 1 single line with incremental emissions and energy outputs since the object was not destroyed and routines just do ~ total_energy += measurement

If you instantiate a new cc object then you get a new line in the csv file (assuming it has not moved)

vict0rsch · 2021-08-19T12:05:33Z

That being said, you're the first ones to ask for this behaviour so you get to choose: does updating rows (as I've done) make sense or would you expect bran-new row(s) for the resumed part(s)?

stas00 · 2021-08-20T19:46:39Z

Thank you for working on mlco2#235, @vict0rsch!

That being said, you're the first ones to ask for this behaviour so you get to choose: does updating rows (as I've done) make sense or would you expect bran-new row(s) for the resumed part(s)?

Won't it be faster to append, rather than rewrite?

I will defer to @jaketae's choice who is the actual user of these logs.

I'm the integrator - so my main concern is speed and no clutter on the filesystem.

vict0rsch · 2021-08-20T19:50:37Z

It would be slightly indeed. Depends on your bottlenecks. Given that this event happens only once at the stop() call, and that it's a quick pandas lookup in a rather small table I wouldn't consider it to be an expensive difference.

stas00 · 2021-08-20T19:55:55Z

I also followed up in the PR, and I think this is actually cc.flush that we are after (i.e. not needing an actual restart), so probably appending is the most straightforward solution, since this sort of emulates flush on a file handle.

mlco2#235 (comment)

* add codecarbon * switch to offline * rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost * adjust API to match bigscience-workshop/codecarbon#1 * fix logging * new implementation based on mlco2/codecarbon#236 * add test * update requirements

stas00 · 2021-08-25T19:22:50Z

superceded by mlco2#236

* add codecarbon * switch to offline * rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost * adjust API to match bigscience-workshop/codecarbon#1 * fix logging * new implementation based on mlco2/codecarbon#236 * add test * update requirements

jaketae added 2 commits August 14, 2021 03:00

feat: allow tracker pause and restart

4e98ae2

fix: prevent csv files from being overrideen

58ea114

jaketae changed the title ~~feat: allow tracker pause and restart~~ Allow tracker pause and resume Aug 13, 2021

fix: prevent .stop() from overriding first dump

484981b

jaketae self-assigned this Aug 13, 2021

jaketae requested a review from stas00 August 13, 2021 19:12

docs: complete docstring sentence

4cdda19

jaketae requested a review from JetRunner August 13, 2021 19:25

stas00 added a commit to stas00/Megatron-DeepSpeed that referenced this pull request Aug 13, 2021

adjust API to match bigscience-workshop/codecarbon#1

4511b15

stas00 mentioned this pull request Aug 13, 2021

[codecarbon] integration bigscience-workshop/Megatron-DeepSpeed#15

Merged

stas00 approved these changes Aug 13, 2021

View reviewed changes

fix: remove unnecessary csv dumping

ccac4eb

stas00 mentioned this pull request Aug 15, 2021

need to be able to restart the tracker mlco2/codecarbon#217

Closed

stas00 closed this Aug 25, 2021

Allow tracker pause and resume #1

Allow tracker pause and resume #1

Uh oh!

Conversation

jaketae commented Aug 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 commented Aug 13, 2021

Uh oh!

stas00 commented Aug 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jaketae commented Aug 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 15, 2021

Uh oh!

jaketae commented Aug 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 15, 2021

Uh oh!

vict0rsch commented Aug 18, 2021

Uh oh!

jaketae commented Aug 18, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vict0rsch commented Aug 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vict0rsch commented Aug 19, 2021

Uh oh!

stas00 commented Aug 20, 2021

Uh oh!

vict0rsch commented Aug 20, 2021

Uh oh!

stas00 commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Aug 25, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jaketae commented Aug 13, 2021 •

edited

Loading

stas00 left a comment •

edited

Loading

stas00 commented Aug 14, 2021 •

edited

Loading

jaketae commented Aug 14, 2021 •

edited

Loading

jaketae commented Aug 15, 2021 •

edited

Loading

jaketae commented Aug 18, 2021 •

edited

Loading

vict0rsch commented Aug 19, 2021 •

edited

Loading

stas00 commented Aug 20, 2021 •

edited

Loading