Skip to content

Conversation

@jaketae
Copy link
Member

@jaketae jaketae commented Aug 13, 2021

This PR allows pausing and restarting of the carbon emissions tracker.

from codecarbon import EmissionsTracker

tracker = EmissionsTracker()
# start
tracker.start()
# pause
tracker.pause()
# resume
tracker.resume()
# pause again
tracker.pause()
# resume again
tracker.resume()
# stop
tracker.stop()

Each .pause() call will dump tracked statistics in a CSV file. If emissions.csv already exists due to a previous pause, the next CSV will be titled emissions_1.csv, the next next, emissions_2.csv, etc.

In the Megatron-DeepSpeed codebase, we could do

def codecarbon_tracker_pause():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    print('codecarbon STOP')
    _GLOBAL_CODECARBON_TRACKER.pause()


def codecarbon_tracker_resume():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    print('codecarbon STOP')
    _GLOBAL_CODECARBON_TRACKER.resume()

def codecarbon_tracker_restart():
    global _GLOBAL_CODECARBON_TRACKER
    if _GLOBAL_CODECARBON_TRACKER is None:
        return

    codecarbon_tracker_pause()
    codecarbon_tracker_resume()

@jaketae jaketae changed the title feat: allow tracker pause and restart Allow tracker pause and resume Aug 13, 2021
@jaketae jaketae self-assigned this Aug 13, 2021
@jaketae jaketae requested a review from stas00 August 13, 2021 19:12
@jaketae jaketae requested a review from JetRunner August 13, 2021 19:25
stas00 added a commit to stas00/Megatron-DeepSpeed that referenced this pull request Aug 13, 2021
Copy link

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your PR resolves the need to be able to make intermediary checkpoints, @jaketae - thank you!

I tested it work correctly (thank you for the proposed Megatron code - you made it super easy for me).

So if there are 4 checkpoint saving events during each training, this approach leads to 4 csv files per gpu which get appended to.

I think for the consumer of these checkpoints it might be easier if it were one file per gpu, but your solution definitely works for me. I will let others comment on how they want the results to be.

Do you think we should just use your fork for now after you merge this? While you propose this to the upstream?

When it's ready to be used please let me know and I will put it through tests on multi-node and then in production.

I updated bigscience-workshop/Megatron-DeepSpeed#15 and will await your feedback to deploy.

@stas00
Copy link

stas00 commented Aug 13, 2021

For an example, here is the output on 1 node 2 gpus and 3 scheduled checkpoint savings and 4th on exiting:

$ ls -1sh
total 40K
4.0K emissions-000_1.csv
4.0K emissions-000_2.csv
4.0K emissions-000_3.csv
4.0K emissions-000_4.csv
4.0K emissions-000.csv
4.0K emissions-001_1.csv
4.0K emissions-001_2.csv
4.0K emissions-001_3.csv
4.0K emissions-001_4.csv
4.0K emissions-001.csv

@stas00
Copy link

stas00 commented Aug 14, 2021

So the only remaining part is to remove the CC info logs, since each debug print will get multiplied 256 times (or more times).

OK, fixed in the meg-ds PR

@jaketae
Copy link
Member Author

jaketae commented Aug 14, 2021

@stas00 Thanks for the review and feedback!

  1. Something I didn't realize yesterday night is that codecarbon is actually smart about file management and does not override existing files. I found this snippet of code in theFileOutput class:
    def out(self, data: EmissionsData):
        file_exists: bool = os.path.isfile(self.save_file_path)
        if file_exists and not self.has_valid_headers(data):
            logger.info("Backing up old emission file")
            os.rename(self.save_file_path, self.save_file_path + ".bak")
            file_exists = False

        with open(self.save_file_path, "a+") as f:
            writer = csv.DictWriter(f, fieldnames=data.values.keys())
            if not file_exists:
                writer.writeheader()
            writer.writerow(data.values)

So long story short, there was no need to create CSV files for each pause for fear of overriding. Codecarbon will append new logs at the final row of the existing CSV entry and update the file itself. So the implication of this is that we won't end up with a ton of CSV files in the multi-node, multi-GPU setup. I pushed a commit that removes the code that used to dump files on each pause.

  1. As logging, EmissionsTracker seems to take in a log_level argument during initialization, so I wonder if we can just do
tracker = EmissionsTracker(..., log_level="error")

to suppress info logs. On my local, this seems to take care of them.

  1. What I got from the carbon footprint WG meeting is that the majority of the codecarbon maintainers are on vacation. Given this, I wouldn't expect an upstream PR to be merged anytime soon. So I second your suggestion of using this fork for now while I propose this solution to the original repo. When that gets merged, we can start using the upstream codecarbon again.

@stas00
Copy link

stas00 commented Aug 15, 2021

yes, indeed, I forgot I already had the logging level figured out in the first place, so we are good there.

Thank you for improving your solution, @jaketae - that's even better.

Meanwhile we did get a response from the maintainers of cc and a different proposal was made mlco2#217 (comment) to which I made some suggestion.

If you get a chance, please have a look and recommend which of the approaches seems to be more efficient for our needs?

It was also mentioned that distributed env might have some special treatment, I asked in the Issue for details.

@jaketae
Copy link
Member Author

jaketae commented Aug 15, 2021

Frankly speaking, the proposed solution in the upstream seems more elegant. As much as I'd want my solution to be chosen, I would vote for mlco2#217 (comment) for its simplicity, as long as it works as expected in our distributed setup. If you confirm that it works, I can open another PR that implements what you suggest, i.e. keeping the run_id internally somewhere, though now this might be of lower priority. If for some reason it doesn't work, we can return to this solution and see if this satisfies our needs. What do you think @stas00?

@stas00
Copy link

stas00 commented Aug 15, 2021

Sounds like a plan to me, @jaketae - thank you for taking care of this!

@vict0rsch
Copy link

Hi everyone, I'm one of the maintainers of CC and coming back from vacation :)

Regarding logging log_level="error" would work indeed, you can also use configuration files and/or env variables as described here https://github.com/mlco2/codecarbon#configuration

I've spent some time working on this issue here mlco2#235 it's not finished but you're welcome to use it / comment / suggest / contribute there

@jaketae
Copy link
Member Author

jaketae commented Aug 18, 2021

Hey @vict0rsch, thanks for addressing the issue. @stas00 is currently on vacation, but he should be able to get back to you later this week.

In the meantime, I have a few questions:

  1. What is the difference between your proposed solution and need to be able to restart the tracker mlco2/codecarbon#217 (comment)? From the comment, it seemed like restarting would be possible very simply by keeping a run_id variable returned by the initial tracker.start() call. Is there something additional you implement that is not taken care of by this solution?
  2. Am I correct in saying that each "pause" of the tracker would create a new CSV file in your PR? In other words, if I start, stop, restart, and finally stop a tracker, would it result in two logs?

Thanks again for your work and clarification!

@vict0rsch
Copy link

vict0rsch commented Aug 19, 2021

Hey @jaketae the issue with the run_id is that it simply does not exist in itself. It actually depends on the persistence_objects which are basically output classes and those handle the concept of id as they like.

For instance, in a csv file, there currently is no id column (there used to be a uuid which we might want to bring back for this very purpose) and in this PR I considered (timestamp, project_name) to be a unique id.

On the other hand, the CodeCarbonAPI we're working on handles ids very differently, they are created on the backend and sent to the codecarbon client, as they are attached to a user, an organization, a project, an experiment.

But in spirit my implementation is very similar and you just have to

from codecarbon import EmissionsTracker

cc = EmissionsTracker(**kwargs)

cc.start()
do_something()
cc.stop()

do_something_else()

cc.start()
one_more_thing()
cc.stop()

...

This process will update 1 single line with incremental emissions and energy outputs since the object was not destroyed and routines just do ~ total_energy += measurement

If you instantiate a new cc object then you get a new line in the csv file (assuming it has not moved)

@vict0rsch
Copy link

That being said, you're the first ones to ask for this behaviour so you get to choose: does updating rows (as I've done) make sense or would you expect bran-new row(s) for the resumed part(s)?

@stas00
Copy link

stas00 commented Aug 20, 2021

Thank you for working on mlco2#235, @vict0rsch!

That being said, you're the first ones to ask for this behaviour so you get to choose: does updating rows (as I've done) make sense or would you expect bran-new row(s) for the resumed part(s)?

Won't it be faster to append, rather than rewrite?

I will defer to @jaketae's choice who is the actual user of these logs.

I'm the integrator - so my main concern is speed and no clutter on the filesystem.

@vict0rsch
Copy link

It would be slightly indeed. Depends on your bottlenecks. Given that this event happens only once at the stop() call, and that it's a quick pandas lookup in a rather small table I wouldn't consider it to be an expensive difference.

@stas00
Copy link

stas00 commented Aug 20, 2021

I also followed up in the PR, and I think this is actually cc.flush that we are after (i.e. not needing an actual restart), so probably appending is the most straightforward solution, since this sort of emulates flush on a file handle.

mlco2#235 (comment)

stas00 added a commit to bigscience-workshop/Megatron-DeepSpeed that referenced this pull request Aug 25, 2021
* add codecarbon

* switch to offline

* rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost

* adjust API to match bigscience-workshop/codecarbon#1

* fix logging

* new implementation based on mlco2/codecarbon#236

* add test

* update requirements
@stas00
Copy link

stas00 commented Aug 25, 2021

superceded by mlco2#236

@stas00 stas00 closed this Aug 25, 2021
stas00 added a commit to bigscience-workshop/Megatron-DeepSpeed that referenced this pull request Aug 25, 2021
* add codecarbon

* switch to offline

* rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost

* adjust API to match bigscience-workshop/codecarbon#1

* fix logging

* new implementation based on mlco2/codecarbon#236

* add test

* update requirements
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants