-
Notifications
You must be signed in to change notification settings - Fork 0
Allow tracker pause and resume #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your PR resolves the need to be able to make intermediary checkpoints, @jaketae - thank you!
I tested it work correctly (thank you for the proposed Megatron code - you made it super easy for me).
So if there are 4 checkpoint saving events during each training, this approach leads to 4 csv files per gpu which get appended to.
I think for the consumer of these checkpoints it might be easier if it were one file per gpu, but your solution definitely works for me. I will let others comment on how they want the results to be.
Do you think we should just use your fork for now after you merge this? While you propose this to the upstream?
When it's ready to be used please let me know and I will put it through tests on multi-node and then in production.
I updated bigscience-workshop/Megatron-DeepSpeed#15 and will await your feedback to deploy.
|
For an example, here is the output on 1 node 2 gpus and 3 scheduled checkpoint savings and 4th on exiting: |
|
So the only remaining part is to remove the CC info logs, since each debug print will get multiplied 256 times (or more times). OK, fixed in the meg-ds PR |
|
@stas00 Thanks for the review and feedback!
def out(self, data: EmissionsData):
file_exists: bool = os.path.isfile(self.save_file_path)
if file_exists and not self.has_valid_headers(data):
logger.info("Backing up old emission file")
os.rename(self.save_file_path, self.save_file_path + ".bak")
file_exists = False
with open(self.save_file_path, "a+") as f:
writer = csv.DictWriter(f, fieldnames=data.values.keys())
if not file_exists:
writer.writeheader()
writer.writerow(data.values)So long story short, there was no need to create CSV files for each pause for fear of overriding. Codecarbon will append new logs at the final row of the existing CSV entry and update the file itself. So the implication of this is that we won't end up with a ton of CSV files in the multi-node, multi-GPU setup. I pushed a commit that removes the code that used to dump files on each pause.
tracker = EmissionsTracker(..., log_level="error")to suppress info logs. On my local, this seems to take care of them.
|
|
yes, indeed, I forgot I already had the logging level figured out in the first place, so we are good there. Thank you for improving your solution, @jaketae - that's even better. Meanwhile we did get a response from the maintainers of cc and a different proposal was made mlco2#217 (comment) to which I made some suggestion. If you get a chance, please have a look and recommend which of the approaches seems to be more efficient for our needs? It was also mentioned that distributed env might have some special treatment, I asked in the Issue for details. |
|
Frankly speaking, the proposed solution in the upstream seems more elegant. As much as I'd want my solution to be chosen, I would vote for mlco2#217 (comment) for its simplicity, as long as it works as expected in our distributed setup. If you confirm that it works, I can open another PR that implements what you suggest, i.e. keeping the |
|
Sounds like a plan to me, @jaketae - thank you for taking care of this! |
|
Hi everyone, I'm one of the maintainers of CC and coming back from vacation :) Regarding logging I've spent some time working on this issue here mlco2#235 it's not finished but you're welcome to use it / comment / suggest / contribute there |
|
Hey @vict0rsch, thanks for addressing the issue. @stas00 is currently on vacation, but he should be able to get back to you later this week. In the meantime, I have a few questions:
Thanks again for your work and clarification! |
|
Hey @jaketae the issue with the For instance, in a On the other hand, the But in spirit my implementation is very similar and you just have to from codecarbon import EmissionsTracker
cc = EmissionsTracker(**kwargs)
cc.start()
do_something()
cc.stop()
do_something_else()
cc.start()
one_more_thing()
cc.stop()
...This process will update 1 single line with incremental emissions and energy outputs since the object was not destroyed and routines just do ~ If you instantiate a new |
|
That being said, you're the first ones to ask for this behaviour so you get to choose: does updating rows (as I've done) make sense or would you expect bran-new row(s) for the resumed part(s)? |
|
Thank you for working on mlco2#235, @vict0rsch!
Won't it be faster to append, rather than rewrite? I will defer to @jaketae's choice who is the actual user of these logs. I'm the integrator - so my main concern is speed and no clutter on the filesystem. |
|
It would be slightly indeed. Depends on your bottlenecks. Given that this event happens only once at the stop() call, and that it's a quick pandas lookup in a rather small table I wouldn't consider it to be an expensive difference. |
|
I also followed up in the PR, and I think this is actually |
* add codecarbon * switch to offline * rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost * adjust API to match bigscience-workshop/codecarbon#1 * fix logging * new implementation based on mlco2/codecarbon#236 * add test * update requirements
|
superceded by mlco2#236 |
* add codecarbon * switch to offline * rework to also restart the tracker at each checkpoint save to ensure as little as possible data is lost * adjust API to match bigscience-workshop/codecarbon#1 * fix logging * new implementation based on mlco2/codecarbon#236 * add test * update requirements
This PR allows pausing and restarting of the carbon emissions tracker.
Each
.pause()call will dump tracked statistics in a CSV file. Ifemissions.csvalready exists due to a previous pause, the next CSV will be titledemissions_1.csv, the next next,emissions_2.csv, etc.In the Megatron-DeepSpeed codebase, we could do