Add callback for profiling GPU memory usage#249
Conversation
My editor is just cleaning these up on save and I think we do want it cleaned up, but trying to avoid clutter in PR review.
Opt-in via --record-memory-snapshot or `experiment_settings.record_memory_snapshot`. When enabled, logs peak GPU memory and dumps a torch.cuda.memory._record_memory_history snapshot for each predict batch to <output_dir>/<query_id>/seed_<n>/mem_snapshot.pkl. Registered after PredictTimer so the snapshot dump runs outside the timer's measurement window.
Basic test mirroring the existing test_use_msa_cli / test_use_templates_cli pattern.
|
@christinaflo Sorry, I think I misremembered and thought your comment in #227 said I should share my version, but you actually said you already had one 😬 If this isn't helpful or you like yours better, we can just close this 😄 or if you want to share the version you've got, I'm happy to see if there's any meaningful diff between them and clean up or test |
|
Yeah i just wanted to avoid the conflicts for when mine eventually gets merged in, but i can share mine and you can add some edits on top of it? mine has some extra functionality id like to keep |
jnwei
left a comment
There was a problem hiding this comment.
This is a nice Callback to have, especially for profiling!
One nit: In general, we try to limit the available configuration settings available from the command line to only those which we expect the user to need frequently (e.g. templates, colabfold settings).
Since most users will probably not use memory profiling with their inference run, can we keep the settings configurable only through the experiment_settings header in the runner_yaml?
|
Very handy! |
|
@GMNGeoffrey Ill tag you in the PR later today, mine was mostly training focused so I think we can add in the additions on top of the general snapshot that you have in this PR |
Summary
It's useful to be able to keep track of where GPU memory is being used in the model and where the peaks are so we can try to fit larger inputs (or on smaller GPUs).
Changes
torch.cuda.memoryutilities. This outputs a pkl dump of all allocations over time as well as logging the peak memory usage.Testing
Other Notes
My editor was stripping trailing whitespace from the files I touched and introducing distracting diffs. Rather than fight with it, I pre-factored that change, so you can just view the last two commits for the substantive change (or use the view that hides whitespace diffs). I can revert that if you don't like formatting changes of untouched lines mixed into PRs though.