Skip to content

Add callback for profiling GPU memory usage#249

Closed
GMNGeoffrey wants to merge 3 commits into
aqlaboratory:mainfrom
GMNGeoffrey:memsnap-callback
Closed

Add callback for profiling GPU memory usage#249
GMNGeoffrey wants to merge 3 commits into
aqlaboratory:mainfrom
GMNGeoffrey:memsnap-callback

Conversation

@GMNGeoffrey

Copy link
Copy Markdown
Contributor

Summary
It's useful to be able to keep track of where GPU memory is being used in the model and where the peaks are so we can try to fit larger inputs (or on smaller GPUs).

Changes

  • Adds an optional callback that tracks GPU memory usage with torch.cuda.memory utilities. This outputs a pkl dump of all allocations over time as well as logging the peak memory usage.

Testing

  • Basic unit test matching the ones for other experiment runner config settings.

Other Notes
My editor was stripping trailing whitespace from the files I touched and introducing distracting diffs. Rather than fight with it, I pre-factored that change, so you can just view the last two commits for the substantive change (or use the view that hides whitespace diffs). I can revert that if you don't like formatting changes of untouched lines mixed into PRs though.

My editor is just cleaning these up on save and I think we do want it
cleaned up, but trying to avoid clutter in PR review.
Opt-in via --record-memory-snapshot or
`experiment_settings.record_memory_snapshot`. When enabled, logs peak
GPU memory and dumps a torch.cuda.memory._record_memory_history snapshot
for each predict batch to
<output_dir>/<query_id>/seed_<n>/mem_snapshot.pkl.

Registered after PredictTimer so the snapshot dump runs outside the
timer's measurement window.
Basic test mirroring the existing test_use_msa_cli /
test_use_templates_cli pattern.
@GMNGeoffrey

Copy link
Copy Markdown
Contributor Author

@christinaflo Sorry, I think I misremembered and thought your comment in #227 said I should share my version, but you actually said you already had one 😬 If this isn't helpful or you like yours better, we can just close this 😄 or if you want to share the version you've got, I'm happy to see if there's any meaningful diff between them and clean up or test

@christinaflo

Copy link
Copy Markdown
Collaborator

Yeah i just wanted to avoid the conflicts for when mine eventually gets merged in, but i can share mine and you can add some edits on top of it? mine has some extra functionality id like to keep

@jnwei jnwei left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice Callback to have, especially for profiling!

One nit: In general, we try to limit the available configuration settings available from the command line to only those which we expect the user to need frequently (e.g. templates, colabfold settings).

Since most users will probably not use memory profiling with their inference run, can we keep the settings configurable only through the experiment_settings header in the runner_yaml?

@jandom

jandom commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Very handy!

@jandom jandom added the safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing. label Jun 9, 2026
@christinaflo

Copy link
Copy Markdown
Collaborator

@GMNGeoffrey Ill tag you in the PR later today, mine was mostly training focused so I think we can add in the additions on top of the general snapshot that you have in this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Internal only label used to indicate PRs that are ready for automated CI testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants