add model num params display, gpu memory metrics (#56) #541

H-Huang · 2024-08-20T19:21:47Z

Stack from ghstack (oldest at bottom):

(to be filled)

This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:

this helps de-mistify for example the size of our debug model as well:

additional updates - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.

This PR is the start of adding perf related metrics. 1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work). 2 - logs it with comma formatted logging and model name ala: <img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5"> this helps de-mistify for example the size of our debug model as well: <img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd"> **additional updates** - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance. Thus, added class GPUMemoryMonitor: usage: 1 - instantiate <img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895"> 2 - start of training = start_monitoring() 3 - end of training = stop_monitoring() 4 - show results = get_peak_stats_str() and rank0_log it. <img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381"> [ghstack-poisoned]

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 20, 2024

H-Huang closed this Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add model num params display, gpu memory metrics (#56) #541

add model num params display, gpu memory metrics (#56) #541

H-Huang commented Aug 20, 2024

add model num params display, gpu memory metrics (#56) #541

add model num params display, gpu memory metrics (#56) #541

Conversation

H-Huang commented Aug 20, 2024