add model num params display, gpu memory metrics #56

lessw2020 · 2024-02-13T00:20:12Z

This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:

this helps de-mistify for example the size of our debug model as well:

additional updates - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.

tianyu-l · 2024-02-13T19:11:48Z

Wow nice work!

As I'm preparing Tensorboard (in PR #57), it would be great if we can just utilize this tool to plot memory-related metrics to TB. Do you think I can just call gpu_metrics.get_current_status() and then gpu_metrics.device_memory_usage and gpu_metrics.device_memory_utilization to plot them per TB logging step?

One related thing I'd like your advice is that: when, or on what metrics, do we want to log/show rank level metrics and when to log global level metrics?

lessw2020 · 2024-02-14T18:17:29Z

Wow nice work!

As I'm preparing Tensorboard (in PR #57), it would be great if we can just utilize this tool to plot memory-related metrics to TB. Do you think I can just call gpu_metrics.get_current_status() and then gpu_metrics.device_memory_usage and gpu_metrics.device_memory_utilization to plot them per TB logging step?

One related thing I'd like your advice is that: when, or on what metrics, do we want to log/show rank level metrics and when to log global level metrics?

yes agree - I'll consolidate this into same file (will modify to use metrics.py) and I can expose a per iter api to supply the desired mem stats for TB logging.

wanchaol

Looks pretty good! Have some comments

torchtrain/train_configs/train_config.toml

torchtrain/metrics_utils.py

run_llama_train.sh

…d_metrics

This PR is the start of adding perf related metrics. 1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work). 2 - logs it with comma formatted logging and model name ala: <img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5"> this helps de-mistify for example the size of our debug model as well: <img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd"> **additional updates** - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance. Thus, added class GPUMemoryMonitor: usage: 1 - instantiate <img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895"> 2 - start of training = start_monitoring() 3 - end of training = stop_monitoring() 4 - show results = get_peak_stats_str() and rank0_log it. <img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

This PR is the start of adding perf related metrics. 1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work). 2 - logs it with comma formatted logging and model name ala: <img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5"> this helps de-mistify for example the size of our debug model as well: <img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd"> **additional updates** - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance. Thus, added class GPUMemoryMonitor: usage: 1 - instantiate <img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895"> 2 - start of training = start_monitoring() 3 - end of training = stop_monitoring() 4 - show results = get_peak_stats_str() and rank0_log it. <img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381"> [ghstack-poisoned]

add model num params display, logging

9a7e7f4

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2024

lessw2020 added 4 commits February 12, 2024 16:53

fix linting

a2d5c08

fix linting via pre-commit

de23018

all mem stats tracked, displayed

744fb8d

fix linting

635c191

lessw2020 changed the title ~~add model num params display, logging~~ add model num params display, gpu memory metrics Feb 13, 2024

lessw2020 requested review from tianyu-l, wconstab and wanchaol February 13, 2024 17:37

wanchaol reviewed Feb 14, 2024

View reviewed changes

torchtrain/train_configs/train_config.toml Outdated Show resolved Hide resolved

torchtrain/metrics_utils.py Outdated Show resolved Hide resolved

run_llama_train.sh Outdated Show resolved Hide resolved

lessw2020 added 6 commits February 14, 2024 14:29

update get_current_stats

a73f1e2

update get_current_stats with option to return data as named_tuple

d0e2135

revert compile/checkpt/profiling back to on

2313f4f

Merge branch 'main' into add_metrics

fc30a34

lint fixes

93f14dc

Merge branch 'add_metrics' of github.com:lessw2020/torchtrain into ad…

05cd9f2

…d_metrics

lessw2020 merged commit 40c93e9 into pytorch:main Feb 15, 2024
3 checks passed

lessw2020 deleted the add_metrics branch February 15, 2024 00:34

tianyu-l linked an issue Feb 16, 2024 that may be closed by this pull request

Add metrics to collect during training #3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add model num params display, gpu memory metrics #56

add model num params display, gpu memory metrics #56

lessw2020 commented Feb 13, 2024 •

edited

Loading

tianyu-l commented Feb 13, 2024

lessw2020 commented Feb 14, 2024

wanchaol left a comment

add model num params display, gpu memory metrics #56

add model num params display, gpu memory metrics #56

Conversation

lessw2020 commented Feb 13, 2024 • edited Loading

tianyu-l commented Feb 13, 2024

lessw2020 commented Feb 14, 2024

wanchaol left a comment

Choose a reason for hiding this comment

lessw2020 commented Feb 13, 2024 •

edited

Loading