Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add model num params display, gpu memory metrics #56

Merged
merged 11 commits into from
Feb 15, 2024

Conversation

lessw2020
Copy link
Contributor

@lessw2020 lessw2020 commented Feb 13, 2024

This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
Screenshot 2024-02-12 at 4 12 22 PM

this helps de-mistify for example the size of our debug model as well:
Screenshot 2024-02-12 at 4 10 17 PM

additional updates - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
Screenshot 2024-02-13 at 9 32 11 AM

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
Screenshot 2024-02-13 at 9 12 45 AM

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2024
@lessw2020 lessw2020 changed the title add model num params display, logging add model num params display, gpu memory metrics Feb 13, 2024
@tianyu-l
Copy link
Contributor

Wow nice work!

As I'm preparing Tensorboard (in PR #57), it would be great if we can just utilize this tool to plot memory-related metrics to TB. Do you think I can just call gpu_metrics.get_current_status() and then gpu_metrics.device_memory_usage and gpu_metrics.device_memory_utilization to plot them per TB logging step?

One related thing I'd like your advice is that: when, or on what metrics, do we want to log/show rank level metrics and when to log global level metrics?

@lessw2020
Copy link
Contributor Author

Wow nice work!

As I'm preparing Tensorboard (in PR #57), it would be great if we can just utilize this tool to plot memory-related metrics to TB. Do you think I can just call gpu_metrics.get_current_status() and then gpu_metrics.device_memory_usage and gpu_metrics.device_memory_utilization to plot them per TB logging step?

One related thing I'd like your advice is that: when, or on what metrics, do we want to log/show rank level metrics and when to log global level metrics?

yes agree - I'll consolidate this into same file (will modify to use metrics.py) and I can expose a per iter api to supply the desired mem stats for TB logging.

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Have some comments

torchtrain/train_configs/train_config.toml Outdated Show resolved Hide resolved
torchtrain/metrics_utils.py Outdated Show resolved Hide resolved
run_llama_train.sh Outdated Show resolved Hide resolved
@lessw2020 lessw2020 merged commit 40c93e9 into pytorch:main Feb 15, 2024
3 checks passed
@lessw2020 lessw2020 deleted the add_metrics branch February 15, 2024 00:34
@tianyu-l tianyu-l linked an issue Feb 16, 2024 that may be closed by this pull request
lessw2020 added a commit that referenced this pull request Apr 18, 2024
This PR is the start of adding perf related metrics. 
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
This PR is the start of adding perf related metrics. 
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">
awgu added a commit that referenced this pull request Aug 19, 2024
This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 20, 2024
This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

[ghstack-poisoned]
H-Huang added a commit that referenced this pull request Aug 20, 2024
This PR is the start of adding perf related metrics.
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

[ghstack-poisoned]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics to collect during training
4 participants