Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add TensorBoard logging with loss and wps #57

Merged
merged 3 commits into from
Feb 15, 2024

Conversation

tianyu-l
Copy link
Contributor

@tianyu-l tianyu-l commented Feb 13, 2024

Stack from ghstack (oldest at bottom):

Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging.

To visualize using SSH tunneling:
ssh -L 6006:127.0.0.1:6006 your_user_name@my_server_ip
in torchtrain repo
tensorboard --logdir=./torchtrain/outputs/tb
then on web browser go to http://localhost:6006/

Screenshot 2024-02-12 at 6 39 28 PM

tianyu-l added a commit that referenced this pull request Feb 13, 2024
ghstack-source-id: 297bec2b7acdf83c0af32dbf89dda3c6672095c9
Pull Request resolved: #57
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 13, 2024
@tianyu-l tianyu-l linked an issue Feb 13, 2024 that may be closed by this pull request
tianyu-l added a commit that referenced this pull request Feb 13, 2024
ghstack-source-id: cdfe4c2c496feae23399019ec2a63b443fb3b6a9
Pull Request resolved: #57
train.py Outdated

time_delta = timer() - time_last_log
wps = nwords_since_last_log / (
time_delta * parallel_dims.sp * parallel_dims.pp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a neater way is to define a model_parallel_size in the parallel_dims class that return this number directly (i.e. a cached property)

Copy link
Contributor

@wanchaol wanchaol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! one minor comment, and please update the readme to include how to setup and use tensorboard.

Each rank build its own TensorBoard writer. The global loss is communicated among all ranks before logging.

To visualize using SSH tunneling:
`ssh -L 6006:127.0.0.1:6006 your_user_namemy_server_ip`
in torchtrain repo
`tensorboard --logdir=./torchtrain/outputs/tb`
then on web browser go to http://localhost:6006/

<img width="722" alt="Screenshot 2024-02-12 at 6 39 28 PM" src="https://github.com/pytorch-labs/torchtrain/assets/150487191/6304103c-fa89-4f1c-a8a2-57c887b07cd3">



[ghstack-poisoned]
tianyu-l added a commit that referenced this pull request Feb 15, 2024
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4
Pull Request resolved: #57
@tianyu-l tianyu-l merged commit 377eab2 into gh/tianyu-l/1/base Feb 15, 2024
3 checks passed
tianyu-l added a commit that referenced this pull request Feb 15, 2024
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4
Pull Request resolved: #57
@tianyu-l tianyu-l deleted the gh/tianyu-l/1/head branch February 15, 2024 18:38
@tianyu-l tianyu-l linked an issue Feb 16, 2024 that may be closed by this pull request
lessw2020 pushed a commit that referenced this pull request Apr 18, 2024
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4
Pull Request resolved: #57
philippguevorguian pushed a commit to YerevaNN/YNNtitan that referenced this pull request Aug 17, 2024
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4
Pull Request resolved: pytorch#57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Tensorboard Add metrics to collect during training
3 participants