Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

new feature for profiling training runs #782

Merged
merged 24 commits into from
Feb 7, 2020

Conversation

jeremyjordan
Copy link
Contributor

@jeremyjordan jeremyjordan commented Feb 2, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
  • Did you read the contributor guideline?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #753

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@jeremyjordan
Copy link
Contributor Author

Added tests for the simple profiler. Not sure the best way to test the advanced profiler. The current output is string which is somewhat tricky/annoying to parse. There was a recent PR which allowed for getting profile stats as a dict, but that was only merged into Python a couple weeks ago.

@jeremyjordan
Copy link
Contributor Author

I'm planning to add more documentation regarding usage, but would like to get some input to make sure this is generally an acceptable solution.

@Borda
Copy link
Member

Borda commented Feb 3, 2020

looks promising, just need to have deeper look... :]

@jeremyjordan
Copy link
Contributor Author

great! I’ll go ahead and work on more documentation this evening

@Borda
Copy link
Member

Borda commented Feb 3, 2020

maybe wait for @williamFalcon first screening opinion before you start with fine pruning... ;]

@Borda Borda added the feature Is an improvement or enhancement label Feb 3, 2020
@Borda Borda added this to the 0.7.0 milestone Feb 3, 2020
pytorch_lightning/trainer/training_loop.py Show resolved Hide resolved

def start(self, action_name):
if action_name in self.current_actions:
raise ValueError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider some kind of soft information (just a logger warning) since it is a complementary service...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it some thought and would prefer to keep the exception. For the default profiled actions, this will be tested and we'll ensure that everything works properly (verifying that no tests raise the exception). If the user is profiling additional actions, I would think they would want to know when their code is incorrect. The only time this error is raised is if someone calls profiler.stop() without calling profiler.start() or vice versa.

In reality, I don't even think we should document the start() and stop() interfaces to the user (since it can quickly get very messy), following the principle of having one way to do things.

Force User Decisions To Best Practices
There are 1,000 ways to do something. However, something eventually becomes standard practice that everyone does. Thus we pick one way of doing it and force everyone to do it this way.


def describe(self):
def print_row(action, mean, std_dev):
print(f"{action:<20s}\t| {mean:<15}\t| {std_dev:<15}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use logging

"""

def __init__(self, output_filename=None):
'''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

''' >> """

@jeremyjordan
Copy link
Contributor Author

jeremyjordan commented Feb 4, 2020

I used the LightningTemplateModel to verify that the profiler behaves and outputs as expected. The printed summary could definitely be presented better. Should we keep the reporting in seconds or use something like milliseconds? Any other tips to clean up this summary?

Also, we should still print the summary when the user exits early using "Ctrl C" to trigger a KeyboardEscape. Where would I need to make a change to make sure that's the case?

>>> from pl_examples.basic_examples.lightning_module_template import LightningTemplateModel
>>> from pytorch_lightning import Trainer
>>> from pytorch_lightning.utilities.profiler import Profiler
>>> from argparse import Namespace
>>> hparams = {
...     "batch_size": 32,
...     "in_features": 784,
...     "hidden_dim": 128,
...     "drop_prob": 0.2,
...     "out_features": 10,
...     "learning_rate": 1e-3,
...     "data_root": "data"
... }
>>> hparams = Namespace(**hparams)
>>> model = LightningTemplateModel(hparams)
>>> profiler = Profiler()
>>> trainer = Trainer(profiler=profiler, max_epochs=1)
>>> trainer.fit(model)
INFO:root:Training data loader called.
INFO:root:Test data loader called.
INFO:root:Validation data loader called.
INFO:root:
  | Name      | Type        | Params | In sizes | Out sizes
-------------------------------------------------------------
0 | c_d1      | Linear      | 100 K  | [5, 784] | [5, 128] 
1 | c_d1_bn   | BatchNorm1d | 256    | [5, 128] | [5, 128] 
2 | c_d1_drop | Dropout     | 0      | [5, 128] | [5, 128] 
3 | c_d2      | Linear      | 1 K    | [5, 128] | [5, 10]  
Epoch 1: 100%|█| 2188/2188 [00:28<00:00, 153.56batch/s, batch_idx=1874, loss=0.210, train_loss=0.183, v_num=2, val_acc=0.952, val_loss=0.1/
Epoch 1: 100%|█| 2188/2188 [00:28<00:00, 76.55batch/s, batch_idx=1874, loss=0.210, train_loss=0.183, v_num=2, val_acc=0.952, val_loss=0.171
INFO:profiler:
Profiler Report

Action                  |  Mean duration (s)    |  Std dev.       
------------------------------------------------------------
on_epoch_start          |  7.863e-06            |  0.0            
on_batch_start          |  5.4336e-06           |  1.5331e-06     
model_forward           |  0.0016781            |  0.0002032      
model_backward          |  0.0017407            |  0.00023147     
on_after_backward       |  4.2085e-06           |  1.5902e-06     
optimizer_step          |  0.0010742            |  0.00016276     
on_batch_end            |  4.4134e-06           |  1.1413e-06     
on_epoch_end            |  5.039e-06            |  0.0            
on_train_end            |  4.06e-06             |  0.0 
1

@jeremyjordan
Copy link
Contributor Author

Ok, I added a method to the base profiler which we can use when we need to profile calling next() on iterables (eg. get next training batch).

I also stopped reporting standard deviations for durations and instead report the total time spent per action. I think this will be more informative.

Profiler Report

Action                  |  Mean duration (s)    |  Total time (s) 
-----------------------------------------------------------------
on_epoch_start          |  5.993e-06            |  5.993e-06      
get_train_batch         |  0.0087412            |  16.398         
on_batch_start          |  5.0865e-06           |  0.0095372      
model_forward           |  0.0017818            |  3.3408         
model_backward          |  0.0018283            |  3.4282         
on_after_backward       |  4.2862e-06           |  0.0080366      
optimizer_step          |  0.0011072            |  2.0759         
on_batch_end            |  4.5202e-06           |  0.0084753      
on_epoch_end            |  3.919e-06            |  3.919e-06      
on_train_end            |  5.449e-06            |  5.449e-06  

@jeremyjordan jeremyjordan changed the title WIP: new feature for profiling training runs new feature for profiling training runs Feb 5, 2020
@williamFalcon
Copy link
Contributor

@jeremyjordan this is so sick haha. great job!

@jeremyjordan
Copy link
Contributor Author

jeremyjordan commented Feb 5, 2020

@williamFalcon thanks! i'm also excited about the AdvancedProfiler as this may help people who really want to get into the details. i'm envisioning they'll start with Profiler and switch to AdvancedProfiler if they decide they want to try optimizing a step. the output for AdvancedProfiler is something like: (for each action)

Profile stats for: get_train_batch
         4869394 function calls (4863767 primitive calls) in 19.107 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
3752/1876    0.012    0.000   19.100    0.010 {built-in method builtins.next}
     1876    0.008    0.000   19.090    0.010 dataloader.py:344(__next__)
     1876    0.075    0.000   19.082    0.010 dataloader.py:383(_next_data)
     1875    0.013    0.000   18.933    0.010 fetch.py:42(fetch)
     1875    0.086    0.000   18.494    0.010 fetch.py:44(<listcomp>)
    60000    1.797    0.000   18.408    0.000 mnist.py:80(__getitem__)
    60000    0.268    0.000   13.126    0.000 transforms.py:68(__call__)
    60000    0.184    0.000    7.104    0.000 transforms.py:93(__call__)
    60000    1.679    0.000    6.921    0.000 functional.py:42(to_tensor)
    60000    0.258    0.000    5.754    0.000 transforms.py:167(__call__)
    60000    2.856    0.000    5.495    0.000 functional.py:191(normalize)
    60000    0.727    0.000    3.128    0.000 Image.py:2612(fromarray)
    60000    0.398    0.000    2.362    0.000 Image.py:2552(frombuffer)
    60000    0.410    0.000    1.395    0.000 Image.py:734(tobytes)
    60000    1.197    0.000    1.197    0.000 {method 'div' of 'torch._C._TensorBase' objects}
    60000    0.246    0.000    1.069    0.000 Image.py:2469(new)
   120000    0.990    0.000    0.990    0.000 {built-in method as_tensor}
   120000    0.795    0.000    0.795    0.000 {method 'transpose' of 'torch._C._TensorBase' objects}
    60000    0.694    0.000    0.694    0.000 {method 'view' of 'torch._C._TensorBase' objects}
    60000    0.650    0.000    0.650    0.000 {method 'float' of 'torch._C._TensorBase' objects}
    60000    0.551    0.000    0.551    0.000 {method 'sub_' of 'torch._C._TensorBase' objects}
   120000    0.326    0.000    0.539    0.000 Image.py:572(_new)
    60000    0.500    0.000    0.500    0.000 {method 'clone' of 'torch._C._TensorBase' objects}
    60000    0.298    0.000    0.450    0.000 Image.py:461(_getencoder)
5625/1875    0.049    0.000    0.426    0.000 collate.py:42(default_collate)
    60000    0.397    0.000    0.397    0.000 {method 'div_' of 'torch._C._TensorBase' objects}
   180000    0.150    0.000    0.372    0.000 Image.py:629(__del__)
     1875    0.010    0.000    0.363    0.000 collate.py:79(<listcomp>)
   180000    0.315    0.000    0.315    0.000 Image.py:547(__init__)
     1875    0.299    0.000    0.299    0.000 {built-in method stack}
    60000    0.284    0.000    0.284    0.000 {built-in method PIL._imaging.map_buffer}
   120000    0.184    0.000    0.252    0.000 Image.py:2451(_check_size)
   560625    0.222    0.000    0.246    0.000 {built-in method builtins.isinstance}
    60000    0.239    0.000    0.239    0.000 {built-in method PIL._imaging.fill}
   180000    0.152    0.000    0.222    0.000 Image.py:591(__exit__)
    60000    0.220    0.000    0.220    0.000 {method 'encode' of 'ImagingEncoder' objects}
    60000    0.201    0.000    0.201    0.000 {built-in method from_buffer}
    60000    0.102    0.000    0.201    0.000 functional.py:30(_is_tensor_image)
    60000    0.189    0.000    0.189    0.000 {method 'numpy' of 'torch._C._TensorBase' objects}
    60000    0.090    0.000    0.126    0.000 Image.py:838(load)
420003/420002    0.088    0.000    0.088    0.000 {built-in method builtins.len}
     1876    0.002    0.000    0.073    0.000 dataloader.py:338(_next_index)
    60000    0.053    0.000    0.072    0.000 functional.py:34(_is_numpy)
   181875    0.071    0.000    0.071    0.000 {built-in method builtins.hasattr}
    60000    0.053    0.000    0.070    0.000 __init__.py:116(is_tensor)
    60000    0.070    0.000    0.070    0.000 {built-in method builtins.max}
     1876    0.047    0.000    0.070    0.000 sampler.py:198(__iter__)
    60000    0.054    0.000    0.069    0.000 functional.py:23(_is_pil_image)
    60000    0.067    0.000    0.067    0.000 {built-in method PIL._imaging.raw_encoder}
    60000    0.066    0.000    0.066    0.000 {built-in method builtins.getattr}
    60000    0.038    0.000    0.056    0.000 _util.py:9(isStringType)
    60000    0.054    0.000    0.054    0.000 {method 'setimage' of 'ImagingEncoder' objects}
   180000    0.052    0.000    0.052    0.000 Image.py:568(size)
    60000    0.041    0.000    0.041    0.000 {method 'contiguous' of 'torch._C._TensorBase' objects}
    60000    0.037    0.000    0.037    0.000 {method 'pixel_access' of 'ImagingCore' objects}
     1875    0.030    0.000    0.030    0.000 {built-in method tensor}
   120000    0.030    0.000    0.030    0.000 {method 'copy' of 'dict' objects}
    60000    0.029    0.000    0.029    0.000 {method 'ndimension' of 'torch._C._TensorBase' objects}
   120000    0.027    0.000    0.027    0.000 {method 'append' of 'list' objects}
     3750    0.003    0.000    0.024    0.000 abc.py:137(__instancecheck__)
     3750    0.012    0.000    0.022    0.000 {built-in method _abc._abc_instancecheck}
    61876    0.021    0.000    0.021    0.000 {method 'get' of 'dict' objects}
    60000    0.016    0.000    0.016    0.000 {method 'join' of 'bytes' objects}
     1875    0.002    0.000    0.010    0.000 abc.py:141(__subclasscheck__)
     1875    0.006    0.000    0.008    0.000 {built-in method _abc._abc_subclasscheck}
     1876    0.005    0.000    0.007    0.000 profiler.py:204(stop)
        1    0.000    0.000    0.004    0.004 sampler.py:103(__iter__)
        1    0.002    0.002    0.002    0.002 {method 'tolist' of 'torch._C._TensorBase' objects}
     1875    0.002    0.000    0.002    0.000 _collections_abc.py:302(__subclasshook__)
        1    0.002    0.002    0.002    0.002 {built-in method randperm}
     1875    0.001    0.000    0.001    0.000 worker.py:74(get_worker_info)
     1876    0.001    0.000    0.001    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 mnist.py:102(__len__)
        1    0.000    0.000    0.000    0.000 tensor.py:443(__len__)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.iter}
        1    0.000    0.000    0.000    0.000 {method 'dim' of 'torch._C._TensorBase' objects}

Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Thx!

pytorch_lightning/utilities/profiler.py Outdated Show resolved Hide resolved
import numpy as np


def test_simple_profiler():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a test for the advanced profiler

.. role:: hidden
:class: hidden-section

Profiler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "Performance Profiler" ?

pytorch_lightning/trainer/trainer.py Show resolved Hide resolved
@williamFalcon
Copy link
Contributor

@jeremyjordan amazing job! Let's make sure the docs and tests are thorough here.

In addition, I'd love to get a medium post about this. I can co-publish with you if you want, so we can get it on a lot of publications. email me to discuss

@Borda
Copy link
Member

Borda commented Feb 5, 2020

You can make a #publishing channel on slack ;)
Btw, Do you think that s medium post is good way for citing?

@williamFalcon
Copy link
Contributor

it's good for discovery because we can also publish in towards data science, kdnuggets, etc...

@jeremyjordan
Copy link
Contributor Author

yeah i'd love to do a blog post on this! i agree that great documentation (including blog posts) is important for user adoption. happy to cross-post on my blog as well.

@jeremyjordan
Copy link
Contributor Author

@williamFalcon i've added more documentation, a quick test for the AdvancedProfiler, and made the update to the Trainer to allow for the usage of profiler=True. ready for review :)

@williamFalcon
Copy link
Contributor

williamFalcon commented Feb 6, 2020

@jeremyjordan can we make the import:

from pytorch_lightning.profilers import ProfilerA, ProfilerB

@Borda
Copy link
Member

Borda commented Feb 6, 2020

the profiler is in pytorch_lightning/utilities/profiler.py
so more likely from pytorch_lightning import ProfilerA, ProfilerB

@williamFalcon
Copy link
Contributor

@Borda we need organization in the package. i think we need a collection called profilers because it allows other profilers to enter lightning as well. much like we have loggers and callbacks

@jeremyjordan
Copy link
Contributor Author

should we move pytorch_lightning/utilities/profiler.py to pytorch_lightning/profiler/?

@williamFalcon
Copy link
Contributor

yes. i kind of hate utility folders, it usually means we didn’t think hard about organization :)

@jeremyjordan
Copy link
Contributor Author

awesome! i'll make the change.

haha i feel the same way, utils.py is my pet peeve.

@Borda
Copy link
Member

Borda commented Feb 6, 2020

@Borda we need organization in the package. i think we need a collection called profilers because it allows other profilers to enter lightning as well. much like we have loggers and callbacks

sure, do you want to do it in 0.6.1 or in next release 0.7?

@williamFalcon
Copy link
Contributor

we can add the organization over time. right now since profilers is new we need to do it now

@Borda
Copy link
Member

Borda commented Feb 6, 2020

Sure, just think about versions since the restructure on utils API would be better in a new release, like 0.7 https://github.com/PyTorchLightning/pytorch-lightning/milestone/5

@williamFalcon
Copy link
Contributor

i don’t think we have any utils api right now... the point is to avoid utils.

@williamFalcon williamFalcon merged commit 1cf430f into Lightning-AI:master Feb 7, 2020
@Borda Borda modified the milestones: 0.7.0, 0.6.1 Feb 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: Profiling for a training run
3 participants