Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

DickJC123 · 2019-11-12T03:43:08Z

Description

This PR speeds up the dynamic nvrtc-compilation of fused_ops in response to @rondogency's comment #15167 (comment). As reported in the comment, the runtime of 3 mentioned unittests had grown drastically with the fusion enabled to 17.5 minutes in total. With this PR, the runtime drops to 1 minute, with the original fusion-turned-off runtime being 30 seconds.

The process of runtime compilation of NVIDIA gpu kernels involves 2 steps:
- compiling the cuda code to PTX assembly (performed once per GPU architecture)
- translating the ptx assembly to binary and loading it into a GPU's set of runnable kernels (performed once per GPU device). This latter step produces the CUfunction needed to execute the kernel on the device.

After realizing that the slowed-down unittests were creating many identical fused ops, I added a cache of the PTX and CUfunctions. The cache comprises a mapping (for each GPU arch) from the cuda source code to the PTX and to any CUfunctions created from it.

It's worth a reminder that the fusion framework is targeting the typical scenario of creating a model's graph and executing it many times. The CI was adversely impacted because it often executes a model's graph just once after creation.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

[ X] The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
[X ] Changes are complete (i.e. I finished coding on this PR)
[X ] All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
[X ] Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
[X ] To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

Feature1, tests, (and when applicable, API doc)
Feature2, tests, (and when applicable, API doc)

Comments

If this change is a backward incompatible change, why must this change be made.
Interesting edge cases to note here

DickJC123 · 2019-11-12T21:01:49Z

As reported originally with pointwise fusion enabled:
test_operator_gpu.test_sparse_mathematical_core goes from ~13s to ~350s, test_operator_gpu.test_lstm_bidirectional goes from ~15s to ~450s,
test_operator_gpu.test_rnnrelu_bidirectional goes from ~4s to ~250s.

As timed on this latest passing CI run for centos-gpu:
test_operator_gpu.test_lstm_bidirectional: 39s
test_operator_gpu.test_sparse_mathematical_core: 18s
test_operator_gpu.test_rnnrelu_bidirectional: 6s

DickJC123 · 2019-11-12T21:28:54Z

Using the centos-gpu unittest runtime now as a metric:

Before op fusion: 40 minutes
With op fusion (but before this PR): 1hr 40 minutes
With op fusion and this PR to cache compiles: 44 minutes
@larroy @samskalicky

ptrendx

LGTM

…pache#16783)

…, #16792) (#16832) * Fix nightly build (#16773) * Remove dependency on tvmop.conf * Fix binaries dependencies for ni nightly * Add comments * Update tvmop.py * Fix rebase * Fix (#16781) * Speed fused_op compilation by caching ptx and jit-compiled functions (#16783) * [Numpy] Fix collect_params().zero_grad() in gluon numpy interface (#16716) * fix zero_grad * Update parameter.py * add test * fix * Mixed data type binary ops (#16699) * support mixed-precision binary operations * improvement for documentations and error messages * Support boolean elemwise/broadcast binary add, multiply and true_divide (#16728) * support pure boolean elemwise/broadcast binary op * switch to unique_tpr * fix the test error * Fix rtrue_divide grad (#16769) * Fix rtrue_divide_scalar * More tests * Fix numpy-compatible mean output type for integer inputs (#16792) * fix mean output type for integer inputs * enable for windows

…16783)

Speed fused_op compilation by caching ptx and jit-compiled functions

9b2c2d7

DickJC123 requested a review from ptrendx November 12, 2019 03:43

ptrendx approved these changes Nov 12, 2019

View reviewed changes

ptrendx merged commit 2c02bff into apache:master Nov 12, 2019

ptrendx pushed a commit to ptrendx/mxnet that referenced this pull request Nov 15, 2019

Speed fused_op compilation by caching ptx and jit-compiled functions (a…

1f570c7

…pache#16783)

ptrendx mentioned this pull request Nov 15, 2019

Backport to 1.6 (#16773, #16781, #16783, #16716, #16699, #16728, #16769, #16792) #16832

Merged

apeforest pushed a commit that referenced this pull request Nov 19, 2019

Speed fused_op compilation by caching ptx and jit-compiled functions (#…

169ed69

…16783)

ptrendx mentioned this pull request Dec 18, 2019

Bug with fusion #17105

Closed

ptrendx mentioned this pull request May 11, 2020

[Performance Regression] GPU memory increase for training and inference models #18280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

DickJC123 commented Nov 12, 2019 •

edited

Loading

DickJC123 commented Nov 12, 2019

DickJC123 commented Nov 12, 2019

ptrendx left a comment

Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

Speed fused_op compilation by caching ptx and jit-compiled device functions #16783

Conversation

DickJC123 commented Nov 12, 2019 • edited Loading

Description

Checklist

Essentials

Changes

Comments

DickJC123 commented Nov 12, 2019

DickJC123 commented Nov 12, 2019

ptrendx left a comment

Choose a reason for hiding this comment

DickJC123 commented Nov 12, 2019 •

edited

Loading