[FEATURE] Use RTC for reduction ops #19426

ptrendx · 2020-10-26T16:36:23Z

Description

This PR is a continuation of the work started in #18622. It changes the reduction operations to be compiled with runtime compilation (RTC).

As the work progresses I will update the description.

Checklist

Essentials

PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage
Code is well-documented

Changes

Changed reduction kernels to RTC
Removed duplication (broadcast_reduce-inl.cuh, broadcast_reduce_customized-inl.h)
Removed allocations/memcpies/synchronizations from forward of NumPy norm operator (ReduceAxesComputeImplWithReducer)
Fixed handling of workspace in backward of kron operator
Fixed handling of workspace in NumPy argmin/argmax operators
Fixed the logic of choosing argmin/argmax (if the values are the same, index should be lower)
During the process of working on this PR noticed bug in workspace handling in tensordot operator, filed issue Wrong handling of workspace in multiple numpy operators #19458.

mxnet-bot · 2020-10-26T16:36:27Z

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, unix-cpu, website, centos-gpu, sanity, clang, windows-cpu, miscellaneous, centos-cpu, edge, unix-gpu]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

ptrendx · 2021-04-23T00:07:15Z

@mxnet-bot run ci [unix-cpu]

mxnet-bot · 2021-04-23T00:07:19Z

Jenkins CI successfully triggered : [unix-cpu]

DickJC123 · 2021-04-27T03:25:37Z

Nice work! This will be an important and complementary addition to the work you already PR'd in #18622. Some high-level questions:

Do you have any data on the overheads involved in RTC launch vs. compiled kernel launch, e.g. on the first iteration and thereafter (perhaps for both hybridized and unhybridized models)?

I'm sorry to see all those floating point constants in the MXNet RTC code. Are there no compiler-defined constants that can be used, or is there a motivation for avoiding them?

Having worked on these reduce functions quite a bit, you probably have a good sense of the level of testing. Do you feel it's adequate? Can RTC-based reduction invoke any new regions of the operator parameter space?

ptrendx · 2021-05-19T21:54:22Z

Do you have any data on the overheads involved in RTC launch vs. compiled kernel launch, e.g. on the first iteration and thereafter (perhaps for both hybridized and unhybridized models)?

There is an overhead on the first launch of the given kernel of 10ms-100ms since it needs to be compiled before use. After the compilation it is stored in a cache and any subsequent call is fast - I measured ~2us overhead for constructing the kernel code and cache lookup, which is comparable with the cudaLaunchKernel itself. There is not really any difference between the hybridized and nonhybridized models since the functionality works irrespective of hybridization.

I'm sorry to see all those floating point constants in the MXNet RTC code. Are there no compiler-defined constants that can be used, or is there a motivation for avoiding them?

No floating point constants are compiler defined - they all come from header files (e.g. ). The motivation of avoiding including external headers is to avoid the potential issues of finding the headers' location and the fact that in NVRTC we cannot include any header which contains host-only code.

Having worked on these reduce functions quite a bit, you probably have a good sense of the level of testing. Do you feel it's adequate? Can RTC-based reduction invoke any new regions of the operator parameter space?

I think the level of testing is generally adequate and the change to RTC does not introduce any additional parameters to be tested. It actually consolidates the functionality and so improves the testing coverage (since previously some functions were using customized versions of the kernel e.g. from src/operator/numpy/linalg/broadcast_reduce_customized-inl.cuh and now all the usecases are handled by the same kernel code).

ptrendx · 2021-05-24T15:38:08Z

@mxnet-bot run ci [centos-gpu, unix-cpu]

mxnet-bot · 2021-05-24T15:38:12Z

Jenkins CI successfully triggered : [unix-cpu, centos-gpu]

DickJC123

My questions have been answered previously to my satisfaction.

As pointed out by the author, this PR is a continuation of the work started in #18622, which has seen ample usage by the community without issue. I feel the benefits of a smaller libmxnet.so and global-memory model footprints outweigh the penalty of slower kernel execution during first-time-use. Our every-growing body of kernels is more maintainable with this RTC framework, and perf-enhancing fusions become possible.

LGTM.

Initial rebase

3ca492a

ptrendx added the pr-work-in-progress PR is still work in progress label Oct 26, 2020

ptrendx added 10 commits October 26, 2020 09:51

Merge commit '3faf6df22' into pr_rtc_reduce_ops

6d4fcc3

Merge commit '75c62166e' into pr_rtc_reduce_ops

fd0db00

Fixes after merge

91f715f

Merge commit '187c75d6b' into pr_rtc_reduce_ops

a3694b9

Merge commit '95f9ea2c8' into pr_rtc_reduce_ops

46060a1

Merge commit '4b3be14a4' into pr_rtc_reduce_ops

fc8b771

Merge commit '8dc3652ab' into pr_rtc_reduce_ops

22b669d

Merge branch 'upstream' into pr_rtc_reduce_ops

a08fa72

Fixes

fe5656f

Merge branch 'upstream' into pr_rtc_reduce_ops

bd61456

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020

Fix lint

f44314a

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020

Fix lint for real

9b561e3

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020

Cleaning and code reuse

9acc3f5

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Oct 31, 2020

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 22, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 23, 2021

Merge branch 'upstream' into pr_rtc_reduce_ops

4a6f89a

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels May 19, 2021

mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 19, 2021

Merge branch 'upstream' into pr_rtc_reduce_ops

c457f55

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 21, 2021

mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 24, 2021

DickJC123 approved these changes May 25, 2021

View reviewed changes

DickJC123 merged commit 57d0ace into apache:master May 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Use RTC for reduction ops #19426

[FEATURE] Use RTC for reduction ops #19426

ptrendx commented Oct 26, 2020 •

edited

Loading

mxnet-bot commented Oct 26, 2020

ptrendx commented Apr 23, 2021

mxnet-bot commented Apr 23, 2021

DickJC123 commented Apr 27, 2021

ptrendx commented May 19, 2021

ptrendx commented May 24, 2021

mxnet-bot commented May 24, 2021

DickJC123 left a comment

[FEATURE] Use RTC for reduction ops #19426

[FEATURE] Use RTC for reduction ops #19426

Conversation

ptrendx commented Oct 26, 2020 • edited Loading

Description

Checklist

Essentials

Changes

mxnet-bot commented Oct 26, 2020

ptrendx commented Apr 23, 2021

mxnet-bot commented Apr 23, 2021

DickJC123 commented Apr 27, 2021

ptrendx commented May 19, 2021

ptrendx commented May 24, 2021

mxnet-bot commented May 24, 2021

DickJC123 left a comment

Choose a reason for hiding this comment

ptrendx commented Oct 26, 2020 •

edited

Loading