Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[FEATURE] Use RTC for reduction ops #19426

Merged
merged 29 commits into from
May 25, 2021
Merged

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Oct 26, 2020

Description

This PR is a continuation of the work started in #18622. It changes the reduction operations to be compiled with runtime compilation (RTC).

As the work progresses I will update the description.

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

Changes

  • Changed reduction kernels to RTC
  • Removed duplication (broadcast_reduce-inl.cuh, broadcast_reduce_customized-inl.h)
  • Removed allocations/memcpies/synchronizations from forward of NumPy norm operator (ReduceAxesComputeImplWithReducer)
  • Fixed handling of workspace in backward of kron operator
  • Fixed handling of workspace in NumPy argmin/argmax operators
  • Fixed the logic of choosing argmin/argmax (if the values are the same, index should be lower)
  • During the process of working on this PR noticed bug in workspace handling in tensordot operator, filed issue Wrong handling of workspace in multiple numpy operators #19458.

@ptrendx ptrendx added the pr-work-in-progress PR is still work in progress label Oct 26, 2020
@mxnet-bot
Copy link

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [windows-gpu, unix-cpu, website, centos-gpu, sanity, clang, windows-cpu, miscellaneous, centos-cpu, edge, unix-gpu]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Oct 29, 2020
@lanking520 lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Oct 31, 2020
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-awaiting-review PR is waiting for code review pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 22, 2021
@ptrendx
Copy link
Member Author

ptrendx commented Apr 23, 2021

@mxnet-bot run ci [unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels Apr 23, 2021
@DickJC123
Copy link
Contributor

Nice work! This will be an important and complementary addition to the work you already PR'd in #18622. Some high-level questions:

Do you have any data on the overheads involved in RTC launch vs. compiled kernel launch, e.g. on the first iteration and thereafter (perhaps for both hybridized and unhybridized models)?

I'm sorry to see all those floating point constants in the MXNet RTC code. Are there no compiler-defined constants that can be used, or is there a motivation for avoiding them?

Having worked on these reduce functions quite a bit, you probably have a good sense of the level of testing. Do you feel it's adequate? Can RTC-based reduction invoke any new regions of the operator parameter space?

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-review PR is waiting for code review labels May 19, 2021
@ptrendx
Copy link
Member Author

ptrendx commented May 19, 2021

Do you have any data on the overheads involved in RTC launch vs. compiled kernel launch, e.g. on the first iteration and thereafter (perhaps for both hybridized and unhybridized models)?

There is an overhead on the first launch of the given kernel of 10ms-100ms since it needs to be compiled before use. After the compilation it is stored in a cache and any subsequent call is fast - I measured ~2us overhead for constructing the kernel code and cache lookup, which is comparable with the cudaLaunchKernel itself. There is not really any difference between the hybridized and nonhybridized models since the functionality works irrespective of hybridization.

I'm sorry to see all those floating point constants in the MXNet RTC code. Are there no compiler-defined constants that can be used, or is there a motivation for avoiding them?

No floating point constants are compiler defined - they all come from header files (e.g. ). The motivation of avoiding including external headers is to avoid the potential issues of finding the headers' location and the fact that in NVRTC we cannot include any header which contains host-only code.

Having worked on these reduce functions quite a bit, you probably have a good sense of the level of testing. Do you feel it's adequate? Can RTC-based reduction invoke any new regions of the operator parameter space?

I think the level of testing is generally adequate and the change to RTC does not introduce any additional parameters to be tested. It actually consolidates the functionality and so improves the testing coverage (since previously some functions were using customized versions of the kernel e.g. from src/operator/numpy/linalg/broadcast_reduce_customized-inl.cuh and now all the usecases are handled by the same kernel code).

@mseth10 mseth10 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 19, 2021
@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 21, 2021
@ptrendx
Copy link
Member Author

ptrendx commented May 24, 2021

@mxnet-bot run ci [centos-gpu, unix-cpu]

@mxnet-bot
Copy link

Jenkins CI successfully triggered : [unix-cpu, centos-gpu]

@mseth10 mseth10 added pr-awaiting-testing PR is reviewed and waiting CI build and test pr-awaiting-review PR is waiting for code review and removed pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test labels May 24, 2021
Copy link
Contributor

@DickJC123 DickJC123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My questions have been answered previously to my satisfaction.

As pointed out by the author, this PR is a continuation of the work started in #18622, which has seen ample usage by the community without issue. I feel the benefits of a smaller libmxnet.so and global-memory model footprints outweigh the penalty of slower kernel execution during first-time-use. Our every-growing body of kernels is more maintainable with this RTC framework, and perf-enhancing fusions become possible.

LGTM.

@DickJC123 DickJC123 merged commit 57d0ace into apache:master May 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants