Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Infra for tvm op runtime dispatch #16100

Merged
merged 10 commits into from
Oct 28, 2019
Merged

Infra for tvm op runtime dispatch #16100

merged 10 commits into from
Oct 28, 2019

Conversation

hzfan
Copy link
Contributor

@hzfan hzfan commented Sep 5, 2019

Description

This PR implements an infra to let users dispatch the execution of a tvm operator to different schedules according to the runtime input shapes. This helps with acceleration.

A gemm example can be found in

  • Kernel definition: contrib/tvmop/core/multiarray.py
  • Operator registry and dispatch: src/operator/contrib/tvmop/dot.cc
  • Benchmark: benchmark/python/tvmop/benchmark_tvmop.py

The following are some experimental results for matrix multiplication between two n * n matrix. Note that benchmark results cannot be reproduced until this gets merged.

n After Dispatch (ms) Before Dispatch (ms)
1024 177 482
1056 190 366
1088 200 424

The example schedule is roughly equivalent to the Blocking optimization. More opt (like vectorization, loop permutation, array packing, write cache for blocks, parallel) can be used for further acceleration.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Add dispatch infra
  • Add an example dot operator

Comments

@ZhennanQin
Copy link
Contributor

Just for curious. Based on my knowledge, tvm op kernel is pre-compiled and then linked together with MXNet. How can it be configured according to the runtime input shapes?

@hzfan
Copy link
Contributor Author

hzfan commented Sep 6, 2019

Just for curious. Based on my knowledge, tvm op kernel is pre-compiled and then linked together with MXNet. How can it be configured according to the runtime input shapes?

Yes, kernels are pre-compiled. At compile time, several different schedules (kernels) for a single op are defined and compiled. Then at runtime, with the runtime input shape, the most suitable kernel is chosen.

It's true that the kernel is pre-compiled, but we have multiple available kernels for one single op, so we can choose the most efficient one based on the runtime input shape.

@ZhennanQin
Copy link
Contributor

@hzfan Thanks for explanation. My next question is, how do I know which schedule is the best one for a certain input shape? Static rule defined or runtime tuned?

@hzfan
Copy link
Contributor Author

hzfan commented Sep 9, 2019

@hzfan Thanks for explanation. My next question is, how do I know which schedule is the best one for a certain input shape? Static rule defined or runtime tuned?

That's a good question. Actually we have considered both options, and for now we use simple static rules. To be more specific, for now I require the size of a for-loop to be multiples of its splitting factor (if the for-loop is splitted). This helps eliminate a if-condition, and thus makes it faster.

Runtime tuning has also been considered, but has not been implemented in this version. The idea is to try all the available schedules for every runtime shape, measure their performance, and cache the best choice. This is quite similar to autotvm.

  • Pros: the choice is optimal
  • Cons: first-time running of a shape not encountered before will be slow

@yzhliu
Copy link
Member

yzhliu commented Sep 17, 2019

Also cc @icemelon9 @kevinthesun

return diff / repeat


def test_tvm_dot():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

who uses this? for testing only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, for reproducing the benchmark result. Other code in benchmark/ only serves this purpose, too.

conf_path = [p for p in candidates_path if os.path.exists(p) and os.path.isfile(p)]
if len(conf_path) == 0:
raise RuntimeError('Cannot find the TVM op config.\n' +
'List of candidates:\n' + str('\n'.join(candidates_path)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we fallback to default behavior if config file is missing?

Copy link
Contributor Author

@hzfan hzfan Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just a little bit more code I think.

In which case will the config file be missing? It is generated in compile time (even if no tunable parameters are needed, a nearly empty config will be generated too).

contrib/tvmop/compile.py Outdated Show resolved Hide resolved
def dot(dtype, fallback):
cfg = autotvm.get_config()
cfg.define_knob("bn", [64] if fallback else [64, 32])
cfg.define_knob("factor", [4] if fallback else [4])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems it's always [4] no matter what fallback is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The difference is that when fallback is false, the shape comes with a hint, indicating that it is multiples of 4.

This factor means in any case I want to split the loop by a factor of 4. When fallback, there is no guarantee the loop size is a multiples of 4., while when not fallback, there is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what you are trying to get here. My point is that this line is equivalent to the following, correct?

cfg.define_knob("factor", [4])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

src/operator/tvmop/op_module.h Outdated Show resolved Hide resolved
from collections import OrderedDict
import numpy as _np

class OtherOptionSpace(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we call this GeneralOptionSpace? Same for other places: other -> general.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the OtherOptionSpace comes from tvm/python/tvm/autotvm/task/space.py. Besides OtherOptionSpace, there is SplitSpace, ReorderSpace and AnnotateSpace. Maybe in the future the three other spaces will be needed, so I keep its name consistent with tvm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough.

if op.dispatch is True:
config_space = autotvm.ConfigSpace()
with autotvm.task.ApplyConfig(config_space):
sch, args = op.func(fallback=False, **each_kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requires fallback as a mandatory parameter in op.func, which is not ideal in terms of usability in my opinion. We should support compiling whatever users define and treat the fallback knob as an advanced feature for performance tuning.

A way to achieve such purpose is inspect the signature of op.func for keyword fallback. If the keyword does not exist, we just compile the op using the default schedule, e.g.

if 'fallback' in str(inspect.signature(op.func)):
    sch, args = op.func(fallback=False, **each_kwargs)
else:
    sch, args = op.func(**each_kwargs)

@yzhliu What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I set self.dispatchable = 'fallback' in inspect.signature(self.func).parameters in opdef.py

@hzfan hzfan force-pushed the autotvm_pr branch 2 times, most recently from 2fe9f81 to 587812d Compare October 26, 2019 04:04
Copy link
Member

@yzhliu yzhliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@reminisce reminisce merged commit 9322864 into apache:master Oct 28, 2019
yajiedesign pushed a commit to yajiedesign/mxnet that referenced this pull request Nov 6, 2019
* infra for dispatch tvm op

* fix ci and sanity error

* disable shape with hint and fix coding style

* rename to avoid conflict with original dot

* update tvm and use soft link

* config file moves to lib/ when using Makefile

* add tvmop.conf to ci

* fix rebase

* fix rebase

* use inspect to detect dispatchable func
@eric-haibin-lin
Copy link
Member

do we have a developer guide using tvm op?

@hzfan
Copy link
Contributor Author

hzfan commented Dec 23, 2019

do we have a developer guide using tvm op?

Seems @yzhliu is working on it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants