Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QNN][Legalize] Specialize for Platforms w/o fast Int8 support #4307

Merged
merged 1 commit into from
Nov 13, 2019

Conversation

anijain2305
Copy link
Contributor

@anijain2305 anijain2305 commented Nov 11, 2019

More details at https://discuss.tvm.ai/t/qnn-conv2d-dense-legalize-for-platforms-with-no-fast-int8-units/4698

QNN op lowering is currently optimized for HW platforms that have fast Int8 HW. This PR supports different lowering for platforms w/o any fast units. Helps Raspberry Pi and old Intel servers.

@jackwish @FrozenGene @yzhliu @tqchen @ajtulloch

@anijain2305 anijain2305 force-pushed the qnn_arm_int8 branch 2 times, most recently from 7025a08 to 5014e6f Compare November 11, 2019 20:42
@qnn_conv2d_legalize.register('cpu')
def _qnn_conv2d_legalize_intel_cpu(attrs, inputs, types):
# The VNNI transformations prefer uint8 x int8 datatypes.
if is_fast_int8_hw_present():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we are already Intel CPU here, I think the HW feature checking can try Intel CPU directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is used twice - for conv2d and dense, even for Intel CPU. So, I decided to put that into a function. I think this might be ok. We can have one place where we can filter out the targets that have fast int8 HW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. Yes, I have seen it been used by dense and conv2d. I mean if we can split it per different targets rather than merge arm and x86 into one. It could be a bit weird to run target dependent logic when we are already know the targets, though the code here guarantees the correctness... What do you say?

@qnn_conv2d_legalize.register('arm_cpu')
def _qnn_conv2d_legalize_arm_cpu(attrs, inputs, types):
# ARM prefers the dtypes to be same.
if is_fast_int8_hw_present():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to Intel CPU.

@anijain2305
Copy link
Contributor Author

@jackwish Let me know if you have more comments.

@zhiics Can you please review as well?

new_attrs['input_zero_point'] = input_zp
return relay_op(data, kernel, **new_attrs)

def is_fast_int8_hw_present():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could break into isolated function for Intel and ARM, which will make code cleaner, for example, we have PowerPC support in the future, I would like to have one isolated function ppc_int8_hw_support. However, current way is acceptable too.

Copy link
Contributor

@zhenhuaw-me zhenhuaw-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for ping @anijain2305 , several minor comments :)

I didn't check the test detail, but if they are not strong enough?

@qnn_conv2d_legalize.register('cpu')
def _qnn_conv2d_legalize_intel_cpu(attrs, inputs, types):
# The VNNI transformations prefer uint8 x int8 datatypes.
if is_fast_int8_hw_present():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply. Yes, I have seen it been used by dense and conv2d. I mean if we can split it per different targets rather than merge arm and x86 into one. It could be a bit weird to run target dependent logic when we are already know the targets, though the code here guarantees the correctness... What do you say?

is_present_arm = False
for opt in target.options:
if arm_supported_attr in opt:
is_present_arm = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A break here may help :)

Or rewrite this to be something like

is_present_arm = '+v8.2a,+dotprod' in ' '.join(target.options)`

new_attrs['input_zero_point'] = input_zp
return relay_op(data, kernel, **new_attrs)

def is_fast_int8_hw_present():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In https://github.com/apache/incubator-tvm/pull/4307/files#r345009110, I mean rewrite this function to be something like is_fast_int8_on_arm and is_fast_int8_on_x86 or something. Or maybe

def is_fast_int8_hw_present(key):
    if key == 'arm': # something
    elif key == 'x86': # some other
    else: # fall through

to unify the check.

Comment on lines 214 to 215
assert 'int8' in data_dtype and 'int8' in kernel_dtype, \
"Qnn Conv2D only accepts uint8 or int8 inputs"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this assertion consistent with its description?

"""

def _shift(data, out_dtype):
"""Shifts (add/subtracts) the qnn tensor with +/-128)"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is add or subtract with 128 :)

python/tvm/relay/qnn/op/legalizations.py Show resolved Hide resolved
Comment on lines 218 to 225
input_zp = attrs['input_zero_point']
data = _shift(data, kernel_dtype)
if data_dtype == 'int8':
input_zp = input_zp + 128
elif data_dtype == 'uint8':
input_zp = input_zp - 128
else:
raise RuntimeError("Qnn Conv2D only accepts uint8 or int8 inputs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about rewrite _shift() such that it also get the zero point shifting done?

@anijain2305
Copy link
Contributor Author

Addressed the comments :)

Copy link
Contributor

@zhenhuaw-me zhenhuaw-me left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you!

Copy link
Member

@zhiics zhiics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM.

I will merge since everyone has approved.

@zhiics zhiics merged commit 3486e2c into apache:master Nov 13, 2019
@ajtulloch
Copy link
Contributor

Great stuff.

zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Nov 15, 2019
zxy844288792 pushed a commit to zxy844288792/tvm that referenced this pull request Nov 15, 2019
kevinthesun pushed a commit to neo-ai/tvm that referenced this pull request Nov 25, 2019
* [TOPI][OP] Support Faster-RCNN Proposal OP on CPU (apache#4297)

* Support Proposal operator on CPU.

* PyLint space issue

* PyLint space issue

* Pylint singleton-comparison issue

* [QNN][Legalize] Specialize for Platforms without any fast Int8 arithmetic units. (apache#4307)

* fix error when memory_id is VTA_MEM_ID_OUT (apache#4330)

* [CI][DOCKER] Add ONNX runtime dep (apache#4314)

* [DOCKER] Add ONNX runtime dep

* Improve ci script

* [QNN] Quantize - Fixing the sequence of lowering. (apache#4316)

* [QNN] Use Int16 upcast in Fallback Conv2D. Fix test names. (apache#4329)

* [doc][fix] fix sphinx parsing for pass infra tutorial (apache#4337)

* change ci image version (apache#4313)

* [Codegen] remove fp16 function override for cuda  (apache#4331)

* add volatile override back

* [codegen] remove fp16 function override for cuda

* [CI] Set workspace to be per executor (apache#4336)

* [Build][Windows] Fix Windows build by including cctype (apache#4319)

* Fix build

* dummy change to retrigger CI

* dummy change to retrigger ci

* dummy change to retrigger ci

* Enable hipModuleGetGlobal() (apache#4321)

* [Relay][Pass] Add pass to remove unused functions in relay module (apache#4334)

* [Relay][Pass] Add pass to remove unused functions in relay module

* Add tests

* Fix lint

* Fix visit order

* Add pass argument

* Fix

* Add support for quant. mul operator in tflite frontend (apache#4283)

A test for qnn_mul has to be added when the qnn elemwise tests (apache#4282) get merged.

* Add topi.nn.fifo_buffer to TVM doc (apache#4343)

* Solve custom model of prelu (apache#4326)

* Deprecate NNVM warning msg (apache#4333)

* [Contrib] Add MKL DNN option (apache#4323)

* [Contrib] Add MKL DNN

* update

* update

* [Relay][Frontend][TF] Fix transpose when axes is not a param (apache#4327)

* [Relay][Frontend][TF] Use _infer_value_simulated when axes is not a const to Transpose

* uncomment tests

* dummy change to retrigger ci

* [RUNTIME] Add device query for AMD GcnArch (apache#4341)

* add gcnArch query

* kGcnArch query for cuda is a no-op

* [Test][Relay][Pass] Add test case for lambda lift (apache#4317)

* [Relay][Frontend][ONNX] operator support: DepthToSpace, SpaceToDepth (apache#4271)

* imp module is deprecated (apache#4275)

* [VTA] Bug fix for padded load with large inputs (apache#4293)

* bug fix for padded load with large inputs

* Update TensorLoad.scala

* Update test_vta_insn.py

* fix inconsistent tag name (apache#4134)

* [CodeGen] Add build config option disable_assert to control whether to generate assert (apache#4340)

* Bump up CUDA log version in tophub.py (apache#4347)

* Add check to ensure input file was successfully opened in NNVM deploy code demo (apache#4315)

* [COMMUNITY] Add DISCLAIMER, KEYS for ASF release (apache#4345)

* [COMMUNITY] Add DISCLAIMER, KEYS for ASF release

* Add file name spec

* [Relay][VM][Interpreter] Enable first-class constructors in VM and interpreter via eta expansion (apache#4218)

* Fix constructor pretty printing

* Make Module::HasDef name consistent with API

* Add VM constructor compilation via eta expansion

* Lint

* Fix CI

* Fix failing test

* Address comment

* Retrigger CI

* Retrigger CI

* Update dmlc_tvm_commit_id.txt
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants