Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RELAY] Add primal gradients for Relay operators. #2562

Closed
9 of 86 tasks
jroesch opened this issue Feb 4, 2019 · 14 comments
Closed
9 of 86 tasks

[RELAY] Add primal gradients for Relay operators. #2562

jroesch opened this issue Feb 4, 2019 · 14 comments

Comments

@jroesch
Copy link
Member

jroesch commented Feb 4, 2019

Relay's automatic differentiation is still missing primal gradients. It would be interesting to integrate with the Tensor level AD at some point, but for the time being we should focus on adding primal gradients. I will open an PR adding to the basic set but we should work towards completion for Relay operators. Those with expertise on the less straight forward gradient computations help would be appreciated.

The gradients should be in C++ and provide tests, see below for complete list.

Level 1

  • tvm.relay.log
  • tvm.relay.sqrt
  • tvm.relay.exp
  • tvm.relay.sigmoid
  • tvm.relay.add
  • tvm.relay.subtract
  • tvm.relay.multiply
  • tvm.relay.divide
  • tvm.relay.mod
  • tvm.relay.tanh
  • tvm.relay.concatenate
  • tvm.relay.expand_dims
  • tvm.relay.nn.softmax
  • tvm.relay.nn.log_softmax
  • tvm.relay.nn.relu
  • tvm.relay.nn.dropout
  • tvm.relay.nn.batch_norm
  • tvm.relay.nn.bias_add

Level 2

  • tvm.relay.nn.conv2d
  • tvm.relay.nn.conv2d_transpose
  • tvm.relay.nn.dense
  • tvm.relay.nn.max_pool2d
  • tvm.relay.nn.avg_pool2d
  • tvm.relay.nn.global_max_pool2d
  • tvm.relay.nn.global_avg_pool2d
  • tvm.relay.nn.upsampling
  • tvm.relay.nn.batch_flatten
  • tvm.relay.nn.pad
  • tvm.relay.nn.lrn
  • tvm.relay.nn.l2_normalize
  • tvm.relay.nn.contrib_conv2d_winograd_without_weight_transform
  • tvm.relay.nn.contrib_conv2d_winograd_weight_transform

Level 3

  • tvm.relay.nn.leaky_relu
  • tvm.relay.nn.prelu
  • tvm.relay.reshape
  • tvm.relay.reshape_like
  • tvm.relay.copy
  • tvm.relay.transpose
  • tvm.relay.squeeze
  • tvm.relay.floor
  • tvm.relay.ceil
  • tvm.relay.trunc
  • tvm.relay.clip
  • tvm.relay.round
  • tvm.relay.abs
  • tvm.relay.negative
  • tvm.relay.take
  • tvm.relay.zeros
  • tvm.relay.zeros_like
  • tvm.relay.ones
  • tvm.relay.ones_like
  • tvm.relay.full
  • tvm.relay.full_like
  • tvm.relay.cast
  • tvm.relay.split

Level 4

  • tvm.relay.right_shift
  • tvm.relay.left_shift
  • tvm.relay.equal
  • tvm.relay.not_equal
  • tvm.relay.greater
  • tvm.relay.greater_equal
  • tvm.relay.less
  • tvm.relay.less_equal
  • tvm.relay.maximum
  • tvm.relay.minimum
  • tvm.relay.power
  • tvm.relay.where
  • tvm.relay.argmax
  • tvm.relay.argmin
  • tvm.relay.sum
  • tvm.relay.max
  • tvm.relay.min
  • tvm.relay.mean
  • tvm.relay.prod
  • tvm.relay.strided_slice
  • tvm.relay.broadcast_to

Level 5

  • tvm.relay.image.resize
  • tvm.relay.vision.multibox_prior
  • tvm.relay.vision.multibox_transform_loc
  • tvm.relay.vision.nms

Level 10

  • tvm.relay.broadcast_to_like
  • tvm.relay.collapse_sum_like
  • tvm.relay.slice_like
  • tvm.relay.layout_transform
  • tvm.relay.device_copy
  • tvm.relay.annotation.on_device
@masahi
Copy link
Member

masahi commented Feb 5, 2019

does this mean we need to write all gradient ops in TOPI (conv2d_grad etc)? That would be major undertaking.

@reminisce
Copy link
Contributor

To ease the work of implementing so many gradient expressions, I think we can take advantage of this PR #2498 for simple operators and attach appropriate schedules. For complicated operators such as convolution, we will probably need to implement gradient expression manually.

@sergei-mironov
Copy link
Contributor

We think that a portion of above operations may indeed be handled by #2498. We will test tensor-level AD for compatibility with listed operations and publish results. Meanwhile, we work on integration of AD with Relay. We plan to provide a layer similar in spirit to our NNVM draft https://github.com/sgrechanik-h/tvm/blob/87d6f319f74360b9dfd0578b68214d1309b208fe/nnvm/src/top/tensor/gradient.cc .

@ajtulloch
Copy link
Contributor

@jroesch given how many of these are just simple either elementwise ops (log, etc) or reductions (broadcast, etc) - would it be possible for you (or someone familiar with how you want this work done) to first implement one of them as a template (i.e. showing desired code location (alongside or in separate file?), primal grad registration, direct + gradient checking in unittests, etc), which will allow others to efficiently use that as a template for the similar work?

@jroesch
Copy link
Member Author

jroesch commented Feb 6, 2019

@ajtulloch yes, there are a few basic ones committed to the repo, I will try to open a PR with multiple examples from level 1 this week. I've been busy prototyping other Relay features for training and execution which I hope to RFC in the coming weeks.

@reminisce @grwlf I think it would be great if we could get default behavior for Relay, and if the generated gradient's performance isn't sufficient we can hand implement them. @tqchen what do you think about this approach?

@sergei-mironov
Copy link
Contributor

sergei-mironov commented Feb 8, 2019

@jroesch , dear all. We made a quick check of AD-Relay compatibility: For every relay operation from the above list, we (a) Look at its FTVMCompute attribute (b) determine which TOPI function corresponds to it and (c) Compare the gradients of this function calculated by AD with their numerical estimations. The results are in the table below.

Additional notes:

  • Numerical check in this test may need adjustments, we saw rare random failures due to precision problems
  • Some functions run different implementations depending on parameters passed. We attempted to include the most common cases, but some combinations may be missing.
  • Checking the performance of all operations would require additional efforts, we don't do it now.
  • For cases with 'Integers gradients' comment: we need to clarify the gradient semantics for such operations. One possible solutions is to just return zeros. But we think that it may be incorrect for some tasks.
  • To reproduce, apply [TVM] Automatic differentiation for tensor expressions #2498 to the 427bdcc26 commit of TVM and use the following test.

PS We think about writing TVM Python codegen to pretty-print TVM IR code. Does anybody work on it?

Legend:

  • green Supported, numerical check passed
  • yellow Missing by accident/easy to add
  • orange Need to think first
  • red Need to debug
  • grey Unable to check
Status Name Comment
Level 1
orange tvm.relay.log Currently we do not assert on negative values which may be incorrect
yellow tvm.relay.sqrt Missing by accident, easy to fix
green tvm.relay.exp
green tvm.relay.sigmoid
green tvm.relay.add
green tvm.relay.substract
green tvm.relay.multiply
green tvm.relay.divide
orange tvm.relay.mod 🔢 Integer gradients
green tvm.relay.tanh
green tvm.relay.concatenate
green tvm.relay.expand_dims
green tvm.relay.softmax
green tvm.relay.log_softmax
green tvm.relay.relu
grey tvm.relay.dropout 💻 Missing FTVMCompute attribute
grey tvm.relay.batch_norm 💻 Missing FTVMCompute attribute
green tvm.relay.bias_add
Level 2
green tvm.relay.conv2d
green tvm.relay.conv2d_transpose
green tvm.relay.dense
green tvm.relay.max_pool
green tvm.relay.avg_pool
green tvm.relay.global_max_pool
green tvm.relay.global_avg_pool
green tvm.relay.upsampling
green tvm.relay.flatten
green tvm.relay.pad
yellow tvm.relay.lrn Blocked by missing pow intrinsic
yellow tvm.relay.l2_normalize Blocked by missing sqrt intrinsic
grey tvm.relay.conv2d_winograd_without_weight_transform Missing TOPI implementation
green tvm.relay.conv2d_winograd_weight_transform
Level 3
green tvm.relay.leaky_relu
green tvm.relay.prelu
green tvm.relay.reshape
green tvm.relay.reshape_like
green tvm.relay.copy_identity
green tvm.relay.transpose
green tvm.relay.squeeze
orange tvm.relay.floor 🔢 Integer gradients
orange tvm.relay.ceil 🔢 Integer gradients
orange tvm.relay.trunc 🔢 Integer gradients
red tvm.relay.clip Missing Not operation
orange tvm.relay.round 🔢 Integer gradients
green tvm.relay.abs
green tvm.relay.negative
green tvm.relay.take
green tvm.relay.zeros
green tvm.relay.zeros_like
green tvm.relay.ones
green tvm.relay.ones_like
green tvm.relay.full
green tvm.relay.full_like
grey tvm.relay.cast Currently, differentiate returns zeros for non-float32 inputs
green tvm.relay.split
Level 4
orange tvm.relay.right_shift 🔢 Integer gradients
orange tvm.relay.left_shift 🔢 Integer gradients
orange tvm.relay.equal 🔢 Integer gradients
orange tvm.relay.not_equal 🔢 Integer gradients
orange tvm.relay.greater 🔢 Integer gradients
orange tvm.relay.greater_equal 🔢 Integer gradients
orange tvm.relay.less 🔢 Integer gradients
orange tvm.relay.less_equal 🔢 Integer gradients
green tvm.relay.maximum
green tvm.relay.minimum
yellow tvm.relay.power Missing by accident, should be easy to fix.
grey tvm.relay.where 🐍 Missing Python API
red tvm.relay.argmax
red tvm.relay.argmin
green tvm.relay.sum
green tvm.relay.max
green tvm.relay.min
green tvm.relay.mean
green tvm.relay.prod
green tvm.relay.strided_slice
green tvm.relay.broadcast_to
Level 5
orange tvm.relay.resize Blocked by missing floor intrinsic
red tvm.relay.multibox_prior
red tvm.relay.multibox_transform_loc
red tvm.relay.nms
Level 10
green tvm.relay.broadcast_to_like
grey tvm.relay.collapse_sum_like 🐍 Missing Python API
green tvm.relay.slice_like
grey tvm.relay.layout_transform 🐍 Missing Python API
grey tvm.relay.device_copy 💻 Missing FTVMCompute attribute
grey tvm.relay.on_device 💻 Missing FTVMCompute attribute

@tqchen
Copy link
Member

tqchen commented Feb 8, 2019

While it is great to have a tensor expression gradient support. I recommend we provide the primal gradient in the form of relay operators, at this moment.

The main reason is that the relay-> relay transformation and makes it easier to do follow up analysis and transformations in relay, it also makes sure that each op can generate different variants easily(winograd, spatial pack for conv2d).

This does not eliminate the value of expression level gradient though, as they could be nice complementary when a user define custom op, and as a topic of research in the long run, if integrated properly with relay

@sergei-mironov
Copy link
Contributor

Expressing gradients in relay would be a good design test. My thoughts regarding this design choice are follows:

  • I am not sure that all listed operations have gradients which may be expressed in Relay language currently. Ideally we should move towards creating a list of basic operations which form a closed set in the sense that they have gradients expressible in themselves.
  • Expressing gradients in relay may foster Relay's C++ API.
  • As an option, one may implement operations in Relay (in addition to provide its FTVMCompute attribute). This way It would become a subject to relay's differentiation engine. dense, softmax are possible candidates for this approach.

@sgrechanik-h
Copy link
Contributor

I've updated the tensor expression AD PR with a Relay integration, here. The commit itself is here.

@ZihengJiang
Copy link
Contributor

I am working on adding gradient definition for some level 1/2 operators, see #2633 for details

@SWu
Copy link
Contributor

SWu commented May 30, 2019

I'm interested in helping contribute gradient implementations, but I'm finding it a bit difficult to understand what orientation the original op arguments are in, and what role collapse_sum_like plays (its documentation, "Return a scalar value array with the same shape and type as the input array." is identical to broadcast_to_like, and I'm not really understanding the mathematical operation it's performing).

As an example, by trial and error I arrived at the following for nn.dense:

@register_gradient("nn.dense")
def dense_grad(orig, grad):
    data, weight = orig.args
    return [collapse_sum_like(transpose(transpose(weight) * grad), data),
            collapse_sum_like(transpose(grad * transpose(data)), weight)]

I'm verifying this by checking gradient values numerically from a toy tensorflow model with a dense layer that I converted. I would not have expected to need the outer transpose here, but without it it seems like collapse_sum_like was broadcasting a sum on the wrong axis.

Would it be possible to provide a more detailed tutorial about how to translate a known mathematical form of a gradient to a relay implementation, to make it easier for the community to contribute some of these implementations?

@jroesch
Copy link
Member Author

jroesch commented May 31, 2019

@altanh could you maybe further improve upon the docs when you open your PR and try to address some of @SWu's comments.

Altan has revived the work in the past few weeks and we have been working on a library for using Relay for training, he will hopefully follow up on this thread with more details.

@altanh
Copy link
Contributor

altanh commented Jun 1, 2019

@SWu this is an issue that I've run into as well. I believe the specific documentation issue you ran into is indeed a copy-paste error, which we should fix. Overall though, the documentation is lacking as @jroesch said, and we (who implement more grads) should definitely update it with better descriptions as we work through them.

For collapse_sum_like, I dug into the TOPI code, and it looks like the general idea is to match up tensor dimensions (starting from the last dimension of both) and reduce them (using sum) until it matches the target shape. If two dims don't match, then the input dim is reduced and squeezed, and we continue trying to match. If they are equal, then do nothing. If the output dimension is 1, then we reduce down to 1.

For example, if A.shape() = (4,5,3) and B.shape() = (5, 1), then collapse_sum_like(A, B) will reduce the 3rd dim of A to 1 (i.e. keepdims=True), not reduce the 2nd dimension, and then reduce and squeeze (i.e. keepdims=False) the 1st dimension. It's unclear to me how this will work for 'mismatched' shapes like (4,4,4) and (3,2), since the input will just be completely squeezed (and from what I can tell, there's no error check for this, so maybe this is correct behavior that I don't understand).

We also need to think about the best way to verify correctness of these implementations, since currently the numerical tests in TVM are somewhat arbitrary. Your approach seems solid for ensuring correct behavior with respect to existing frameworks. This problem is more general than just for gradients though, and I think we should have a TVM-wide discussion.

As for your last point, I think this would be a good idea. I'll try to type up a tutorial of sorts walking through my implementation of softmax once I'm done with my current work. I don't want to write too much more here (and maybe this is already too much), but hopefully this helped. I'll make a more comprehensive post once the PR is ready.

@tqchen
Copy link
Member

tqchen commented Sep 3, 2020

closing for now due to inactive status, let us open new thread for new TODOs of gradients

@tqchen tqchen closed this as completed Sep 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants