Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Wrong gradients with C API? #20293

Closed
matteosal opened this issue May 21, 2021 · 10 comments · Fixed by #20462
Closed

Wrong gradients with C API? #20293

matteosal opened this issue May 21, 2021 · 10 comments · Fixed by #20462

Comments

@matteosal
Copy link
Contributor

The following program performs a forward + backward pass on the sum of two length-1 vectors (via the elemwise_add operator)

#include <iostream>

#include "mxnet/c_api.h"
#include "nnvm/c_api.h"

#define checkedMXCall(func, ...)                              \
  {                                                           \
    if (func(__VA_ARGS__) != 0) {                             \
      printf("MX call %s failed at line %d:\n%s",             \
            #func, __LINE__, MXGetLastError());               \
      exit(1)               ;                                 \
    }                                                         \
  }

int main() {

  /* Create symbol */
  const char json[] = "{\"nodes\": [{\"op\":\"null\",\"name\":\".Inputs.Input1\",\"inputs\":[]},{\"op\":\"null\",\"name\":\".Inputs.Input2\",\"inputs\":[]},{\"op\":\"elemwise_add\",\"name\":\".$0\",\"inputs\":[[0,0,0],[1,0,0]]},{\"op\":\"_copy\",\"name\":\".Outputs.Output\",\"inputs\":[[2,0,0]]}],\"arg_nodes\":[0,1],\"heads\":[[3,0,0]]}";
  SymbolHandle sym;
  checkedMXCall(MXSymbolCreateFromJSON, json, &sym);

  /* Create NDArray for arguments */
  int dev_type = 1;
  int dev_id = 0; 
  mx_uint shape[1] = {1};
  void *data;
  NDArrayHandle in_arr_1, in_arr_2;
  checkedMXCall(MXNDArrayCreate, shape, 1, dev_type, dev_id, 0, 0, &in_arr_1);
  checkedMXCall(MXNDArrayCreate, shape, 1, dev_type, dev_id, 0, 0, &in_arr_2);
  checkedMXCall(MXNDArrayGetData, in_arr_1, &data);
  /* Set values for arguments */
  *reinterpret_cast<float*>(data) = 0.4;
  checkedMXCall(MXNDArrayGetData, in_arr_2, &data);
  *reinterpret_cast<float*>(data) = 0.5;

  /* Create NDArray for gradients */
  NDArrayHandle grad_arr_1, grad_arr_2;
  checkedMXCall(MXNDArrayCreate, shape, 1, dev_type, dev_id, 0, 0, &grad_arr_1);
  checkedMXCall(MXNDArrayCreate, shape, 1, dev_type, dev_id, 0, 0, &grad_arr_2);
  /* Set values for gradients */
  checkedMXCall(MXNDArrayGetData, grad_arr_1, &data);
  *reinterpret_cast<float*>(data) = 0;
  checkedMXCall(MXNDArrayGetData, grad_arr_2, &data);
  *reinterpret_cast<float*>(data) = 0;
  /* Attach gradients to arguments */
  uint32_t grad_req_1[1] = {1};
  uint32_t grad_req_2[1] = {1}; 
  checkedMXCall(MXAutogradMarkVariables, 1, &in_arr_1, grad_req_1, &grad_arr_1);
  checkedMXCall(MXAutogradMarkVariables, 1, &in_arr_2, grad_req_2, &grad_arr_2);

  /* Create cached op */
  const char *cachedop_keys[1] = {"static_alloc"};
  const char *cachedop_vals[1] = {"true"};
  CachedOpHandle cached_op;
  checkedMXCall(MXCreateCachedOp, sym, 1, cachedop_keys, cachedop_vals, &cached_op, false);

  /* Set autograd to record & set training mode */
  int dummy_prev;
  checkedMXCall(MXAutogradSetIsRecording, 1, &dummy_prev);
  checkedMXCall(MXAutogradSetIsTraining, 1, &dummy_prev);

  /* Run forward */
  int n_outs;
  NDArrayHandle *out_arr_p = nullptr;
  const int *dummy_stypes = nullptr;
  NDArrayHandle inputs[] = {in_arr_1, in_arr_2};
  checkedMXCall(MXInvokeCachedOp, cached_op, 2, inputs, dev_type, dev_id, &n_outs, &out_arr_p, &dummy_stypes);
  checkedMXCall(MXNDArrayWaitToRead, *out_arr_p);

  /* Create NDArray for outgrad */
  NDArrayHandle outgrad_arr;
  checkedMXCall(MXNDArrayCreate, shape, 1, dev_type, dev_id, 0, 0, &outgrad_arr);
  /* Set values for outgrad */
  checkedMXCall(MXNDArrayGetData, outgrad_arr, &data);
  *reinterpret_cast<float*>(data) = 1;

  /* Run backward */
  checkedMXCall(MXAutogradBackward, 1, out_arr_p, &outgrad_arr, false);
  checkedMXCall(MXNDArrayWaitToRead, grad_arr_1);
  checkedMXCall(MXNDArrayWaitToRead, grad_arr_2);

  /* Check results */
  checkedMXCall(MXNDArrayGetData, in_arr_1, &data);
  std::cout << "INPUT 1: " << *reinterpret_cast<float*>(data) << "\n"; 
  checkedMXCall(MXNDArrayGetData, in_arr_2, &data);
  std::cout << "INPUT 2: " << *reinterpret_cast<float*>(data) << "\n"; 
  checkedMXCall(MXNDArrayGetData, *out_arr_p, &data);
  std::cout << "OUTPUT: " << *reinterpret_cast<float*>(data) << "\n"; 
  checkedMXCall(MXNDArrayGetData, outgrad_arr, &data);
  std::cout << "OUTGRAD: " << *reinterpret_cast<float*>(data) << "\n"; 
  checkedMXCall(MXNDArrayGetData, grad_arr_1, &data);
  std::cout << "GRAD 1: " << *reinterpret_cast<float*>(data) << "\n"; 
  checkedMXCall(MXNDArrayGetData, grad_arr_2, &data);
  std::cout << "GRAD 2: " << *reinterpret_cast<float*>(data) << "\n"; 

  return 0;
}

It behaves as expected, producing

INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 1
GRAD 1: 1
GRAD 2: 1

Switching off the gradient request for the second input, i.e. defining the grad_req arrays as

  uint32_t grad_req_1[1] = {1};
  uint32_t grad_req_2[1] = {0}; 

also behaves as expected, producing 0 for GRAD 2 (gradient arrays are initialized with 0):

INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 1
GRAD 1: 1
GRAD 2: 0

But switching off the first gradient with

  uint32_t grad_req_1[1] = {0};
  uint32_t grad_req_2[1] = {1}; 

seems to switch off both GRAD 1 and GRAD 2!

INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 1
GRAD 1: 0
GRAD 2: 0

I must be doing something wrong here, because the equivalent python example works just fine with gradient requests 0 and 1 ('null' and 'write'):

import mxnet as mx

json = "{\"nodes\": [{\"op\":\"null\",\"name\":\".Inputs.Input1\",\"inputs\":[]},{\"op\":\"null\",\"name\":\".Inputs.Input2\",\"inputs\":[]},{\"op\":\"elemwise_add\",\"name\":\".$0\",\"inputs\":[[0,0,0],[1,0,0]]},{\"op\":\"_copy\",\"name\":\".Outputs.Output\",\"inputs\":[[2,0,0]]}],\"arg_nodes\":[0,1],\"heads\":[[3,0,0]]}"
sym = mx.symbol.fromjson(json)

ex = sym._bind(
	mx.cpu(), 
	{'.Inputs.Input1': mx.nd.array([0.4]), '.Inputs.Input2': mx.nd.array([0.5])},
	args_grad={
		'.Inputs.Input1': mx.ndarray.zeros((1)), 
		'.Inputs.Input2': mx.ndarray.zeros((1))
	},
	grad_req={'.Inputs.Input1': 'null', '.Inputs.Input2': 'write'}
)
ex.forward(is_train=True)
print(ex.outputs)
ex.backward(out_grads=mx.nd.array([1]))
print(ex.grad_arrays)
[
[0.9]
<NDArray 1 @cpu(0)>]
[None, 
[1.]
<NDArray 1 @cpu(0)>]

Can anyone spot the problem with the C program with gradient requests 0 and 1?

@matteosal
Copy link
Contributor Author

This seems a problem specific with the elemwise_add operator. Replacing it with elemwise_mul in the json string produces correct gradients for all the request combinations.

@matteosal
Copy link
Contributor Author

Any help on this?

@KexinFeng
Copy link
Contributor

KexinFeng commented Jun 23, 2021

Analysis:

The issue is solved mainly by comparing elemwise_add with elemwise_mul at operation registration. Thanks to @barry-jin pointing out an important discrepancy between these two operators today. The discrepancy is in “registering the function for creating the node of the operator in a backward pass”:
https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L111 https://github.com/apache/incubator-mxnet/blob/da4ff3a4dc0bd6a54af3d75c492021d18ba1867b/src/operator/tensor/elemwise_binary_op_basic.cc#L238

elemwise_mul uses ElemwiseGradUseIn, since calculation of _back_mul depends on input value. However, elemwise_add uses CloneGradient, the reason of which is not clear yet. But ElemwiseGradUseNone is another valid option for elemwise_add . So I tested it by replacing CloneGradient with ElemwiseGradUseNone .

Result:

The issue is solved! ./demo2 0 1 specifies the gradient request as following, and 2021 is the hardcoding head gradient.

 uint32_t grad_req_1[1] = {0};
 uint32_t grad_req_2[1] = {1};
$~/mxnet/issue2/build2$ ./demo2 0 1
[19:45:23] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[19:45:23] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:45:23] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:45:23] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 0
GRAD 2: 2021

$~/mxnet/issue2/build2$ ./demo2 1 0
[19:49:59] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[19:49:59] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:49:59] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:49:59] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 2021
GRAD 2: 0

$~/mxnet/issue2/build2$ ./demo2 1 1
[19:50:11] ../src/nnvm/legacy_json_util.cc:209: Loading symbol saved by previous version v0.8.0. Attempting to upgrade...
[19:50:11] ../src/nnvm/legacy_json_util.cc:217: Symbol successfully upgraded!
[19:50:11] ../src/engine/engine.cc:55: MXNet start using engine: NaiveEngine
[19:50:11] ../src/storage/storage.cc:199: Using Pooled (Naive) StorageManager for CPU
INPUT 1: 0.4
INPUT 2: 0.5
OUTPUT: 0.9
OUTGRAD: 2021
GRAD 1: 2021
GRAD 2: 2021

Further investigation:

This result shows CloneGradient is the most possible reason that causes this issue. So the remaining check is

  1. In bug report , the sym is built by:
 {
        "nodes": [
            {
                \"op\":\"null\",
                \"name\":\".Inputs.Input1\",
                \"inputs\":[]
            },
            {
                \"op\":\"null\",
                \"name\":\".Inputs.Input2\",
                \"inputs\":[]
            },
            {
                \"op\":\"elemwise_add\",
                \"name\":\".$0\",
                \"inputs\":[[0,0,0],[1,0,0]]
            },
            {
                \"op\":\"_copy\",
                \"name\":\".Outputs.Output\",
                \"inputs\":[[2,0,0]]
            }
        ],
        \"arg_nodes\":[0,1],
        \"heads\":[[3,0,0]]
    }

It is possible that "_copy" operation intrigues this issue in CloneGradient. This helps understand why it fails.

  1. Other usages of CloneGradient in mxnet\src\operator\contrib\stes_op.cc
CloneGradient{"_backward_round_ste"}
CloneGradient{"_backward_sign_ste"}

It is possible they have the similar bug.

@matteosal
Copy link
Contributor Author

matteosal commented Jun 23, 2021

@KexinFeng thanks for looking into this!

It is possible that "_copy" operation intrigues this issue in CloneGradient. This helps understand why it fails.

There must be other factors besides this because the equivalent python example in my original report works fine and it also has the _copy operator. This also mean that there is a particular combination of C API calls (the one used by python) which is able to work around this bug.

@KexinFeng
Copy link
Contributor

KexinFeng commented Jun 23, 2021

I see.

because the equivalent python example in my original report works fine and it also has the _copy operator

Actually I notice a discrepancy between the python call and cpp call. In python code, it calls CachedOp::Forward() and then CachedOp::StaticForward(). In cpp code, CachedOp::Forward() calls CachedOp::DynamicForward(). I currently don't know if this discrepancy triggers the bug. But it can be tested quicly by setting something like hybridize(static_alloc=True) on python, or setting static_alloc=False on cpp.

Update
As a temporary solution, I find changing static allocation to false will avoid this bug.

/* Create cached op */
const char *cachedop_keys[1] = {"static_alloc"};
const char *cachedop_vals[1] = {"false"};

I'm wondering if dynamic allocation is applicable in your usage?

@matteosal
Copy link
Contributor Author

Unfortunately dynamic allocation is not an option for me, at least for now.

Anyway, I have verified that static_alloc = false indeed fixes the issue. It also explains why it works on python, because no flag is passed to the cached op from python and static_alloc is false by default. Hardcoding flags = {"static_alloc": "true"} here exposes the issue in the python script as well.

Thanks for submitting the PR.

@ptrendx
Copy link
Member

ptrendx commented Jul 16, 2021

@szha @leezu @sandeep-krishnamurthy Bringing this issue to your attention - this issue is most probably a bug in the handling of reqs in CachedOp when static_alloc is set to True, which seems pretty major. Could you assign somebody to take a look?

@barry-jin
Copy link
Contributor

barry-jin commented Jul 16, 2021

I don't think the root cause is in CachedOp. As I was debugging this issue, the elemwise_add is using CloneGradient, which means copy ograds multiple times for the inputs.

For cached_op, if the static_alloc is on, then it will construct backward graph with grad_graph outputs
https://github.com/apache/incubator-mxnet/blob/3480ba2c6df02bb907d3a975d354efa8697c4e71/src/imperative/cached_op.cc#L270-L281
In the case of elemwise_add(a, b), the grad_graph will be like this.
Screen Shot 2021-07-16 at 4 19 40 PM
The gradient of b will be the copy of the gradient of a. So there will be divergence between (case1: a.grad_req = null, b.grad_req = write) and (case2: a.grad_req = write, b.grad_req = null) when constructing the new graph based on the grad_graph.

From my point of view, the solution of this bug is to change the elemwise_add gradient function to this

.set_attr<nnvm::FGradient>("FGradient",
  [](const nnvm::ObjectPtr& n, const std::vector<nnvm::NodeEntry>& ograds) {
    std::vector<nnvm::NodeEntry> ret;
    const size_t input_count = n->inputs.size();
    ret.reserve(input_count);
    for (size_t i = 0; i < input_count; ++i) {
      ret.emplace_back(MakeNode("ones_like", n->attrs.name + "_grad_ones", {n->inputs[i]}, nullptr, &n));
    }
    return ret;
});

@KexinFeng FYI

@ptrendx
Copy link
Member

ptrendx commented Jul 17, 2021

The graph is exactly the same, whether you have static or dynamic alloc. The problem I believe lies here:
https://github.com/apache/incubator-mxnet/blob/3480ba2c6df02bb907d3a975d354efa8697c4e71/src/imperative/cached_op.cc#L986-L1006

especially lines 992 and 1002 - it sets the reqs for NodeEntries of the graph outputs based on the user preferences. This is only fine if those NodeEntries are not actually used in the other places in the computation, which for most cases is true. However, the gradient of elementwise_add is just a split. The CloneGradient method does not do any copying actually, it just reuses the NodeEntry and passes it to both downstream operators - this is the right behavior. But then when 1 of those downstream places is the output of the graph marked with req=null, you get a problem - because of those lines in StaticBackward function that edge is marked as kNullOp even though it is used in the rest of the computation.

The proper fix therefore would be to change the logic in the StaticBackward to take into account all usages of those NodeEntries and only mark it as null if the graph output is the only usage. There could also be an additional improvement here (although the real world impact of this is probably pretty minimal) to make the kNullOp requirement "traverse" the graph - right now if you have a chain of (single input/output) operations A -> B -> C which ultimately produces unnecessary output, then only the C operation will know about this and A and B will be performed (and their outputs allocated) no matter what.

@ptrendx
Copy link
Member

ptrendx commented Jul 17, 2021

Also, you definitely do not want to make gradient of add be ones_like ;-).

@barry-jin barry-jin mentioned this issue Jul 20, 2021
4 tasks
szha pushed a commit that referenced this issue Oct 27, 2021
* fix 20293

* avoid state.array_reqs being overrided by reqs

* update

* fix AddTo grad_req in staticbackward

* fix lint

* fix executor
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants