From ddbe0b174591b499a5719a86673a1901cf65d30d Mon Sep 17 00:00:00 2001
From: Zhennan Qin <zhennan.qin@intel.com>
Date: Mon, 28 Oct 2019 10:55:00 +0800
Subject: [PATCH] [mkldnn-v1.0]rebase with master (#16649)

* fixed broken links across multiple files (#16581)

* fix missing docs due to git add issues (#16496)

* Create SECURITY.md (#16573)

* Create SECURITY.md

* Update SECURITY.md

* [Numpy] Support N_D(N>=3) batch_dot (#16586)

* Support N_D(N>=3) batch_dot

* use 1E-4

* fix lint

* remove unnecessary comment

* Update test_numpy_op.py

* Large Vector tests for DGL Ops Part 2 (#16497)

* add hyperbolic, logical, sign and regression tests for large vector

* changed hyperbolic functions into existing trignometric functions

* fix trigo and simple bind needs shape as tuple

* fix logical ops, add with_seed

* fix arcosh in largearray, remove regression from largevector

* [Numpy] Loading numpy-incompatible NDArray in numpy-compatible mode (#16597)

* Make MXIsNumpyShape return enum

* address the comment

* Surpress subgraph log in CI (#16607)

Change-Id: Ia2ed6fdbb1d2cb5cc607a8856ca13ee338e27eac

* Fix dequantize memory corruption (#16606)

Change-Id: I51b62a32987bdbcf96f04b1bc6617e66796f648b

* [MKLDNN]Fix reorder2default (#16602)

* Fix reorder2default

Change-Id: I74c87af9535f6264e6d1ea7eaed089a6480a3358

* fix

Change-Id: I6d07b43b520a47e7c78bd4b4b6390f5fb95e6957

* Fix

Change-Id: Id72f25c34291be4711f55569c6d61467edd6113d

* Fix CI

Change-Id: I8c33a82555d5ace2d0b682c1e3eefa13f3a44768

* Run CI

Change-Id: Ie8a6dab80ef91c0337cafbae4e3db277e0c7ebf7

* second round of fixing broken links in multiple files (#16598)

* Python Docstring Convetion (#16550)

* Docstring convetnion for

* Docstring convention for

* Docstring convention for

* Docstring convention for

* Docstring convention for

* Docstring convention for

* Docstring convention

* Revert removing new line

* Remove white space

* [MXNET-1434] Fix a broken link for basic C++ tutorial (#16461)

* Fix for wrong reqs set after switching from training to inference (#16553)

* Debugging reqs

* Move literal strings to const static members

* Fix lint

* julia/docs: more DRY on page rendering (#16396)

* Disables test_bulking_operator_gpu due to flakiness (#16611)

* C Api for simplebind, fix comment for trigoops, add atol to assert (#16585)

* C Api for simplebind, fix comment for trigoops, add atol to assert

* fix build issues

* fix lint and add regression test

* fix indent

* api doc and function name change

* fix lint and add infer shape test

* Imagenet inference to nightly fix (#16599)

* split to cd and shell

* comment

* lots of prints

* copy binary at correct location

* remove comments

* add mkl lib

* update docker run build function

* set nvidia docker true to run imagenet inference on GPU

* Revert "set nvidia docker true to run imagenet inference on GPU"

This reverts commit 98f8eef2057351d7964f1e9326ea6772c216f0af.
As we don't need GPU for compilation.

* Fix python doc build issue (#16630)

* pin the pip versions

* remove nbconvert comment

* Faster general take (#16615)

* Sped up perf of take op when axis != 0

* Formatting and syntax fixes

* Rename Take to specify axis

* Fix line length lint errors

* [Gluon] Don't serialize shared parameters twice (#16582)

Add deduplicate argument (default of False) to save_parameters.

* Fix index overflow bug in einsum (#16589)

* fix index overflow

* check index overflow

* fix index overflow in einsum path

* fix indent

* reduce NPY_MAXARGS

* safe accumulate

* Move some subgraph verbose to MXNET_SUBGRAPH_VERBOSE=2 (#16622)

* Move subgraph pass log to verbose=2

* Run CI

* add npx reshape (#16640)

* RNNOp only call cuda/cudnn if GPU ctx is requested (#16632)

* fix bad encode (#16641)

* [Perl] - ndarray to native array conversion fix (#16635)

* fixing broken links in multiple files - round 3 (#16634)

* add type switch to weight tensor (#16543)

* numpy doc enhancement (#16637)

* Change NDArray to ndarray for npx ops

Add nonzero

boolean mask supports boolean ndarray

Add argmin op and interoperability test for nonzero

Fix vdot, inner, outter docs

Add nonzero to mx.nd.np

Add docs

Fix

* Fix lint

* Fix

* Fix

* Fix get_constant

* Disable float16 test (#16643)

* Fix GetMKLDNNData for delay alloc (#16618)

* Fix GetMKLDNNData for delay alloc

* Run CI

* Run CI

* Run CI

* Run CI

* Run CI

Change-Id: I7ac2796e0ee8439c92fd2bd7a70a23a359b76b12

* Revert "[mkldnn-1.0]Rebase to master (#16648)"

This reverts commit dea3dd23d1982c913b3af6cfc7f4115c2cfa7244.
---
 benchmark/python/einsum/benchmark_einsum.py   |   9 +
 ci/docker/runtime_functions.sh                |   5 +-
 docs/python_docs/environment.yml              |  19 +-
 .../python/tutorials/extend/custom_layer.md   |   2 +-
 .../gluon_from_experiment_to_deployment.md    |   4 +-
 .../gluon/training/fit_api_tutorial.md        |   2 +-
 .../packages/ndarray/sparse/train.md          |   4 +-
 .../packages/ndarray/sparse/train_gluon.md    |  35 +-
 .../packages/onnx/fine_tuning_gluon.md        |   2 +-
 .../python/tutorials/packages/viz/index.rst   |   2 +-
 .../backend/mkldnn/mkldnn_quantization.md     |   4 +-
 .../tutorials/performance/backend/profiler.md |   2 +-
 .../performance/backend/tensorrt/tensorrt.md  |   4 +-
 docs/static_site/src/pages/api/api.html       |   2 +-
 .../tutorials/mxnet_cpp_inference_tutorial.md |  16 +-
 docs/static_site/src/pages/api/faq/float16.md |   2 +-
 docs/static_site/src/pages/api/faq/perf.md    |   6 +-
 .../pages/get_started/build_from_source.md    |   2 +-
 include/mxnet/c_api.h                         |  38 ++
 julia/docs/src/api/io.md                      |   2 +-
 julia/docs/src/tutorial/char-lstm.md          |   2 +-
 julia/docs/src/tutorial/mnist.md              |   4 +-
 perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm |   6 +-
 perl-package/AI-MXNet/t/test_ndarray.t        |  19 +-
 python/mxnet/_numpy_op_doc.py                 | 126 ++++
 python/mxnet/base.py                          |   3 +
 python/mxnet/gluon/block.py                   |  35 +-
 python/mxnet/gluon/parameter.py               |   3 +-
 python/mxnet/ndarray/numpy/_op.py             | 156 ++++-
 python/mxnet/ndarray/numpy/random.py          |  12 +-
 python/mxnet/numpy/linalg.py                  |  25 +
 python/mxnet/numpy/multiarray.py              | 545 +++++++++++++++++-
 python/mxnet/numpy/random.py                  |  80 ++-
 python/mxnet/numpy/stride_tricks.py           |   9 +
 python/mxnet/numpy/utils.py                   |   4 +-
 python/mxnet/numpy_dispatch_protocol.py       |   2 +
 python/mxnet/numpy_extension/random.py        |   2 +-
 python/mxnet/symbol/numpy/_symbol.py          |  55 +-
 python/mxnet/symbol/numpy/random.py           |   8 +-
 python/mxnet/symbol/symbol.py                 | 110 ++--
 python/mxnet/util.py                          |  63 +-
 src/c_api/c_api_executor.cc                   | 231 ++++++--
 src/ndarray/ndarray.cc                        |   2 +
 src/operator/contrib/allclose_op-inl.h        |   4 +-
 src/operator/contrib/boolean_mask.cc          |   2 +-
 src/operator/contrib/boolean_mask.cu          |   2 +-
 src/operator/mxnet_op.h                       |  12 +
 .../numpy/np_broadcast_reduce_op_index.cc     |  11 +
 .../numpy/np_broadcast_reduce_op_index.cu     |   3 +
 src/operator/numpy/np_einsum_op-inl.h         | 172 +++---
 src/operator/numpy/np_einsum_op.cc            |  11 +
 src/operator/numpy/np_einsum_path_op-inl.h    | 114 ++--
 src/operator/numpy/np_matrix_op-inl.h         |  54 +-
 src/operator/numpy/np_matrix_op.cc            | 206 ++++++-
 src/operator/numpy/np_matrix_op.cu            |   3 +
 src/operator/numpy/np_nonzero_op.cc           |   3 +-
 src/operator/numpy/np_nonzero_op.cu           |   2 +-
 src/operator/numpy/random/np_choice_op.h      |  20 +-
 src/operator/rnn-inl.h                        |   6 +-
 src/operator/subgraph/build_subgraph.cc       |   2 +-
 src/operator/tensor/indexing_op.cc            |  59 +-
 src/operator/tensor/indexing_op.cu            |  61 +-
 src/operator/tensor/indexing_op.h             |  28 +-
 tests/nightly/JenkinsfileForBinaries          |  10 +-
 tests/nightly/test_large_array.py             |   6 +-
 tests/nightly/test_large_vector.py            |  49 +-
 tests/python/gpu/test_operator_gpu.py         |   1 +
 tests/python/unittest/test_gluon.py           |  40 ++
 tests/python/unittest/test_numpy_gluon.py     |  23 +
 .../unittest/test_numpy_interoperability.py   |  62 ++
 tests/python/unittest/test_numpy_op.py        | 180 ++++--
 71 files changed, 2284 insertions(+), 526 deletions(-)

diff --git a/benchmark/python/einsum/benchmark_einsum.py b/benchmark/python/einsum/benchmark_einsum.py
index 3593de2db9e1..6de8223287da 100644
--- a/benchmark/python/einsum/benchmark_einsum.py
+++ b/benchmark/python/einsum/benchmark_einsum.py
@@ -48,6 +48,15 @@ def test_np_einsum():
     cost = measure_cost(500, np.einsum, *args, optimize=True)
     print("Greedy einsum: {} ms".format(cost * 1000))
 
+    print("RNN Use Case:")
+    a = np.random.uniform(0, 1, size=(64, 128, 512))
+    b = np.random.uniform(0, 1, size=(128, 512, 2, 2))
+    args = ['bij, ijkl->bkl', a, b]
+    cost = measure_cost(2, np.einsum, *args, optimize=True)
+    print('Greedy einsum: {} ms'.format(cost * 1000))
+    cost = measure_cost(2, np.einsum, *args)
+    print('Basic einsum: {} ms'.format(cost * 1000))
+
     print('Inner Product:')
     a = np.ones(6000000)
     b = np.ones(6000000)
diff --git a/ci/docker/runtime_functions.sh b/ci/docker/runtime_functions.sh
index c2acc0f40d7d..581bb2fd5280 100755
--- a/ci/docker/runtime_functions.sh
+++ b/ci/docker/runtime_functions.sh
@@ -1482,8 +1482,9 @@ nightly_test_installation() {
 nightly_test_imagenet_inference() {
     set -ex
     echo $PWD
-    cp /work/mxnet/build/cpp-package/example/imagenet_inference .
-    /work/mxnet/cpp-package/example/inference/unit_test_imagenet_inference.sh
+    cp /work/mxnet/build/cpp-package/example/imagenet_inference /work/mxnet/cpp-package/example/inference/
+    cd /work/mxnet/cpp-package/example/inference/
+    ./unit_test_imagenet_inference.sh
 }
 
 #Runs a simple MNIST training example
diff --git a/docs/python_docs/environment.yml b/docs/python_docs/environment.yml
index 11e43a1733f3..5f66d7715af9 100644
--- a/docs/python_docs/environment.yml
+++ b/docs/python_docs/environment.yml
@@ -27,13 +27,12 @@ dependencies:
 - matplotlib
 - notebook
 - pip:
-  # using nbconvert master until v5.5 comes out
-  - git+https://github.com/jupyter/nbconvert@master
-  - nbsphinx>=0.4.2
-  - recommonmark
-  - notedown
-  - pypandoc
-  - breathe
-  - mock
-  - awscli
-  - autodocsumm
+  - nbconvert==5.6.1
+  - nbsphinx==0.4.3
+  - recommonmark==0.6.0
+  - notedown==1.5.1
+  - pypandoc==1.4
+  - breathe==4.13.1
+  - mock==3.0.5
+  - awscli==1.16.266
+  - autodocsumm==0.1.11
diff --git a/docs/python_docs/python/tutorials/extend/custom_layer.md b/docs/python_docs/python/tutorials/extend/custom_layer.md
index 6002a7812ec7..2fe795ba5439 100644
--- a/docs/python_docs/python/tutorials/extend/custom_layer.md
+++ b/docs/python_docs/python/tutorials/extend/custom_layer.md
@@ -57,7 +57,7 @@ The rest of methods of the `Block` class are already implemented, and majority o
 
 Looking into implementation of [existing layers](https://mxnet.apache.org/api/python/gluon/nn.html), one may find that more often a block inherits from a [HybridBlock](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/block.py#L428), instead of directly inheriting from `Block`.
 
-The reason for that is that `HybridBlock` allows to write custom layers that can be used in imperative programming as well as in symbolic programming. It is convinient to support both ways, because the imperative programming eases the debugging of the code and the symbolic one provides faster execution speed. You can learn more about the difference between symbolic vs. imperative programming from [this article](https://mxnet.apache.org/architecture/program_model.html).
+The reason for that is that `HybridBlock` allows to write custom layers that can be used in imperative programming as well as in symbolic programming. It is convinient to support both ways, because the imperative programming eases the debugging of the code and the symbolic one provides faster execution speed. You can learn more about the difference between symbolic vs. imperative programming from [this article](/api/architecture/program_model).
 
 Hybridization is a process that Apache MxNet uses to create a symbolic graph of a forward computation. This allows to increase computation performance by optimizing the computational symbolic graph. Once the symbolic graph is created, Apache MxNet caches and reuses it for subsequent computations.
 
diff --git a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
index b1f65e682263..47b629991650 100644
--- a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
+++ b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
@@ -99,14 +99,14 @@ ctx = [mx.gpu(i) for i in range(num_gpus)] if num_gpus > 0 else [mx.cpu()]
 batch_size = per_device_batch_size * max(num_gpus, 1)
 ```
 
-Now we will apply data augmentations on training images. This makes minor alterations on the training images, and our model will consider them as distinct images. This can be very useful for fine-tuning on a relatively small dataset, and it will help improve the model. We can use the Gluon [DataSet API](https://mxnet.apache.org/tutorials/gluon/datasets.html), [DataLoader API](https://mxnet.apache.org/tutorials/gluon/datasets.html), and [Transform API](https://mxnet.apache.org/tutorials/gluon/data_augmentation.html) to load the images and apply the following data augmentations:
+Now we will apply data augmentations on training images. This makes minor alterations on the training images, and our model will consider them as distinct images. This can be very useful for fine-tuning on a relatively small dataset, and it will help improve the model. We can use the Gluon [DataSet API](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.Dataset), [DataLoader API](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader), and [Transform API](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.Dataset.transform) to load the images and apply the following data augmentations:
 1. Randomly crop the image and resize it to 224x224
 2. Randomly flip the image horizontally
 3. Randomly jitter color and add noise
 4. Transpose the data from `[height, width, num_channels]` to `[num_channels, height, width]`, and map values from [0, 255] to [0, 1]
 5. Normalize with the mean and standard deviation from the ImageNet dataset.
 
-For validation and inference, we only need to apply step 1, 4, and 5. We also need to save the mean and standard deviation values for [inference using C++](https://mxnet.apache.org/versions/master/tutorials/c++/mxnet_cpp_inference_tutorial.html).
+For validation and inference, we only need to apply step 1, 4, and 5. We also need to save the mean and standard deviation values for [inference using C++](/api/cpp/docs/tutorials/cpp_inference).
 
 ```python
 jitter_param = 0.4
diff --git a/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md b/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
index 9e4cbe2f5114..896e5f217aa3 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
@@ -252,7 +252,7 @@ with warnings.catch_warnings():
     Epoch 2, loss 0.3229 <!--notebook-skip-line-->
 ```
 
-You can load the saved model, by using the `load_parameters` API in Gluon. For more details refer to the [Loading model parameters from file tutorial](../blocks/save_load_params.html#saving-model-parameters-to-file)
+You can load the saved model, by using the `load_parameters` API in Gluon. For more details refer to the [Loading model parameters from file tutorial](/api/python/docs/tutorials/packages/gluon/blocks/save_load_params.html#saving-model-parameters-to-file)
 
 
 ```python
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/sparse/train.md b/docs/python_docs/python/tutorials/packages/ndarray/sparse/train.md
index 336185cf7583..23654fc6a33a 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/sparse/train.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/sparse/train.md
@@ -240,8 +240,8 @@ The function you will explore is: *y = x<sub>1</sub>  +  2x<sub>2</sub> + ... 10
 
 ### Preparing the Data
 
-In MXNet, both [mx.io.LibSVMIter](https://mxnet.apache.org/versions/master/api/python/io/io.html#mxnet.io.LibSVMIter)
-and [mx.io.NDArrayIter](https://mxnet.apache.org/versions/master/api/python/io/io.html#mxnet.io.NDArrayIter)
+In MXNet, both [mx.io.LibSVMIter](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter)
+and [mx.io.NDArrayIter](/api/python/docs/api/mxnet/io/index.html#mxnet.io.NDArrayIter)
 support loading sparse data in CSR format. In this example, we'll use the `NDArrayIter`.
 
 You may see some warnings from SciPy. You don't need to worry about those for this example.
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/sparse/train_gluon.md b/docs/python_docs/python/tutorials/packages/ndarray/sparse/train_gluon.md
index 402cc2aeb739..688071062e20 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/sparse/train_gluon.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/sparse/train_gluon.md
@@ -20,7 +20,7 @@
 
 When working on machine learning problems, you may encounter situations where the input data is sparse (i.e. the majority of values are zero). One example of this is in recommendation systems. You could have millions of user and product features, but only a few of these features are present for each sample. Without special treatment, the sheer magnitude of the feature space can lead to out-of-memory situations and cause significant slowdowns when training and making predictions.
 
-MXNet supports a number of sparse storage types (often called 'stype' for short) for these situations. In this tutorial, we'll start by generating some sparse data, write it to disk in the LibSVM format and then read back using the [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html) for training. We use the Gluon API to train the model and leverage sparse storage types such as [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) and [`RowSparseNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=rowsparsendarray#mxnet.ndarray.sparse.RowSparseNDArray) to maximise performance and memory efficiency.
+MXNet supports a number of sparse storage types (often called 'stype' for short) for these situations. In this tutorial, we'll start by generating some sparse data, write it to disk in the LibSVM format and then read back using the [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter) for training. We use the Gluon API to train the model and leverage sparse storage types such as [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) and [`RowSparseNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.RowSparseNDArray) to maximise performance and memory efficiency.
 
 
 ```python
@@ -63,7 +63,7 @@ print('{:,.0f} non-zero elements'.format(data.data.size))
 10,000 non-zero elements
 ```
 
-Our storage type is CSR (Compressed Sparse Row) which is the ideal type for sparse data along multiple axes. See [this in-depth tutorial](https://mxnet.apache.org/versions/master/tutorials/sparse/csr.html) for more information. Just to confirm the generation process ran correctly, we can see that the vast majority of values are indeed zero. One of the first questions to ask would be how much memory is saved by storing this data in a [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) versus a standard [`NDArray`](https://mxnet.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=ndarray#module-mxnet.ndarray). Since sparse arrays are constructed from many components (e.g. `data`, `indices` and `indptr`) we define a function called `get_nbytes` to calculate the number of bytes taken in memory to store an array. We compare the same data stored in a standard [`NDArray`](https://mxnet.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=ndarray#module-mxnet.ndarray) (with `data.tostype('default')`) to the [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray).
+Our storage type is CSR (Compressed Sparse Row) which is the ideal type for sparse data along multiple axes. See [this in-depth tutorial](https://mxnet.apache.org/versions/master/tutorials/sparse/csr.html) for more information. Just to confirm the generation process ran correctly, we can see that the vast majority of values are indeed zero. One of the first questions to ask would be how much memory is saved by storing this data in a [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) versus a standard [`NDArray`](/api/python/docs/api/ndarray/ndarray.html#module-mxnet.ndarray). Since sparse arrays are constructed from many components (e.g. `data`, `indices` and `indptr`) we define a function called `get_nbytes` to calculate the number of bytes taken in memory to store an array. We compare the same data stored in a standard [`NDArray`](/api/python/docs/api/ndarray/ndarray.html#module-mxnet.ndarray) (with `data.tostype('default')`) to the [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray).
 
 
 ```python
@@ -94,9 +94,9 @@ Given the extremely high sparsity of the data, we observe a huge memory saving h
 
 ### Writing Sparse Data
 
-Since there is such a large size difference between dense and sparse storage formats here, we ideally want to store the data on disk in a sparse storage format too. MXNet supports a format called LibSVM and has a data iterator called [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html?highlight=libsvmiter) specifically for data formatted this way.
+Since there is such a large size difference between dense and sparse storage formats here, we ideally want to store the data on disk in a sparse storage format too. MXNet supports a format called LibSVM and has a data iterator called [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter) specifically for data formatted this way.
 
-A LibSVM file has a row for each sample, and each row starts with the label: in this case `0.0` or `1.0` since we have a classification task. After this we have a variable number of `key:value` pairs separated by spaces, where the key is column/feature index and the value is the value of that feature. When working with your own sparse data in a custom format you should try to convert your data into this format. We define a `save_as_libsvm` function to save the `data` ([`CSRNDArray`](https://mxnet.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray)) and `label` (`NDArray`) to disk in LibSVM format.
+A LibSVM file has a row for each sample, and each row starts with the label: in this case `0.0` or `1.0` since we have a classification task. After this we have a variable number of `key:value` pairs separated by spaces, where the key is column/feature index and the value is the value of that feature. When working with your own sparse data in a custom format you should try to convert your data into this format. We define a `save_as_libsvm` function to save the `data` ([`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray)) and `label` (`NDArray`) to disk in LibSVM format.
 
 
 ```python
@@ -148,10 +148,9 @@ Some storage overhead is introduced by serializing the data as characters (with
 
 ### Reading Sparse Data
 
-Using [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html?highlight=libsvmiter), we can quickly and easily load data into batches ready for training. Although Gluon [`Dataset`](https://mxnet.apache.org/versions/master/api/python/gluon/data.html?highlight=dataset#mxnet.gluon.data.Dataset)s can be written to return sparse arrays, Gluon [`DataLoader`](https://mxnet.apache.org/versions/master/api/python/gluon/data.html?highlight=dataloader#mxnet.gluon.data.DataLoader)s currently convert each sample to dense before stacking up to create the batch. As a result, [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html?highlight=libsvmiter) is the recommended method of loading sparse data in batches.
-
-Similar to using a [`DataLoader`](https://mxnet.apache.org/versions/master/api/python/gluon/data.html?highlight=dataloader#mxnet.gluon.data.DataLoader), you must specify the required `batch_size`. Since we're dealing with sparse data and the column shape isn't explicitly stored in the LibSVM file, we additionally need to provide the shape of the data and label. Our [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html?highlight=libsvmiter) returns batches in a slightly different form to a [`DataLoader`](https://mxnet.apache.org/versions/master/api/python/gluon/data.html?highlight=dataloader#mxnet.gluon.data.DataLoader). We get `DataBatch` objects instead of `tuple`. See the [appendix of this tutorial](https://mxnet.apache.org/versions/master/tutorials/gluon/datasets.html) for more information.
+Using [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter), we can quickly and easily load data into batches ready for training. Although Gluon [`Dataset`](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.Dataset)s can be written to return sparse arrays, Gluon [`DataLoader`](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader)s currently convert each sample to dense before stacking up to create the batch. As a result, [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter) is the recommended method of loading sparse data in batches.
 
+Similar to using a [`DataLoader`](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader), you must specify the required `batch_size`. Since we're dealing with sparse data and the column shape isn't explicitly stored in the LibSVM file, we additionally need to provide the shape of the data and label. Our [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter) returns batches in a slightly different form to a [`DataLoader`](/api/python/docs/api/gluon/data/index.html#mxnet.gluon.data.DataLoader). We get `DataBatch` objects instead of `tuple`. 
 
 ```python
 data_iter = mx.io.LibSVMIter(data_libsvm=filepath, data_shape=(num_features,), label_shape=(1,), batch_size=10)
@@ -215,7 +214,7 @@ Although results will change depending on system specifications and degree of sp
 
 Our next step is to define a network. We have an input of 1,000,000 features and we want to make a binary prediction. We don't have any spatial or temporal relationships between features, so we'll use a 3 layer fully-connected network where the last layer has 1 output unit (with sigmoid activation). Since we're working with sparse data, we'd ideally like to use network operators that can exploit this sparsity for improved performance and memory efficiency.
 
-Gluon's [`nn.Dense`](https://mxnet.apache.org/versions/master/api/python/gluon/nn.html?highlight=dense#mxnet.gluon.nn.Dense) block can used with [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) input arrays but it doesn't exploit the sparsity. Under the hood, [`Dense`](https://mxnet.apache.org/versions/master/api/python/gluon/nn.html?highlight=dense#mxnet.gluon.nn.Dense) uses the [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) operator which isn't optimized for [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) arrays. We'll implement a `Block` that does exploit this sparsity, *but first*, let's just remind ourselves of the [`Dense`](https://mxnet.apache.org/versions/master/api/python/gluon/nn.html?highlight=dense#mxnet.gluon.nn.Dense) implementation by creating an equivalent `Block` called `FullyConnected`.
+Gluon's [`nn.Dense`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Dense) block can used with [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) input arrays but it doesn't exploit the sparsity. Under the hood, [`Dense`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Dense) uses the [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) operator which isn't optimized for [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) arrays. We'll implement a `Block` that does exploit this sparsity, *but first*, let's just remind ourselves of the [`Dense`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Dense) implementation by creating an equivalent `Block` called `FullyConnected`.
 
 
 ```python
@@ -235,11 +234,11 @@ class FullyConnected(mx.gluon.HybridBlock):
         return F.FullyConnected(x, weight, bias, num_hidden=self._units)
 ```
 
-Our `weight` and `bias` parameters are dense (see `stype='default'`) and so are their gradients (see `grad_stype='default'`). Our `weight` parameter has shape `(units, in_units)` because the [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) operator performs the following calculation:
+Our `weight` and `bias` parameters are dense (see `stype='default'`) and so are their gradients (see `grad_stype='default'`). Our `weight` parameter has shape `(units, in_units)` because the [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) operator performs the following calculation:
 
 $$Y = XW^T + b$$
 
-We could instead have created our parameter with shape `(in_units, units)` and avoid the transpose of the weight matrix. We'll see why this is so important later on. And instead of [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) we could have used [`mx.sparse.dot`](https://mxnet.apache.org/versions/master/api/python/ndarray/sparse.html?highlight=sparse.dot#mxnet.ndarray.sparse.dot) to fully exploit the sparsity of the [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) input arrays. We'll now implement an alternative `Block` called `FullyConnectedSparse` using these ideas. We take `grad_stype` of the `weight` as an argument (called `weight_grad_stype`), since we're going to change this later on.
+We could instead have created our parameter with shape `(in_units, units)` and avoid the transpose of the weight matrix. We'll see why this is so important later on. And instead of [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) we could have used [`mx.sparse.dot`](/api/python/docs/api/ndarray/sparse/index.html?#mxnet.ndarray.sparse.dot) to fully exploit the sparsity of the [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) input arrays. We'll now implement an alternative `Block` called `FullyConnectedSparse` using these ideas. We take `grad_stype` of the `weight` as an argument (called `weight_grad_stype`), since we're going to change this later on.
 
 
 ```python
@@ -261,7 +260,7 @@ class FullyConnectedSparse(mx.gluon.HybridBlock):
 
 Once again, we're using a dense `weight`, so both `FullyConnected` and `FullyConnectedSparse` will return dense array outputs. When constructing a multi-layer network therefore, only the first layer needs to be optimized for sparse inputs. Our first layer is often responsible for reducing the feature dimension dramatically (e.g. 1,000,000 features down to 128 features). We'll set the number of units in our 3 layers to be 128, 8 and 1.
 
-We will use [`timeit`](https://docs.python.org/2/library/timeit.html) to check the performance of these two variants, and analyse some [MXNet Profiler](https://mxnet.apache.org/versions/master/tutorials/python/profiler.html) traces that have been created from these benchmarks. Additionally, we will inspect the memory usage of the weights (and gradients) using the `print_memory_allocation` function defined below:
+We will use [`timeit`](https://docs.python.org/2/library/timeit.html) to check the performance of these two variants, and analyse some [MXNet Profiler](/api/python/docs/tutorials/performance/backend/profiler.html) traces that have been created from these benchmarks. Additionally, we will inspect the memory usage of the weights (and gradients) using the `print_memory_allocation` function defined below:
 
 
 ```python
@@ -324,7 +323,7 @@ for batch in data_iter:
 
 ![fully connected](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/ndarray/sparse/fully_connected.png)
 
-We can see the first [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) operator takes a significant proportion of time to execute (~25% of the iteration) because there are 1,000,000 input features (to 128). After this, the other [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) operators are much faster because they have input features of 128 (to 8) and 8 (to 1). On the backward pass, we see the same pattern (but in reverse). And finally, the parameter update step takes a large amount of time on the weight matrix of the first `FullyConnected` `Block`. When checking the memory allocations below, we can see the weight matrix of the first `FullyConnected` `Block` is responsible for 99.999% of the memory compared to other [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) weight matrices.
+We can see the first [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) operator takes a significant proportion of time to execute (~25% of the iteration) because there are 1,000,000 input features (to 128). After this, the other [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) operators are much faster because they have input features of 128 (to 8) and 8 (to 1). On the backward pass, we see the same pattern (but in reverse). And finally, the parameter update step takes a large amount of time on the weight matrix of the first `FullyConnected` `Block`. When checking the memory allocations below, we can see the weight matrix of the first `FullyConnected` `Block` is responsible for 99.999% of the memory compared to other [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) weight matrices.
 
 
 ```python
@@ -384,7 +383,7 @@ for batch in data_iter:
 
 ![fully connected sparse](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/ndarray/sparse/fully_connected_sparse.png)
 
-We see the forward pass of `dot` and `add` (equivalent to [`FullyConnected`](https://mxnet.apache.org/versions/master/api/python/ndarray/ndarray.html?highlight=fullyconnected#mxnet.ndarray.FullyConnected) operator) is much faster now: 1.54ms vs 0.26ms. And this explains the reduction in overall time for the epoch. We didn't gain any benefit on the backward pass or parameter updates though.
+We see the forward pass of `dot` and `add` (equivalent to [`FullyConnected`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.FullyConnected) operator) is much faster now: 1.54ms vs 0.26ms. And this explains the reduction in overall time for the epoch. We didn't gain any benefit on the backward pass or parameter updates though.
 
 ![fully connected sparse backward](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/ndarray/sparse/fully_connected_sparse_backward.png)
 
@@ -408,7 +407,7 @@ Memory Allocation for Weight Gradient:
 
 ### Benchmark: `FullyConnectedSparse` with `grad_stype=row_sparse` 
 
-One useful outcome of sparsity in our [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) input is that our gradients will be row sparse. We can exploit this fact to give us potentially huge memory savings and speed improvements. Creating our `weight` parameter with shape `(units, in_units)` and not transposing in the forward pass are important pre-requisite for obtaining row sparse gradients. Using [`nn.Dense`](https://mxnet.apache.org/versions/master/api/python/gluon/nn.html?highlight=dense#mxnet.gluon.nn.Dense) would have led to column sparse gradients which are not supported in MXNet. We previously had `grad_stype` of the `weight` parameter in the first layer set to `'default'` so we were handling the gradient as a dense array. Switching this to `'row_sparse'` can give us these potential improvements.
+One useful outcome of sparsity in our [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) input is that our gradients will be row sparse. We can exploit this fact to give us potentially huge memory savings and speed improvements. Creating our `weight` parameter with shape `(units, in_units)` and not transposing in the forward pass are important pre-requisite for obtaining row sparse gradients. Using [`nn.Dense`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Dense) would have led to column sparse gradients which are not supported in MXNet. We previously had `grad_stype` of the `weight` parameter in the first layer set to `'default'` so we were handling the gradient as a dense array. Switching this to `'row_sparse'` can give us these potential improvements.
 
 
 ```python
@@ -472,12 +471,12 @@ You can optimize this example further by setting the weight's `stype` to `'row_s
 
 ## Conclusion
 
-As part of this tutorial, we learned how to write sparse data to disk in LibSVM format and load it back in sparse batches with the [`LibSVMIter`](https://mxnet.apache.org/api/python/io/io.html?highlight=libsvmiter). We learned how to improve the performance of Gluon's [`nn.Dense`](https://mxnet.apache.org/versions/master/api/python/gluon/nn.html?highlight=dense#mxnet.gluon.nn.Dense) on sparse arrays using `mx.nd.sparse`. And lastly, we set `grad_stype` to `'row_sparse'` to reduce the size of the gradient and speed up the parameter update step.
+As part of this tutorial, we learned how to write sparse data to disk in LibSVM format and load it back in sparse batches with the [`LibSVMIter`](/api/python/docs/api/mxnet/io/index.html#mxnet.io.LibSVMIter). We learned how to improve the performance of Gluon's [`nn.Dense`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Dense) on sparse arrays using `mx.nd.sparse`. And lastly, we set `grad_stype` to `'row_sparse'` to reduce the size of the gradient and speed up the parameter update step.
 
 ## Recommended Next Steps
 
-* More detail on the [`CSRNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=csrndarray#mxnet.ndarray.sparse.CSRNDArray) sparse array format can be found in [this tutorial](https://mxnet.apache.org/versions/master/tutorials/sparse/csr.html).
-* More detail on the [`RowSparseNDArray`](https://mxnet.apache.org/api/python/ndarray/sparse.html?highlight=rowsparsendarray#mxnet.ndarray.sparse.RowSparseNDArray) sparse array format can be found in [this tutorial](https://mxnet.apache.org/versions/master/tutorials/sparse/row_sparse.html).
-* Users of the Module API can see a symbolic only example in [this tutorial](https://mxnet.apache.org/versions/master/tutorials/sparse/train.html).
+* More detail on the [`CSRNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.CSRNDArray) sparse array format can be found in [this tutorial](/api/python/docs/tutorials/packages/ndarray/sparse/csr.html).
+* More detail on the [`RowSparseNDArray`](/api/python/docs/api/ndarray/sparse/index.html#mxnet.ndarray.sparse.RowSparseNDArray) sparse array format can be found in [this tutorial](/api/python/docs/tutorials/packages/ndarray/sparse/row_sparse.html).
+* Users of the Module API can see a symbolic only example in [this tutorial](/api/python/docs/tutorials/packages/ndarray/sparse/train.html).
 
 <!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
index f77731494215..e1eb3044a9fa 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
@@ -36,7 +36,7 @@ To run the tutorial you will need to have installed the following python modules
 - matplotlib
 
 We recommend that you have first followed this tutorial:
-- [Inference using an ONNX model on MXNet Gluon](https://mxnet.apache.org/tutorials/onnx/inference_on_onnx_model.html)
+- [Inference using an ONNX model on MXNet Gluon](/api/python/docs/tutorials/packages/onnx/inference_on_onnx_model.html)
 
 
 ```python
diff --git a/docs/python_docs/python/tutorials/packages/viz/index.rst b/docs/python_docs/python/tutorials/packages/viz/index.rst
index c9254a983824..367c8ecc67fb 100644
--- a/docs/python_docs/python/tutorials/packages/viz/index.rst
+++ b/docs/python_docs/python/tutorials/packages/viz/index.rst
@@ -29,7 +29,7 @@ Visualization
 References
 ----------
 
-- `mxnet.viz <../api/symbol-related/mxnet.visualization.html>`_
+- `mxnet.viz </api/python/docs/api/mxnet/visualization/index.html>`_
 
 .. toctree::
    :hidden:
diff --git a/docs/python_docs/python/tutorials/performance/backend/mkldnn/mkldnn_quantization.md b/docs/python_docs/python/tutorials/performance/backend/mkldnn/mkldnn_quantization.md
index da442e9fb42f..8c15af267cd4 100644
--- a/docs/python_docs/python/tutorials/performance/backend/mkldnn/mkldnn_quantization.md
+++ b/docs/python_docs/python/tutorials/performance/backend/mkldnn/mkldnn_quantization.md
@@ -23,7 +23,7 @@ If you are not familiar with Apache/MXNet quantization flow, please reference [q
 
 ## Installation and Prerequisites
 
-Installing MXNet with MKLDNN backend is an easy and essential process. You can follow [How to build and install MXNet with MKL-DNN backend](https://mxnet.apache.org/tutorials/mkldnn/MKLDNN_README.html) to build and install MXNet from source. Also, you can install the release or nightly version via PyPi and pip directly by running:
+Installing MXNet with MKLDNN backend is an easy and essential process. You can follow [How to build and install MXNet with MKL-DNN backend](/api/python/docs/tutorials/performance/backend/mkldnn/mkldnn_readme.html) to build and install MXNet from source. Also, you can install the release or nightly version via PyPi and pip directly by running:
 
 ```
 # release version
@@ -38,7 +38,7 @@ A quantization script [imagenet_gen_qsym_mkldnn.py](https://github.com/apache/in
 
 ## Integrate Quantization Flow to Your Project
 
-Quantization flow works for both symbolic and Gluon models. If you're using Gluon, you can first refer [Saving and Loading Gluon Models](https://mxnet.apache.org/versions/master/tutorials/gluon/save_load_params.html) to hybridize your computation graph and export it as a symbol before running quantization.
+Quantization flow works for both symbolic and Gluon models. If you're using Gluon, you can first refer [Saving and Loading Gluon Models](/api/python/docs/tutorials/packages/gluon/blocks/save_load_params.html) to hybridize your computation graph and export it as a symbol before running quantization.
 
 In general, the quantization flow includes 4 steps. The user can get the acceptable accuracy from step 1 to 3 with minimum effort. Most of thing in this stage is out-of-box and the data scientists and researchers only need to focus on how to represent data and layers in their model. After a quantized model is generated, you may want to deploy it online and the performance will be the next key point. Thus, step 4, calibration, can improve the performance a lot by reducing lots of runtime calculation.
 
diff --git a/docs/python_docs/python/tutorials/performance/backend/profiler.md b/docs/python_docs/python/tutorials/performance/backend/profiler.md
index 6969517cbd58..f90d5ba9559e 100644
--- a/docs/python_docs/python/tutorials/performance/backend/profiler.md
+++ b/docs/python_docs/python/tutorials/performance/backend/profiler.md
@@ -212,7 +212,7 @@ Let's zoom in to check the time taken by operators
 The above picture visualizes the sequence in which the operators were executed and the time taken by each operator.
 
 ### Profiling Custom Operators
-Should the existing NDArray operators fail to meet all your model's needs, MXNet supports [Custom Operators](https://mxnet.apache.org/versions/master/tutorials/gluon/customop.html) that you can define in Python. In `forward()` and `backward()` of a custom operator, there are two kinds of code: "pure Python" code (NumPy operators included) and "sub-operators" (NDArray operators called within `forward()` and `backward()`). With that said, MXNet can profile the execution time of both kinds without additional setup. Specifically, the MXNet profiler will break a single custom operator call into a pure Python event and several sub-operator events if there are any. Furthermore, all of those events will have a prefix in their names, which is, conveniently, the name of the custom operator you called.
+Should the existing NDArray operators fail to meet all your model's needs, MXNet supports [Custom Operators](/api/python/docs/tutorials/extend/customop.html) that you can define in Python. In `forward()` and `backward()` of a custom operator, there are two kinds of code: "pure Python" code (NumPy operators included) and "sub-operators" (NDArray operators called within `forward()` and `backward()`). With that said, MXNet can profile the execution time of both kinds without additional setup. Specifically, the MXNet profiler will break a single custom operator call into a pure Python event and several sub-operator events if there are any. Furthermore, all of those events will have a prefix in their names, which is, conveniently, the name of the custom operator you called.
 
 Let's try profiling custom operators with the following code example:
 
diff --git a/docs/python_docs/python/tutorials/performance/backend/tensorrt/tensorrt.md b/docs/python_docs/python/tutorials/performance/backend/tensorrt/tensorrt.md
index 8dc19f183729..63dd678f3f5f 100644
--- a/docs/python_docs/python/tutorials/performance/backend/tensorrt/tensorrt.md
+++ b/docs/python_docs/python/tutorials/performance/backend/tensorrt/tensorrt.md
@@ -39,7 +39,7 @@ nvidia-docker run -ti mxnet/tensorrt python
 
 ## Sample Models
 ### Resnet 18
-TensorRT is an inference only library, so for the purposes of this blog post we will be using a pre-trained network, in this case a Resnet 18.  Resnets are a computationally intensive model architecture that are often used as a backbone for various computer vision tasks. Resnets are also commonly used as a reference for benchmarking deep learning library performance.  In this section we'll use a pretrained Resnet 18 from the [Gluon Model Zoo](https://mxnet.apache.org/versions/master/api/python/gluon/model_zoo.html) and compare its inference speed with TensorRT using MXNet with TensorRT integration turned off as a baseline.
+TensorRT is an inference only library, so for the purposes of this blog post we will be using a pre-trained network, in this case a Resnet 18.  Resnets are a computationally intensive model architecture that are often used as a backbone for various computer vision tasks. Resnets are also commonly used as a reference for benchmarking deep learning library performance.  In this section we'll use a pretrained Resnet 18 from the [Gluon Model Zoo](/api/python/docs/api/gluon/model_zoo/index.html) and compare its inference speed with TensorRT using MXNet with TensorRT integration turned off as a baseline.
 
 ## Model Initialization
 ```python
@@ -128,7 +128,7 @@ This means that when an MXNet computation graph is constructed, it will be parse
 
 During this process MXNet will take care of passing along the input to the node and fetching the results.  MXNet will also attempt to remove any duplicated weights (parameters) during the graph initialization to keep memory usage low.  That is, if there are graph weights that are used only in the TensorRT sections of the graph, they will be removed from the MXNet set of parameters, and their memory will be freed.
 
-The examples below shows a Gluon implementation of a Wavenet before and after a TensorRT graph pass. You can see that for this network TensorRT supports a subset of the operators involved. This makes it an interesting example to visualize, as several subgraphs are extracted and replaced with special TensorRT nodes. The Resnet used as an example above would be less interesting to visualization. The entire Resnet graph is supported by TensorRT, and hence the optimized graph would be a single TensorRT node.  If your browser is unable to render svg files you can view the graphs in png format: [unoptimized](_static/tutorials/tensorrt/wavenet_unoptimized.png) and [optimized](_static/tutorials/tensorrt/wavenet_optimized.png).
+The examples below shows a Gluon implementation of a Wavenet before and after a TensorRT graph pass. You can see that for this network TensorRT supports a subset of the operators involved. This makes it an interesting example to visualize, as several subgraphs are extracted and replaced with special TensorRT nodes. The Resnet used as an example above would be less interesting to visualization. The entire Resnet graph is supported by TensorRT, and hence the optimized graph would be a single TensorRT node.  If your browser is unable to render svg files you can view the graphs in png format: [unoptimized](wavenet_unoptimized.svg) and [optimized](wavenet_optimized.svg).
 
 ## Before
 ![before](wavenet_unoptimized.svg)
diff --git a/docs/static_site/src/pages/api/api.html b/docs/static_site/src/pages/api/api.html
index a1f4ae140701..824756898606 100644
--- a/docs/static_site/src/pages/api/api.html
+++ b/docs/static_site/src/pages/api/api.html
@@ -52,7 +52,7 @@
 - title: Julia
   guide_link: /api/julia
   api_link: /api/julia/docs/api
-  tutorial_link: https://github.com/apache/incubator-mxnet/tree/master/julia/examples
+  tutorial_link: https://mxnet.incubator.apache.org/api/julia/docs/api/#tutorials
   description:
   icon: /assets/img/julia_logo.svg
   tag: julia
diff --git a/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md b/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
index 0d96817560d0..6d9998d7a7a9 100644
--- a/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
+++ b/docs/static_site/src/pages/api/cpp/docs/tutorials/mxnet_cpp_inference_tutorial.md
@@ -28,23 +28,23 @@ tag: cpp
 
 ## Overview
 MXNet provides various useful tools and interfaces for deploying your model for inference. For example, you can use [MXNet Model Server](https://github.com/awslabs/mxnet-model-server) to start a service and host your trained model easily.
-Besides that, you can also use MXNet's different language APIs to integrate your model with your existing service. We provide [Python]({{'/api/python/docs/api/symbol-related/mxnet.module'|relative_url}}),    [Java]({{'/api/java/docs/api'|relative_url}}), [Scala]({{'/api/scala/docs/api'|relative_url}}), and [C++]({{'/api/cpp/docs/api'|relative_url}}) APIs.
+Besides that, you can also use MXNet's different language APIs to integrate your model with your existing service. We provide [Python](/api/python/docs/api/), [Java](/api/java/docs/api/#package), [Scala](/api/scala/docs/api), and [C++](/api/cpp/docs/api/) APIs.
 We will focus on the MXNet C++ API. We have slightly modified the code in [C++ Inference Example](https://github.com/apache/incubator-mxnet/tree/master/cpp-package/example/inference) for our use case.
 
 ## Prerequisites
 
-To complete this tutorial, you need:
-- Complete the training part of [Gluon end to end tutorial]({{'api/python/docs/tutorials/packages/gluon/image-augmentation.html'|relative_url}})
-- Learn the basics about [MXNet C++ API]({{'/api/cpp'|relative_url}})
+To complete this tutorial, you need to:
+- Complete the training part of [Gluon end to end tutorial](/api/python/docs/tutorials/getting-started/gluon_from_experiment_to_deployment.html)
+- Learn the basics about [MXNet C++ API](/api/cpp)
 
 
 ## Setup the MXNet C++ API
-To use the C++ API in MXNet, you need to build MXNet from source with C++ package. Please follow the [built from source guide]({{'/get_started/ubuntu_setup.html'|relative_url}}), and [C++ Package documentation]({{'/api/cpp'|relative_url}})
+To use the C++ API in MXNet, you need to build MXNet from source with C++ package. Please follow the [built from source guide](/get_started/ubuntu_setup.html), and [C++ Package documentation](/api/cpp)
 The summary of those two documents is that you need to build MXNet from source with `USE_CPP_PACKAGE` flag set to 1. For example: `make -j USE_CPP_PACKAGE=1`.
 
 ## Load the model and run inference
 
-After you complete [the previous tutorial]({{'/api/python/docs/tutorials/packages/gluon/gluon_from_experiment_to_deployment.html'|relative_url}}), you will get the following output files:
+After you complete [the previous tutorial](/api/python/docs/tutorials/getting-started/gluon_from_experiment_to_deployment.html), you will get the following output files:
 1. Model Architecture stored in `flower-recognition-symbol.json`
 2. Model parameter values stored in `flower-recognition-0040.params` (`0040` is for 40 epochs we ran)
 3. Label names stored in `synset.txt`
@@ -280,8 +280,8 @@ Then it will predict your image:
 
 Now you can explore more ways to run inference and deploy your models:
 1. [Java Inference examples](https://github.com/apache/incubator-mxnet/tree/master/scala-package/examples/src/main/java/org/apache/mxnetexamples/javaapi/infer)
-2. [Scala Inference examples](/api/scala/docs/tutorials)
-3. [ONNX model inference examples](/api/python/docs/tutorials/deploy/index.html)
+2. [Scala Inference examples](https://github.com/apache/incubator-mxnet/tree/master/scala-package/examples/src/main/scala/org/apache/mxnetexamples/infer)
+3. [ONNX model inference examples](/api/python/docs/tutorials/packages/onnx/inference_on_onnx_model.html)
 4. [MXNet Model Server Examples](https://github.com/awslabs/mxnet-model-server/tree/master/examples)
 
 ## References
diff --git a/docs/static_site/src/pages/api/faq/float16.md b/docs/static_site/src/pages/api/faq/float16.md
index e63bf87ac68f..6ffb04054554 100644
--- a/docs/static_site/src/pages/api/faq/float16.md
+++ b/docs/static_site/src/pages/api/faq/float16.md
@@ -133,7 +133,7 @@ if dtype == 'float16':
 output = mx.sym.SoftmaxOutput(data=net_out, name='softmax')
 ```
 
-If you would like to train ResNet50 model on ImageNet using float16 precision, you can find the full script [here](https://github.com/apache/incubator-mxnet/tree/master/example/image-classificatiIfon/train_imagenet.py)
+If you would like to train ResNet50 model on ImageNet using float16 precision, you can find the full script [here](https://github.com/apache/incubator-mxnet/blob/master/docs/static_site/src/pages/api/faq/float16.md)
 
 If you don't have ImageNet dataset at your disposal, you can still run the script above using synthetic float16 data by providing the following command:
 
diff --git a/docs/static_site/src/pages/api/faq/perf.md b/docs/static_site/src/pages/api/faq/perf.md
index 675304f01241..202a099b324f 100644
--- a/docs/static_site/src/pages/api/faq/perf.md
+++ b/docs/static_site/src/pages/api/faq/perf.md
@@ -64,7 +64,7 @@ Note that _MXNet_ treats all CPUs on a single machine as a single device.
 So whether you specify `cpu(0)` or `cpu()`, _MXNet_ will use all CPU cores on the machine.
 
 ### Scoring results
-The following table shows performance of [MXNet-1.2.0.rc1](https://github.com/apache/incubator-mxnet/releases/download/1.2.0.rc1/apache-mxnet-src-1.2.0.rc1-incubating.tar.gz),
+The following table shows performance of MXNet-1.2.0.rc1,
 namely number of images that can be predicted per second.
 We used [example/image-classification/benchmark_score.py](https://github.com/dmlc/mxnet/blob/master/example/image-classification/benchmark_score.py)
 to measure the performance on different AWS EC2 machines.
@@ -151,7 +151,7 @@ and V100 (EC2 p3.2xlarge).
 
 Based on
 [example/image-classification/benchmark_score.py](https://github.com/dmlc/mxnet/blob/master/example/image-classification/benchmark_score.py)
-and  [MXNet-1.2.0.rc1](https://github.com/apache/incubator-mxnet/releases/download/1.2.0.rc1/apache-mxnet-src-1.2.0.rc1-incubating.tar.gz), with cuDNN 7.0.5
+and  MXNet-1.2.0.rc1, with cuDNN 7.0.5
 
 - K80 (single GPU)
 
@@ -214,7 +214,7 @@ Below is the performance result on V100 using float 16.
 
 Based on
 [example/image-classification/train_imagenet.py](https://github.com/dmlc/mxnet/blob/master/example/image-classification/train_imagenet.py)
-and  [MXNet-1.2.0.rc1](https://github.com/apache/incubator-mxnet/releases/download/1.2.0.rc1/apache-mxnet-src-1.2.0.rc1-incubating.tar.gz), with CUDNN 7.0.5. The benchmark script is available at
+and  MXNet-1.2.0.rc1, with CUDNN 7.0.5. The benchmark script is available at
 [here](https://github.com/mli/mxnet-benchmark/blob/master/run_vary_batch.sh),
 where the batch size for Alexnet is increased by 16x.
 
diff --git a/docs/static_site/src/pages/get_started/build_from_source.md b/docs/static_site/src/pages/get_started/build_from_source.md
index 20a4542461c4..1dfa95a82ade 100644
--- a/docs/static_site/src/pages/get_started/build_from_source.md
+++ b/docs/static_site/src/pages/get_started/build_from_source.md
@@ -50,7 +50,7 @@ Building from source follows this general two-step flow of building the shared l
             * [non-Intel CPUs](#recommended-for-Systems-with-non-Intel-CPUs)
 2. [Install the language API binding(s)](#installing-mxnet-language-bindings) you would like to use for MXNet.
 MXNet's newest and most popular API is Gluon. Gluon is built into the Python binding. If Python isn't your preference, you still have more options. MXNet supports several other language APIs:
-    - [Python (includes Gluon)]({{'/api/python/index'|relative_url}})
+    - [Python (includes Gluon)]({{'/api/python/docs/api/index.html'|relative_url}})
     - [C++]({{'/api/cpp'|relative_url}})
     - [Clojure]({{'/api/clojure'|relative_url}})
     - [Java]({{'/api/java'|relative_url}})
diff --git a/include/mxnet/c_api.h b/include/mxnet/c_api.h
index ac0c6726f2c7..2463a5b75cfd 100644
--- a/include/mxnet/c_api.h
+++ b/include/mxnet/c_api.h
@@ -2255,6 +2255,44 @@ MXNET_DLL int MXExecutorSimpleBindEx(SymbolHandle symbol_handle,
                                      NDArrayHandle** aux_states,
                                      ExecutorHandle shared_exec_handle,
                                      ExecutorHandle* out);
+
+
+MXNET_DLL int MXExecutorSimpleBindEx64(SymbolHandle symbol_handle,
+                                     int dev_type,
+                                     int dev_id,
+                                     const uint32_t num_g2c_keys,
+                                     const char** g2c_keys,
+                                     const int* g2c_dev_types,
+                                     const int* g2c_dev_ids,
+                                     const uint32_t provided_grad_req_list_len,
+                                     const char** provided_grad_req_names,
+                                     const char** provided_grad_req_types,
+                                     const uint32_t num_provided_arg_shapes,
+                                     const char** provided_arg_shape_names,
+                                     const int64_t* provided_arg_shape_data,
+                                     const uint32_t* provided_arg_shape_idx,
+                                     const uint32_t num_provided_arg_dtypes,
+                                     const char** provided_arg_dtype_names,
+                                     const int* provided_arg_dtypes,
+                                     const uint32_t num_provided_arg_stypes,
+                                     const char** provided_arg_stype_names,
+                                     const int* provided_arg_stypes,
+                                     const uint32_t num_shared_arg_names,
+                                     const char** shared_arg_name_list,
+                                     int* shared_buffer_len,
+                                     const char** shared_buffer_name_list,
+                                     NDArrayHandle* shared_buffer_handle_list,
+                                     const char*** updated_shared_buffer_name_list,
+                                     NDArrayHandle** updated_shared_buffer_handle_list,
+                                     uint32_t* num_in_args,
+                                     NDArrayHandle** in_args,
+                                     NDArrayHandle** arg_grads,
+                                     uint32_t* num_aux_states,
+                                     NDArrayHandle** aux_states,
+                                     ExecutorHandle shared_exec_handle,
+                                     ExecutorHandle* out);
+
+
 /*!
  * \brief DEPRECATED. Use MXExecutorReshapeEx instead.
  * Return a new executor with the same symbol and shared memory,
diff --git a/julia/docs/src/api/io.md b/julia/docs/src/api/io.md
index 34ad3c42bce7..52d172010af4 100644
--- a/julia/docs/src/api/io.md
+++ b/julia/docs/src/api/io.md
@@ -54,7 +54,7 @@ end
 By default, `eachbatch` simply returns the provider itself, so the iterator interface
 is implemented on the provider type itself. But the extra layer of abstraction allows us to
 implement a data provider easily via a Julia `Task` coroutine. See the
-data provider defined in [the char-lstm example](tutorial/char-lstm) for an example of using coroutine to define data
+data provider defined in [the char-lstm example](/api/julia/docs/api/tutorial/char-lstm/) for an example of using coroutine to define data
 providers.
 
 The detailed interface functions for the iterator API is listed below:
diff --git a/julia/docs/src/tutorial/char-lstm.md b/julia/docs/src/tutorial/char-lstm.md
index ab7e9352b5ab..1109f3554c17 100644
--- a/julia/docs/src/tutorial/char-lstm.md
+++ b/julia/docs/src/tutorial/char-lstm.md
@@ -38,7 +38,7 @@ network models directly.
 
 The most important code snippets of this example is shown and explained
 here. To see and run the complete code, please refer to the
-[examples/char-lstm](https://github.com/dmlc/MXNet.jl/tree/master/examples/char-lstm)
+[examples/char-lstm](https://github.com/apache/incubator-mxnet/blob/master/julia/docs/src/tutorial/char-lstm.md)
 directory. You will need to install
 [Iterators.jl](https://github.com/JuliaLang/Iterators.jl) and
 [StatsBase.jl](https://github.com/JuliaStats/StatsBase.jl) to run this
diff --git a/julia/docs/src/tutorial/mnist.md b/julia/docs/src/tutorial/mnist.md
index a404f75efe12..942752364526 100644
--- a/julia/docs/src/tutorial/mnist.md
+++ b/julia/docs/src/tutorial/mnist.md
@@ -23,7 +23,7 @@ multi-layer perceptron and then a convolutional neural network (the
 LeNet architecture) on the [MNIST handwritten digit
 dataset](http://yann.lecun.com/exdb/mnist/). The code for this tutorial
 could be found in
-[examples/mnist](/api/julia/docs/api/tutorial/mnist/).  There are also two Jupyter notebooks that expand a little more on the [MLP](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistMLP.ipynb) and the [LeNet](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistLenet.ipynb), using the more general `ArrayDataProvider`. 
+[examples/mnist](https://github.com/apache/incubator-mxnet/tree/master/julia/examples/mnist).  There are also two Jupyter notebooks that expand a little more on the [MLP](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistMLP.ipynb) and the [LeNet](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistLenet.ipynb), using the more general `ArrayDataProvider`. 
 
 Simple 3-layer MLP
 ------------------
@@ -36,7 +36,7 @@ using MXNet
 ```
 
 to load the `MXNet` module. Then we are ready to define the network
-architecture via the [symbolic API](../user-guide/overview.md). We start
+architecture via the [symbolic API](/api/julia/docs/api/user-guide/overview/). We start
 with a placeholder `data` symbol,
 
 ```julia
diff --git a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
index f75cc84b2a8f..1d968c14a487 100644
--- a/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
+++ b/perl-package/AI-MXNet/lib/AI/MXNet/NDArray.pm
@@ -116,7 +116,11 @@ method STORABLE_thaw($cloning, $buf, $writable)
 
 method split_array(@args)
 {
-     $self->shape->[0] > 1 ? $self->split(num_outputs => $self->shape->[0], squeeze_axis => @{ $self->shape } > 1 ? 1 : 0, axis => 0) : [$self];
+    my $shape = $self->shape;
+    return [] if $shape->[0] == 0;
+    my $list = $self->split(num_outputs=>$shape->[0],
+        squeeze_axis=>int(@$shape > 1), axis=>0);
+    $shape->[0] == 1 ? [ $list ] : $list;
 }
 
 method at(Index @indices)
diff --git a/perl-package/AI-MXNet/t/test_ndarray.t b/perl-package/AI-MXNet/t/test_ndarray.t
index a6cd113c3f89..1e290b4bc715 100644
--- a/perl-package/AI-MXNet/t/test_ndarray.t
+++ b/perl-package/AI-MXNet/t/test_ndarray.t
@@ -19,7 +19,7 @@ use strict;
 use warnings;
 use AI::MXNet qw(mx);
 use AI::MXNet::TestUtils qw(almost_equal same rand_ndarray randint zip);
-use Test::More tests => 251;
+use Test::More tests => 261;
 use PDL;
 use File::Temp qw(tempdir);
 use IO::File;
@@ -217,6 +217,22 @@ sub test_histogram
     ok(same($bins->aspdl, pdl([10, 20, 30, 60])));
 }
 
+sub test_array_overload
+{
+    # array conversions are largely calls to mx->nd->split(), but have
+    # special cases around dimensions of length 0 and 1.
+    is_deeply([ @{ mx->nd->array(zeros(7, 0)) } ], []);
+    is_deeply(mx->nd->zeros([3, 7])->[0]->shape, [ 7 ]);
+    is_deeply(mx->nd->zeros([2, 7])->[0]->shape, [ 7 ]);
+    is_deeply(mx->nd->zeros([1, 7])->[0]->shape, [ 7 ]);
+    is_deeply(mx->nd->zeros([3, 7, 11])->[0]->shape, [7, 11]);
+    is_deeply(mx->nd->zeros([2, 7, 11])->[0]->shape, [7, 11]);
+    is_deeply(mx->nd->zeros([1, 7, 11])->[0]->shape, [7, 11]);
+    is_deeply(mx->nd->zeros([3, 7, 11, 13])->[0]->shape, [7, 11, 13]);
+    is_deeply(mx->nd->zeros([2, 7, 11, 13])->[0]->shape, [7, 11, 13]);
+    is_deeply(mx->nd->zeros([1, 7, 11, 13])->[0]->shape, [7, 11, 13]);
+}
+
 test_ndarray_slice();
 test_ndarray_reshape();
 test_moveaxis();
@@ -226,3 +242,4 @@ test_linalg_gemm2();
 test_image_to_tensor();
 test_buffer_load();
 test_histogram();
+test_array_overload();
diff --git a/python/mxnet/_numpy_op_doc.py b/python/mxnet/_numpy_op_doc.py
index d9bb378d3049..33158baf10a5 100644
--- a/python/mxnet/_numpy_op_doc.py
+++ b/python/mxnet/_numpy_op_doc.py
@@ -34,6 +34,24 @@ def _np_ones_like(a):
     -------
     out : ndarray
         Array of ones with the same shape and type as `a`.
+
+    Examples
+    --------
+    >>> x = np.arange(6)
+    >>> x = x.reshape((2, 3))
+    >>> x
+    array([[0., 1., 2.],
+           [3., 4., 5.]])
+    >>> np.ones_like(x)
+    array([[1., 1., 1.],
+           [1., 1., 1.]])
+
+    >>> y = np.arange(3, dtype=float)
+    >>> y
+    array([0., 1., 2.], dtype=float64)
+    >>>
+    >>> np.ones_like(y)
+    array([1., 1., 1.], dtype=float64)
     """
     pass
 
@@ -52,6 +70,23 @@ def _np_zeros_like(a):
     -------
     out : ndarray
         Array of zeros with the same shape and type as `a`.
+
+    Examples
+    --------
+    >>> x = np.arange(6)
+    >>> x = x.reshape((2, 3))
+    >>> x
+    array([[0., 1., 2.],
+           [3., 4., 5.]])
+    >>> np.zeros_like(x)
+    array([[0., 0., 0.],
+           [0., 0., 0.]])
+    >>> y = np.arange(3, dtype=float)
+    >>> y
+    array([0., 1., 2.], dtype=float64)
+    >>>
+    >>> np.zeros_like(y)
+    array([0., 0., 0.], dtype=float64)
     """
     pass
 
@@ -477,6 +512,31 @@ def _np_reshape(a, newshape, order='C', out=None):
     See Also
     --------
     ndarray.reshape : Equivalent method.
+
+    Examples
+    --------
+    >>> a = np.arange(6).reshape((3, 2))
+    >>> a
+    array([[0., 1.],
+           [2., 3.],
+           [4., 5.]])
+
+    >>> np.reshape(a, (2, 3)) # C-like index ordering
+    array([[0., 1., 2.],
+           [3., 4., 5.]])
+
+    >>> np.reshape(np.ravel(a), (2, 3)) # equivalent to C ravel then C reshape
+    array([[0., 1., 2.],
+           [3., 4., 5.]])
+
+    >>> a = np.array([[1,2,3], [4,5,6]])
+    >>> np.reshape(a, 6)
+    array([1., 2., 3., 4., 5., 6.])
+
+    >>> np.reshape(a, (3,-1))       # the unspecified value is inferred to be 2
+    array([[1., 2.],
+           [3., 4.],
+           [5., 6.]])
     """
 
 
@@ -961,3 +1021,69 @@ def _np_broadcast_to(array, shape, out=None):
            [1., 2., 3.]])
     """
     pass
+
+
+def _npx_reshape(a, newshape, reverse=False, order='C'):
+    """
+    Gives a new shape to an array without changing its data.
+    This function always returns a copy of the input array if
+    ``out`` is not provided.
+
+    Parameters
+    ----------
+    a : ndarray
+        Array to be reshaped.
+    newshape : int or tuple of ints
+        The new shape should be compatible with the original shape.
+        If an integer, then the result will be a 1-D array of that length.
+        One shape dimension can be -1. In this case, the value is inferred
+        from the length of the array and remaining dimensions.
+        -2 to -6 are used for data manipulation.
+
+        - -2 copy this dimension from the input to the output shape.
+        - -3 will skip current dimension if and only if the current dim size is one.
+        - -4 copy all remain of the input dimensions to the output shape.
+        - -5 use the product of two consecutive dimensions of the input
+          shape as the output.
+        - -6 split one dimension of the input into two dimensions passed
+          subsequent to -6 in the new shape.
+
+    reverse : bool, optional
+        If set to true, the special values will be inferred from right to left.
+    order : {'C'}, optional
+        Read the elements of `a` using this index order, and place the
+        elements into the reshaped array using this index order.  'C'
+        means to read / write the elements using C-like index order,
+        with the last axis index changing fastest, back to the first
+        axis index changing slowest. Other order types such as 'F'/'A'
+        may be added in the future.
+
+    Returns
+    -------
+    reshaped_array : ndarray
+        It will be always a copy of the original array. This behavior is different
+        from the official NumPy ``reshape`` operator where views of the original array may be
+        generated.
+
+    Examples
+    --------
+    >>> x = np.ones((2, 3, 8))
+    >>> npx.reshape(x, (-2, -2, 2, -1)).shape
+    (2, 3, 2, 4)
+    >>> x = np.ones((8, 3, 3, 3, 4, 4))
+    >>> npx.reshape(x, (-6, 2, -1, -4)).shape
+    (2, 4, 3, 3, 3, 4, 4)
+    >>> x = np.ones((8, 3, 3, 3, 4, 4))
+    >>> npx.reshape(x, (-5, -4)).shape
+    (24, 3, 3, 4, 4)
+    >>> x = np.ones((8, 1, 1, 1, 3))
+    >>> npx.reshape(x, (-2, -3, -3, -3, -2)).shape
+    (8, 3)
+    >>> x = np.ones((8, 3, 3, 3, 3, 8))
+    >>> npx.reshape(x, (-4, -5), reverse=True).shape
+    (8, 3, 3, 3, 24)
+    >>> x = np.ones((8, 3, 2, 4, 8))
+    >>> npx.reshape(x, (-4, -1, 2, -6), reverse=True).shape
+    (8, 3, 2, 4, 4, 2)
+    """
+    pass
diff --git a/python/mxnet/base.py b/python/mxnet/base.py
index cbd9abe9d754..db1fa29ab9b4 100644
--- a/python/mxnet/base.py
+++ b/python/mxnet/base.py
@@ -20,6 +20,7 @@
 """ctypes library of mxnet and helper functions."""
 from __future__ import absolute_import
 
+import re
 import atexit
 import ctypes
 import os
@@ -853,3 +854,5 @@ def _init_np_op_module(root_module_name, np_module_name, mx_module_name, make_op
 
         if hasattr(_np_op_doc, name):
             function.__doc__ = getattr(_np_op_doc, name).__doc__
+        else:
+            function.__doc__ = re.sub('NDArray', 'ndarray', function.__doc__)
diff --git a/python/mxnet/gluon/block.py b/python/mxnet/gluon/block.py
index eff7dd754572..629ff22ec4e0 100644
--- a/python/mxnet/gluon/block.py
+++ b/python/mxnet/gluon/block.py
@@ -24,7 +24,7 @@
 import copy
 import warnings
 import re
-from collections import OrderedDict
+from collections import OrderedDict, defaultdict
 
 from ..base import mx_real_t, MXNetError
 from .. import symbol, ndarray, initializer
@@ -413,7 +413,7 @@ def _collect_params_with_prefix(self, prefix=''):
             ret.update(child._collect_params_with_prefix(prefix + name))
         return ret
 
-    def save_parameters(self, filename):
+    def save_parameters(self, filename, deduplicate=False):
         """Save parameters to file.
 
         Saved parameters can only be loaded with `load_parameters`. Note that this
@@ -424,6 +424,10 @@ def save_parameters(self, filename):
         ----------
         filename : str
             Path to file.
+        deduplicate : bool, default False
+            If True, save shared parameters only once. Otherwise, if a Block
+            contains multiple sub-blocks that share parameters, each of the
+            shared parameters will be separately saved for every sub-block.
 
         References
         ----------
@@ -431,7 +435,17 @@ def save_parameters(self, filename):
         <https://mxnet.apache.org/api/python/docs/tutorials/packages/gluon/blocks/save_load_params.html>`_
         """
         params = self._collect_params_with_prefix()
-        arg_dict = {key : val._reduce() for key, val in params.items()}
+
+        if deduplicate:
+            # Shared parameters are stored only a single time as of MXNet 1.6.
+            # Shared parameters are registered under multiple prefixes returned by
+            # _collect_params_with_prefix. We select a single one and only store
+            # it. In load_parameters it is sufficient for a shared parameter to
+            # only set it for a single prefix.
+            reverse_params = {v: k for k, v in params.items()}
+            params = {v: k for k, v in reverse_params.items()}
+
+        arg_dict = {key: val._reduce() for key, val in params.items()}
         save_fn = _mx_npx.save if is_np_array() else ndarray.save
         save_fn(filename, arg_dict)
 
@@ -510,15 +524,24 @@ def load_parameters(self, filename, ctx=None, allow_missing=False,
 
         if not any('.' in i for i in loaded.keys()):
             # legacy loading
-            del loaded
+            loaded = None  # This should be changed to `del loaded` when dropping Python 2
             self.collect_params().load(
                 filename, ctx, allow_missing, ignore_extra, self.prefix,
                 cast_dtype=cast_dtype, dtype_source=dtype_source)
             return
 
         if not allow_missing:
-            for name in params.keys():
-                assert name in loaded, \
+            # Shared parameters are stored only a single time as of MXNet 1.6.
+            # We thus retrieve all prefixes (through _collect_params_with_prefix)
+            # that a shared parameter is used with. Check that there are no
+            # missing parameters that were not yet already loaded from the
+            # shared version.
+            params_inv = defaultdict(list)
+            for k, v in params.items():
+                params_inv[v].append(k)
+
+            for name, param in params.items():
+                assert any(p in loaded for p in params_inv[param]), \
                     "Parameter '%s' is missing in file '%s', which contains parameters: %s. " \
                     "Set allow_missing=True to ignore missing parameters."%(
                         name, filename, _brief_print_list(loaded.keys()))
diff --git a/python/mxnet/gluon/parameter.py b/python/mxnet/gluon/parameter.py
index 8800684ad0b4..957dc2cd69b7 100644
--- a/python/mxnet/gluon/parameter.py
+++ b/python/mxnet/gluon/parameter.py
@@ -674,7 +674,8 @@ def __init__(self, **kwargs):
     """
     def __init__(self, name, value):
         if not isinstance(value, ndarray.NDArray):
-            value = ndarray.array(value)
+            array_fn = _mx_np.array if is_np_array() else ndarray.array
+            value = array_fn(value)
         self.value = value
 
         class Init(initializer.Initializer):
diff --git a/python/mxnet/ndarray/numpy/_op.py b/python/mxnet/ndarray/numpy/_op.py
index cf66e29d6205..fdb9694146b5 100644
--- a/python/mxnet/ndarray/numpy/_op.py
+++ b/python/mxnet/ndarray/numpy/_op.py
@@ -34,11 +34,11 @@
            'log1p', 'rint', 'radians', 'reciprocal', 'square', 'negative', 'fix', 'ceil', 'floor',
            'trunc', 'logical_not', 'arcsinh', 'arccosh', 'arctanh', 'tensordot', 'histogram', 'eye',
            'linspace', 'logspace', 'expand_dims', 'tile', 'arange', 'split', 'vsplit', 'concatenate',
-           'stack', 'vstack', 'dstack', 'mean', 'maximum', 'minimum', 'swapaxes', 'clip', 'argmax',
+           'stack', 'vstack', 'dstack', 'mean', 'maximum', 'minimum', 'swapaxes', 'clip', 'argmax', 'argmin',
            'std', 'var', 'indices', 'copysign', 'ravel', 'hanning', 'hamming', 'blackman', 'flip',
            'around', 'hypot', 'rad2deg', 'deg2rad', 'unique', 'lcm', 'tril', 'identity', 'take',
            'ldexp', 'vdot', 'inner', 'outer', 'equal', 'not_equal', 'greater', 'less', 'greater_equal', 'less_equal',
-           'hsplit', 'rot90', 'einsum', 'true_divide']
+           'hsplit', 'rot90', 'einsum', 'true_divide', 'nonzero']
 
 
 @set_module('mxnet.ndarray.numpy')
@@ -3165,8 +3165,6 @@ def clip(a, a_min, a_max, out=None):
 @set_module('mxnet.ndarray.numpy')
 def argmax(a, axis=None, out=None):
     r"""
-    argmax(a, axis=None, out=None)
-
     Returns the indices of the maximum values along an axis.
 
     Parameters
@@ -3234,6 +3232,75 @@ def argmax(a, axis=None, out=None):
     return _npi.argmax(a, axis=axis, keepdims=False, out=out)
 
 
+@set_module('mxnet.ndarray.numpy')
+def argmin(a, axis=None, out=None):
+    r"""
+    Returns the indices of the maximum values along an axis.
+
+    Parameters
+    ----------
+    a : ndarray
+        Input array. Only support ndarrays of dtype `float16`, `float32`, and `float64`.
+    axis : int, optional
+        By default, the index is into the flattened array, otherwise
+        along the specified axis.
+    out : ndarray or None, optional
+        If provided, the result will be inserted into this array. It should
+        be of the appropriate shape and dtype.
+
+    Returns
+    -------
+    index_array : ndarray of indices whose dtype is same as the input ndarray.
+        Array of indices into the array. It has the same shape as `a.shape`
+        with the dimension along `axis` removed.
+
+    Notes
+    -----
+    In case of multiple occurrences of the maximum values, the indices
+    corresponding to the first occurrence are returned.
+
+    This function differs from the original `numpy.argmax
+    <https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html>`_ in
+    the following aspects:
+
+    - Input type does not support Python native iterables(list, tuple, ...).
+    - Output has dtype that is same as the input ndarray.
+    - ``out`` param: cannot perform auto broadcasting. ``out`` ndarray's shape must be the same as the expected output.
+    - ``out`` param: cannot perform auto type cast. ``out`` ndarray's dtype must be the same as the expected output.
+    - ``out`` param does not support scalar input case.
+
+    Examples
+    --------
+    >>> a = np.arange(6).reshape(2,3) + 10
+    >>> a
+    array([[10., 11., 12.],
+           [13., 14., 15.]])
+    >>> np.argmin(a)
+    array(0.)
+    >>> np.argmin(a, axis=0)
+    array([0., 0., 0.])
+    >>> np.argmin(a, axis=1)
+    array([0., 0.])
+
+    >>> b = np.arange(6)
+    >>> b[2] = 0
+    >>> b
+    array([0., 1., 0., 3., 4., 5.])
+    >>> np.argmax(b)  # Only the first occurrence is returned.
+    array(0.)
+
+    Specify ``out`` ndarray:
+
+    >>> a = np.arange(6).reshape(2,3) + 10
+    >>> b = np.zeros((2,))
+    >>> np.argmin(a, axis=1, out=b)
+    array([0., 0.])
+    >>> b
+    array([0., 0.])
+    """
+    return _npi.argmin(a, axis=axis, keepdims=False, out=out)
+
+
 @set_module('mxnet.ndarray.numpy')
 def mean(a, axis=None, dtype=None, out=None, keepdims=False):  # pylint: disable=arguments-differ
     """
@@ -4761,3 +4828,84 @@ def einsum(*operands, **kwargs):
     subscripts = operands[0]
     operands = operands[1:]
     return _npi.einsum(*operands, subscripts=subscripts, out=out, optimize=int(optimize_arg))
+
+
+@set_module('mxnet.ndarray.numpy')
+def nonzero(a):
+    """
+    Return the indices of the elements that are non-zero.
+
+    Returns a tuple of arrays, one for each dimension of `a`,
+    containing the indices of the non-zero elements in that
+    dimension. The values in `a` are always returned in
+    row-major, C-style order.
+
+    To group the indices by element, rather than dimension, use `argwhere`,
+    which returns a row for each non-zero element.
+
+    Parameters
+    ----------
+    a : ndarray
+        Input array.
+
+    Returns
+    -------
+    tuple_of_arrays : tuple
+        Indices of elements that are non-zero.
+
+    See Also
+    --------
+    ndarray.nonzero :
+        Equivalent ndarray method.
+
+    Notes
+    -----
+    While the nonzero values can be obtained with ``a[nonzero(a)]``, it is
+    recommended to use ``x[x.astype(bool)]`` or ``x[x != 0]`` instead, which
+    will correctly handle 0-d arrays.
+
+    Examples
+    --------
+    >>> x = np.array([[3, 0, 0], [0, 4, 0], [5, 6, 0]])
+    >>> x
+    array([[3, 0, 0],
+           [0, 4, 0],
+           [5, 6, 0]], dtype=int32)
+    >>> np.nonzero(x)
+    (array([0, 1, 2, 2], dtype=int64), array([0, 1, 0, 1], dtype=int64))
+
+    >>> x[np.nonzero(x)]
+    array([3, 4, 5, 6])
+    >>> np.transpose(np.stack(np.nonzero(x)))
+    array([[0, 0],
+           [1, 1],
+           [2, 0],
+           [2, 1]], dtype=int64)
+
+    A common use for ``nonzero`` is to find the indices of an array, where
+    a condition is True.  Given an array `a`, the condition `a` > 3 is a
+    boolean array and since False is interpreted as 0, np.nonzero(a > 3)
+    yields the indices of the `a` where the condition is true.
+
+    >>> a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.int32)
+    >>> a > 3
+    array([[False, False, False],
+           [ True,  True,  True],
+           [ True,  True,  True]])
+    >>> np.nonzero(a > 3)
+    (array([1, 1, 1, 2, 2, 2], dtype=int64), array([0, 1, 2, 0, 1, 2], dtype=int64))
+
+    Using this result to index `a` is equivalent to using the mask directly:
+
+    >>> a[np.nonzero(a > 3)]
+    array([4, 5, 6, 7, 8, 9], dtype=int32)
+    >>> a[a > 3]
+    array([4, 5, 6, 7, 8, 9], dtype=int32)
+
+    ``nonzero`` can also be called as a method of the array.
+
+    >>> (a > 3).nonzero()
+    (array([1, 1, 1, 2, 2, 2], dtype=int64), array([0, 1, 2, 0, 1, 2], dtype=int64))
+    """
+    out = _npi.nonzero(a).transpose()
+    return tuple([out[i] for i in range(len(out))])
diff --git a/python/mxnet/ndarray/numpy/random.py b/python/mxnet/ndarray/numpy/random.py
index 583f56e046f3..9d1a6f9119ee 100644
--- a/python/mxnet/ndarray/numpy/random.py
+++ b/python/mxnet/ndarray/numpy/random.py
@@ -23,11 +23,11 @@
 from ..ndarray import NDArray
 
 
-__all__ = ['randint', 'uniform', 'normal', "choice", "rand"]
+__all__ = ['randint', 'uniform', 'normal', "choice", "rand", "multinomial"]
 
 
 def randint(low, high=None, size=None, dtype=None, ctx=None, out=None):
-    """Return random integers from `low` (inclusive) to `high` (exclusive).
+    r"""Return random integers from `low` (inclusive) to `high` (exclusive).
 
     Return random integers from the "discrete uniform" distribution of
     the specified dtype in the "half-open" interval [`low`, `high`). If
@@ -88,7 +88,7 @@ def randint(low, high=None, size=None, dtype=None, ctx=None, out=None):
 
 
 def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw samples from a uniform distribution.
+    r"""Draw samples from a uniform distribution.
 
     Samples are uniformly distributed over the half-open interval
     ``[low, high)`` (includes low, but excludes high).  In other words,
@@ -143,7 +143,7 @@ def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
 
 
 def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw random samples from a normal (Gaussian) distribution.
+    r"""Draw random samples from a normal (Gaussian) distribution.
 
     Samples are distributed according to a normal distribution parametrized
     by *loc* (mean) and *scale* (standard deviation).
@@ -194,7 +194,7 @@ def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
 
 
 def multinomial(n, pvals, size=None):
-    """multinomial(n, pvals, size=None)
+    r"""multinomial(n, pvals, size=None)
 
     Draw samples from a multinomial distribution.
 
@@ -246,7 +246,7 @@ def multinomial(n, pvals, size=None):
 
 
 def choice(a, size=None, replace=True, p=None, ctx=None, out=None):
-    """Generates a random sample from a given 1-D array
+    r"""Generates a random sample from a given 1-D array
 
     Parameters
     -----------
diff --git a/python/mxnet/numpy/linalg.py b/python/mxnet/numpy/linalg.py
index 1ca34716d7d4..9ee5156c3bb1 100644
--- a/python/mxnet/numpy/linalg.py
+++ b/python/mxnet/numpy/linalg.py
@@ -54,10 +54,35 @@ def norm(x, ord=None, axis=None, keepdims=False):
     n : float or ndarray
         Norm of the matrix or vector(s).
 
+    Notes
+    -----
+    This operator differs from NumPy in the aspect that it always returns a
+    zero-dim tensor for the cases where Python float values are expected
+    in NumPy.
+
     References
     ----------
     .. [1] G. H. Golub and C. F. Van Loan, *Matrix Computations*,
            Baltimore, MD, Johns Hopkins University Press, 1985, pg. 15
+
+    Examples
+    --------
+    >>> from numpy import linalg as LA
+    >>> a = np.arange(9) - 4
+    >>> a
+    array([-4., -3., -2., -1.,  0.,  1.,  2.,  3.,  4.])
+    >>> b = a.reshape((3, 3))
+    >>> b
+    array([[-4., -3., -2.],
+           [-1.,  0.,  1.],
+           [ 2.,  3.,  4.]])
+    >>> LA.norm(a)
+    array(7.745967)
+    >>>
+    >>> LA.norm(b)
+    array(7.745967)
+    >>> LA.norm(b, 'fro')
+    array(7.745967)
     """
     return _mx_nd_np.linalg.norm(x, ord, axis, keepdims)
 
diff --git a/python/mxnet/numpy/multiarray.py b/python/mxnet/numpy/multiarray.py
index 623b5fc482d7..5c9de8194a74 100644
--- a/python/mxnet/numpy/multiarray.py
+++ b/python/mxnet/numpy/multiarray.py
@@ -52,10 +52,10 @@
            'fix', 'ceil', 'floor', 'trunc', 'logical_not', 'arcsinh', 'arccosh', 'arctanh',
            'tensordot', 'histogram', 'eye', 'linspace', 'logspace', 'expand_dims', 'tile', 'arange',
            'split', 'vsplit', 'concatenate', 'stack', 'vstack', 'dstack', 'mean', 'maximum', 'minimum',
-           'swapaxes', 'clip', 'argmax', 'std', 'var', 'indices', 'copysign', 'ravel', 'hanning', 'hamming',
+           'swapaxes', 'clip', 'argmax', 'argmin', 'std', 'var', 'indices', 'copysign', 'ravel', 'hanning', 'hamming',
            'blackman', 'flip', 'around', 'arctan2', 'hypot', 'rad2deg', 'deg2rad', 'unique', 'lcm', 'tril',
            'identity', 'take', 'ldexp', 'vdot', 'inner', 'outer', 'equal', 'not_equal', 'greater', 'less',
-           'greater_equal', 'less_equal', 'hsplit', 'rot90', 'einsum', 'true_divide']
+           'greater_equal', 'less_equal', 'hsplit', 'rot90', 'einsum', 'true_divide', 'nonzero']
 
 # Return code for dispatching indexing function call
 _NDARRAY_UNSUPPORTED_INDEXING = -1
@@ -478,7 +478,7 @@ def __getitem__(self, key):
             for i in range(key_ndim):
                 if key_shape[i] != shape[i]:
                     raise IndexError('boolean index did not match indexed array along dimension {};'
-                                     'dimension is {} but corresponding boolean dimension is {}'
+                                     ' dimension is {} but corresponding boolean dimension is {}'
                                      .format(i, shape[i], key_shape[i]))
             remaining_dims = shape[key_ndim:]
             data = _reshape_view(self, -1, *remaining_dims)
@@ -831,6 +831,17 @@ def item(self, *args):
         # TODO(junwu): no need to call asnumpy() on the whole array.
         return self.asnumpy().item(*args)
 
+    def nonzero(self):
+        """Return the indices of the elements that are non-zero.
+
+        Refer to `numpy.nonzero` for full documentation.
+
+        See Also
+        --------
+        numpy.nonzero : equivalent function
+        """
+        return nonzero(self)
+
     @property
     # pylint: disable= invalid-name, undefined-variable
     def T(self):
@@ -1369,13 +1380,10 @@ def argmax_channel(self, *args, **kwargs):
         """
         raise AttributeError('mxnet.numpy.ndarray object has no attribute argmax_channel')
 
-    def argmin(self, *args, **kwargs):
-        """Convenience fluent method for :py:func:`argmin`.
-
-        The arguments are the same as for :py:func:`argmin`, with
-        this array as data.
-        """
-        raise NotImplementedError
+    def argmin(self, axis=None, out=None):  # pylint: disable=arguments-differ
+        """Return indices of the minium values along the given axis.
+        Refer to `mxnet.numpy.argmin` for full documentation."""
+        return argmin(self, axis, out)
 
     def clip(self, min=None, max=None, out=None):  # pylint: disable=arguments-differ
         """Return an array whose values are limited to [min, max].
@@ -1925,6 +1933,16 @@ def empty(shape, dtype=_np.float32, order='C', ctx=None):
     -------
     out : ndarray
         Array of uninitialized (arbitrary) data of the given shape, dtype, and order.
+
+    Examples
+    --------
+    >>> np.empty([2, 2])
+    array([[ 0.000000e+00, -2.524355e-29],
+           [          nan, -8.592023e+09]])  # uninitialized
+
+    >>> np.empty([2, 2], dtype=int)
+    array([[8751743591039004782, 3196766424264760104],
+           [7583328881310196768,     562950123910254]], dtype=int64)  # uninitialized
     """
     if order != 'C':
         raise NotImplementedError('`empty` only supports order equal to `C`, while received {}'
@@ -1958,6 +1976,19 @@ def array(object, dtype=None, ctx=None):
     -------
     out : ndarray
         An array object satisfying the specified requirements.
+
+    Examples
+    --------
+    >>> np.array([1, 2, 3])
+    array([1., 2., 3.])
+
+    >>> np.array([[1, 2], [3, 4]])
+    array([[1., 2.],
+           [3., 4.]])
+
+    >>> np.array([[1, 0], [0, 1]], dtype=bool)
+    array([[ True, False],
+           [False,  True]])
     """
     if ctx is None:
         ctx = current_context()
@@ -2003,6 +2034,18 @@ def zeros(shape, dtype=_np.float32, order='C', ctx=None):
     -------
     out : ndarray
         Array of zeros with the given shape, dtype, and ctx.
+
+    Examples
+    --------
+    >>> np.zeros(5)
+    array([0., 0., 0., 0., 0.])
+
+    >>> np.zeros((5,), dtype=int)
+    array([0, 0, 0, 0, 0], dtype=int64)
+
+    >>> np.zeros((2, 1))
+    array([[0.],
+           [0.]])
     """
     return _mx_nd_np.zeros(shape, dtype, order, ctx)
 
@@ -2032,6 +2075,23 @@ def ones(shape, dtype=_np.float32, order='C', ctx=None):
     -------
     out : ndarray
         Array of ones with the given shape, dtype, and ctx.
+
+    Examples
+    --------
+    >>> np.ones(5)
+    array([1., 1., 1., 1., 1.])
+
+    >>> np.ones((5,), dtype=int)
+    array([1, 1, 1, 1, 1], dtype=int64)
+
+    >>> np.ones((2, 1))
+    array([[1.],
+           [1.]])
+
+    >>> s = (2,2)
+    >>> np.ones(s)
+    array([[1., 1.],
+           [1., 1.]])
     """
     return _mx_nd_np.ones(shape, dtype, order, ctx)
 
@@ -2332,6 +2392,18 @@ def add(x1, x2, out=None, **kwargs):
     -------
     add : ndarray or scalar
         The sum of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.add(1.0, 4.0)
+    5.0
+    >>>
+    >>> x1 = np.arange(9.0).reshape((3, 3))
+    >>> x2 = np.arange(3.0)
+    >>> np.add(x1, x2)
+    array([[ 0.,  2.,  4.],
+           [ 3.,  5.,  7.],
+           [ 6.,  8., 10.]])
     """
     return _mx_nd_np.add(x1, x2, out)
 
@@ -2358,6 +2430,17 @@ def subtract(x1, x2, out=None, **kwargs):
     -------
     subtract : ndarray or scalar
         The difference of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.subtract(1.0, 4.0)
+    -3.0
+    >>> x1 = np.arange(9.0).reshape((3, 3))
+    >>> x2 = np.arange(3.0)
+    >>> np.subtract(x1, x2)
+    array([[0., 0., 0.],
+           [3., 3., 3.],
+           [6., 6., 6.]])
     """
     return _mx_nd_np.subtract(x1, x2, out)
 
@@ -2383,6 +2466,17 @@ def multiply(x1, x2, out=None, **kwargs):
     -------
     out : ndarray or scalar
         The difference of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.multiply(2.0, 4.0)
+    8.0
+    >>> x1 = np.arange(9.0).reshape((3, 3))
+    >>> x2 = np.arange(3.0)
+    >>> np.multiply(x1, x2)
+    array([[ 0.,  1.,  4.],
+           [ 0.,  4., 10.],
+           [ 0.,  7., 16.]])
     """
     return _mx_nd_np.multiply(x1, x2, out)
 
@@ -2410,6 +2504,11 @@ def divide(x1, x2, out=None, **kwargs):
     -------
     out : ndarray or scalar
         This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.true_divide(x, 4)
+    array([0.  , 0.25, 0.5 , 0.75, 1.  ])
     """
     return _mx_nd_np.divide(x1, x2, out=out)
 
@@ -2439,6 +2538,12 @@ def true_divide(x1, x2, out=None):
     -------
     out : ndarray or scalar
         This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> x = np.arange(5)
+    >>> np.true_divide(x, 4)
+    array([0.  , 0.25, 0.5 , 0.75, 1.  ])
     """
     return _mx_nd_np.true_divide(x1, x2, out=out)
 
@@ -2466,6 +2571,11 @@ def mod(x1, x2, out=None, **kwargs):
     -------
     out : ndarray or scalar
         This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.mod(np.arange(7), 5)
+    array([0., 1., 2., 3., 4., 0., 1.])
     """
     return _mx_nd_np.mod(x1, x2, out=out)
 
@@ -2493,6 +2603,11 @@ def remainder(x1, x2, out=None, **kwargs):
     -------
     out : ndarray or scalar
         This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.remainder(np.arange(7), 5)
+    array([0., 1., 2., 3., 4., 0., 1.])
     """
     return _mx_nd_np.remainder(x1, x2, out=out)
 
@@ -2521,6 +2636,29 @@ def power(x1, x2, out=None, **kwargs):
     out : ndarray or scalar
         The bases in x1 raised to the exponents in x2.
         This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> x1 = np.arange(6)
+    >>> np.power(x1, 3)
+    array([  0.,   1.,   8.,  27.,  64., 125.])
+
+    Raise the bases to different exponents.
+
+    >>> x2 = np.array([1.0, 2.0, 3.0, 3.0, 2.0, 1.0])
+    >>> np.power(x1, x2)
+    array([ 0.,  1.,  8., 27., 16.,  5.])
+
+    The effect of broadcasting.
+
+    >>> x2 = np.array([[1, 2, 3, 3, 2, 1], [1, 2, 3, 3, 2, 1]])
+    >>> x2
+    array([[1., 2., 3., 3., 2., 1.],
+           [1., 2., 3., 3., 2., 1.]])
+
+    >>> np.power(x1, x2)
+    array([[ 0.,  1.,  8., 27., 16.,  5.],
+           [ 0.,  1.,  8., 27., 16.,  5.]])
     """
     return _mx_nd_np.power(x1, x2, out=out)
 
@@ -3610,7 +3748,7 @@ def negative(x, out=None, **kwargs):
     y : ndarray or scalar
         Returned array or scalar: y = -x. This is a scalar if x is a scalar.
 
-    Examples:
+    Examples
     --------
     >>> np.negative(1)
     -1
@@ -3637,7 +3775,7 @@ def fix(x, out=None, **kwargs):
     y : ndarray or scalar
     Returned array or scalar: y = -x. This is a scalar if x is a scalar.ndarray of floats
 
-    Examples:
+    Examples
     ---------
     >>> np.fix(3.14)
     3
@@ -3667,10 +3805,10 @@ def tan(x, out=None, **kwargs):
     y : ndarray
     The corresponding tangent values. This is a scalar if x is a scalar.
 
-    Examples:
+    Examples
     ---------
-    >>> np.tan(0.5)
-    0.5463024898437905
+    >>> np.tan(np.array([-np.pi, np.pi/2, np.pi]))
+    array([-8.7422777e-08, -2.2877332e+07,  8.7422777e-08])
     """
 
     return _mx_nd_np.tan(x, out=out, **kwargs)
@@ -4044,7 +4182,7 @@ def histogram(a, bins=10, range=None, normed=None, weights=None, density=None):
     ----------
     a : ndarray
         Input data. The histogram is computed over the flattened array.
-    bins : int or NDArray
+    bins : int or ndarray
         If `bins` is an int, it defines the number of equal-width
         bins in the given range (10, by default). If `bins` is a
         sequence, it defines a monotonically increasing array of bin edges,
@@ -4062,6 +4200,11 @@ def histogram(a, bins=10, range=None, normed=None, weights=None, density=None):
         Not supported yet, coming soon.
     density : bool, optional
         Not supported yet, coming soon.
+
+    Examples
+    --------
+    >>> np.histogram(np.arange(4), bins=np.arange(5))
+    [array([1, 1, 1, 1], dtype=int64), array([0., 1., 2., 3., 4.])]
     """
     return _mx_nd_np.histogram(a, bins=bins, range=range, normed=normed, weights=weights, density=density)
 
@@ -4089,6 +4232,16 @@ def eye(N, M=None, k=0, dtype=_np.float32, **kwargs):
     I : ndarray of shape (N,M)
         An array where all elements are equal to zero,
         except for the k-th diagonal, whose values are equal to one.
+
+    Examples
+    --------
+    >>> np.eye(2, dtype=int)
+    array([[1, 0],
+           [0, 1]], dtype=int64)
+    >>> np.eye(3, k=1)
+    array([[0., 1., 0.],
+           [0., 0., 1.],
+           [0., 0., 0.]])
     """
     return _mx_nd_np.eye(N, M, k, dtype, **kwargs)
 
@@ -4274,6 +4427,37 @@ def expand_dims(a, axis):
     res : ndarray
         Output array. The number of dimensions is one greater than that of
         the input array.
+
+    See Also
+    --------
+    squeeze : The inverse operation, removing singleton dimensions
+    reshape : Insert, remove, and combine dimensions, and resize existing ones
+
+    Examples
+    --------
+    >>> x = np.array([1,2])
+    >>> x.shape
+    (2,)
+
+    >>> y = np.expand_dims(x, axis=0)
+    >>> y
+    array([[1., 2.]])
+
+    >>> y.shape
+    (1, 2)
+
+    >>> y = np.expand_dims(x, axis=1)  # Equivalent to x[:,np.newaxis]
+    >>> y
+    array([[1.],
+           [2.]])
+
+    >>> y.shape
+    (2, 1)
+
+    Note that some examples may use None instead of np.newaxis. These are the same objects:
+
+    >>> np.newaxis is None
+    True
     """
     return _npi.expand_dims(a, axis)
 
@@ -4417,6 +4601,20 @@ def arange(start, stop=None, step=1, dtype=None, ctx=None):
         ``ceil((stop - start)/step)``.  Because of floating point overflow,
         this rule may result in the last element of `out` being greater
         than `stop`.
+
+    Examples
+    --------
+    >>> np.arange(3)
+    array([0., 1., 2.])
+
+    >>> np.arange(3.0)
+    array([0., 1., 2.])
+
+    >>> np.arange(3,7)
+    array([3., 4., 5., 6.])
+
+    >>> np.arange(3,7,2)
+    array([3., 5.])
     """
     return _mx_nd_np.arange(start, stop, step, dtype, ctx)
 
@@ -4424,6 +4622,7 @@ def arange(start, stop=None, step=1, dtype=None, ctx=None):
 @set_module('mxnet.numpy')
 def split(ary, indices_or_sections, axis=0):
     """Split an array into multiple sub-arrays.
+
     Parameters
     ----------
     ary : ndarray
@@ -4442,15 +4641,38 @@ def split(ary, indices_or_sections, axis=0):
         an empty sub-array is returned correspondingly.
     axis : int, optional
         The axis along which to split, default is 0.
+
     Returns
     -------
     sub-arrays : list of ndarrays
         A list of sub-arrays.
+
     Raises
     ------
     ValueError
         If `indices_or_sections` is given as an integer, but
-        a split does not result in equal division."""
+        a split does not result in equal division.
+
+    See Also
+    --------
+    hsplit : Split array into multiple sub-arrays horizontally (column-wise).
+    vsplit : Split array into multiple sub-arrays vertically (row wise).
+    dsplit : Split array into multiple sub-arrays along the 3rd axis (depth).
+    concatenate : Join a sequence of arrays along an existing axis.
+    stack : Join a sequence of arrays along a new axis.
+    hstack : Stack arrays in sequence horizontally (column wise).
+    vstack : Stack arrays in sequence vertically (row wise).
+    dstack : Stack arrays in sequence depth wise (along third dimension).
+
+    Examples
+    --------
+    >>> x = np.arange(9.0)
+    >>> np.split(x, 3)
+    [array([0., 1., 2.]), array([3., 4., 5.]), array([6., 7., 8.])]
+
+    >>> np.split(x, [3, 5, 6, 8])
+    [array([0., 1., 2.]), array([3., 4.]), array([5.]), array([6., 7.]), array([])]
+    """
     return _mx_nd_np.split(ary, indices_or_sections, axis=axis)
 
 
@@ -4533,6 +4755,7 @@ def vsplit(ary, indices_or_sections):
 @set_module('mxnet.numpy')
 def concatenate(seq, axis=0, out=None):
     """Join a sequence of arrays along an existing axis.
+
     Parameters
     ----------
     a1, a2, ... : sequence of array_like
@@ -4545,10 +4768,35 @@ def concatenate(seq, axis=0, out=None):
         If provided, the destination to place the result. The shape must be
         correct, matching that of what concatenate would have returned if no
         out argument were specified.
+
     Returns
     -------
     res : ndarray
         The concatenated array.
+
+    See Also
+    --------
+    split : Split array into a list of multiple sub-arrays of equal size.
+    hsplit : Split array into multiple sub-arrays horizontally (column wise)
+    vsplit : Split array into multiple sub-arrays vertically (row wise)
+    dsplit : Split array into multiple sub-arrays along the 3rd axis (depth).
+    stack : Stack a sequence of arrays along a new axis.
+    hstack : Stack arrays in sequence horizontally (column wise)
+    vstack : Stack arrays in sequence vertically (row wise)
+    dstack : Stack arrays in sequence depth wise (along third dimension)
+
+    Examples
+    --------
+    >>> a = np.array([[1, 2], [3, 4]])
+    >>> b = np.array([[5, 6]])
+    >>> np.concatenate((a, b), axis=0)
+    array([[1., 2.],
+           [3., 4.],
+           [5., 6.]])
+
+    >>> np.concatenate((a, b.T), axis=1)
+    array([[1., 2., 5.],
+           [3., 4., 6.]])
     """
     return _mx_nd_np.concatenate(seq, axis=axis, out=out)
 
@@ -4558,6 +4806,7 @@ def stack(arrays, axis=0, out=None):
     """Join a sequence of arrays along a new axis.
         The axis parameter specifies the index of the new axis in the dimensions of the result.
         For example, if `axis=0` it will be the first dimension and if `axis=-1` it will be the last dimension.
+
     Parameters
     ----------
     arrays : sequence of array_like
@@ -4567,10 +4816,40 @@ def stack(arrays, axis=0, out=None):
     out : ndarray, optional
         If provided, the destination to place the result. The shape must be correct,
         matching that of what stack would have returned if no out argument were specified.
+
     Returns
     -------
     stacked : ndarray
-        The stacked array has one more dimension than the input arrays."""
+        The stacked array has one more dimension than the input arrays.
+
+    See Also
+    --------
+    concatenate : Join a sequence of arrays along an existing axis.
+    split : Split array into a list of multiple sub-arrays of equal size.
+
+    Examples
+    --------
+    >>> arrays = [np.random.rand(3, 4) for _ in range(10)]
+    >>> np.stack(arrays, axis=0).shape
+    (10, 3, 4)
+
+    >>> np.stack(arrays, axis=1).shape
+    (3, 10, 4)
+
+    >>> np.stack(arrays, axis=2).shape
+    (3, 4, 10)
+
+    >>> a = np.array([1, 2, 3])
+    >>> b = np.array([2, 3, 4])
+    >>> np.stack((a, b))
+    array([[1., 2., 3.],
+           [2., 3., 4.]])
+
+    >>> np.stack((a, b), axis=-1)
+    array([[1., 2.],
+           [2., 3.],
+           [3., 4.]])
+    """
     return _mx_nd_np.stack(arrays, axis=axis, out=out)
 
 
@@ -4678,7 +4957,17 @@ def maximum(x1, x2, out=None, **kwargs):
     Returns
     -------
     out : mxnet.numpy.ndarray or scalar
-        The maximum of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars."""
+        The maximum of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.maximum(np.array([2, 3, 4]), np.array([1, 5, 2]))
+    array([2., 5., 4.])
+
+    >>> np.maximum(np.eye(2), np.array([0.5, 2])) # broadcasting
+    array([[1. , 2. ],
+           [0.5, 2. ]])
+    """
     return _mx_nd_np.maximum(x1, x2, out=out)
 
 
@@ -4697,7 +4986,17 @@ def minimum(x1, x2, out=None, **kwargs):
     Returns
     -------
     out : mxnet.numpy.ndarray or scalar
-        The minimum of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars."""
+        The minimum of x1 and x2, element-wise. This is a scalar if both x1 and x2 are scalars.
+
+    Examples
+    --------
+    >>> np.minimum(np.array([2, 3, 4]), np.array([1, 5, 2]))
+    array([1., 3., 2.])
+
+    >>> np.minimum(np.eye(2), np.array([0.5, 2])) # broadcasting
+    array([[0.5, 0. ],
+           [0. , 1. ]])
+    """
     return _mx_nd_np.minimum(x1, x2, out=out)
 
 
@@ -4718,6 +5017,29 @@ def swapaxes(a, axis1, axis2):
     -------
     a_swapped : ndarray
         Swapped array. This is always a copy of the input array.
+
+    Examples
+    --------
+    >>> x = np.array([[1,2,3]])
+    >>> np.swapaxes(x,0,1)
+    array([[1.],
+           [2.],
+           [3.]])
+
+    >>> x = np.array([[[0,1],[2,3]],[[4,5],[6,7]]])
+    >>> x
+    array([[[0., 1.],
+            [2., 3.]],
+
+           [[4., 5.],
+            [6., 7.]]])
+
+    >>> np.swapaxes(x,0,2)
+    array([[[0., 4.],
+            [2., 6.]],
+
+           [[1., 5.],
+            [3., 7.]]])
     """
     return _npi.swapaxes(a, dim1=axis1, dim2=axis2)
 
@@ -4776,8 +5098,6 @@ def clip(a, a_min, a_max, out=None):
 @set_module('mxnet.numpy')
 def argmax(a, axis=None, out=None):
     r"""
-    argmax(a, axis=None, out=None)
-
     Returns the indices of the maximum values along an axis.
 
     Parameters
@@ -4844,13 +5164,82 @@ def argmax(a, axis=None, out=None):
     return _mx_nd_np.argmax(a, axis, out)
 
 
+@set_module('mxnet.numpy')
+def argmin(a, axis=None, out=None):
+    r"""
+    Returns the indices of the minimum values along an axis.
+
+    Parameters
+    ----------
+    a : ndarray
+        Input array. Only support ndarrays of dtype `float16`, `float32`, and `float64`.
+    axis : int, optional
+        By default, the index is into the flattened array, otherwise
+        along the specified axis.
+    out : ndarray or None, optional
+        If provided, the result will be inserted into this array. It should
+        be of the appropriate shape and dtype.
+
+    Returns
+    -------
+    index_array : ndarray of indices whose dtype is same as the input ndarray.
+        Array of indices into the array. It has the same shape as `a.shape`
+        with the dimension along `axis` removed.
+
+    Notes
+    -----
+    In case of multiple occurrences of the minimum values, the indices
+    corresponding to the first occurrence are returned.
+
+    This function differs from the original `numpy.argmin
+    <https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html>`_ in
+    the following aspects:
+
+    - Input type does not support Python native iterables(list, tuple, ...).
+    - Output has dtype that is same as the input ndarray.
+    - ``out`` param: cannot perform auto broadcasting. ``out`` ndarray's shape must be the same as the expected output.
+    - ``out`` param: cannot perform auto type cast. ``out`` ndarray's dtype must be the same as the expected output.
+    - ``out`` param does not support scalar input case.
+
+    Examples
+    --------
+    >>> a = np.arange(6).reshape(2,3) + 10
+    >>> a
+    array([[10., 11., 12.],
+           [13., 14., 15.]])
+    >>> np.argmin(a)
+    array(0.)
+    >>> np.argmin(a, axis=0)
+    array([0., 0., 0.])
+    >>> np.argmin(a, axis=1)
+    array([0., 0.])
+
+    >>> b = np.arange(6)
+    >>> b[2] = 0
+    >>> b
+    array([0., 1., 0., 3., 4., 5.])
+    >>> np.argmax(b)  # Only the first occurrence is returned.
+    array(0.)
+
+    Specify ``out`` ndarray:
+
+    >>> a = np.arange(6).reshape(2,3) + 10
+    >>> b = np.zeros((2,))
+    >>> np.argmin(a, axis=1, out=b)
+    array([0., 0.])
+    >>> b
+    array([0., 0.])
+    """
+    return _mx_nd_np.argmin(a, axis, out)
+
+
 @set_module('mxnet.numpy')
 def mean(a, axis=None, dtype=None, out=None, keepdims=False):  # pylint: disable=arguments-differ
     """
-    mean(a, axis=None, dtype=None, out=None, keepdims=None)
     Compute the arithmetic mean along the specified axis.
     Returns the average of the array elements.
     The average is taken over the flattened array by default, otherwise over the specified axis.
+
     Parameters
     ----------
     a : ndarray
@@ -4872,11 +5261,13 @@ def mean(a, axis=None, dtype=None, out=None, keepdims=False):  # pylint: disable
         If the default value is passed, then keepdims will not be passed through to the mean
         method of sub-classes of ndarray, however any non-default value will be. If the sub-class
         method does not implement keepdims any exceptions will be raised.
+
     Returns
     -------
     m : ndarray, see dtype parameter above
         If out=None, returns a new array containing the mean values,
         otherwise a reference to the output array is returned.
+
     Notes
     -----
     This function differs from the original `numpy.mean
@@ -4884,6 +5275,7 @@ def mean(a, axis=None, dtype=None, out=None, keepdims=False):  # pylint: disable
     the following way(s):
     - only ndarray is accepted as valid input, python iterables or scalar is not supported
     - default data type for integer input is float32
+
     Examples
     --------
     >>> a = np.array([[1, 2], [3, 4]])
@@ -5758,16 +6150,19 @@ def inner(a, b):
     Examples
     --------
     Ordinary inner product for vectors:
+
     >>> a = np.array([1,2,3])
     >>> b = np.array([0,1,0])
     >>> np.inner(a, b)
-    2
+    array(2.)
+
     A multidimensional example:
+
     >>> a = np.arange(24).reshape((2,3,4))
     >>> b = np.arange(4)
     >>> np.inner(a, b)
-    array([[ 14,  38,  62],
-           [ 86, 110, 134]])
+    array([[ 14.,  38.,  62.],
+           [ 86., 110., 134.]])
     """
     return tensordot(a, b, [-1, -1])
 
@@ -5796,6 +6191,7 @@ def outer(a, b):
     -------
     out : (M, N) ndarray
         ``out[i, j] = a[i] * b[j]``
+
     See also
     --------
     inner
@@ -5812,13 +6208,14 @@ def outer(a, b):
     Examples
     --------
     Make a (*very* coarse) grid for computing a Mandelbrot set:
+
     >>> rl = np.outer(np.ones((5,)), np.linspace(-2, 2, 5))
     >>> rl
     array([[-2., -1.,  0.,  1.,  2.],
-        [-2., -1.,  0.,  1.,  2.],
-        [-2., -1.,  0.,  1.,  2.],
-        [-2., -1.,  0.,  1.,  2.],
-        [-2., -1.,  0.,  1.,  2.]])
+           [-2., -1.,  0.,  1.,  2.],
+           [-2., -1.,  0.,  1.,  2.],
+           [-2., -1.,  0.,  1.,  2.],
+           [-2., -1.,  0.,  1.,  2.]])
     """
     return tensordot(a.flatten(), b.flatten(), 0)
 
@@ -5851,12 +6248,13 @@ def vdot(a, b):
     Examples
     --------
     Note that higher-dimensional arrays are flattened!
+
     >>> a = np.array([[1, 4], [5, 6]])
     >>> b = np.array([[4, 1], [2, 2]])
     >>> np.vdot(a, b)
-    30
+    array(30.)
     >>> np.vdot(b, a)
-    30
+    array(30.)
     >>> 1*4 + 4*1 + 5*2 + 6*2
     30
     """
@@ -6060,6 +6458,7 @@ def rot90(m, k=1, axes=(0, 1)):
     """
     Rotate an array by 90 degrees in the plane specified by axes.
     Rotation direction is from the first towards the second axis.
+
     Parameters
     ----------
     m : ndarray
@@ -6075,9 +6474,11 @@ def rot90(m, k=1, axes=(0, 1)):
     y : ndarray
         A rotated view of `m`.
 
+    Notes
     -----
     rot90(m, k=1, axes=(1,0)) is the reverse of rot90(m, k=1, axes=(0,1))
     rot90(m, k=1, axes=(1,0)) is equivalent to rot90(m, k=-1, axes=(0,1))
+
     Examples
     --------
     >>> m = np.array([[1,2],[3,4]], 'int')
@@ -6419,3 +6820,83 @@ def einsum(*operands, **kwargs):
     ...     np.einsum('ijk,ilm,njm,nlk,abc->',a,a,a,a,a, optimize=True)
     """
     return _mx_nd_np.einsum(*operands, **kwargs)
+
+
+@set_module('mxnet.numpy')
+def nonzero(a):
+    """
+    Return the indices of the elements that are non-zero.
+
+    Returns a tuple of arrays, one for each dimension of `a`,
+    containing the indices of the non-zero elements in that
+    dimension. The values in `a` are always returned in
+    row-major, C-style order.
+
+    To group the indices by element, rather than dimension, use `argwhere`,
+    which returns a row for each non-zero element.
+
+    Parameters
+    ----------
+    a : ndarray
+        Input array.
+
+    Returns
+    -------
+    tuple_of_arrays : tuple
+        Indices of elements that are non-zero.
+
+    See Also
+    --------
+    ndarray.nonzero :
+        Equivalent ndarray method.
+
+    Notes
+    -----
+    While the nonzero values can be obtained with ``a[nonzero(a)]``, it is
+    recommended to use ``x[x.astype(bool)]`` or ``x[x != 0]`` instead, which
+    will correctly handle 0-d arrays.
+
+    Examples
+    --------
+    >>> x = np.array([[3, 0, 0], [0, 4, 0], [5, 6, 0]])
+    >>> x
+    array([[3, 0, 0],
+           [0, 4, 0],
+           [5, 6, 0]], dtype=int32)
+    >>> np.nonzero(x)
+    (array([0, 1, 2, 2], dtype=int64), array([0, 1, 0, 1], dtype=int64))
+
+    >>> x[np.nonzero(x)]
+    array([3, 4, 5, 6])
+    >>> np.transpose(np.stack(np.nonzero(x)))
+    array([[0, 0],
+           [1, 1],
+           [2, 0],
+           [2, 1]], dtype=int64)
+
+    A common use for ``nonzero`` is to find the indices of an array, where
+    a condition is True.  Given an array `a`, the condition `a` > 3 is a
+    boolean array and since False is interpreted as 0, np.nonzero(a > 3)
+    yields the indices of the `a` where the condition is true.
+
+    >>> a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]], dtype=np.int32)
+    >>> a > 3
+    array([[False, False, False],
+           [ True,  True,  True],
+           [ True,  True,  True]])
+    >>> np.nonzero(a > 3)
+    (array([1, 1, 1, 2, 2, 2], dtype=int64), array([0, 1, 2, 0, 1, 2], dtype=int64))
+
+    Using this result to index `a` is equivalent to using the mask directly:
+
+    >>> a[np.nonzero(a > 3)]
+    array([4, 5, 6, 7, 8, 9], dtype=int32)
+    >>> a[a > 3]
+    array([4, 5, 6, 7, 8, 9], dtype=int32)
+
+    ``nonzero`` can also be called as a method of the array.
+
+    >>> (a > 3).nonzero()
+    (array([1, 1, 1, 2, 2, 2], dtype=int64), array([0, 1, 2, 0, 1, 2], dtype=int64))
+    """
+    return _mx_nd_np.nonzero(a)
diff --git a/python/mxnet/numpy/random.py b/python/mxnet/numpy/random.py
index d0ae237a5b92..1cad4a55c466 100644
--- a/python/mxnet/numpy/random.py
+++ b/python/mxnet/numpy/random.py
@@ -20,11 +20,11 @@
 from __future__ import absolute_import
 from ..ndarray import numpy as _mx_nd_np
 
-__all__ = ["randint", "uniform", "normal", "choice", "rand"]
+__all__ = ["randint", "uniform", "normal", "choice", "rand", "multinomial"]
 
 
 def randint(low, high=None, size=None, dtype=None, ctx=None, out=None):
-    """Return random integers from `low` (inclusive) to `high` (exclusive).
+    r"""Return random integers from `low` (inclusive) to `high` (exclusive).
 
     Return random integers from the "discrete uniform" distribution of
     the specified dtype in the "half-open" interval [`low`, `high`). If
@@ -76,7 +76,7 @@ def randint(low, high=None, size=None, dtype=None, ctx=None, out=None):
 
 
 def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw samples from a uniform distribution.
+    r"""Draw samples from a uniform distribution.
 
     Samples are uniformly distributed over the half-open interval
     ``[low, high)`` (includes low, but excludes high).  In other words,
@@ -95,7 +95,8 @@ def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
         Output shape.  If the given shape is, e.g., ``(m, n, k)``, then
         ``m * n * k`` samples are drawn.  If size is ``None`` (default),
         a scalar tensor containing a single value is returned if
-        ``low`` and ``high`` are both scalars.
+        ``low`` and ``high`` are both scalars. Otherwise,
+        ``np.broadcast(low, high).size`` samples are drawn.
     dtype : {'float16', 'float32', 'float64'}, optional
         Data type of output samples. Default is 'float32'
     ctx : Context, optional
@@ -105,12 +106,33 @@ def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
     -------
     out : ndarray
         Drawn samples from the parameterized uniform distribution.
+
+    See Also
+    --------
+    randint : Discrete uniform distribution, yielding integers.
+    rand : Convenience function that accepts dimensions as input, e.g.,
+           ``rand(2,2)`` would generate a 2-by-2 array of floats,
+           uniformly distributed over ``[0, 1)``.
+
+    Notes
+    -----
+    The probability density function of the uniform distribution is
+
+    .. math:: p(x) = \frac{1}{b - a}
+
+    anywhere within the interval ``[a, b)``, and zero elsewhere.
+
+    When ``high`` == ``low``, values of ``low`` will be returned.
+    If ``high`` < ``low``, the results are officially undefined
+    and may eventually raise an error, i.e. do not rely on this
+    function to behave when passed arguments satisfying that
+    inequality condition.
     """
     return _mx_nd_np.random.uniform(low, high, size=size, ctx=ctx, dtype=dtype, out=out)
 
 
 def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw random samples from a normal (Gaussian) distribution.
+    r"""Draw random samples from a normal (Gaussian) distribution.
 
     Samples are distributed according to a normal distribution parametrized
     by *loc* (mean) and *scale* (standard deviation).
@@ -125,7 +147,8 @@ def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
     size : int or tuple of ints, optional
         Output shape. If the given shape is, e.g., `(m, n, k)`, then `m * n * k`
         samples are drawn. If size is `None` (default), a scalar tensor containing
-        a single value is returned if loc and scale are both scalars.
+        a single value is returned if loc and scale are both scalars. Otherwise,
+        ``np.broadcast(low, high).size`` samples are drawn.
     dtype : {'float16', 'float32', 'float64'}, optional
         Data type of output samples. Default is 'float32'
     ctx : Context, optional
@@ -137,17 +160,53 @@ def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
     -------
     out : ndarray
         Drawn samples from the parameterized normal distribution.
+
+    Notes
+    -----
+    The probability density for the Gaussian distribution is
+
+    .. math:: p(x) = \frac{1}{\sqrt{ 2 \pi \sigma^2 }}
+                     e^{ - \frac{ (x - \mu)^2 } {2 \sigma^2} },
+
+    where :math:`\mu` is the mean and :math:`\sigma` the standard
+    deviation. The square of the standard deviation, :math:`\sigma^2`,
+    is called the variance.
+
+    The function has its peak at the mean, and its "spread" increases with
+    the standard deviation (the function reaches 0.607 times its maximum at
+    :math:`x + \sigma` and :math:`x - \sigma` [2]_).  This implies that
+    `numpy.random.normal` is more likely to return samples lying close to
+    the mean, rather than those far away.
+
+    References
+    ----------
+    .. [1] Wikipedia, "Normal distribution",
+           https://en.wikipedia.org/wiki/Normal_distribution
+    .. [2] P. R. Peebles Jr., "Central Limit Theorem" in "Probability,
+           Random Variables and Random Signal Principles", 4th ed., 2001,
+           pp. 51, 51, 125.
+
+    Examples
+    --------
+    >>> mu, sigma = 0, 0.1 # mean and standard deviation
+    >>> s = np.random.normal(mu, sigma, 1000)
+
+    Verify the mean and the variance:
+
+    >>> np.abs(mu - np.mean(s)) < 0.01
+    array(True)
     """
     return _mx_nd_np.random.normal(loc, scale, size, dtype, ctx, out)
 
 
 def multinomial(n, pvals, size=None, **kwargs):
-    """multinomial(n, pvals, size=None)
+    r"""
     Draw samples from a multinomial distribution.
     The multinomial distribution is a multivariate generalisation of the binomial distribution.
     Take an experiment with one of ``p`` possible outcomes. An example of such an experiment is throwing a dice,
     where the outcome can be 1 through 6. Each sample drawn from the distribution represents n such experiments.
     Its values, ``X_i = [X_0, X_1, ..., X_p]``, represent the number of times the outcome was ``i``.
+
     Parameters
     ----------
     n : int
@@ -157,18 +216,23 @@ def multinomial(n, pvals, size=None, **kwargs):
     size : int or tuple of ints, optional
         Output shape. If the given shape is, e.g., ``(m, n, k)``, then ``m * n * k`` samples
         are drawn. Default is None, in which case a single value is returned.
+
     Returns
     -------
     out : ndarray
         The drawn samples, of shape size, if that was provided. If not, the shape is ``(N,)``.
         In other words, each entry ``out[i,j,...,:]`` is an N-dimensional value drawn from the distribution.
+
     Examples
     --------
     Throw a dice 1000 times, and 1000 times again:
+
     >>> np.random.multinomial(1000, [1/6.]*6, size=2)
     array([[164, 161, 179, 158, 150, 188],
            [178, 162, 177, 143, 163, 177]])
+
     A loaded die is more likely to land on number 6:
+
     >>> np.random.multinomial(100, [1/7.]*5 + [2/7.])
     array([19, 14, 12, 11, 21, 23])
     >>> np.random.multinomial(100, [1.0 / 3, 2.0 / 3])
@@ -178,7 +242,7 @@ def multinomial(n, pvals, size=None, **kwargs):
 
 
 def choice(a, size=None, replace=True, p=None, ctx=None, out=None):
-    """Generates a random sample from a given 1-D array
+    r"""Generates a random sample from a given 1-D array
 
     Parameters
     -----------
diff --git a/python/mxnet/numpy/stride_tricks.py b/python/mxnet/numpy/stride_tricks.py
index 0b2fe523b0f3..b4a4d0a7b44a 100644
--- a/python/mxnet/numpy/stride_tricks.py
+++ b/python/mxnet/numpy/stride_tricks.py
@@ -46,6 +46,15 @@ def broadcast_arrays(*args):
         These arrays are copies of the original arrays unless that all the input
         arrays have the same shape, the input list of arrays are returned
         instead of a list of copies.
+
+    Examples
+    --------
+    >>> x = np.array([[1,2,3]])
+    >>> y = np.array([[4],[5]])
+    >>> np.broadcast_arrays(x, y)
+    [array([[1., 2., 3.],
+           [1., 2., 3.]]), array([[4., 4., 4.],
+           [5., 5., 5.]])]
     """
     shape = _broadcast_shape(*args)
 
diff --git a/python/mxnet/numpy/utils.py b/python/mxnet/numpy/utils.py
index b2335e29855d..c34650a61f31 100644
--- a/python/mxnet/numpy/utils.py
+++ b/python/mxnet/numpy/utils.py
@@ -23,7 +23,7 @@
 import numpy as onp
 
 __all__ = ['float16', 'float32', 'float64', 'uint8', 'int32', 'int8', 'int64',
-           'bool', 'bool_', 'pi', 'inf', 'nan', 'PZERO', 'NZERO']
+           'bool', 'bool_', 'pi', 'inf', 'nan', 'PZERO', 'NZERO', 'newaxis']
 
 float16 = onp.float16
 float32 = onp.float32
@@ -40,3 +40,5 @@
 nan = onp.nan
 PZERO = onp.PZERO
 NZERO = onp.NZERO
+
+newaxis = None
diff --git a/python/mxnet/numpy_dispatch_protocol.py b/python/mxnet/numpy_dispatch_protocol.py
index 6db44fad7780..cec2f245a5e1 100644
--- a/python/mxnet/numpy_dispatch_protocol.py
+++ b/python/mxnet/numpy_dispatch_protocol.py
@@ -83,6 +83,7 @@ def _run_with_array_ufunc_proto(*args, **kwargs):
 
 
 _NUMPY_ARRAY_FUNCTION_LIST = [
+    'argmin',
     'argmax',
     'around',
     'broadcast_arrays',
@@ -99,6 +100,7 @@ def _run_with_array_ufunc_proto(*args, **kwargs):
     'max',
     'mean',
     'min',
+    'nonzero',
     'ones_like',
     'prod',
     'ravel',
diff --git a/python/mxnet/numpy_extension/random.py b/python/mxnet/numpy_extension/random.py
index 5aa58a0cc69d..9c059ca9ade4 100644
--- a/python/mxnet/numpy_extension/random.py
+++ b/python/mxnet/numpy_extension/random.py
@@ -25,7 +25,7 @@
 
 
 def seed(seed, ctx='all'):  # pylint: disable=redefined-outer-name
-    """Seeds the random number generators in MXNet.
+    r"""Seeds the random number generators in MXNet.
 
     This affects the behavior of modules in MXNet that uses random number generators,
     like the dropout operator and `ndarray`'s random sampling operators.
diff --git a/python/mxnet/symbol/numpy/_symbol.py b/python/mxnet/symbol/numpy/_symbol.py
index aa456c8e5166..ddf2feb30b18 100644
--- a/python/mxnet/symbol/numpy/_symbol.py
+++ b/python/mxnet/symbol/numpy/_symbol.py
@@ -36,7 +36,7 @@
            'rint', 'radians', 'reciprocal', 'square', 'negative', 'fix', 'ceil', 'floor',
            'trunc', 'logical_not', 'arcsinh', 'arccosh', 'arctanh', 'tensordot', 'histogram', 'eye',
            'linspace', 'logspace', 'expand_dims', 'tile', 'arange', 'split', 'vsplit', 'concatenate',
-           'stack', 'vstack', 'dstack', 'mean', 'maximum', 'minimum', 'swapaxes', 'clip', 'argmax',
+           'stack', 'vstack', 'dstack', 'mean', 'maximum', 'minimum', 'swapaxes', 'clip', 'argmax', 'argmin',
            'std', 'var', 'indices', 'copysign', 'ravel', 'hanning', 'hamming', 'blackman', 'flip',
            'around', 'hypot', 'rad2deg', 'deg2rad', 'unique', 'lcm', 'tril', 'identity', 'take',
            'ldexp', 'vdot', 'inner', 'outer', 'equal', 'not_equal', 'greater', 'less', 'greater_equal',
@@ -385,13 +385,10 @@ def argmax_channel(self, *args, **kwargs):
         """
         raise AttributeError('_Symbol object has no attribute argmax_channel')
 
-    def argmin(self, *args, **kwargs):
-        """Convenience fluent method for :py:func:`argmin`.
-
-        The arguments are the same as for :py:func:`argmin`, with
-        this array as data.
-        """
-        raise NotImplementedError
+    def argmin(self, axis=None, out=None):  # pylint: disable=arguments-differ
+        """Return indices of the minimum values along the given axis.
+        Refer to `mxnet.numpy.argmax` for full documentation."""
+        return argmin(self, axis, out)
 
     def clip(self, min=None, max=None, out=None):  # pylint: disable=arguments-differ
         """Return an array whose values are limited to [min, max].
@@ -3187,8 +3184,6 @@ def swapaxes(a, axis1, axis2):
 @set_module('mxnet.symbol.numpy')
 def argmax(a, axis=None, out=None):
     r"""
-    argmax(a, axis=None, out=None)
-
     Returns the indices of the maximum values along an axis.
 
     Parameters
@@ -3226,6 +3221,46 @@ def argmax(a, axis=None, out=None):
     return _npi.argmax(a, axis=axis, keepdims=False, out=out)
 
 
+@set_module('mxnet.symbol.numpy')
+def argmin(a, axis=None, out=None):
+    r"""
+    Returns the indices of the minimum values along an axis.
+
+    Parameters
+    ----------
+    a : _Symbol
+        Input array. Only support dtype `float16`, `float32`, and `float64`.
+    axis : int, optional
+        By default, the index is into the flattened array, otherwise
+        along the specified axis.
+    out : _Symbol or None, optional
+        Dummy parameter to keep the consistency with the ndarray counterpart.
+
+    Returns
+    -------
+    index_array : _Symbol of indices whose dtype is same as the input ndarray.
+        Array of indices into the array. It has the same shape as `a.shape`
+        with the dimension along `axis` removed.
+
+    Notes
+    -----
+    In case of multiple occurrences of the minimum values, the indices
+    corresponding to the first occurrence are returned.
+
+    This function differs from the original `numpy.argmin
+    <https://docs.scipy.org/doc/numpy/reference/generated/numpy.argmin.html>`_ in
+    the following aspects:
+
+    - Input type does not support Python native iterables(list, tuple, ...).
+    - Output has dtype that is same as the input ndarray.
+    - ``out`` param: cannot perform auto broadcasting. ``out`` symbol's shape must be the same as the expected output.
+    - ``out`` param: cannot perform auto type cast. ``out`` symnbol's dtype must be the same as the expected output.
+    - ``out`` param does not support scalar input case.
+
+    """
+    return _npi.argmin(a, axis=axis, keepdims=False, out=out)
+
+
 @set_module('mxnet.symbol.numpy')
 def mean(a, axis=None, dtype=None, out=None, keepdims=False):  # pylint: disable=arguments-differ
     """
diff --git a/python/mxnet/symbol/numpy/random.py b/python/mxnet/symbol/numpy/random.py
index d891ea0c21a0..48bccb64a2b4 100644
--- a/python/mxnet/symbol/numpy/random.py
+++ b/python/mxnet/symbol/numpy/random.py
@@ -25,7 +25,7 @@
 
 
 def randint(low, high=None, size=None, dtype=None, ctx=None, out=None):
-    """Return random integers from `low` (inclusive) to `high` (exclusive).
+    r"""Return random integers from `low` (inclusive) to `high` (exclusive).
 
     Return random integers from the "discrete uniform" distribution of
     the specified dtype in the "half-open" interval [`low`, `high`). If
@@ -113,7 +113,7 @@ def rand(*size, **kwargs):
 
 
 def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw samples from a uniform distribution.
+    r"""Draw samples from a uniform distribution.
 
     Samples are uniformly distributed over the half-open interval
     ``[low, high)`` (includes low, but excludes high).  In other words,
@@ -168,7 +168,7 @@ def uniform(low=0.0, high=1.0, size=None, dtype=None, ctx=None, out=None):
 
 
 def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
-    """Draw random samples from a normal (Gaussian) distribution.
+    r"""Draw random samples from a normal (Gaussian) distribution.
 
     Samples are distributed according to a normal distribution parametrized
     by *loc* (mean) and *scale* (standard deviation).
@@ -217,7 +217,7 @@ def normal(loc=0.0, scale=1.0, size=None, dtype=None, ctx=None, out=None):
 
 
 def choice(a, size=None, replace=True, p=None, ctx=None, out=None):
-    """Generates a random sample from a given 1-D array
+    r"""Generates a random sample from a given 1-D array
 
     Parameters
     -----------
diff --git a/python/mxnet/symbol/symbol.py b/python/mxnet/symbol/symbol.py
index b8e8db57188c..6146ab9dc50e 100644
--- a/python/mxnet/symbol/symbol.py
+++ b/python/mxnet/symbol/symbol.py
@@ -1695,42 +1695,80 @@ def simple_bind(self, ctx, grad_req='write', type_dict=None, stype_dict=None,
         aux_state_handles = ctypes.POINTER(NDArrayHandle)()
 
         try:
-            check_call(_LIB.MXExecutorSimpleBindEx(self.handle,
-                                                   ctypes.c_int(ctx.device_typeid),
-                                                   ctypes.c_int(ctx.device_id),
-                                                   num_ctx_map_keys,
-                                                   ctx_map_keys,
-                                                   ctx_map_dev_types,
-                                                   ctx_map_dev_ids,
-                                                   mx_uint(provided_req_type_list_len),
-                                                   provided_grad_req_names,
-                                                   provided_grad_req_types,
-                                                   mx_uint(len(provided_arg_shape_names)),
-                                                   c_str_array(provided_arg_shape_names),
-                                                   c_array_buf(mx_int,
-                                                               array('I', provided_arg_shape_data)),
-                                                   c_array_buf(mx_uint,
-                                                               array('i', provided_arg_shape_idx)),
-                                                   num_provided_arg_types,
-                                                   provided_arg_type_names,
-                                                   provided_arg_type_data,
-                                                   num_provided_arg_stypes,
-                                                   provided_arg_stype_names,
-                                                   provided_arg_stype_data,
-                                                   mx_uint(len(shared_arg_name_list)),
-                                                   c_str_array(shared_arg_name_list),
-                                                   ctypes.byref(shared_buffer_len),
-                                                   shared_buffer_names,
-                                                   shared_buffer_handles,
-                                                   ctypes.byref(updated_shared_buffer_names),
-                                                   ctypes.byref(updated_shared_buffer_handles),
-                                                   ctypes.byref(num_in_args),
-                                                   ctypes.byref(in_arg_handles),
-                                                   ctypes.byref(arg_grad_handles),
-                                                   ctypes.byref(num_aux_states),
-                                                   ctypes.byref(aux_state_handles),
-                                                   shared_exec_handle,
-                                                   ctypes.byref(exe_handle)))
+            if sys.version_info[0] > 2 and _int64_enabled():
+                check_call(_LIB.MXExecutorSimpleBindEx64(self.handle,
+                                                         ctypes.c_int(ctx.device_typeid),
+                                                         ctypes.c_int(ctx.device_id),
+                                                         num_ctx_map_keys,
+                                                         ctx_map_keys,
+                                                         ctx_map_dev_types,
+                                                         ctx_map_dev_ids,
+                                                         mx_uint(provided_req_type_list_len),
+                                                         provided_grad_req_names,
+                                                         provided_grad_req_types,
+                                                         mx_uint(len(provided_arg_shape_names)),
+                                                         c_str_array(provided_arg_shape_names),
+                                                         c_array_buf(mx_int64,
+                                                                     array('q', provided_arg_shape_data)),
+                                                         c_array_buf(mx_uint,
+                                                                     array('i', provided_arg_shape_idx)),
+                                                         num_provided_arg_types,
+                                                         provided_arg_type_names,
+                                                         provided_arg_type_data,
+                                                         num_provided_arg_stypes,
+                                                         provided_arg_stype_names,
+                                                         provided_arg_stype_data,
+                                                         mx_uint(len(shared_arg_name_list)),
+                                                         c_str_array(shared_arg_name_list),
+                                                         ctypes.byref(shared_buffer_len),
+                                                         shared_buffer_names,
+                                                         shared_buffer_handles,
+                                                         ctypes.byref(updated_shared_buffer_names),
+                                                         ctypes.byref(updated_shared_buffer_handles),
+                                                         ctypes.byref(num_in_args),
+                                                         ctypes.byref(in_arg_handles),
+                                                         ctypes.byref(arg_grad_handles),
+                                                         ctypes.byref(num_aux_states),
+                                                         ctypes.byref(aux_state_handles),
+                                                         shared_exec_handle,
+                                                         ctypes.byref(exe_handle)))
+            else:
+                check_call(_LIB.MXExecutorSimpleBindEx(self.handle,
+                                                       ctypes.c_int(ctx.device_typeid),
+                                                       ctypes.c_int(ctx.device_id),
+                                                       num_ctx_map_keys,
+                                                       ctx_map_keys,
+                                                       ctx_map_dev_types,
+                                                       ctx_map_dev_ids,
+                                                       mx_uint(provided_req_type_list_len),
+                                                       provided_grad_req_names,
+                                                       provided_grad_req_types,
+                                                       mx_uint(len(provided_arg_shape_names)),
+                                                       c_str_array(provided_arg_shape_names),
+                                                       c_array_buf(mx_int,
+                                                                   array('I', provided_arg_shape_data)),
+                                                       c_array_buf(mx_uint,
+                                                                   array('i', provided_arg_shape_idx)),
+                                                       num_provided_arg_types,
+                                                       provided_arg_type_names,
+                                                       provided_arg_type_data,
+                                                       num_provided_arg_stypes,
+                                                       provided_arg_stype_names,
+                                                       provided_arg_stype_data,
+                                                       mx_uint(len(shared_arg_name_list)),
+                                                       c_str_array(shared_arg_name_list),
+                                                       ctypes.byref(shared_buffer_len),
+                                                       shared_buffer_names,
+                                                       shared_buffer_handles,
+                                                       ctypes.byref(updated_shared_buffer_names),
+                                                       ctypes.byref(updated_shared_buffer_handles),
+                                                       ctypes.byref(num_in_args),
+                                                       ctypes.byref(in_arg_handles),
+                                                       ctypes.byref(arg_grad_handles),
+                                                       ctypes.byref(num_aux_states),
+                                                       ctypes.byref(aux_state_handles),
+                                                       shared_exec_handle,
+                                                       ctypes.byref(exe_handle)))
         except MXNetError as e:
             error_msg = "simple_bind error. Arguments:\n"
             for k, v in kwargs.items():
diff --git a/python/mxnet/util.py b/python/mxnet/util.py
index 9e15caae9698..3a85e31e7e43 100644
--- a/python/mxnet/util.py
+++ b/python/mxnet/util.py
@@ -690,15 +690,76 @@ def _set_np_array(active):
 
 def set_np(shape=True, array=True):
     """Setting NumPy shape and array semantics at the same time.
-    It is required to keep NumPy shape semantics active when activating NumPy array semantics.
+    It is required to keep NumPy shape semantics active while activating NumPy array semantics.
     Deactivating NumPy shape semantics while NumPy array semantics is still active is not allowed.
+    It is highly recommended to set these two flags to `True` at the same time to fully enable
+    NumPy-like behaviors. Please refer to the Examples section for a better understanding.
 
     Parameters
     ----------
     shape : bool
         A boolean value indicating whether the NumPy-shape semantics should be turned on or off.
+        When this flag is set to `True`, zero-size and zero-dim shapes are all valid shapes in
+        shape inference process, instead of treated as unknown shapes in legacy mode.
     array : bool
         A boolean value indicating whether the NumPy-array semantics should be turned on or off.
+        When this flag is set to `True`, it enables Gluon code flow to use or generate `mxnet.numpy.ndarray`s
+        instead of `mxnet.ndarray.NDArray`. For example, a `Block` would create parameters of type
+        `mxnet.numpy.ndarray`.
+
+    Examples
+    --------
+    >>> import mxnet as mx
+
+    Creating zero-dim ndarray in legacy mode would fail at shape inference.
+
+    >>> mx.nd.ones(shape=())
+    mxnet.base.MXNetError: Operator _ones inferring shapes failed.
+
+    >>> mx.nd.ones(shape=(2, 0, 3))
+    mxnet.base.MXNetError: Operator _ones inferring shapes failed.
+
+    In legacy mode, Gluon layers would create parameters and outputs of type `mx.nd.NDArray`.
+
+    >>> from mxnet.gluon import nn
+    >>> dense = nn.Dense(2)
+    >>> dense.initialize()
+    >>> dense(mx.nd.ones(shape=(3, 2)))
+    [[0.01983214 0.07832371]
+     [0.01983214 0.07832371]
+     [0.01983214 0.07832371]]
+    <NDArray 3x2 @cpu(0)>
+
+    >>> [p.data() for p in dense.collect_params().values()]
+    [
+    [[0.0068339  0.01299825]
+     [0.0301265  0.04819721]]
+    <NDArray 2x2 @cpu(0)>,
+    [0. 0.]
+    <NDArray 2 @cpu(0)>]
+
+    When the `shape` flag is `True`, both shape inferences are successful.
+
+    >>> from mxnet import np, npx
+    >>> npx.set_np()  # this is required to activate NumPy-like behaviors
+
+    >>> np.ones(shape=())
+    array(1.)
+    >>> np.ones(shape=(2, 0, 3))
+    array([], shape=(2, 0, 3))
+
+    When the `array` flag is `True`, Gluon layers would create parameters and outputs of type `mx.np.ndarray`.
+
+    >>> dense = nn.Dense(2)
+    >>> dense.initialize()
+    >>> dense(np.ones(shape=(3, 2)))
+    array([[0.01983214, 0.07832371],
+           [0.01983214, 0.07832371],
+           [0.01983214, 0.07832371]])
+
+    >>> [p.data() for p in dense.collect_params().values()]
+    [array([[0.0068339 , 0.01299825],
+           [0.0301265 , 0.04819721]]), array([0., 0.])]
     """
     if not shape and array:
         raise ValueError('NumPy Shape semantics is required in using NumPy array semantics.')
diff --git a/src/c_api/c_api_executor.cc b/src/c_api/c_api_executor.cc
index ff85b4fd62fa..afc64f73de7c 100644
--- a/src/c_api/c_api_executor.cc
+++ b/src/c_api/c_api_executor.cc
@@ -515,44 +515,11 @@ int MXExecutorSimpleBind(SymbolHandle symbol_handle,
   API_END();
 }
 
-/*!
- * \brief
- * \param symbol_handle symbol handle
- * \param dev_type default device type
- * \param dev_id default device id
- * \param num_g2c_keys number of group2ctx keys
- * \param g2c_keys key list of group2ctx
- * \param g2c_dev_types device type list of group2ctx
- * \param g2c_dev_ids id list of group2ctx
- * \param provided_grad_req_list_len grad_req length provided by users in front-end
- * \param provided_grad_req_names grad_req names provided by users in front-end
- * \param provided_grad_req_types req types provided by users in front-end
- * \param num_provided_arg_shapes number of user provided in_arg and aux_state shapes
- * \param provided_arg_shape_names name list of provided shapes
- * \param provided_arg_shape_data provided shape data
- * \param provided_arg_shape_idx provided shape data index
- * \param num_provided_arg_dtypes number of user provided in_arg and axu_state dtypes
- * \param provided_arg_dtype_names argument name list of provided dtypes
- * \param provided_arg_dtypes data of provided dtypes
- * \param num_provided_arg_stypes number of user provided in_arg and axu_state storage types
- * \param provided_arg_stype_names argument name list of provided storage types
- * \param provided_arg_stypes data of provided storage types
- * \param num_shared_arg_names number of parameter names passed from _bind_ith_exec
- * \param shared_arg_name_list parameter name list passed from _bind_ith_exec
- * \param shared_buffer_len number of shared data arrays passed from _bind_ith_exec
- * \param shared_buffer_name_list shared data array names passed from _bind_ith_exec
- * \param shared_buffer_handle_list shared data array handles passed from _bind_ith_exec
- * \param updated_shared_buffer_name_list updated shared data array names after binding
- * \param updated_shared_buffer_handle_list updated shared data arrays after binding
- * \param num_in_args number of input arguments of this sym
- * \param in_args list_arguments associated with the current executor
- * \param arg_grads list of gradients of in_args associated with the current executor
- * \param num_aux_states number of aux states of this sym
- * \param aux_states list_auxiliary_states associated with the current executor
- * \param shared_exec_handle shared excutor handle passed from _bind_ith_exec
- * \param out the handle of the executor to be created
- */
-int MXExecutorSimpleBindEx(SymbolHandle symbol_handle,
+
+namespace mxnet {
+
+template<typename DType>
+int _SimpleBindImpl(SymbolHandle symbol_handle,
                            int dev_type,
                            int dev_id,
                            const uint32_t num_g2c_keys,
@@ -564,7 +531,7 @@ int MXExecutorSimpleBindEx(SymbolHandle symbol_handle,
                            const char** provided_grad_req_types,
                            const uint32_t num_provided_arg_shapes,
                            const char** provided_arg_shape_names,
-                           const int* provided_arg_shape_data,
+                           const DType* provided_arg_shape_data,
                            const uint32_t* provided_arg_shape_idx,
                            const uint32_t num_provided_arg_dtypes,
                            const char** provided_arg_dtype_names,
@@ -849,6 +816,192 @@ int MXExecutorSimpleBindEx(SymbolHandle symbol_handle,
   API_END();
 }
 
+}  // namespace mxnet
+
+
+/*!
+ * \brief Executor for simple_bind
+ * when INT64_TENSOR_SIZE = OFF
+ * \param symbol_handle symbol handle
+ * \param dev_type default device type
+ * \param dev_id default device id
+ * \param num_g2c_keys number of group2ctx keys
+ * \param g2c_keys key list of group2ctx
+ * \param g2c_dev_types device type list of group2ctx
+ * \param g2c_dev_ids id list of group2ctx
+ * \param provided_grad_req_list_len grad_req length provided by users in front-end
+ * \param provided_grad_req_names grad_req names provided by users in front-end
+ * \param provided_grad_req_types req types provided by users in front-end
+ * \param num_provided_arg_shapes number of user provided in_arg and aux_state shapes
+ * \param provided_arg_shape_names name list of provided shapes
+ * \param provided_arg_shape_data provided shape data
+ * \param provided_arg_shape_idx provided shape data index
+ * \param num_provided_arg_dtypes number of user provided in_arg and axu_state dtypes
+ * \param provided_arg_dtype_names argument name list of provided dtypes
+ * \param provided_arg_dtypes data of provided dtypes
+ * \param num_provided_arg_stypes number of user provided in_arg and axu_state storage types
+ * \param provided_arg_stype_names argument name list of provided storage types
+ * \param provided_arg_stypes data of provided storage types
+ * \param num_shared_arg_names number of parameter names passed from _bind_ith_exec
+ * \param shared_arg_name_list parameter name list passed from _bind_ith_exec
+ * \param shared_buffer_len number of shared data arrays passed from _bind_ith_exec
+ * \param shared_buffer_name_list shared data array names passed from _bind_ith_exec
+ * \param shared_buffer_handle_list shared data array handles passed from _bind_ith_exec
+ * \param updated_shared_buffer_name_list updated shared data array names after binding
+ * \param updated_shared_buffer_handle_list updated shared data arrays after binding
+ * \param num_in_args number of input arguments of this sym
+ * \param in_args list_arguments associated with the current executor
+ * \param arg_grads list of gradients of in_args associated with the current executor
+ * \param num_aux_states number of aux states of this sym
+ * \param aux_states list_auxiliary_states associated with the current executor
+ * \param shared_exec_handle shared excutor handle passed from _bind_ith_exec
+ * \param out the handle of the executor to be created
+ */
+int MXExecutorSimpleBindEx(SymbolHandle symbol_handle,
+                           int dev_type,
+                           int dev_id,
+                           const uint32_t num_g2c_keys,
+                           const char** g2c_keys,
+                           const int* g2c_dev_types,
+                           const int* g2c_dev_ids,
+                           const uint32_t provided_grad_req_list_len,
+                           const char** provided_grad_req_names,
+                           const char** provided_grad_req_types,
+                           const uint32_t num_provided_arg_shapes,
+                           const char** provided_arg_shape_names,
+                           const int* provided_arg_shape_data,
+                           const uint32_t* provided_arg_shape_idx,
+                           const uint32_t num_provided_arg_dtypes,
+                           const char** provided_arg_dtype_names,
+                           const int* provided_arg_dtypes,
+                           const uint32_t num_provided_arg_stypes,
+                           const char** provided_arg_stype_names,
+                           const int* provided_arg_stypes,
+                           const uint32_t num_shared_arg_names,
+                           const char** shared_arg_name_list,
+                           int* shared_buffer_len,
+                           const char** shared_buffer_name_list,
+                           NDArrayHandle* shared_buffer_handle_list,
+                           const char*** updated_shared_buffer_name_list,
+                           NDArrayHandle** updated_shared_buffer_handle_list,
+                           uint32_t* num_in_args,
+                           NDArrayHandle** in_args,
+                           NDArrayHandle** arg_grads,
+                           uint32_t* num_aux_states,
+                           NDArrayHandle** aux_states,
+                           ExecutorHandle shared_exec_handle,
+                           ExecutorHandle* out) {
+  return mxnet::_SimpleBindImpl(symbol_handle,
+                            dev_type, dev_id,
+                            num_g2c_keys, g2c_keys, g2c_dev_types, g2c_dev_ids,
+                            provided_grad_req_list_len, provided_grad_req_names,
+                            provided_grad_req_types,
+                            num_provided_arg_shapes, provided_arg_shape_names,
+                            provided_arg_shape_data, provided_arg_shape_idx,
+                            num_provided_arg_dtypes, provided_arg_dtype_names, provided_arg_dtypes,
+                            num_provided_arg_stypes, provided_arg_stype_names, provided_arg_stypes,
+                            num_shared_arg_names, shared_arg_name_list,
+                            shared_buffer_len, shared_buffer_name_list,
+                            shared_buffer_handle_list, updated_shared_buffer_name_list,
+                            updated_shared_buffer_handle_list,
+                            num_in_args, in_args, arg_grads,
+                            num_aux_states, aux_states,
+                            shared_exec_handle, out);
+}
+
+
+// TODO(ChaiBapchya): add API doc for rest of C APIs for int64
+/*!
+ * \brief Large tensor specific implementation for simple_bind executor
+ * when USE_INT64_TENSOR_SIZE = ON
+ * \param symbol_handle symbol handle
+ * \param dev_type default device type
+ * \param dev_id default device id
+ * \param num_g2c_keys number of group2ctx keys
+ * \param g2c_keys key list of group2ctx
+ * \param g2c_dev_types device type list of group2ctx
+ * \param g2c_dev_ids id list of group2ctx
+ * \param provided_grad_req_list_len grad_req length provided by users in front-end
+ * \param provided_grad_req_names grad_req names provided by users in front-end
+ * \param provided_grad_req_types req types provided by users in front-end
+ * \param num_provided_arg_shapes number of user provided in_arg and aux_state shapes
+ * \param provided_arg_shape_names name list of provided shapes
+ * \param provided_arg_shape_data provided shape data
+ * \param provided_arg_shape_idx provided shape data index
+ * \param num_provided_arg_dtypes number of user provided in_arg and axu_state dtypes
+ * \param provided_arg_dtype_names argument name list of provided dtypes
+ * \param provided_arg_dtypes data of provided dtypes
+ * \param num_provided_arg_stypes number of user provided in_arg and axu_state storage types
+ * \param provided_arg_stype_names argument name list of provided storage types
+ * \param provided_arg_stypes data of provided storage types
+ * \param num_shared_arg_names number of parameter names passed from _bind_ith_exec
+ * \param shared_arg_name_list parameter name list passed from _bind_ith_exec
+ * \param shared_buffer_len number of shared data arrays passed from _bind_ith_exec
+ * \param shared_buffer_name_list shared data array names passed from _bind_ith_exec
+ * \param shared_buffer_handle_list shared data array handles passed from _bind_ith_exec
+ * \param updated_shared_buffer_name_list updated shared data array names after binding
+ * \param updated_shared_buffer_handle_list updated shared data arrays after binding
+ * \param num_in_args number of input arguments of this sym
+ * \param in_args list_arguments associated with the current executor
+ * \param arg_grads list of gradients of in_args associated with the current executor
+ * \param num_aux_states number of aux states of this sym
+ * \param aux_states list_auxiliary_states associated with the current executor
+ * \param shared_exec_handle shared excutor handle passed from _bind_ith_exec
+ * \param out the handle of the executor to be created
+ */
+int MXExecutorSimpleBindEx64(SymbolHandle symbol_handle,
+                           int dev_type,
+                           int dev_id,
+                           const uint32_t num_g2c_keys,
+                           const char** g2c_keys,
+                           const int* g2c_dev_types,
+                           const int* g2c_dev_ids,
+                           const uint32_t provided_grad_req_list_len,
+                           const char** provided_grad_req_names,
+                           const char** provided_grad_req_types,
+                           const uint32_t num_provided_arg_shapes,
+                           const char** provided_arg_shape_names,
+                           const int64_t* provided_arg_shape_data,
+                           const uint32_t* provided_arg_shape_idx,
+                           const uint32_t num_provided_arg_dtypes,
+                           const char** provided_arg_dtype_names,
+                           const int* provided_arg_dtypes,
+                           const uint32_t num_provided_arg_stypes,
+                           const char** provided_arg_stype_names,
+                           const int* provided_arg_stypes,
+                           const uint32_t num_shared_arg_names,
+                           const char** shared_arg_name_list,
+                           int* shared_buffer_len,
+                           const char** shared_buffer_name_list,
+                           NDArrayHandle* shared_buffer_handle_list,
+                           const char*** updated_shared_buffer_name_list,
+                           NDArrayHandle** updated_shared_buffer_handle_list,
+                           uint32_t* num_in_args,
+                           NDArrayHandle** in_args,
+                           NDArrayHandle** arg_grads,
+                           uint32_t* num_aux_states,
+                           NDArrayHandle** aux_states,
+                           ExecutorHandle shared_exec_handle,
+                           ExecutorHandle* out) {
+  return mxnet::_SimpleBindImpl(symbol_handle,
+                            dev_type, dev_id,
+                            num_g2c_keys, g2c_keys, g2c_dev_types, g2c_dev_ids,
+                            provided_grad_req_list_len, provided_grad_req_names,
+                            provided_grad_req_types,
+                            num_provided_arg_shapes, provided_arg_shape_names,
+                            provided_arg_shape_data, provided_arg_shape_idx,
+                            num_provided_arg_dtypes, provided_arg_dtype_names, provided_arg_dtypes,
+                            num_provided_arg_stypes, provided_arg_stype_names, provided_arg_stypes,
+                            num_shared_arg_names, shared_arg_name_list,
+                            shared_buffer_len, shared_buffer_name_list,
+                            shared_buffer_handle_list, updated_shared_buffer_name_list,
+                            updated_shared_buffer_handle_list,
+                            num_in_args, in_args, arg_grads,
+                            num_aux_states, aux_states,
+                            shared_exec_handle, out);
+}
+
+
 int MXExecutorReshape(int partial_shaping,
                       int allow_up_sizing,
                       int dev_type,
diff --git a/src/ndarray/ndarray.cc b/src/ndarray/ndarray.cc
index 3feccf55b734..55295a42d924 100644
--- a/src/ndarray/ndarray.cc
+++ b/src/ndarray/ndarray.cc
@@ -636,6 +636,7 @@ const mkldnn::memory *NDArray::GetMKLDNNData() const {
     // If this is a view, we can't create a MKLDNN memory for the chunk
     // because we don't have the complete data type and shape information for
     // the chunk.
+    CheckAndAlloc();
     void *off_addr = static_cast<char *>(ptr_->shandle.dptr) + byte_offset_;
     // Create the primitive desc for the new mkldnn memory.
     mkldnn::memory::dims dims(shape().ndim());
@@ -652,6 +653,7 @@ const mkldnn::memory *NDArray::GetMKLDNNData() const {
   } else {
     // If this isn't a view, we can create a MKLDNN memory and store it in the
     // chunk.
+    CheckAndAlloc();
     ptr_->SetMKLMem(shape_, dtype_);
     MKLDNNStream::Get()->RegisterMem(ptr_->mkl_mem_->GetMem());
     return ptr_->mkl_mem_->GetRaw();
diff --git a/src/operator/contrib/allclose_op-inl.h b/src/operator/contrib/allclose_op-inl.h
index a858450f0007..a10c7795e568 100644
--- a/src/operator/contrib/allclose_op-inl.h
+++ b/src/operator/contrib/allclose_op-inl.h
@@ -58,8 +58,8 @@ struct AllCloseParam : public dmlc::Parameter<AllCloseParam> {
       .describe("Absolute tolerance.");
     DMLC_DECLARE_FIELD(equal_nan)
       .set_default(true)
-      .describe("Whether to compare NaN’s as equal. If True, NaN’s in A will be considered equal "
-                "to NaN’s in B in the output array.");
+      .describe("Whether to compare NaN's as equal. If True, NaN's in A will be considered equal "
+                "to NaN's in B in the output array.");
   }
 };
 
diff --git a/src/operator/contrib/boolean_mask.cc b/src/operator/contrib/boolean_mask.cc
index a54cc917776d..cd2fd8e42f8e 100644
--- a/src/operator/contrib/boolean_mask.cc
+++ b/src/operator/contrib/boolean_mask.cc
@@ -129,7 +129,7 @@ inline void BooleanMaskForward<cpu>(const nnvm::NodeAttrs& attrs,
 
   const_cast<NDArray &>(out).Init(s);
   // do the copy
-  MSHADOW_TYPE_SWITCH(data.dtype(), DType, {
+  MSHADOW_TYPE_SWITCH_WITH_BOOL(data.dtype(), DType, {
     size_t input_size = data.shape().Size();
     size_t col_size = input_size / idx_size;
     mshadow::Stream<cpu> *stream = ctx.get_stream<cpu>();
diff --git a/src/operator/contrib/boolean_mask.cu b/src/operator/contrib/boolean_mask.cu
index 71d91c63f64e..f6c1df0c62a8 100644
--- a/src/operator/contrib/boolean_mask.cu
+++ b/src/operator/contrib/boolean_mask.cu
@@ -86,7 +86,7 @@ inline void BooleanMaskForward<gpu>(const nnvm::NodeAttrs& attrs,
   size_t input_size = data.shape().Size();
   size_t col_size = input_size / idx.shape()[0];
   // Do the copy
-  MSHADOW_TYPE_SWITCH(out.dtype(), DType, {
+  MSHADOW_TYPE_SWITCH_WITH_BOOL(out.dtype(), DType, {
     if (valid_num > 0) {
       mxnet_op::Kernel<BooleanMaskForwardKernel, gpu>::Launch(
         s, input_size, out.data().dptr<DType>(), data.data().dptr<DType>(), prefix_sum, col_size);
diff --git a/src/operator/mxnet_op.h b/src/operator/mxnet_op.h
index 8ccc34247b6f..463c71b5b0eb 100644
--- a/src/operator/mxnet_op.h
+++ b/src/operator/mxnet_op.h
@@ -619,6 +619,18 @@ MSHADOW_XINLINE Shape<ndim> calc_stride(const Shape<ndim>& shape) {
   return stride;
 }
 
+/* Increment coordinates */
+template<int ndim>
+MSHADOW_XINLINE bool inc(Shape<ndim>* coord, const Shape<ndim>& shape) {
+  ++(*coord)[ndim-1];
+  #pragma unroll
+  for (int i = ndim - 1; i > 0 && (*coord)[i] >= shape[i]; --i) {
+    (*coord)[i] -= shape[i];
+    ++(*coord)[i-1];
+  }
+  return (*coord)[0] < shape[0];
+}
+
 /* Increment coordinates and modify index */
 template<int ndim>
 MSHADOW_XINLINE void inc(Shape<ndim>* coord, const Shape<ndim>& shape,
diff --git a/src/operator/numpy/np_broadcast_reduce_op_index.cc b/src/operator/numpy/np_broadcast_reduce_op_index.cc
index bd6915cc9b27..15831c7e79ba 100644
--- a/src/operator/numpy/np_broadcast_reduce_op_index.cc
+++ b/src/operator/numpy/np_broadcast_reduce_op_index.cc
@@ -57,5 +57,16 @@ NNVM_REGISTER_OP(_npi_argmax)
 .set_attr<nnvm::FGradient>("FGradient", MakeZeroGradNodes)
 .add_arguments(ReduceAxisParam::__FIELDS__());
 
+NNVM_REGISTER_OP(_npi_argmin)
+.set_num_inputs(1)
+.set_num_outputs(1)
+.set_attr_parser(ParamParser<ReduceAxisParam>)
+.set_attr<mxnet::FInferShape>("FInferShape", NumpyReduceAxisShape)
+.set_attr<nnvm::FInferType>("FInferType", ElemwiseType<1, 1>)
+.add_argument("data", "NDArray-or-Symbol", "The input")
+.set_attr<FCompute>("FCompute<cpu>", SearchAxisCompute<cpu, mshadow::red::minimum>)
+.set_attr<nnvm::FGradient>("FGradient", MakeZeroGradNodes)
+.add_arguments(ReduceAxisParam::__FIELDS__());
+
 }  // namespace op
 }  // namespace mxnet
diff --git a/src/operator/numpy/np_broadcast_reduce_op_index.cu b/src/operator/numpy/np_broadcast_reduce_op_index.cu
index a07baa9c070c..0420133ee7c0 100644
--- a/src/operator/numpy/np_broadcast_reduce_op_index.cu
+++ b/src/operator/numpy/np_broadcast_reduce_op_index.cu
@@ -30,5 +30,8 @@ namespace op {
 NNVM_REGISTER_OP(_npi_argmax)
 .set_attr<FCompute>("FCompute<gpu>", SearchAxisCompute<gpu, mshadow::red::maximum>);
 
+NNVM_REGISTER_OP(_npi_argmin)
+.set_attr<FCompute>("FCompute<gpu>", SearchAxisCompute<gpu, mshadow::red::minimum>);
+
 }  // namespace op
 }  // namespace mxnet
diff --git a/src/operator/numpy/np_einsum_op-inl.h b/src/operator/numpy/np_einsum_op-inl.h
index 2145abec682b..d2f399b2533d 100644
--- a/src/operator/numpy/np_einsum_op-inl.h
+++ b/src/operator/numpy/np_einsum_op-inl.h
@@ -73,8 +73,8 @@
 namespace mxnet {
 namespace op {
 
-#define NPY_MAXDIMS 32
-#define NPY_MAXARGS 32
+#define NPY_MAXDIMS 16
+#define NPY_MAXARGS 16
 
 inline TShape get_stride(const TShape& shape) {
   int ndim = shape.ndim(), prod = 1;
@@ -394,7 +394,7 @@ struct NumpyEinsumParam: public dmlc::Parameter<NumpyEinsumParam> {
       .set_default("")
       .describe("Specifies the subscripts for summation as comma separated list"
       " of subscript labels. An implicit (classical Einstein summation) calculation"
-      " is performed unless the explicit indicator ‘->’ is included as well as"
+      " is performed unless the explicit indicator '->' is included as well as"
       " subscript labels of the precise output form.");
     DMLC_DECLARE_FIELD(optimize)
       .set_default(0);
@@ -415,40 +415,45 @@ class EinsumOp {
   }
 };  // class EinsumOp
 
-template<int dimension, int req, bool back>
-struct numpy_einsum {
+template<int dimension, int req, bool back, typename AType>
+struct numpy_einsum{
   template<typename DType>
   MSHADOW_XINLINE static void Map(index_t i, DType* out,
                                   common::StaticArray<DType*, NPY_MAXARGS> op,
                                   mshadow::Shape<dimension> oshape,
-                                  mshadow::Shape<dimension> ostride,
+                                  common::StaticArray<mshadow::Shape<dimension>,
+                                                      NPY_MAXARGS> ostride,
                                   mshadow::Shape<dimension> reduceshape,
-                                  mshadow::Shape<dimension> reducestride,
-                                  mshadow::Shape<dimension> itershape,
                                   common::StaticArray<mshadow::Shape<dimension>,
-                                                      NPY_MAXARGS> iterstride,
+                                                      NPY_MAXARGS> rstride,
                                   int nop,
                                   int iop0,
                                   const DType* out_grad) {
     using namespace mxnet_op;
-    index_t oidx = back ? dot(unravel(dot(unravel(i, oshape), ostride), itershape),
-                              iterstride[iop0]) : i;
+    mshadow::Shape<dimension> oidx = unravel(i, oshape);
+    i = back ? dot(oidx, ostride[iop0]) : i;
     if (req == kWriteTo) {
-      out[oidx] = (DType)0;
+      out[i] = (DType)0;
+    }
+    for (int rdim = 0; rdim < dimension; ++rdim) {
+      if (reduceshape[rdim] == 0) {
+        return;
+      }
     }
-    for (int j = 0; j < reduceshape.Size(); j++) {
-      mshadow::Shape<dimension> idx = unravel(dot(unravel(j, reduceshape), reducestride) +
-                                              dot(unravel(i, oshape), ostride),
-                                              itershape);
-      DType tmp = back ? out_grad[dot(idx, iterstride[nop])] :  (DType)1;
+    mshadow::Shape<dimension> ridx = unravel(0, reduceshape);
+    AType sum = 0;
+    do {
+      AType tmp = back ? static_cast<AType>(out_grad[dot(oidx, ostride[nop]) +
+                                                     dot(ridx, rstride[nop])]): (AType)1;
       for (int iop = 0; iop < nop; ++iop) {
         if (iop != iop0) {
-          index_t k = dot(idx, iterstride[iop]);
-          tmp = tmp * op[iop][k];
+          index_t k = dot(oidx, ostride[iop]) + dot(ridx, rstride[iop]);
+          tmp = tmp * static_cast<AType>(op[iop][k]);
         }
       }
-      out[oidx] = out[oidx] + tmp;
-    }
+      sum = sum + tmp;
+    }while (inc(&ridx, reduceshape));
+    out[i] = out[i] + static_cast<DType>(sum);
   }
 };
 
@@ -603,12 +608,12 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
   }
 
   /* Step 4: Set up the op_axes for the iterator */
-  TShape itershape(ndim_iter, -1), iterstride_true(ndim_iter, -1);
+  TShape itershape(ndim_iter, -1);
+  std::vector<TShape> iterstride(nop + 1, TShape(ndim_iter, 0));
   TShape oshape = back ? inputs[0].shape_ : outputs[0].shape_;
   TShape ostride_true = get_stride(oshape);
-  TShape reduceshape, ostride, reducestride;
-  std::vector<TShape> iterstride(nop + 1, TShape(ndim_iter, 0));
-  std::vector<TShape> remainshape(nop), opstride(nop), remainstride(nop);
+  TShape reduceshape;
+  std::vector<TShape> remainshape(nop);
   int op_axes_arrays[NPY_MAXARGS][NPY_MAXDIMS];
   int *op_axes[NPY_MAXARGS];
 
@@ -632,7 +637,6 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
   for (idim = 0; idim < ndim_output; ++idim) {
     iterstride[nop][idim] = ostride_true[idim];
   }
-  iterstride_true = get_stride(itershape);
   reduceshape = TShape(ndim_iter - ndim_output, 0);
   for (idim = ndim_output; idim < ndim_iter; ++idim) {
     reduceshape[idim - ndim_output] = itershape[idim];
@@ -648,30 +652,6 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
     remainshape[iop] = TShape(rsh.begin(), rsh.end());
   }
 
-  // calculate stride
-  ostride = TShape(ndim_output, 0);
-  for (idim = 0; idim < ndim_output; ++idim) {
-    ostride[idim] = iterstride_true[idim];
-  }
-  reducestride = TShape(ndim_iter - ndim_output, 0);
-  for (idim = ndim_output; idim < ndim_iter; ++idim) {
-    reducestride[idim - ndim_output] = iterstride_true[idim];
-  }
-  for (iop = 0; iop < nop; ++iop) {
-    opstride[iop] = TShape(opshape[iop].ndim(), 0);
-    remainstride[iop] = TShape(remainshape[iop].ndim(), 0);
-    int j = 0;
-    for (idim = 0; idim < ndim_iter; ++idim) {
-      if (op_axes_arrays[iop][idim] != -1 &&
-          itershape[idim] == opshape[iop][op_axes_arrays[iop][idim]]) {
-        opstride[iop][op_axes_arrays[iop][idim]] = iterstride_true[idim];
-      } else {
-        remainstride[iop][j++] = iterstride_true[idim];
-      }
-    }
-    CHECK_EQ(j, remainstride[iop].ndim());
-  }
-
   // exclude the 0-dim case
   if (ndim_iter == 0) {
     ndim_iter = 1;
@@ -681,14 +661,10 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
     iterstride[iop] = pad(iterstride[iop], ndim_iter);
   }
   oshape = pad(oshape, ndim_iter);
-  ostride = pad(ostride, ndim_iter);
   reduceshape = pad(reduceshape, ndim_iter);
-  reducestride = pad(reducestride, ndim_iter);
   for (iop = 0; iop < nop; ++iop) {
     opshape[iop] = pad(opshape[iop], ndim_iter);
-    opstride[iop] = pad(opstride[iop], ndim_iter);
     remainshape[iop] = pad(remainshape[iop], ndim_iter);
-    remainstride[iop] = pad(remainstride[iop], ndim_iter);
   }
 
   if (!back) {
@@ -696,28 +672,33 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
       return;
     }
     const TBlob &out_data = outputs[0];
-    MSHADOW_TYPE_SWITCH(out_data.type_flag_, DType, {
+    MXNET_ACC_TYPE_SWITCH(out_data.type_flag_, DType, AType, {
       mxnet::common::StaticArray<DType*, NPY_MAXARGS> op;
       for (iop = 0; iop < nop; ++iop) {
         op[iop] = inputs[iop].dptr<DType>();
       }
       MXNET_ASSIGN_REQ_SWITCH(req[0], req_type, {
         MXNET_NDIM_SWITCH_EX(ndim_iter, dimension, {
-          mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> iterstride_arr;
-          for (iop = 0; iop <= nop; ++iop) {
-            iterstride_arr[iop] = iterstride[iop].get<dimension>();
+          mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> ostride_arr;
+          mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> rstride_arr;
+          for (iop = 0; iop < nop; ++iop) {
+            mshadow::Shape<dimension> otmp, rtmp;
+            for (idim = 0; idim < dimension; ++idim) {
+              otmp[idim] = idim < ndim_output ? iterstride[iop][idim] : 1;
+              rtmp[idim] = idim < dimension - ndim_output ? iterstride[iop][idim + ndim_output] : 1;
+            }
+            ostride_arr[iop] = otmp;
+            rstride_arr[iop] = rtmp;
           }
-          Kernel<numpy_einsum<dimension, req_type, 0>,
+          Kernel<numpy_einsum<dimension, req_type, 0, AType>,
                  xpu>::Launch(ctx.get_stream<xpu>(),
                               oshape.Size(),
                               out_data.dptr<DType>(),
                               op,
                               oshape.get<dimension>(),
-                              ostride.get<dimension>(),
+                              ostride_arr,
                               reduceshape.get<dimension>(),
-                              reducestride.get<dimension>(),
-                              itershape.get<dimension>(),
-                              iterstride_arr,
+                              rstride_arr,
                               nop,
                               -1,
                               reinterpret_cast<DType*>(NULL));
@@ -743,31 +724,44 @@ inline void NumpyEinsumProcess(const std::vector<TBlob>& inputs,
     for (int i = 0; i < nop; ++i) {
       const TBlob &out_data = outputs[i];
       const TBlob &out_grad = inputs[0];
-      MSHADOW_TYPE_SWITCH(out_data.type_flag_, DType, {
+      std::vector<TShape> opstride(nop + 1, TShape(ndim_iter, 0));
+      std::vector<TShape> remainstride(nop + 1, TShape(ndim_iter, 0));
+      for (iop = 0; iop <= nop; ++iop) {
+        int j = 0;
+        for (idim = 0; idim < ndim_iter; ++idim) {
+          if (op_axes_arrays[i][idim] == -1 ||
+              opshape[i][op_axes_arrays[i][idim]] == 1) {
+            remainstride[iop][j++] = iterstride[iop][idim];
+          } else {
+            opstride[iop][op_axes_arrays[i][idim]] = iterstride[iop][idim];
+          }
+        }
+      }
+      MXNET_ACC_TYPE_SWITCH(out_data.type_flag_, DType, AType, {
         mxnet::common::StaticArray<DType*, NPY_MAXARGS> op;
         for (iop = 0; iop < nop; ++iop) {
           op[iop] = inputs[iop + back].dptr<DType>();
         }
         MXNET_ASSIGN_REQ_SWITCH(req[i], req_type, {
           MXNET_NDIM_SWITCH_EX(ndim_iter, dimension, {
-            mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> iterstride_arr;
+            mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> opstride_arr;
+            mxnet::common::StaticArray<mshadow::Shape<dimension>, NPY_MAXARGS> remainstride_arr;
             for (iop = 0; iop <= nop; ++iop) {
-              iterstride_arr[iop] = iterstride[iop].get<dimension>();
+              opstride_arr[iop] = opstride[iop].get<dimension>();
+              remainstride_arr[iop] = remainstride[iop].get<dimension>();
             }
-            Kernel<numpy_einsum<dimension, req_type, 1>,
+            Kernel<numpy_einsum<dimension, req_type, 1, AType>,
                   xpu>::Launch(ctx.get_stream<xpu>(),
-                              opshape[i].Size(),
-                              out_data.dptr<DType>(),
-                              op,
-                              opshape[i].get<dimension>(),
-                              opstride[i].get<dimension>(),
-                              remainshape[i].get<dimension>(),
-                              remainstride[i].get<dimension>(),
-                              itershape.get<dimension>(),
-                              iterstride_arr,
-                              nop,
-                              i,
-                              out_grad.dptr<DType>());
+                               opshape[i].Size(),
+                               out_data.dptr<DType>(),
+                               op,
+                               opshape[i].get<dimension>(),
+                               opstride_arr,
+                               remainshape[i].get<dimension>(),
+                               remainstride_arr,
+                               nop,
+                               i,
+                               out_grad.dptr<DType>());
           })
         })
       })
@@ -798,13 +792,14 @@ inline void NumpyEinsumForward(const OpStatePtr& state_ptr,
   std::vector<std::vector<int> > pos;
   std::string string_repr;
   paths = einsum_path(state.subscripts, inputs, true, ctx.run_ctx, &pos, &string_repr);
-  int paths_len = paths.size(), temp_space_size = 0, max_temp_space_size = 0;
+  int paths_len = paths.size();
+  size_t temp_space_size = 0, max_temp_space_size = 0;
   std::vector<TBlob> operands(inputs), tmp_operands, temp_space_vec(paths_len - 1);
   for (int i = 0; i + 1 < paths_len; ++i) {
     temp_space_size += paths[i].oshape.Size();
   }
   for (int i = 0; i < paths_len; ++i) {
-    max_temp_space_size = std::max(max_temp_space_size, static_cast<int>(paths[i].oshape.Size()));
+    max_temp_space_size = std::max(max_temp_space_size, paths[i].oshape.Size());
   }
   temp_space_size += max_temp_space_size;
   MSHADOW_TYPE_SWITCH(outputs[0].type_flag_, DType, {
@@ -813,7 +808,7 @@ inline void NumpyEinsumForward(const OpStatePtr& state_ptr,
                                                false,
                                                outputs[0].type_flag_));
     Tensor<xpu, 1, DType> temp_space = state.tempspace->data().FlatTo1D<xpu, DType>();
-    int begin = max_temp_space_size;
+    size_t begin = max_temp_space_size;
     for (int i = 0; i < paths_len - 1; ++i) {
       TBlob tblob = TBlob(temp_space.Slice(begin, begin + paths[i].oshape.Size()));
       temp_space_vec[i] = tblob.reshape(paths[i].oshape);
@@ -910,12 +905,13 @@ inline void NumpyEinsumBackward(const OpStatePtr& state_ptr,
   }
   // calculate temporary space size for temp_grad
   const std::vector<Step>& paths = state.paths;
-  int paths_len = paths.size(), temp_space_size = 0, max_temp_space_size = 0;
+  int paths_len = paths.size();
+  size_t temp_space_size = 0, max_temp_space_size = 0;
   for (int i = 0; i < paths_len - 1; ++i) {
     temp_space_size += paths[i].oshape.Size();
   }
   for (int i = 0; i < paths_len; ++i) {
-    max_temp_space_size = std::max(max_temp_space_size, static_cast<int>(paths[i].oshape.Size()));
+    max_temp_space_size = std::max(max_temp_space_size, paths[i].oshape.Size());
   }
   temp_space_size += max_temp_space_size;
   // replay the forward process
@@ -936,8 +932,8 @@ inline void NumpyEinsumBackward(const OpStatePtr& state_ptr,
     }
   }
   // calculate temporary space size for tensordot
-  int tensordot_max_tempspace_size = 0;
-  int begin_tensordot_tempspace = 0;
+  size_t tensordot_max_tempspace_size = 0;
+  size_t begin_tensordot_tempspace = 0;
   std::vector<TBlob> temp_inputs, temp_outputs;
   std::vector<OpReqType> temp_req;
   std::vector<size_t> tensordot_tempspace_size;
@@ -999,7 +995,7 @@ inline void NumpyEinsumBackward(const OpStatePtr& state_ptr,
       }
       tensordot_tempspace_size.push_back(cur_tensordot_tempspace_size);
       tensordot_max_tempspace_size = std::max(tensordot_max_tempspace_size,
-                                              static_cast<int>(cur_tensordot_tempspace_size));
+                                              cur_tensordot_tempspace_size);
     }
     begin_tensordot_tempspace = temp_space_size;
     temp_space_size += (tensordot_max_tempspace_size + sizeof(DType) - 1) / sizeof(DType);
@@ -1010,7 +1006,7 @@ inline void NumpyEinsumBackward(const OpStatePtr& state_ptr,
     // allocate temporary space for gradients of intermediate results
     Tensor<xpu, 1, DType> temp_space = ctx.requested[0].get_space_typed<xpu, 1, DType>
       (Shape1(temp_space_size), s);
-    int begin = max_temp_space_size;
+    size_t begin = max_temp_space_size;
     for (int i = 0; i + 1 < paths_len; ++i) {
       TBlob tblob = TBlob(temp_space.Slice(begin, begin + paths[i].oshape.Size()));
       temp_grad[i] = tblob.reshape(paths[i].oshape);
diff --git a/src/operator/numpy/np_einsum_op.cc b/src/operator/numpy/np_einsum_op.cc
index 4d232b9b7c04..522780f5f3ad 100644
--- a/src/operator/numpy/np_einsum_op.cc
+++ b/src/operator/numpy/np_einsum_op.cc
@@ -305,6 +305,17 @@ bool NumpyEinsumShape(const nnvm::NodeAttrs& attrs,
     oshape[i] = dimension_dict[static_cast<int>(output_str[i])];
   }
   SHAPE_ASSIGN_CHECK(*out_attrs, 0, oshape);
+  size_t lim = static_cast<size_t>(std::numeric_limits<index_t>::max());
+  for (int i = 0; i < num_args; ++i) {
+    CHECK_LE(in_attrs->at(i).Size(), lim)
+      << "Size of operand " << i
+      << " exceeds the maximum index."
+      << " Try setting `USE_INT64_TENSOR_SIZE`.";
+  }
+  CHECK_LE(oshape.Size(), lim)
+    << "Size of output"
+    << " exceeds the maximum index."
+    << " Try setting `USE_INT64_TENSOR_SIZE`.";
   return shape_is_known(oshape);
 }
 
diff --git a/src/operator/numpy/np_einsum_path_op-inl.h b/src/operator/numpy/np_einsum_path_op-inl.h
index cebd4e8ce9af..968d52106da7 100644
--- a/src/operator/numpy/np_einsum_path_op-inl.h
+++ b/src/operator/numpy/np_einsum_path_op-inl.h
@@ -80,7 +80,7 @@ struct Contraction {
 };
 
 struct Alternative {
-  int cost[2];
+  int64_t cost[2];
   std::vector<int> positions;
   SetVector new_input_sets;
 };
@@ -115,28 +115,28 @@ inline size_t _compute_size_by_dict(const std::bitset<MAXAXIS>& indices,
   return ret;
 }
 
-inline int _flop_count(const std::string& idx_contraction,
-                       bool inner,
-                       int num_terms,
-                       const dim_t size_dictionary[]) {
+inline int64_t _flop_count(const std::string& idx_contraction,
+                           bool inner,
+                           int num_terms,
+                           const dim_t size_dictionary[]) {
   size_t overall_size = _compute_size_by_dict(idx_contraction, size_dictionary);
   int op_factor = std::max(1, num_terms - 1);
   if (inner) {
     ++op_factor;
   }
-  return overall_size * op_factor;
+  return static_cast<int64_t>(overall_size) * op_factor;
 }
 
-inline int _flop_count(const std::bitset<MAXAXIS>& idx_contraction,
-                       bool inner,
-                       int num_terms,
-                       const dim_t size_dictionary[]) {
+inline int64_t _flop_count(const std::bitset<MAXAXIS>& idx_contraction,
+                           bool inner,
+                           int num_terms,
+                           const dim_t size_dictionary[]) {
   size_t overall_size = _compute_size_by_dict(idx_contraction, size_dictionary);
   int op_factor = std::max(1, num_terms - 1);
   if (inner) {
     ++op_factor;
   }
-  return overall_size * op_factor;
+  return static_cast<int64_t>(overall_size) * op_factor;
 }
 
 inline Contraction _find_contraction(const std::vector<int>& positions,
@@ -164,16 +164,16 @@ inline int _parse_possible_contraction(const std::vector<int>& positions,
                                        const SetVector& input_sets,
                                        const std::bitset<MAXAXIS>& output_set,
                                        const dim_t idx_dict[],
-                                       int memory_limit,
-                                       int path_cost,
-                                       int naive_cost,
+                                       size_t memory_limit,
+                                       int64_t path_cost,
+                                       int64_t naive_cost,
                                        Alternative* ret) {
   // Find the contraction
   Contraction contract = _find_contraction(positions, input_sets, output_set);
 
   // Sieve the results based on memory_limit
   size_t new_size = _compute_size_by_dict(contract.new_result, idx_dict);
-  if (new_size > static_cast<size_t>(memory_limit)) {
+  if (new_size > memory_limit) {
     return -1;
   }
 
@@ -182,10 +182,10 @@ inline int _parse_possible_contraction(const std::vector<int>& positions,
   for (auto p : positions) {
     old_sizes += _compute_size_by_dict(input_sets[p], idx_dict);
   }
-  int remove_size = old_sizes - new_size;
+  int64_t remove_size = static_cast<int64_t>(old_sizes) - static_cast<int64_t>(new_size);
 
-  int cost = _flop_count(contract.idx_contract, contract.idx_removed.any(),
-                            positions.size(), idx_dict);
+  int64_t cost = _flop_count(contract.idx_contract, contract.idx_removed.any(),
+                             positions.size(), idx_dict);
   ret->cost[0] = -remove_size;
   ret->cost[1] = cost;
 
@@ -206,7 +206,7 @@ inline void _update_other_results(std::vector<Alternative>* results,
   int bx = best_con[0], by = best_con[1];
   size_t size = results->size();
 
-  for (int i = size - 1; i >= 0; --i) {
+  for (int i = static_cast<int>(size) - 1; i >= 0; --i) {
     int x = results->at(i).positions[0], y = results->at(i).positions[1];
 
     // Ignore results involving tensors just contracted
@@ -233,9 +233,9 @@ inline void _update_other_results(std::vector<Alternative>* results,
 inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
                                                    const std::bitset<MAXAXIS>& output_set,
                                                    const dim_t idx_dict[],
-                                                   int memory_limit) {
-  size_t isize = input_sets->size();
-  size_t iteration_num = isize;
+                                                   size_t memory_limit) {
+  int isize = static_cast<int>(input_sets->size());
+  int iteration_num = isize;
   // Handle trivial cases that leaked through
   if (isize == 1) {
     return std::vector<std::vector<int> >{std::vector<int>{0}};
@@ -245,23 +245,23 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
 
   // Build up a naive cost
   std::vector<int> range(isize);
-  for (size_t i = 0; i < isize; ++i) {
+  for (int i = 0; i < isize; ++i) {
     range[i] = i;
   }
   Contraction contract = _find_contraction(range, *input_sets, output_set);
-  int naive_cost = _flop_count(contract.idx_contract, contract.idx_removed.any(),
-                                  isize, idx_dict);
+  int64_t naive_cost = _flop_count(contract.idx_contract, contract.idx_removed.any(),
+                                   isize, idx_dict);
 
   // Initially iterate over all pairs
   std::vector<Alternative> known_contractions;
   Alternative best;
-  int path_cost = 0;
+  int64_t path_cost = 0;
   std::vector<std::vector<int> > ret;
 
-  for (size_t iteration = 0; iteration + 1 < iteration_num; ++iteration) {
+  for (int iteration = 0; iteration + 1 < iteration_num; ++iteration) {
     if (iteration == 0) {
-      for (int x = 0; x < static_cast<int>(isize); ++x) {
-        for (int y = x + 1; y < static_cast<int>(isize); ++y) {
+      for (int x = 0; x < isize; ++x) {
+        for (int y = x + 1; y < isize; ++y) {
           if (!((input_sets->at(x) & input_sets->at(y)).any())) {
             continue;
           }
@@ -280,7 +280,7 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
         }
       }
     } else {
-      for (int x = 0; x < static_cast<int>(isize) - 1; ++x) {
+      for (int x = 0; x < isize - 1; ++x) {
         int y = isize - 1;
         if (!((input_sets->at(x) & input_sets->at(y)).any())) {
             continue;
@@ -303,8 +303,8 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
     // If we do not have a inner contraction, rescan pairs including outer products
     if (known_contractions.size() == 0) {
       // Then check the outer productsj
-      for (int x = 0; x < static_cast<int>(isize); ++x) {
-        for (int y = x + 1; y < static_cast<int>(isize); ++y) {
+      for (int x = 0; x < isize; ++x) {
+        for (int y = x + 1; y < isize; ++y) {
           Alternative alternative;
           int result = _parse_possible_contraction(std::vector<int>{x, y},
                                                    *input_sets,
@@ -323,7 +323,7 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
       // If we still did not find any remaining contractions, default back to einsum like behavior
       if (known_contractions.size() == 0) {
         std::vector<int> range(isize);
-        for (size_t i = 0; i < isize; ++i) {
+        for (int i = 0; i < isize; ++i) {
           range[i] = i;
         }
         ret.push_back(range);
@@ -332,17 +332,17 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
     }
 
     // Sort based on first index
-    int best_cost[2], idx = -1;
-    size_t size = known_contractions.size();
-    for (size_t i = 0; i < size; ++i) {
+    int64_t best_cost[2];
+    int idx = -1, size = static_cast<int>(known_contractions.size());
+    for (int i = 0; i < size; ++i) {
       auto x = known_contractions[i];
       if (idx == -1) {
         best_cost[0] = x.cost[0];
         best_cost[1] = x.cost[1];
         idx = i;
       } else if (x.cost[0] < best_cost[0] ||
-               (x.cost[0] == best_cost[0] &&
-               x.cost[1] < best_cost[1])) {
+                 (x.cost[0] == best_cost[0] &&
+                  x.cost[1] < best_cost[1])) {
         best_cost[0] = x.cost[0];
         best_cost[1] = x.cost[1];
         idx = i;
@@ -356,7 +356,7 @@ inline std::vector<std::vector<int> > _greedy_path(const SetVector* input_sets,
     // Next iteration only compute contractions with the new tensor
     // All other contractions have been accounted for
     input_sets = &best.new_input_sets;
-    isize = input_sets->size();
+    isize = static_cast<int>(input_sets->size());
 
     // Update path and total cost
     ret.push_back(best.positions);
@@ -708,9 +708,9 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
 
   // Build a few useful list and sets
   std::vector<std::string> input_list = split(parsed_subscripts[0], ",");
-  size_t isize = input_list.size();
+  int isize = static_cast<int>(input_list.size());
   SetVector input_sets;
-  for (int i = 0; i < static_cast<int>(isize); ++i) {
+  for (int i = 0; i < isize; ++i) {
     input_sets.push_back(str2set(input_list[i]));
   }
   std::bitset<MAXAXIS> output_set = str2set(parsed_subscripts[1]);
@@ -721,7 +721,7 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
   dim_t dimension_dict[MAXAXIS];
   SetVector broadcast_indices(isize);
   memset(dimension_dict, -1, sizeof(dimension_dict));
-  for (size_t i = 0; i < isize; ++i) {
+  for (int i = 0; i < isize; ++i) {
     const std::string& term = input_list[i];
     const TShape& sh = operands[i].shape_;
     CHECK_EQ(sh.ndim(), term.length())
@@ -756,8 +756,8 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
 
   // Compute size of each input array plus the output array
   std::vector<size_t> size_list(isize + 1);
-  size_t max_size = -1, memory_arg;
-  for (size_t i = 0; i < isize; ++i) {
+  size_t max_size = 0, memory_arg;
+  for (int i = 0; i < isize; ++i) {
     size_list[i] = _compute_size_by_dict(input_list[i], dimension_dict);
     max_size = std::max(max_size, size_list[i]);
   }
@@ -778,7 +778,7 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
   std::vector<std::vector<int> > path;
   if (optimize == false) {
     path.push_back(std::vector<int>());
-    for (size_t i = 0; i < isize; ++i) {
+    for (int i = 0; i < isize; ++i) {
       path[0].push_back(i);
     }
   } else {
@@ -801,7 +801,7 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
     Contraction contract = _find_contraction(contract_inds, input_sets, output_set);
     input_sets = contract.remaining;
 
-    int cost = _flop_count(contract.idx_contract,
+    int64_t cost = _flop_count(contract.idx_contract,
                            contract.idx_removed.any(),
                            contract_inds.size(),
                            dimension_dict);
@@ -847,9 +847,9 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
                          a < b);
                 });
     }
-    size_t len_idx_result = idx_result.length();
+    int len_idx_result = static_cast<int>(idx_result.length());
     ret[i].oshape = TShape(len_idx_result, -1);
-    for (size_t j = 0; j < len_idx_result; ++j) {
+    for (int j = 0; j < len_idx_result; ++j) {
       ret[i].oshape[j] = dimension_dict[static_cast<int>(idx_result[j])];
     }
 
@@ -867,18 +867,18 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
       std::vector<int> left_pos, right_pos;
       left_pos.reserve(MAXAXIS);
       right_pos.reserve(MAXAXIS);
-      size_t tmp[MAXAXIS] = {0};
-      size_t length_left_input = tmp_inputs[0].length();
-      size_t length_right_input = tmp_inputs[1].length();
-      for (size_t j = 0; j < length_right_input; ++j) {
+      int tmp[MAXAXIS] = {0};
+      int length_left_input = static_cast<int>(tmp_inputs[0].length());
+      int length_right_input = static_cast<int>(tmp_inputs[1].length());
+      for (int j = 0; j < length_right_input; ++j) {
         if (contract.idx_removed.test(static_cast<int>(tmp_inputs[1][j]))) {
           tmp[static_cast<int>(tmp_inputs[1][j])] = j;
         }
       }
-      for (size_t j = 0; j < length_left_input; ++j) {
+      for (int j = 0; j < length_left_input; ++j) {
         if (contract.idx_removed.test(static_cast<int>(tmp_inputs[0][j]))) {
-          left_pos.push_back(static_cast<int>(j));
-          right_pos.push_back(static_cast<int>(tmp[static_cast<int>(tmp_inputs[0][j])]));
+          left_pos.push_back(j);
+          right_pos.push_back(tmp[static_cast<int>(tmp_inputs[0][j])]);
         }
       }
       // Calculate left_pos and right_pos
@@ -887,11 +887,11 @@ inline std::vector<Step> einsum_path(const std::string& subscripts,
       // Calculate do_einsum
       ret[i].do_einsum = (tensor_result != idx_result);
       // Calculate tshape
-      CHECK_EQ(tensor_result.length(), len_idx_result)
+      CHECK_EQ(static_cast<int>(tensor_result.length()), len_idx_result)
         << "tensordot produces dim " << tensor_result.length()
         << ", while einsum produces dim " << len_idx_result << ".";
       ret[i].tshape = TShape(len_idx_result, -1);
-      for (size_t j = 0; j < len_idx_result; ++j) {
+      for (int j = 0; j < len_idx_result; ++j) {
         ret[i].tshape[j] = dimension_dict[static_cast<int>(tensor_result[j])];
       }
       // Calculate blas2einsum_str
diff --git a/src/operator/numpy/np_matrix_op-inl.h b/src/operator/numpy/np_matrix_op-inl.h
index b3206bf4aa75..9ce84835f1a8 100644
--- a/src/operator/numpy/np_matrix_op-inl.h
+++ b/src/operator/numpy/np_matrix_op-inl.h
@@ -27,6 +27,7 @@
 
 #include <vector>
 #include <algorithm>
+#include <string>
 #include "../tensor/matrix_op-inl.h"
 #include "../nn/concat-inl.h"
 #include "../../common/utils.h"
@@ -51,6 +52,58 @@ struct NumpyVstackParam : public dmlc::Parameter<NumpyVstackParam> {
   }
 };
 
+struct NumpyReshapeParam : public dmlc::Parameter<NumpyReshapeParam> {
+  mxnet::TShape newshape;
+  std::string order;
+  DMLC_DECLARE_PARAMETER(NumpyReshapeParam) {
+    DMLC_DECLARE_FIELD(newshape)
+        .describe("The new shape should be compatible with the original shape."
+                  " If an integer, then the result will be a 1-D array of that length."
+                  " One shape dimension can be -1. In this case, the value is inferred"
+                  " from the length of the array and remaining dimensions.");
+    DMLC_DECLARE_FIELD(order)
+        .set_default("C")
+        .describe("Read the elements of a using this index order, and place the elements into"
+                  " the reshaped array using this index order. 'C' means to read/write the elements"
+                  " using C-like index order, with the last axis index changing fastest,"
+                  " back to the first axis index changing slowest."
+                  " Note that currently only C-like order is"
+                  " supported");
+  }
+};
+
+struct NumpyXReshapeParam : public dmlc::Parameter<NumpyXReshapeParam> {
+  mxnet::TShape newshape;
+  bool reverse;
+  std::string order;
+  DMLC_DECLARE_PARAMETER(NumpyXReshapeParam) {
+    DMLC_DECLARE_FIELD(newshape)
+        .describe("The new shape should be compatible with the original shape."
+                  " If an integer, then the result will be a 1-D array of that length."
+                  " One shape dimension can be -1. In this case, the value is inferred"
+                  " from the length of the array and remaining dimensions."
+                  " -2 to -6 are used for data manipulation."
+                  " -2 copy this dimension from the input to the output shape."
+                  " -3 will skip current dimension if and only if the current dim size is one."
+                  " -4 copy all remain of the input dimensions to the output shape."
+                  " -5 use the product of two consecutive dimensions of the input"
+                  " shape as the output."
+                  " -6 split one dimension of the input into two dimensions passed"
+                  " subsequent to -6 in the new shape.");
+    DMLC_DECLARE_FIELD(reverse)
+        .set_default(false)
+        .describe("If true then the special values are inferred from right to left");
+    DMLC_DECLARE_FIELD(order)
+        .set_default("C")
+        .describe("Read the elements of a using this index order, and place the elements into"
+                  " the reshaped array using this index order. 'C' means to read/write the elements"
+                  " using C-like index order, with the last axis index changing fastest,"
+                  " back to the first axis index changing slowest."
+                  " Note that currently only C-like order is"
+                  " supported");
+  }
+};
+
 template<typename xpu>
 void NumpyTranspose(const nnvm::NodeAttrs& attrs,
                     const OpContext& ctx,
@@ -731,7 +784,6 @@ inline void HSplitOpBackward(const nnvm::NodeAttrs &attrs,
   }
   SplitOpBackwardImpl<xpu>(attrs, ctx, inputs, req, outputs, real_axis);
 }
-
 }  // namespace op
 }  // namespace mxnet
 
diff --git a/src/operator/numpy/np_matrix_op.cc b/src/operator/numpy/np_matrix_op.cc
index 7bcd6ad27b52..0a6f9a150d8b 100644
--- a/src/operator/numpy/np_matrix_op.cc
+++ b/src/operator/numpy/np_matrix_op.cc
@@ -34,6 +34,9 @@ DMLC_REGISTER_PARAMETER(NumpyTransposeParam);
 DMLC_REGISTER_PARAMETER(NumpyRollParam);
 DMLC_REGISTER_PARAMETER(NumpyMoveaxisParam);
 DMLC_REGISTER_PARAMETER(NumpyRot90Param);
+DMLC_REGISTER_PARAMETER(NumpyReshapeParam);
+DMLC_REGISTER_PARAMETER(NumpyXReshapeParam);
+
 
 bool NumpyTransposeShape(const nnvm::NodeAttrs& attrs,
                          mxnet::ShapeVector *in_attrs,
@@ -126,26 +129,6 @@ NNVM_REGISTER_OP(_np_transpose)
 .add_argument("a", "NDArray-or-Symbol", "Source input")
 .add_arguments(NumpyTransposeParam::__FIELDS__());
 
-struct NumpyReshapeParam : public dmlc::Parameter<NumpyReshapeParam> {
-  mxnet::TShape newshape;
-  std::string order;
-  DMLC_DECLARE_PARAMETER(NumpyReshapeParam) {
-      DMLC_DECLARE_FIELD(newshape)
-          .describe("The new shape should be compatible with the original shape."
-                    " If an integer, then the result will be a 1-D array of that length."
-                    " One shape dimension can be -1. In this case, the value is inferred"
-                    " from the length of the array and remaining dimensions.");
-      DMLC_DECLARE_FIELD(order)
-      .set_default("C")
-      .describe("Read the elements of a using this index order, and place the elements into"
-                " the reshaped array using this index order. 'C' means to read/write the elements"
-                " using C-like index order, with the last axis index changing fastest, back to the"
-                " first axis index changing slowest. Note that currently only C-like order is"
-                " supported");
-  }
-};
-
-DMLC_REGISTER_PARAMETER(NumpyReshapeParam);
 
 bool NumpyReshapeInferShape(const mxnet::TShape& src, mxnet::TShape* dst) {
   if (shape_is_known(src) && shape_is_known(*dst)) {
@@ -202,6 +185,164 @@ bool NumpyReshapeShape(const nnvm::NodeAttrs& attrs,
   return success;
 }
 
+bool NumpyXReshapeInferShape(const mxnet::TShape& src,
+                             const mxnet::TShape& target,
+                             mxnet::TShape* output,
+                             const std::string &default_error_msg) {
+  bool target_shape_is_known = true;
+  dim_t target_size = 1;
+  for (int i = 0; i < target.ndim(); ++i) {
+    if (target[i] < 0) {
+      target_shape_is_known = false;
+      target_size  = -1;
+      break;
+    } else {
+      target_size *= target[i];
+    }
+  }
+  if (shape_is_known(src) && target_shape_is_known) {
+    CHECK_EQ(src.Size(), target_size) << default_error_msg;
+    *output = TShape(target.begin(), target.end());
+    return true;
+  } else if (!shape_is_known(src) || target.ndim() == -1) {
+    return false;
+  } else {
+    int unknown_axis = -1;
+    dim_t known_dim_size_prod = 1;
+    std::vector<dim_t> output_shape_vector;
+    int src_inx = 0;
+    for (int i = 0; i < target.ndim(); ++i) {
+      dim_t proposed_dim = target[i];
+      CHECK(proposed_dim >= -6)
+        << "Dimension size must be greater than -6, received " << proposed_dim;
+      if (proposed_dim == -1) {
+        // infer the known dimension
+        CHECK_LT(unknown_axis, 0)
+          << "One and only one dim can be inferred";
+        unknown_axis = output_shape_vector.size();
+        output_shape_vector.push_back(-1);
+        src_inx++;
+      } else if (proposed_dim == -2) {
+        // copy the dimension from src to output
+        CHECK_LT(src_inx, src.ndim())
+          << "Unmatching dimension of proposed new shape";
+        known_dim_size_prod *= src[src_inx];
+        output_shape_vector.push_back(src[src_inx++]);
+      } else if (proposed_dim == -3) {
+        // skip the source dimension if and only if it is one
+        CHECK_EQ(src[src_inx], 1)
+          <<"-3 index should only be used to skip dimension size 1";
+        src_inx++;
+      } else if (proposed_dim == -4) {
+        // copy all remaining dims from source
+        while (src_inx < src.ndim()) {
+          known_dim_size_prod *= src[src_inx];
+          const dim_t dn = src[src_inx++];
+          output_shape_vector.push_back(dn);
+        }
+      } else if (proposed_dim == -5) {
+        // merge two dims from source
+        CHECK_LT(src_inx, src.ndim()-1)
+          <<"Not enough dimensions left for the product";
+        const dim_t d1 = src[src_inx++];
+        const dim_t d2 = src[src_inx++];
+        if (!mxnet::dim_size_is_known(d1) || !mxnet::dim_size_is_known(d2)) {
+          CHECK_LT(unknown_axis, 0)
+            << "One and only one dim can be inferred";
+          unknown_axis = output_shape_vector.size();
+          output_shape_vector.push_back(-1);
+        } else {
+          known_dim_size_prod *= d1*d2;
+          output_shape_vector.push_back(d1 * d2);
+        }
+      } else if (proposed_dim == -6) {
+        // split the source dim s into two dims
+        // read the left dim and then the right dim (either can be -1)
+        CHECK_LT(i + 2, target.ndim());
+        CHECK_LT(src_inx, src.ndim());
+        const dim_t d0 = src[src_inx++];
+        dim_t d1 = target[++i];
+        dim_t d2 = target[++i];
+        CHECK(d1 != -1 || d2 != -1) << "Split dims cannot both be -1.";
+        if (d1 == -1 && d0 >= 0) d1 = d0 / d2;  // d0 must be known to do this
+        if (d2 == -1 && d0 >= 0) d2 = d0 / d1;  // d0 must be known to do this
+        CHECK(d1 * d2 == static_cast<dim_t>(d0) || static_cast<dim_t>(d0) == dim_t(-1))
+          <<"Split dims " << d1 << ", " << d2 << " do not divide original dim " << d0;
+        if (d1 == -1) {
+          CHECK_LT(unknown_axis, 0)
+            << "One and only one dim can be inferred";
+          unknown_axis = output_shape_vector.size();
+        } else if (d2 == -1) {
+          CHECK_LT(unknown_axis, 0)
+            << "One and only one dim can be inferred";
+          unknown_axis = output_shape_vector.size() + 1;
+        }
+        known_dim_size_prod *= d0 == -1 ? 1 : d0;
+        output_shape_vector.push_back(d1);
+        output_shape_vector.push_back(d2);
+      } else {
+        // greater than 0, new shape
+        known_dim_size_prod *= proposed_dim;
+        output_shape_vector.push_back(proposed_dim);
+        src_inx++;
+      }
+    }
+
+    if (unknown_axis > -1) {
+      // if the input in zero size tensor, the output must be of known shape of zero size
+      CHECK_NE(known_dim_size_prod, 0) << default_error_msg;
+      CHECK(src.Size() % known_dim_size_prod == 0) << default_error_msg;
+      output_shape_vector[unknown_axis] = src.Size() / known_dim_size_prod;
+    }
+
+    *output = mxnet::TShape(output_shape_vector.begin(), output_shape_vector.end());
+    CHECK_EQ((*output).Size(), src.Size()) << default_error_msg;
+    return true;
+  }
+}
+
+bool NumpyXReshapeShape(const nnvm::NodeAttrs& attrs,
+                       mxnet::ShapeVector* in_attrs,
+                       mxnet::ShapeVector* out_attrs) {
+  CHECK_EQ(in_attrs->size(), 1U) << "Input: [data]";
+  CHECK_EQ(out_attrs->size(), 1U);
+  const NumpyXReshapeParam& param = nnvm::get<NumpyXReshapeParam>(attrs.parsed);
+  // sanity check
+  bool has_unknown_dim_size = false;
+  for (int i = 0; i < param.newshape.ndim(); ++i) {
+    if (param.newshape[i] < 0) {
+      CHECK_GE(param.newshape[i], -6)
+        << "Dimension size must be greater than or equal to -6";
+      if (param.newshape[i] == -1) {
+        CHECK(!has_unknown_dim_size) << "Can only specify one unknown dimension";
+        has_unknown_dim_size = true;
+      }
+    }
+  }
+
+  mxnet::TShape output_shape;
+  bool success;
+  std::stringstream ss;
+  ss << "Cannot reshape array of shape " << in_attrs->at(0)
+     << " into shape " << param.newshape
+     << " , reverse = " << param.reverse;
+  std::string err_msg = ss.str();
+  if (!param.reverse) {
+    success = NumpyXReshapeInferShape(in_attrs->at(0),
+                                      param.newshape, &output_shape, err_msg);
+  } else {
+    mxnet::TShape rev_in_shape = in_attrs->at(0);
+    mxnet::TShape rev_newshape = param.newshape;
+    std::reverse(rev_in_shape.begin(), rev_in_shape.end());
+    std::reverse(rev_newshape.begin(), rev_newshape.end());
+    success = NumpyXReshapeInferShape(rev_in_shape,
+                                      rev_newshape, &output_shape, err_msg);
+    std::reverse(output_shape.begin(), output_shape.end());
+  }
+  SHAPE_ASSIGN_CHECK(*out_attrs, 0, output_shape);
+  return success;
+}
+
 NNVM_REGISTER_OP(_np_reshape)
 .describe(R"code()code" ADD_FILELINE)
 .add_alias("_npi_reshape")
@@ -227,6 +368,31 @@ NNVM_REGISTER_OP(_np_reshape)
 .add_argument("a", "NDArray-or-Symbol", "Array to be reshaped.")
 .add_arguments(NumpyReshapeParam::__FIELDS__());
 
+
+NNVM_REGISTER_OP(_npx_reshape)
+.describe(R"code()code" ADD_FILELINE)
+.set_num_inputs(1)
+.set_num_outputs(1)
+.set_attr_parser(ParamParser<NumpyXReshapeParam>)
+.set_attr<mxnet::FInferShape>("FInferShape", NumpyXReshapeShape)
+.set_attr<nnvm::FInferType>("FInferType", ElemwiseType<1, 1>)
+.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseNone{"_backward_reshape"})
+.set_attr<FCompute>("FCompute<cpu>", UnaryOp::IdentityCompute<cpu>)
+.set_attr<nnvm::FInplaceOption>("FInplaceOption",
+  [](const NodeAttrs& attrs) {
+    return std::vector<std::pair<int, int> >{{0, 0}};
+  })
+.set_attr<nnvm::FInplaceIdentity>("FInplaceIdentity",
+  [](const NodeAttrs& attrs){
+    return std::vector<bool>{true};
+  })
+.set_attr<nnvm::FListInputNames>("FListInputNames",
+  [](const NodeAttrs& attrs) {
+    return std::vector<std::string>{"a"};
+  })
+.add_argument("a", "NDArray-or-Symbol", "Array to be reshaped.")
+.add_arguments(NumpyXReshapeParam::__FIELDS__());
+
 bool NumpySqueezeShape(const nnvm::NodeAttrs& attrs,
                        mxnet::ShapeVector *in_attrs,
                        mxnet::ShapeVector *out_attrs) {
diff --git a/src/operator/numpy/np_matrix_op.cu b/src/operator/numpy/np_matrix_op.cu
index 8c8301bb3bbf..6b4f7a11a9a2 100644
--- a/src/operator/numpy/np_matrix_op.cu
+++ b/src/operator/numpy/np_matrix_op.cu
@@ -109,5 +109,8 @@ NNVM_REGISTER_OP(_npi_hsplit)
 NNVM_REGISTER_OP(_npi_hsplit_backward)
 .set_attr<FCompute>("FCompute<gpu>", HSplitOpBackward<gpu>);
 
+NNVM_REGISTER_OP(_npx_reshape)
+.set_attr<FCompute>("FCompute<gpu>", UnaryOp::IdentityCompute<gpu>);
+
 }  // namespace op
 }  // namespace mxnet
diff --git a/src/operator/numpy/np_nonzero_op.cc b/src/operator/numpy/np_nonzero_op.cc
index 00f9081ba984..0eaf0878a24a 100644
--- a/src/operator/numpy/np_nonzero_op.cc
+++ b/src/operator/numpy/np_nonzero_op.cc
@@ -91,7 +91,7 @@ void NonzeroForwardCPU(const nnvm::NodeAttrs& attrs,
   std::vector<int32_t> prefix_sum(in_size, 0);
   size_t valid_num = 0;
   // Calculate prefix sum
-  MSHADOW_TYPE_SWITCH(in.dtype(), DType, {
+  MSHADOW_TYPE_SWITCH_WITH_BOOL(in.dtype(), DType, {
     DType* in_dptr = in.data().dptr<DType>();
     for (size_t i = 0; i < in_size; i++) {
       prefix_sum[i] = (i == 0) ? 0 : prefix_sum[i - 1];
@@ -113,6 +113,7 @@ void NonzeroForwardCPU(const nnvm::NodeAttrs& attrs,
 }
 
 NNVM_REGISTER_OP(_npx_nonzero)
+.add_alias("_npi_nonzero")
 .set_num_inputs(1)
 .set_num_outputs(1)
 .set_attr<nnvm::FListInputNames>("FListInputNames",
diff --git a/src/operator/numpy/np_nonzero_op.cu b/src/operator/numpy/np_nonzero_op.cu
index 33925ea2e156..c732d2c78493 100644
--- a/src/operator/numpy/np_nonzero_op.cu
+++ b/src/operator/numpy/np_nonzero_op.cu
@@ -80,7 +80,7 @@ void NonzeroForwardGPU(const nnvm::NodeAttrs& attrs,
     ctx.requested[0].get_space_typed<gpu, 1, char>(Shape1(temp_storage_bytes), stream);
   prefix_sum = reinterpret_cast<int32_t*>(workspace.dptr_);
   d_temp_storage = workspace.dptr_ + buffer_size;
-  MSHADOW_TYPE_SWITCH(in.dtype(), DType, {
+  MSHADOW_TYPE_SWITCH_WITH_BOOL(in.dtype(), DType, {
     mxnet_op::Kernel<PrefixSumInit, gpu>::Launch(
       stream, in_size, prefix_sum, in.data().dptr<DType>());
   });
diff --git a/src/operator/numpy/random/np_choice_op.h b/src/operator/numpy/random/np_choice_op.h
index 335cc2741759..a6a7cecfefd5 100644
--- a/src/operator/numpy/random/np_choice_op.h
+++ b/src/operator/numpy/random/np_choice_op.h
@@ -118,15 +118,17 @@ struct random_indices {
 
 // Weighted sample without replacement.
 // Use perturbed Gumbel variates as keys.
+template <typename IType>
 struct generate_keys {
-  MSHADOW_XINLINE static void Map(index_t i, float *uniforms, float *weights) {
+  MSHADOW_XINLINE static void Map(index_t i, float *uniforms, IType *weights) {
     uniforms[i] = -logf(-logf(uniforms[i])) + logf(weights[i]);
   }
 };
 
 // Weighted sample with replacement.
+template <typename IType>
 struct categorical_sampling {
-  MSHADOW_XINLINE static void Map(index_t i, float *weights, size_t length,
+  MSHADOW_XINLINE static void Map(index_t i, IType *weights, size_t length,
                                   float *uniforms, int64_t *outs) {
     outs[i] = 0;
     float acc = 0.0;
@@ -179,15 +181,19 @@ void NumpyChoiceForward(const nnvm::NodeAttrs &attrs, const OpContext &ctx,
     prnd->SampleUniform(&random_numbers, 0, 1);
     workspace_ptr += ((random_tensor_size * sizeof(float) / 7 + 1) * 8);
     if (replace) {
-      Kernel<categorical_sampling, xpu>::Launch(
-          s, output_size, inputs[weight_index].dptr<float>(), input_size,
-          random_numbers.dptr_, outputs[0].dptr<int64_t>());
+      MSHADOW_REAL_TYPE_SWITCH(inputs[weight_index].type_flag_, IType, {
+        Kernel<categorical_sampling<IType>, xpu>::Launch(
+            s, output_size, inputs[weight_index].dptr<IType>(), input_size,
+            random_numbers.dptr_, outputs[0].dptr<int64_t>());
+      });
     } else {
       Tensor<xpu, 1, int64_t> indices = Tensor<xpu, 1, int64_t>(
           reinterpret_cast<int64_t *>(workspace_ptr), Shape1(indices_size), s);
       indices = expr::range((int64_t)0, input_size);
-      Kernel<generate_keys, xpu>::Launch(s, input_size, random_numbers.dptr_,
-                                         inputs[weight_index].dptr<float>());
+      MSHADOW_REAL_TYPE_SWITCH(inputs[weight_index].type_flag_, IType, {
+        Kernel<generate_keys<IType>, xpu>::Launch(s, input_size, random_numbers.dptr_,
+                                           inputs[weight_index].dptr<IType>());
+      });
       _sort<xpu>(random_numbers.dptr_, indices.dptr_, input_size);
       Copy(outputs[0].FlatTo1D<xpu, int64_t>(s), indices.Slice(0, output_size), s);
     }
diff --git a/src/operator/rnn-inl.h b/src/operator/rnn-inl.h
index f69b5415501a..9019726ebcc4 100644
--- a/src/operator/rnn-inl.h
+++ b/src/operator/rnn-inl.h
@@ -181,7 +181,7 @@ inline int GetRnnBiasSize(int num_layer,
  *  - wh[1...Ngates] * h[t] time by time(sz: NxHxNgates)
  *  - output -> h[t](, c[t] additionally with Lstm) time by time(sz: NxH(x2))
  *  - intermediate y[1...T] as next layer's inputs(sz: TxNxHxD)
- */ 
+ */
 inline size_t GetRNNWorkspaceSize(int seq_length,
                                   int batch_size,
                                   int hidden_size,
@@ -420,6 +420,7 @@ class RNNOp {
     this->param_ = param;
     this->ctx_ = ctx;
 
+    if (ctx_.dev_type == kGPU) {
 #if MXNET_USE_CUDNN == 1
     init_cudnn_ = false;
     dtype_ = mshadow::DataType<DType>::kCudnnFlag;
@@ -503,6 +504,7 @@ class RNNOp {
       LOG(FATAL) << "RNN on GPU is only available for cuDNN at the moment.";
     }
 #endif  // MXNET_USE_CUDNN == 1
+    }
 
     if (ctx_.dev_type == kCPU) {
       this->init_space_ = false;
@@ -521,6 +523,7 @@ class RNNOp {
   }
 
   ~RNNOp() {
+    if (ctx_.dev_type == kGPU) {
 #if MXNET_USE_CUDNN == 1
     CUDNN_CALL(cudnnDestroyTensorDescriptor(hx_desc_));
     CUDNN_CALL(cudnnDestroyTensorDescriptor(cx_desc_));
@@ -555,6 +558,7 @@ class RNNOp {
     CUDNN_CALL(cudnnDestroyRNNDataDescriptor(dy_data_desc_));
 #endif  // MXNET_USE_CUDNN_GE_7200
 #endif  // MXNET_USE_CUDNN
+    }
   }
 
   void Forward(const OpContext &ctx, const std::vector<TBlob> &in_data,
diff --git a/src/operator/subgraph/build_subgraph.cc b/src/operator/subgraph/build_subgraph.cc
index 0f4c570331a2..8e7617d57c44 100644
--- a/src/operator/subgraph/build_subgraph.cc
+++ b/src/operator/subgraph/build_subgraph.cc
@@ -717,7 +717,7 @@ nnvm::Graph BuildSubgraph(nnvm::Graph&& g) {
   using namespace sg;
 
   const SubgraphPropertyPtr& subg_prop = g.GetAttr<SubgraphPropertyPtr>("subgraph_property");
-  if (verbose) {
+  if (verbose > 1) {
     const std::string& prop_name = subg_prop->HasAttr("property_name")
                                        ? subg_prop->GetAttr<std::string>("property_name")
                                        : "partition graph";
diff --git a/src/operator/tensor/indexing_op.cc b/src/operator/tensor/indexing_op.cc
index 9961218b5482..470abee71a59 100644
--- a/src/operator/tensor/indexing_op.cc
+++ b/src/operator/tensor/indexing_op.cc
@@ -29,7 +29,7 @@ namespace mxnet {
 namespace op {
 
 template<bool clip = true>
-struct TakeCPU {
+struct TakeZeroAxisCPU {
   // assume that idx have been flattened to a 1-D tensor (N,)
   // assume that out_data and in_data have been flattened to 2-D tensors, (N, M) and (K, M)
   // M is the number of columns of in_data and out_data
@@ -88,8 +88,9 @@ void EmbeddingOpForwardDnsImpl<cpu>(mshadow::Stream<cpu>* s,
       Tensor<cpu, 2, DType> wmat = weight.get<cpu, 2, DType>(s);
       Tensor<cpu, 2, DType> out = output.get_with_shape<cpu, 2, DType>(
         Shape2(oshape.ProdShape(0, oshape.ndim()-1), oshape[oshape.ndim()-1]), s);
-      Kernel<TakeCPU<true>, cpu>::Launch(s, oshape.Size() / wmat.shape_[1], out.dptr_, wmat.dptr_,
-                                         idx.dptr_, wmat.shape_[1], wmat.shape_[0]);
+      Kernel<TakeZeroAxisCPU<true>, cpu>::Launch(s, oshape.Size() / wmat.shape_[1], out.dptr_,
+                                                 wmat.dptr_, idx.dptr_,
+                                                 wmat.shape_[1], wmat.shape_[0]);
     });
   });
 }
@@ -308,17 +309,17 @@ void TakeOpForward<cpu>(const nnvm::NodeAttrs& attrs,
       }
       if (actual_axis == 0) {
         if (param.mode == take_::kClip) {
-          Kernel<TakeCPU<true>, cpu>::Launch(s, idxshape.Size(),
-                                             outputs[take_::kOut].dptr<DType>(),
-                                             inputs[take_::kArr].dptr<DType>(),
-                                             inputs[take_::kIdx].dptr<IType>(),
-                                             oshape.Size()/idxshape.Size(), arrshape[0]);
+          Kernel<TakeZeroAxisCPU<true>, cpu>::Launch(s, idxshape.Size(),
+                                                     outputs[take_::kOut].dptr<DType>(),
+                                                     inputs[take_::kArr].dptr<DType>(),
+                                                     inputs[take_::kIdx].dptr<IType>(),
+                                                     oshape.Size()/idxshape.Size(), arrshape[0]);
         } else {
-          Kernel<TakeCPU<false>, cpu>::Launch(s, idxshape.Size(),
-                                              outputs[take_::kOut].dptr<DType>(),
-                                              inputs[take_::kArr].dptr<DType>(),
-                                              inputs[take_::kIdx].dptr<IType>(),
-                                              oshape.Size()/idxshape.Size(), arrshape[0]);
+          Kernel<TakeZeroAxisCPU<false>, cpu>::Launch(s, idxshape.Size(),
+                                                      outputs[take_::kOut].dptr<DType>(),
+                                                      inputs[take_::kArr].dptr<DType>(),
+                                                      inputs[take_::kIdx].dptr<IType>(),
+                                                      oshape.Size()/idxshape.Size(), arrshape[0]);
         }
       } else {
         mshadow::Shape<10> in_strides;
@@ -332,21 +333,25 @@ void TakeOpForward<cpu>(const nnvm::NodeAttrs& attrs,
           out_strides[i] = stride;
         }
         if (param.mode == take_::kClip) {
-          Kernel<Take<true>, cpu>::Launch(s, oshape.Size(),
-                                          outputs[take_::kOut].dptr<DType>(),
-                                          inputs[take_::kArr].dptr<DType>(),
-                                          inputs[take_::kIdx].dptr<IType>(),
-                                          in_strides, out_strides, arrshape.ndim(),
-                                          oshape.ndim(), idxshape.ndim(),
-                                          arrshape[actual_axis], actual_axis);
+          Kernel<TakeNonzeroAxis<true>, cpu>::Launch(s, oshape.Size(),
+                                                     outputs[take_::kOut].dptr<DType>(),
+                                                     inputs[take_::kArr].dptr<DType>(),
+                                                     inputs[take_::kIdx].dptr<IType>(),
+                                                     out_strides[actual_axis-1],
+                                                     in_strides[actual_axis-1],
+                                                     in_strides[actual_axis], arrshape.ndim(),
+                                                     oshape.ndim(), idxshape.ndim(),
+                                                     arrshape[actual_axis], actual_axis);
         } else {
-          Kernel<Take<false>, cpu>::Launch(s, oshape.Size(),
-                                           outputs[take_::kOut].dptr<DType>(),
-                                           inputs[take_::kArr].dptr<DType>(),
-                                           inputs[take_::kIdx].dptr<IType>(),
-                                           in_strides, out_strides, arrshape.ndim(),
-                                           oshape.ndim(), idxshape.ndim(),
-                                           arrshape[actual_axis], actual_axis);
+          Kernel<TakeNonzeroAxis<false>, cpu>::Launch(s, oshape.Size(),
+                                                      outputs[take_::kOut].dptr<DType>(),
+                                                      inputs[take_::kArr].dptr<DType>(),
+                                                      inputs[take_::kIdx].dptr<IType>(),
+                                                      out_strides[actual_axis-1],
+                                                      in_strides[actual_axis-1],
+                                                      in_strides[actual_axis], arrshape.ndim(),
+                                                      oshape.ndim(), idxshape.ndim(),
+                                                      arrshape[actual_axis], actual_axis);
         }
       }
     });
diff --git a/src/operator/tensor/indexing_op.cu b/src/operator/tensor/indexing_op.cu
index 0b4c20bf2bb5..3ccf1f39d4f7 100644
--- a/src/operator/tensor/indexing_op.cu
+++ b/src/operator/tensor/indexing_op.cu
@@ -116,11 +116,8 @@ struct AddTakeGradRspDeterministicKernel {
   }
 };
 
-/*! \brief name the struct Take instead of take
- * to avoid conflict with the take function in mshadow
- */
 template<bool clip = true>
-struct TakeGPU {
+struct TakeZeroAxisGPU {
   // assume that idx have been flattened to a 1-D tensor (N,)
   // assume that out_data and in_data have been flattened to 2-D tensors, (N, M) and (K, M)
   // M is the number of columns of in_data and out_data
@@ -180,8 +177,8 @@ void EmbeddingOpForwardDnsImpl<gpu>(mshadow::Stream<gpu>* s,
       Tensor<gpu, 2, DType> wmat = weight.get<gpu, 2, DType>(s);
       Tensor<gpu, 2, DType> out = output.get_with_shape<gpu, 2, DType>(
         Shape2(oshape.ProdShape(0, oshape.ndim()-1), oshape[oshape.ndim()-1]), s);
-      Kernel<TakeGPU<true>, gpu>::Launch(s, oshape.Size(), out.dptr_, wmat.dptr_,
-                                         idx.dptr_, wmat.shape_[1], wmat.shape_[0]);
+      Kernel<TakeZeroAxisGPU<true>, gpu>::Launch(s, oshape.Size(), out.dptr_, wmat.dptr_,
+                                                 idx.dptr_, wmat.shape_[1], wmat.shape_[0]);
     });
   });
 }
@@ -502,17 +499,17 @@ void TakeOpForward<gpu>(const nnvm::NodeAttrs& attrs,
       }
       if (actual_axis == 0) {
         if (param.mode == take_::kClip) {
-          Kernel<TakeGPU<true>, gpu>::Launch(s, oshape.Size(),
-                                             outputs[take_::kOut].dptr<DType>(),
-                                             inputs[take_::kArr].dptr<DType>(),
-                                             inputs[take_::kIdx].dptr<IType>(),
-                                             oshape.Size()/idxshape.Size(), arrshape[0]);
+          Kernel<TakeZeroAxisGPU<true>, gpu>::Launch(s, oshape.Size(),
+                                                     outputs[take_::kOut].dptr<DType>(),
+                                                     inputs[take_::kArr].dptr<DType>(),
+                                                     inputs[take_::kIdx].dptr<IType>(),
+                                                     oshape.Size()/idxshape.Size(), arrshape[0]);
         } else {
-          Kernel<TakeGPU<false>, gpu>::Launch(s, oshape.Size(),
-                                              outputs[take_::kOut].dptr<DType>(),
-                                              inputs[take_::kArr].dptr<DType>(),
-                                              inputs[take_::kIdx].dptr<IType>(),
-                                              oshape.Size()/idxshape.Size(), arrshape[0]);
+          Kernel<TakeZeroAxisGPU<false>, gpu>::Launch(s, oshape.Size(),
+                                                      outputs[take_::kOut].dptr<DType>(),
+                                                      inputs[take_::kArr].dptr<DType>(),
+                                                      inputs[take_::kIdx].dptr<IType>(),
+                                                      oshape.Size()/idxshape.Size(), arrshape[0]);
         }
       } else {
         mshadow::Shape<10> in_strides;
@@ -526,19 +523,27 @@ void TakeOpForward<gpu>(const nnvm::NodeAttrs& attrs,
           out_strides[i] = stride;
         }
         if (param.mode == take_::kClip) {
-          Kernel<Take<true>, gpu>::Launch(s, oshape.Size(),
-                                          outputs[take_::kOut].dptr<DType>(),
-                                          inputs[take_::kArr].dptr<DType>(),
-                                          inputs[take_::kIdx].dptr<IType>(),
-                                          in_strides, out_strides, arrshape.ndim(), oshape.ndim(),
-                                          idxshape.ndim(), arrshape[actual_axis], actual_axis);
+          Kernel<TakeNonzeroAxis<true>, gpu>::Launch(s, oshape.Size(),
+                                                     outputs[take_::kOut].dptr<DType>(),
+                                                     inputs[take_::kArr].dptr<DType>(),
+                                                     inputs[take_::kIdx].dptr<IType>(),
+                                                     out_strides[actual_axis-1],
+                                                     in_strides[actual_axis-1],
+                                                     in_strides[actual_axis],
+                                                     arrshape.ndim(), oshape.ndim(),
+                                                     idxshape.ndim(), arrshape[actual_axis],
+                                                     actual_axis);
         } else {
-          Kernel<Take<false>, gpu>::Launch(s, oshape.Size(),
-                                           outputs[take_::kOut].dptr<DType>(),
-                                           inputs[take_::kArr].dptr<DType>(),
-                                           inputs[take_::kIdx].dptr<IType>(),
-                                           in_strides, out_strides, arrshape.ndim(), oshape.ndim(),
-                                           idxshape.ndim(), arrshape[actual_axis], actual_axis);
+          Kernel<TakeNonzeroAxis<false>, gpu>::Launch(s, oshape.Size(),
+                                                      outputs[take_::kOut].dptr<DType>(),
+                                                      inputs[take_::kArr].dptr<DType>(),
+                                                      inputs[take_::kIdx].dptr<IType>(),
+                                                      out_strides[actual_axis-1],
+                                                      in_strides[actual_axis-1],
+                                                      in_strides[actual_axis],
+                                                      arrshape.ndim(), oshape.ndim(),
+                                                      idxshape.ndim(), arrshape[actual_axis],
+                                                      actual_axis);
         }
       }
     });
diff --git a/src/operator/tensor/indexing_op.h b/src/operator/tensor/indexing_op.h
index bb524dd0f5e9..828d761fefd4 100644
--- a/src/operator/tensor/indexing_op.h
+++ b/src/operator/tensor/indexing_op.h
@@ -296,11 +296,11 @@ inline bool SparseEmbeddingOpBackwardStorageType(const nnvm::NodeAttrs& attrs,
   return dispatched;
 }
 
-/*! \brief name the struct Take instead of take
- * to avoid conflict with the take function in mshadow
+/*! \brief name the struct TakeNonzeroAxis for general take when
+ * axis is not zero, use TakeZeroAxisGPU or TakeZeroAxisCPU for axis zero
  */
 template<bool clip = true>
-struct Take {
+struct TakeNonzeroAxis {
   /*!
    * \brief Map function for take operator
    * \param i           global thread id
@@ -315,28 +315,28 @@ struct Take {
    */
   template<typename DType, typename IType>
   MSHADOW_XINLINE static void Map(index_t i, DType* out_data, const DType* in_data,
-                                  const IType* idx,
-                                  const mshadow::Shape<10> in_stride,
-                                  const mshadow::Shape<10> out_stride,
+                                  const IType* idx, const int out_prev_stride,
+                                  const int in_prev_stride, const int in_stride,
                                   const int in_ndims, const int out_ndims, const int idx_ndims,
                                   const int axis_dim, const int axis) {
     // i is the global flattened index in the output
-    const int64_t out_head_index = (axis == 0) ? 0 : (i / out_stride[axis - 1]);
-    const int64_t out_rest_index = (axis == 0) ? i : (i % out_stride[axis - 1]);
-    const int64_t out_mid_index = out_rest_index / in_stride[axis];
+    const int64_t out_head_index = i / out_prev_stride;
+    const int64_t out_rest_index = i % out_prev_stride;
+    const int64_t out_mid_index = out_rest_index / in_stride;
     const int64_t out_tail_index = (axis == in_ndims - 1) ?
-                                   0 : (out_rest_index % in_stride[axis]);
+                                   0 : (out_rest_index % in_stride);
     int64_t idx_index = static_cast<int64_t>(idx[out_mid_index]);
     if (clip) {
       idx_index = (idx_index < 0) ? 0 : idx_index;
       idx_index = (idx_index > axis_dim - 1) ? (axis_dim - 1) : idx_index;
+    } else {
+      idx_index %= axis_dim;
+      idx_index += (idx_index < 0) ? axis_dim : 0;
     }
-    idx_index %= axis_dim;
-    idx_index += (idx_index < 0) ? axis_dim : 0;
     const int64_t in_tail_index = out_tail_index;
     const int64_t in_head_index = out_head_index;
-    int64_t in_src_index = in_tail_index + idx_index * in_stride[axis];
-    in_src_index += (axis == 0) ? 0 : in_head_index * in_stride[axis - 1];
+    int64_t in_src_index = in_tail_index + idx_index * in_stride;
+    in_src_index += in_head_index * in_prev_stride;
     out_data[i] = in_data[in_src_index];
   }
 };
diff --git a/tests/nightly/JenkinsfileForBinaries b/tests/nightly/JenkinsfileForBinaries
index a66159d0075b..af87b2c35658 100755
--- a/tests/nightly/JenkinsfileForBinaries
+++ b/tests/nightly/JenkinsfileForBinaries
@@ -20,7 +20,7 @@
 
 mx_lib = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a'
 mx_cmake_lib = 'build/libmxnet.so, build/libmxnet.a, build/3rdparty/tvm/libtvm_runtime.so, build/libtvmop.so, build/3rdparty/dmlc-core/libdmlc.a, build/tests/mxnet_unit_tests, build/3rdparty/openmp/runtime/src/libomp.so'
-mx_lib_cpp_example = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, build/cpp-package/example/imagenet_inference'
+mx_lib_cpp_example_mkl = 'lib/libmxnet.so, lib/libmxnet.a, lib/libtvm_runtime.so, lib/libtvmop.so, 3rdparty/dmlc-core/libdmlc.a, 3rdparty/tvm/nnvm/lib/libnnvm.a, build/cpp-package/example/imagenet_inference, lib/libmkldnn.so.0, lib/libmklml_intel.so'
 
 node('utility') {
   // Loading the utilities requires a node context unfortunately
@@ -34,10 +34,10 @@ core_logic: {
   stage('Build') {
     parallel 'GPU: CUDA10.1+cuDNN7': {
       node(NODE_LINUX_CPU) {
-        ws('workspace/build-gpu') {
+        ws('workspace/build-mkldnn-gpu') {
           utils.init_git()
-          utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_cuda101_cudnn7', false)
-          utils.pack_lib('gpu', mx_lib_cpp_example)
+          utils.docker_run('ubuntu_build_cuda', 'build_ubuntu_gpu_mkldnn', false)
+          utils.pack_lib('gpu', mx_lib_cpp_example_mkl)
         }
       }
     }/*,
@@ -73,7 +73,7 @@ core_logic: {
     'ImageNet Inference: GPU': {
       node(NODE_LINUX_GPU) {
         ws('workspace/nt-ImageInferenceTest') {
-          utils.unpack_and_init('gpu', mx_lib_cpp_example)
+          utils.unpack_and_init('gpu', mx_lib_cpp_example_mkl)
           utils.docker_run('ubuntu_nightly_gpu', 'nightly_test_imagenet_inference', true)
         }
       }
diff --git a/tests/nightly/test_large_array.py b/tests/nightly/test_large_array.py
index c18a95400f22..74ac179a7e60 100644
--- a/tests/nightly/test_large_array.py
+++ b/tests/nightly/test_large_array.py
@@ -1351,17 +1351,17 @@ def check_trunc():
 
 
 def create_input_for_trigonometric_ops(vals):
-    # Creates large vector input of size(LARGE_X*10, SMALL_Y/10) from vals using tile operator
+    # Creates large vector input of size(LARGE_X*10, SMALL_Y/10) from vals using broadcast_to operator
     inp = nd.array(vals).reshape(1, 5)
     inp = nd.broadcast_to(inp, (LARGE_X*10, SMALL_Y//10))
     return inp
 
 
-def assert_correctness_of_trigonometric_ops(output, expected_vals):
+def assert_correctness_of_trigonometric_ops(output, expected_vals, atol=1e-3):
     # checks verifies 5 values at positions(0, 1, -3, -2, -1) of the input vector
     output_idx_to_inspect = [0, 1, -3, -2, -1]
     for i in range(len(output_idx_to_inspect)):
-        assert np.abs(output[1][output_idx_to_inspect[i]].asnumpy()-expected_vals[i]) <= 1e-3
+        assert np.abs(output[1][output_idx_to_inspect[i]].asnumpy()-expected_vals[i]) <= atol
 
 
 def test_trigonometric_ops():
diff --git a/tests/nightly/test_large_vector.py b/tests/nightly/test_large_vector.py
index b8edc83220bd..c6a99a5d0826 100644
--- a/tests/nightly/test_large_vector.py
+++ b/tests/nightly/test_large_vector.py
@@ -64,7 +64,7 @@ def test_ndarray_random_randint():
     high = 2**34
     a = nd.random.randint(low, high, dtype=np.int64, shape=LARGE_X).asnumpy()
     assert a.shape == (LARGE_X,)
-    assert (a >= low).all()  and (a < high).all()
+    assert (a >= low).all() and (a < high).all()
 
 
 def test_ndarray_empty():
@@ -710,6 +710,39 @@ def test_full():
     assert a[-1] == 3
 
 
+def test_regression():
+    shape = (LARGE_X, )
+
+    def check_regression(symbol, forward, shape):
+        # init executor
+        data_s = mx.symbol.Variable('data')
+        label_s = mx.symbol.Variable('label')
+        out_s = symbol(data=data_s, label=label_s)
+        exe = out_s.simple_bind(ctx=mx.cpu(0), data=shape, label=shape)
+
+        arg_map = dict(zip(out_s.list_arguments(), exe.arg_arrays))
+
+        # init data
+        data = mx.random.uniform(-1, -1, shape)
+        arg_map["data"][:] = data
+        atol = 1e-5
+        density = 0.5
+        stype = 'default'
+        label = arg_map["label"]
+        label[:] = rand_ndarray(shape, stype, density=density)
+        exe.forward(is_train=True)
+        exe.backward()
+        np_out = forward(data.asnumpy())
+        assert_almost_equal(exe.outputs[0].asnumpy(), np_out, atol=atol)
+
+    check_regression(mx.symbol.LogisticRegressionOutput,
+                     lambda x: 1.0 / (1.0 + np.exp(-x)),
+                     shape)
+    check_regression(mx.symbol.LinearRegressionOutput,
+                     lambda x: x,
+                     shape)
+
+
 def test_sign():
     a = mx.nd.random.normal(-1, 1, shape=LARGE_X)
     mx_res = mx.nd.sign(a)
@@ -978,11 +1011,11 @@ def test_add_n():
 def test_modulo():
     x = mx.nd.ones(LARGE_X)*6
     y = mx.nd.ones(LARGE_X)*4
-    z = (x%y)
+    z = (x % y)
     assert z[0] == 2
     assert z[-1] == 2
     x = mx.nd.ones(LARGE_X)*5
-    z = nd.modulo(x,y)
+    z = nd.modulo(x, y)
     assert z[0] == 1
     assert z[-1] == 1
 
@@ -1022,6 +1055,16 @@ def test_gather():
     assert np.sum(arr[idx] == 2) == 10
 
 
+def test_infer_shape():
+    data_1 = mx.symbol.Variable('data_1')
+    data_2 = mx.symbol.Variable('data_2')
+    add = data_1+data_2
+    # > add.infer_shape(data_1=(LARGE_X,), data_2=(LARGE_X,))
+    # OUTPUT - arg_shapes, out_shapes, aux_shapes
+    _, out_shapes, _ = add.infer_shape(data_1=(LARGE_X,), data_2=(LARGE_X,))
+    assert out_shapes == [(LARGE_X,)]
+
+
 if __name__ == '__main__':
     import nose
     nose.runmodule()
diff --git a/tests/python/gpu/test_operator_gpu.py b/tests/python/gpu/test_operator_gpu.py
index 06a16b1bb4f8..8b6928a2aa39 100644
--- a/tests/python/gpu/test_operator_gpu.py
+++ b/tests/python/gpu/test_operator_gpu.py
@@ -2395,6 +2395,7 @@ def _test_bulking_in_process(seed, time_per_iteration):
 
 
 @with_seed()
+@unittest.skip('skippping temporarily, tracked by https://github.com/apache/incubator-mxnet/issues/16517')
 def test_bulking_operator_gpu():
     _test_bulking(_test_bulking_in_process)
 
diff --git a/tests/python/unittest/test_gluon.py b/tests/python/unittest/test_gluon.py
index f1d0cc7ac274..f1413e2b99c2 100644
--- a/tests/python/unittest/test_gluon.py
+++ b/tests/python/unittest/test_gluon.py
@@ -1511,6 +1511,46 @@ def forward(self, x):
     net2 = Network()
     net2.load_parameters('tmp.params')
 
+@with_seed()
+def test_save_load_deduplicate_with_shared_params():
+    class B(mx.gluon.Block):
+        def __init__(self, params=None):
+            super(B, self).__init__(params=params)
+
+            with self.name_scope():
+                self.weight = self.params.get('weight', shape=(10, 10))
+
+    class C(mx.gluon.Block):
+        def __init__(self, b1, b2):
+            super(C, self).__init__()
+            self.b1 = b1
+            self.b2 = b2
+
+    b1 = B()
+    b2 = B(b1.collect_params())
+    c = C(b1, b2)
+    c.initialize()
+    c.save_parameters('tmp.params', deduplicate=True)
+
+    params = mx.nd.load('tmp.params')
+    assert len(params) == 1  # Only a single copy of the shared parameter is saved
+
+    b1 = B()
+    b2 = B(b1.collect_params())
+    c = C(b1, b2)
+    c.load_parameters('tmp.params')
+
+    # Test default behavior
+    c.save_parameters('tmp2.params', deduplicate=False)
+
+    params = mx.nd.load('tmp2.params')
+    assert len(params) == 2  # Only a single copy of the shared parameter is saved
+
+    b1 = B()
+    b2 = B(b1.collect_params())
+    c = C(b1, b2)
+    c.load_parameters('tmp2.params')
+
 @with_seed()
 def test_symbol_block_save_load():
     class Net(gluon.HybridBlock):
diff --git a/tests/python/unittest/test_numpy_gluon.py b/tests/python/unittest/test_numpy_gluon.py
index af5425336699..12e89a2d9b39 100644
--- a/tests/python/unittest/test_numpy_gluon.py
+++ b/tests/python/unittest/test_numpy_gluon.py
@@ -156,6 +156,29 @@ def test_np_loss_ndarray():
     assert_almost_equal(L, _np.array([1.06346405,  0.04858733]), use_broadcast=False)
 
 
+@with_seed()
+@use_np
+def test_np_get_constant():
+    const_arr = _np.random.uniform(0, 100, size=(10, 10)).astype(_np.float32)
+
+    class Foo(gluon.HybridBlock):
+        def __init__(self, prefix=None, params=None):
+            super(Foo, self).__init__(prefix=prefix, params=params)
+            self.weight = self.params.get_constant('const', const_arr)
+
+        def hybrid_forward(self, F, x, weight):
+            return x + weight.astype(np.float32)
+
+    x = np.random.uniform(size=const_arr.shape, dtype=const_arr.dtype)
+    for hybridize in [False, True]:
+        foo = Foo()
+        if hybridize:
+            foo.hybridize()
+        foo.initialize()
+        out = foo(x)
+        assert_almost_equal(out.asnumpy(), (x.asnumpy() + const_arr), atol=1e-5, rtol=1e-4, use_broadcast=False)
+
+
 if __name__ == '__main__':
     import nose
     nose.runmodule()
diff --git a/tests/python/unittest/test_numpy_interoperability.py b/tests/python/unittest/test_numpy_interoperability.py
index 9e8156f3239c..860fecc5cda0 100644
--- a/tests/python/unittest/test_numpy_interoperability.py
+++ b/tests/python/unittest/test_numpy_interoperability.py
@@ -258,6 +258,7 @@ def _add_workload_einsum():
     size_dict = dict(zip(chars, sizes))
 
     configs = [
+        # test_einsum_broadcast
         ('ij...,j...->ij...', [(2, 3, 4), (3,)]),
         ('ij...,...j->ij...', [(2, 3, 4), (3,)]),
         ('ij...,j->ij...', [(2, 3, 4), (3,)]),
@@ -310,6 +311,39 @@ def _add_workload_einsum():
         ('abjk,kl,jl,ab->ab', [(1, 1, 5, 4), (4, 6), (5, 6), (7, 7)]),
         ('obk,ijk->ioj', [(2, 4, 8), (2, 4, 8)]),
     ]
+    # check_einsum_sums
+    configs.extend([('i->', [(i,)]) for i in range(1, 17)])
+    configs.extend([('...i->...', [(2, 3, i,)]) for i in range(1, 17)])
+    configs.extend([('i...->...', [(2, i,)]) for i in range(1, 17)])
+    configs.extend([('i...->...', [(2, 3, i,)]) for i in range(1, 17)])
+    configs.extend([('ii', [(i, i,)]) for i in range(1, 17)])
+    configs.extend([('..., ...', [(3, i,), (2, 3, i,)]) for i in range(1, 17)])
+    configs.extend([('...i, ...i', [(2, 3, i,), (i,)]) for i in range(1, 17)])
+    configs.extend([('i..., i...', [(i, 3, 2,), (i,)]) for i in range(1, 11)])
+    configs.extend([('i, j', [(3,), (i,)]) for i in range(1, 17)])
+    configs.extend([('ij, j', [(4, i), (i,)]) for i in range(1, 17)])
+    configs.extend([('ji, j', [(i, 4), (i,)]) for i in range(1, 17)])
+    configs.extend([('ij, jk', [(4, i), (i, 6)]) for i in range(1, 8)])
+    configs.extend([
+        ('ij,jk,kl', [(3, 4), (4, 5), (5, 6)]),
+        ('ijk, jil -> kl', [(3, 4, 5), (4, 3, 2)]),
+        ('i, i, i -> i', [(8,), (8,), (8,)]),
+        (',i->', [(), (9,)]),
+        ('i,->', [(9,), ()]),
+    ])
+    configs.extend([('...,...', [(n,), (n,)]) for n in range(1, 25)])
+    configs.extend([('i,i', [(n,), (n,)]) for n in range(1, 25)])
+    configs.extend([('i,->i', [(n,), ()]) for n in range(1, 25)])
+    configs.extend([(',i->i', [(), (n,)]) for n in range(1, 25)])
+    configs.extend([('i,->', [(n,), ()]) for n in range(1, 25)])
+    configs.extend([(',i->', [(), (n,)]) for n in range(1, 25)])
+    configs.extend([('...,...', [(n - 1,), (n - 1,)]) for n in range(1, 25)])
+    configs.extend([('i,i', [(n - 1,), (n - 1,)]) for n in range(1, 25)])
+    configs.extend([('i,->i', [(n - 1,), ()]) for n in range(1, 25)])
+    configs.extend([(',i->i', [(), (n - 1,)]) for n in range(1, 25)])
+    configs.extend([('i,->', [(n - 1,), ()]) for n in range(1, 25)])
+    configs.extend([(',i->', [(), (n - 1,)]) for n in range(1, 25)])
+
     for optimize in [False, True]:
         for config in configs:
             subscripts, args = config
@@ -349,6 +383,22 @@ def _add_workload_argmax():
     OpArgMngr.add_workload('argmax', np.array([True, False, True, False, False]))
 
 
+def _add_workload_argmin():
+    OpArgMngr.add_workload('argmin', np.random.uniform(size=(4, 5, 6, 7, 8)), 0)
+    OpArgMngr.add_workload('argmin', np.random.uniform(size=(4, 5, 6, 7, 8)), 1)
+    OpArgMngr.add_workload('argmin', np.random.uniform(size=(4, 5, 6, 7, 8)), 2)
+    OpArgMngr.add_workload('argmin', np.random.uniform(size=(4, 5, 6, 7, 8)), 3)
+    OpArgMngr.add_workload('argmin', np.random.uniform(size=(4, 5, 6, 7, 8)), 4)
+    # OpArgMngr.add_workload('argmin', np.array([0, 1, 2, 3, np.nan]))
+    # OpArgMngr.add_workload('argmin', np.array([0, 1, 2, np.nan, 3]))
+    # OpArgMngr.add_workload('argmin', np.array([np.nan, 0, 1, 2, 3]))
+    # OpArgMngr.add_workload('argmin', np.array([np.nan, 0, np.nan, 2, 3]))
+    OpArgMngr.add_workload('argmin', np.array([False, False, False, False, True]))
+    OpArgMngr.add_workload('argmin', np.array([False, False, False, True, False]))
+    OpArgMngr.add_workload('argmin', np.array([True, False, False, False, False]))
+    OpArgMngr.add_workload('argmin', np.array([True, False, True, False, False]))
+
+
 def _add_workload_around():
     OpArgMngr.add_workload('around', np.array([1.56, 72.54, 6.35, 3.25]), decimals=1)
 
@@ -1025,6 +1075,16 @@ def _add_workload_less_equal(array_pool):
     # OpArgMngr.add_workload('less_equal', np.array([np.nan]), np.array([np.nan]))
 
 
+def _add_workload_nonzero():
+    OpArgMngr.add_workload('nonzero', np.random.randint(0, 2))
+    OpArgMngr.add_workload('nonzero', np.random.randint(0, 2, size=()))
+    OpArgMngr.add_workload('nonzero', np.random.randint(0, 2, size=(0, 1, 2)))
+    OpArgMngr.add_workload('nonzero', np.random.randint(0, 2, size=(0, 1, 0)))
+    OpArgMngr.add_workload('nonzero', np.random.randint(0, 2, size=(2, 3, 4)))
+    OpArgMngr.add_workload('nonzero', np.array([False, False, False], dtype=np.bool_))
+    OpArgMngr.add_workload('nonzero', np.array([True, False, False], dtype=np.bool_))
+
+
 @use_np
 def _prepare_workloads():
     array_pool = {
@@ -1033,6 +1093,7 @@ def _prepare_workloads():
         '1x1x0': np.array([[[]]])
     }
 
+    _add_workload_argmin()
     _add_workload_argmax()
     _add_workload_around()
     _add_workload_broadcast_arrays(array_pool)
@@ -1049,6 +1110,7 @@ def _prepare_workloads():
     _add_workload_max(array_pool)
     _add_workload_min(array_pool)
     _add_workload_mean(array_pool)
+    _add_workload_nonzero()
     _add_workload_ones_like(array_pool)
     _add_workload_prod(array_pool)
     _add_workload_repeat(array_pool)
diff --git a/tests/python/unittest/test_numpy_op.py b/tests/python/unittest/test_numpy_op.py
index ae8ad621df75..391a07411b15 100644
--- a/tests/python/unittest/test_numpy_op.py
+++ b/tests/python/unittest/test_numpy_op.py
@@ -2173,7 +2173,7 @@ def hybrid_forward(self, F, x):
 
 @with_seed()
 @use_np
-def test_np_argmax():
+def test_np_argmin_argmax():
     workloads = [
         ((), 0, False),
         ((), -1, False),
@@ -2188,49 +2188,52 @@ def test_np_argmax():
         ((5, 0, 3), 1, True),
     ]
     dtypes = ['float16', 'float32', 'float64']
+    ops = ['argmin', 'argmax']
 
-    class TestArgMax(HybridBlock):
-        def __init__(self, axis=None):
-            super(TestArgMax, self).__init__()
+    class TestArgExtreme(HybridBlock):
+        def __init__(self, op_name, axis=None):
+            super(TestArgExtreme, self).__init__()
+            self._op_name = op_name
             self._axis = axis
 
         def hybrid_forward(self, F, x):
-            return F.np.argmax(x, self._axis)
-
-    for shape, axis, throw_exception in workloads:
-        for dtype in dtypes:
-            a = np.random.uniform(size=shape, dtype=dtype)
-            if throw_exception:
-                # Cannot use assert_exception because sometimes the main thread
-                # proceeds to `assert False` before the exception is thrown
-                # in the worker thread. Have to use mx.nd.waitall() here
-                # to block the main thread.
-                try:
-                    np.argmax(a, axis)
-                    mx.nd.waitall()
-                    assert False
-                except mx.MXNetError:
-                    pass
-            else:
-                mx_ret = np.argmax(a, axis=axis)
-                np_ret = _np.argmax(a.asnumpy(), axis=axis)
-                assert same(mx_ret.asnumpy(), np_ret)
+            return getattr(x, self._op_name)(self._axis)
 
-            for hybridize in [False, True]:
-                net = TestArgMax(axis)
-                if hybridize:
-                    net.hybridize()
+    for op_name in ops:
+        for shape, axis, throw_exception in workloads:
+            for dtype in dtypes:
+                a = np.random.uniform(size=shape, dtype=dtype)
                 if throw_exception:
+                    # Cannot use assert_exception because sometimes the main thread
+                    # proceeds to `assert False` before the exception is thrown
+                    # in the worker thread. Have to use mx.nd.waitall() here
+                    # to block the main thread.
                     try:
-                        net(a)
+                        getattr(np, op_name)(a, axis)
                         mx.nd.waitall()
                         assert False
                     except mx.MXNetError:
                         pass
                 else:
-                    mx_ret = net(a)
+                    mx_ret = getattr(np, op_name)(a, axis=axis)
+                    np_ret = getattr(_np, op_name)(a.asnumpy(), axis=axis)
                     assert same(mx_ret.asnumpy(), np_ret)
 
+                for hybridize in [False, True]:
+                    net = TestArgExtreme(op_name, axis)
+                    if hybridize:
+                        net.hybridize()
+                    if throw_exception:
+                        try:
+                            net(a)
+                            mx.nd.waitall()
+                            assert False
+                        except mx.MXNetError:
+                            pass
+                    else:
+                        mx_ret = net(a)
+                        assert same(mx_ret.asnumpy(), np_ret)
+
 
 @with_seed()
 @use_np
@@ -2490,16 +2493,17 @@ def test_indexing_mode(sampler, set_size, samples_size, replace, weight=None):
     #     test_sample_without_replacement(np.random.choice, num_classes, shape, 10 ** 5, weight)
 
     # Test hypridize mode:
-    for hybridize in [True, False]:
-        for replace in [True, False]:
-            test_choice = TestUniformChoice(num_classes // 2, replace)
-            test_choice_weighted = TestWeightedChoice(num_classes // 2, replace)
-            if hybridize:
-                test_choice.hybridize()
-                test_choice_weighted.hybridize()
-            weight = np.array(_np.random.dirichlet([1.0] * num_classes))
-            test_indexing_mode(test_choice, num_classes, num_classes // 2, replace, None)
-            test_indexing_mode(test_choice_weighted, num_classes, num_classes // 2, replace, weight)
+    for wtype in ['float16', 'float32', 'float64']:
+        for hybridize in [True, False]:
+            for replace in [True, False]:
+                test_choice = TestUniformChoice(num_classes // 2, replace)
+                test_choice_weighted = TestWeightedChoice(num_classes // 2, replace)
+                if hybridize:
+                    test_choice.hybridize()
+                    test_choice_weighted.hybridize()
+                weight = np.array(_np.random.dirichlet([1.0] * num_classes)).astype(wtype)
+                test_indexing_mode(test_choice, num_classes, num_classes // 2, replace, None)
+                test_indexing_mode(test_choice_weighted, num_classes, num_classes // 2, replace, weight)
 
 
 @with_seed()
@@ -3496,16 +3500,22 @@ def dbg(name, data):
                                                                     _np.dot(args[0].T, _np.dot(_np.ones((2, 2)), args[2].T)),
                                                                     _np.dot(_np.dot(args[0], args[1]).T, _np.ones((2, 2))))),
         # broadcast bug
-        (('ij, ij -> i'), [(1, 4), (2, 4)], lambda *args: (_np.sum(args[1], axis=0)[None, :],
-                                                           _np.tile(args[0], [2, 1]))),
+        ('ij, ij -> i', [(1, 4), (2, 4)], lambda *args: (_np.sum(args[1], axis=0)[None, :],
+                                                         _np.tile(args[0], [2, 1]))),
+        # issue #16576
+        # commented due to long running time
+        # ('abiz,abjz->abij', [(64, 8, 128, 512), (64, 8, 128, 512)], lambda *args: (_np.matmul(_np.ones((64, 8, 128, 128)), args[1]),
+        #                                                                            _np.matmul(_np.ones((64, 8, 128, 128)), args[0]))),
     ]
-    dtypes = ['int32', 'float16', 'float32', 'float64']
+    dtypes = ['float32', 'float64', 'int32']
+    acc_type = {'float16': 'float32', 'float32': 'float64', 'float64': 'float64',
+                'int32': 'int64'}
     for hybridize in [False, True]:
         for dtype in dtypes:
             for config in configs:
                 for optimize in [False, True]:
-                    rtol = 1e-0 if dtype == 'float16' else 1e-3
-                    atol = 1e-1 if dtype == 'float16' else 1e-5
+                    rtol = 1e-2 if dtype == 'float16' else 1e-3
+                    atol = 1e-4 if dtype == 'float16' else 1e-5 
                     (subscripts, operands, get_grad) = config
                     test_einsum = TestEinsum(subscripts, optimize)
                     if hybridize:
@@ -3513,11 +3523,11 @@ def dbg(name, data):
                     x = []
                     x_np = []
                     for shape in operands:
-                        x_np.append(_np.array(_np.random.uniform(-10.0, 10.0, shape),
-                                            dtype=dtype))
-                        x.append(np.array(x_np[-1], dtype=dtype))
+                        tmp = _np.array(_np.random.uniform(-1.0, 1.0, shape), dtype=dtype)
+                        x_np.append(tmp.astype(acc_type[dtype]))
+                        x.append(np.array(tmp, dtype=dtype))
                         x[-1].attach_grad()
-                    expected_np = _np.einsum(subscripts, *x_np, optimize=optimize)
+                    expected_np = _np.einsum(subscripts, *x_np, optimize=optimize).astype(dtype)
                     with mx.autograd.record():
                         out_mx = test_einsum(*x)
                     assert out_mx.shape == expected_np.shape
@@ -3535,7 +3545,7 @@ def dbg(name, data):
                     expected_np = _np.einsum(subscripts, *x_np, optimize=optimize)
                     assert_almost_equal(out_mx.asnumpy(), expected_np, rtol=rtol, atol=atol)
                     for (iop, op) in enumerate(x):
-                        assert_almost_equal(op.grad.asnumpy(), get_grad(*x_np)[iop], rtol=rtol, atol=atol)
+                        assert_almost_equal(op.grad.asnumpy(), get_grad(*x_np)[iop].astype(dtype), rtol=rtol, atol=atol)
     configs = [
         (('ij,jk,kl->il'), [(2, 2), (2, 5), (5, 2)]),
         (('ea,fb,abcd,gc,hd->efgh'), [(5, 5), (5, 5), (5, 5, 5, 5), (5, 5), (5, 5)]),
@@ -3545,8 +3555,8 @@ def dbg(name, data):
         for dtype in dtypes:
             for config in configs:
                 (subscripts, operands) = config
-                rtol = 1e-0 if dtype == 'float16' else 1e-2
-                atol = 1e-1 if dtype == 'float16' else 1e-2
+                rtol = 1e-2 if dtype == 'float16' else 1e-3
+                atol = 1e-4 if dtype == 'float16' else 1e-5 
                 grad = []
                 x_np = []
                 for shape in operands:
@@ -3560,7 +3570,8 @@ def dbg(name, data):
                     test_einsum = TestEinsum(subscripts, optimize)
                     if hybridize:
                         test_einsum.hybridize()
-                    expected_np = _np.einsum(subscripts, *x_np, optimize=optimize)
+                    expected_np = _np.einsum(subscripts, *[op.astype(acc_type[dtype]) for op in x_np],
+                                             optimize=optimize).astype(dtype)
                     with mx.autograd.record():
                         out_mx = test_einsum(*x)
                     assert out_mx.shape == expected_np.shape
@@ -3667,6 +3678,69 @@ def test_np_true_divide():
         assert_almost_equal(out_mx.asnumpy(), out_np, rtol=1e-3, atol=1e-3, use_broadcast=False)
 
 
+@with_seed()
+@use_np
+def test_npx_reshape():
+    class TestNumpyXReshape(HybridBlock):
+        def __init__(self, newshape, reverse):
+            super(TestNumpyXReshape, self).__init__()
+            self._newshape = newshape
+            self._reverse = reverse
+
+        def hybrid_forward(self, F, a, *args, **kwargs):
+            return F.npx.reshape(a, self._newshape, reverse=self._reverse)
+
+    test_cases = [
+        [(2, 3, 5, 5),  (-2, -1),         False, (2, 75)],
+        [(2, 3, 5, 5),  (-2, -2, -1),     False, (2, 3, 25)],
+        [(5, 3, 4, 5),  (-2, -1, -2),     False, (5, 15, 4)],
+        [(2, 3, 5, 4),  (-1, -2, -2),     False, (8, 3, 5)],
+        [(2, 3, 5, 5),  (-2, -2, -2, -2), False, (2, 3, 5, 5)],
+        [(2, 1, 4, 5),  (-2, -3, -2, -2), False, (2, 4, 5)],
+        [(1, 1, 4, 1),  (-3, -3, -2, -2), False, (4, 1)],
+        [(1, 1, 1, 1),  (-3, -3, -3, -3), False, ()],
+        [(2, 4, 5, 3),  (-1, 2, 2, 1),    False, (30, 2, 2, 1)],
+        [(2, 3, 5, 6),  (-4,),            False, (2, 3, 5, 6)],
+        [(2, 3, 5, 6),  (6, 1, -4),       False, (6, 1, 5, 6)],
+        [(2, 3, 5, 6),  (-5, -5),         False, (6, 30)],
+        [(2, 3, 5, 6),  (-5, -1),         False, (6, 30)],
+        [(64,),         (-6, 16, 4),      False, (16, 4)],
+        [(64,),         (-6, 16, -1),     False, (16, 4)],
+        [(64, 1, 2, 3), (-6, 16, -1, -4), False, (16, 4, 1, 2, 3)],
+        [(8, 5, 4, 6),  (-4, -1, 3, -6),  True,  (8, 5, 4, 2, 3)]
+    ]
+    for hybridize in [True, False]:
+        for shape, newshape, reverse, expected_ret_shape in test_cases:
+            for grad_req in ['write', 'add']:
+                # test gluon
+                test_reshape = TestNumpyXReshape(newshape=newshape, reverse=reverse)
+                if hybridize:
+                    test_reshape.hybridize()
+
+                a = mx.np.random.uniform(-1, 1, shape).astype(np.float32)
+                init_a_grad = mx.np.random.uniform(-1, 1, shape).astype(np.float32)
+                a.attach_grad(grad_req=grad_req)
+                if grad_req == 'add':
+                    a.grad[:] = init_a_grad
+                with mx.autograd.record():
+                    y = test_reshape(a)
+                assert y.shape == expected_ret_shape,\
+                    'y.shape={}, expected_ret_shape={}'.format(y.shape, expected_ret_shape)
+                assert_almost_equal(y.asnumpy(), a.asnumpy().reshape(expected_ret_shape), rtol=1e-3, atol=1e-5)
+
+                # test backward
+                mx.autograd.backward(y)
+                expected_grad = _np.ones(shape)
+                if grad_req == 'add':
+                    expected_grad += init_a_grad.asnumpy()
+                assert_almost_equal(a.grad.asnumpy(), expected_grad, rtol=1e-3, atol=1e-5)
+
+                # test imperative
+                npx_out = npx.reshape(a, newshape, reverse=reverse)
+                expected_out = _np.reshape(a.asnumpy(), expected_ret_shape)
+                assert_almost_equal(npx_out.asnumpy(), expected_out, rtol=1e-3, atol=1e-5)
+
+
 if __name__ == '__main__':
     import nose
     nose.runmodule()