From d1897a693dfc717c8216f946371746d686855885 Mon Sep 17 00:00:00 2001
From: Aaron Markham <aaron.s.markham@gmail.com>
Date: Tue, 15 Oct 2019 21:54:51 +0000
Subject: [PATCH 1/9] fixed broken link

fix broken links

fix more broken links
---
 .../python/tutorials/deploy/export/onnx.md    |  2 +-
 .../tutorials/deploy/run-on-aws/index.rst     |  2 +-
 .../gluon_from_experiment_to_deployment.md    |  4 +-
 .../getting-started/to-mxnet/index.rst        |  2 +-
 .../tutorials/packages/autograd/index.md      | 10 +--
 .../gluon/blocks/custom_layer_beginners.md    | 30 ++++----
 .../packages/gluon/blocks/hybridize.md        | 12 ++--
 .../tutorials/packages/gluon/blocks/init.md   | 12 ++--
 .../tutorials/packages/gluon/blocks/nn.md     | 24 +++----
 .../packages/gluon/blocks/parameters.md       | 20 +++---
 .../gluon/image/image-augmentation.md         | 20 +++---
 .../packages/gluon/image/pretrained_models.md | 14 ++--
 .../python/tutorials/packages/gluon/index.rst | 12 ++--
 .../packages/gluon/loss/custom-loss.md        | 28 ++++----
 .../tutorials/packages/gluon/loss/loss.md     | 52 +++++++-------
 .../gluon/training/fit_api_tutorial.md        | 68 +++++++++----------
 .../learning_rates/learning_rate_finder.md    | 18 ++---
 .../learning_rate_schedules_advanced.md       | 22 +++---
 .../packages/gluon/training/trainer.md        | 24 +++----
 .../tutorials/packages/kvstore/index.rst      |  7 +-
 .../tutorials/packages/kvstore/kvstore.md     |  4 +-
 .../packages/ndarray/01-ndarray-intro.md      | 14 ++--
 .../packages/ndarray/02-ndarray-operations.md | 39 +++++------
 .../packages/ndarray/03-ndarray-contexts.md   |  4 +-
 .../ndarray/gotchas_numpy_in_mxnet.md         | 38 +++++------
 .../packages/ndarray/sparse/row_sparse.md     | 28 ++++----
 .../packages/onnx/fine_tuning_gluon.md        | 12 ++--
 .../packages/onnx/inference_on_onnx_model.md  |  2 +-
 .../packages/onnx/super_resolution.md         |  2 +-
 .../src/pages/api/faq/distributed_training.md |  4 +-
 .../tutorials/five_minutes_neural_network.md  |  2 +-
 31 files changed, 259 insertions(+), 273 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/export/onnx.md b/docs/python_docs/python/tutorials/deploy/export/onnx.md
index 0b325cb0ba9a..f3ba3b979fe3 100644
--- a/docs/python_docs/python/tutorials/deploy/export/onnx.md
+++ b/docs/python_docs/python/tutorials/deploy/export/onnx.md
@@ -28,7 +28,7 @@ In this tutorial, we will learn how to use MXNet to ONNX exporter on pre-trained
 ## Prerequisites
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet >= 1.3.0](http://mxnet.apache.org/install/index.html)
+- [MXNet >= 1.3.0]()
 - [onnx]( https://github.com/onnx/onnx#installation) v1.2.1 (follow the install guide)
 
 *Note:* MXNet-ONNX importer and exporter follows version 7 of ONNX operator set which comes with ONNX v1.2.1.
diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst b/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
index 3627f03a830c..46ef737e596a 100644
--- a/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
+++ b/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
@@ -42,7 +42,7 @@ The following tutorials will help you learn how to deploy MXNet on various AWS p
 
    .. card::
       :title: Training with Data from S3
-      :link: https://mxnet.apache.org/versions/master/faq/s3_integration.html
+      :link: s3_integration.html
 
       How to train with data from Amazon S3 buckets.
 
diff --git a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
index 388f2440149c..8d2c4e100c76 100644
--- a/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
+++ b/docs/python_docs/python/tutorials/getting-started/gluon_from_experiment_to_deployment.md
@@ -77,7 +77,7 @@ from mxnet.gluon.data.vision import transforms
 from mxnet.gluon.model_zoo.vision import resnet50_v2
 ```
 
-Next, we define the hyper-parameters that we will use for fine-tuning. We will use the [MXNet learning rate scheduler](https://mxnet.apache.org/tutorials/gluon/learning_rate_schedules.html) to adjust learning rates during training.
+Next, we define the hyper-parameters that we will use for fine-tuning. We will use the [MXNet learning rate scheduler](../packages/gluon/training/learning_rates/learning_rate_schedules.html) to adjust learning rates during training.
 Here we set the `epochs` to 1 for quick demonstration, please change to 40 for actual training.
 
 ```python
@@ -324,4 +324,4 @@ You can also find more ways to run inference and deploy your models here:
 2. [Gluon book on fine-tuning](https://www.d2l.ai/chapter_computer-vision/fine-tuning.html)
 3. [Gluon CV transfer learning tutorial](https://gluon-cv.mxnet.io/build/examples_classification/transfer_learning_minc.html)
 4. [Gluon crash course](https://gluon-crash-course.mxnet.io/)
-5. [Gluon CPP inference example](https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/inference/)
\ No newline at end of file
+5. [Gluon CPP inference example](https://github.com/apache/incubator-mxnet/blob/master/cpp-package/example/inference/)
diff --git a/docs/python_docs/python/tutorials/getting-started/to-mxnet/index.rst b/docs/python_docs/python/tutorials/getting-started/to-mxnet/index.rst
index cf590d8aceb4..622a2891aa88 100644
--- a/docs/python_docs/python/tutorials/getting-started/to-mxnet/index.rst
+++ b/docs/python_docs/python/tutorials/getting-started/to-mxnet/index.rst
@@ -31,7 +31,7 @@ Comparison Guides
 
    .. card::
       :title: Caffe to MXNet
-      :link: https://mxnet.apache.org/versions/master/faq/caffe.html
+      :link: /api/faq/caffe.html
 
       How to convert Caffe models to MXNet and how to call Caffe operators from MXNet.
 
diff --git a/docs/python_docs/python/tutorials/packages/autograd/index.md b/docs/python_docs/python/tutorials/packages/autograd/index.md
index 84d1d1cece67..67229c2d11c4 100644
--- a/docs/python_docs/python/tutorials/packages/autograd/index.md
+++ b/docs/python_docs/python/tutorials/packages/autograd/index.md
@@ -29,15 +29,15 @@ Gradients are fundamental to the process of training neural networks, and tell u
 
 Under the hood, neural networks are composed of operators (e.g. sums, products, convolutions, etc) some of which use parameters (e.g. the weights in convolution kernels) for their computation, and it's our job to find the optimal values for these parameters. Gradients lead us to the solution!
 
-Gradients tell us how much a given variable increases or decreases when we change a variable it depends on. What we're interested in is the effect of changing a each parameter on the performance of the network. We usually define performance using a loss metric that we try to minimize, i.e. a metric that tells us how bad the predictions of a network are given ground truth. As an example, for regression we might try to minimize the [L2 loss](http://beta.mxnet.io/api/gluon/_autogen/mxnet.gluon.loss.L2Loss.html?highlight=l2#mxnet.gluon.loss.L2Loss) (also known as the Euclidean distance) between our predictions and true values, and for classification we minimize the [cross entropy loss](http://beta.mxnet.io/api/gluon/_autogen/mxnet.gluon.loss.SoftmaxCrossEntropyLoss.html).
+Gradients tell us how much a given variable increases or decreases when we change a variable it depends on. What we're interested in is the effect of changing a each parameter on the performance of the network. We usually define performance using a loss metric that we try to minimize, i.e. a metric that tells us how bad the predictions of a network are given ground truth. As an example, for regression we might try to minimize the [L2 loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L2Loss) (also known as the Euclidean distance) between our predictions and true values, and for classification we minimize the [cross entropy loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.SoftmaxCrossEntropyLoss).
 
-Assuming we've calculated the gradient of each parameter with respect to the loss (details in next section), we can then use an optimizer such as [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) to shift the parameters slightly in the *opposite direction* of the gradient. See [Optimizers](http://beta.mxnet.io/api/gluon-related/mxnet.optimizer.html) for more information on these methods. We repeat the process of calculating gradients and updating parameters over and over again, until the parameters of the network start to stabilize and converge to a good solution.
+Assuming we've calculated the gradient of each parameter with respect to the loss (details in next section), we can then use an optimizer such as [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) to shift the parameters slightly in the *opposite direction* of the gradient. See [Optimizers](/api/python/docs/api/optimizer/index.html) for more information on these methods. We repeat the process of calculating gradients and updating parameters over and over again, until the parameters of the network start to stabilize and converge to a good solution.
 
 ## How do we calculate gradients?
 
 ### Short Answer:
 
-We differentiate. [MXNet Gluon](http://beta.mxnet.io/api/gluon/index.html) uses Reverse Mode Automatic Differentiation (`autograd`) to backprogate gradients from the loss metric to the network parameters.
+We differentiate. [MXNet Gluon](/api/python/docs/tutorials/packages/gluon/index.html) uses Reverse Mode Automatic Differentiation (`autograd`) to backprogate gradients from the loss metric to the network parameters.
 
 ![forward-backward](/_static/autograd/autograd_forward_backward.png)
 
@@ -159,7 +159,7 @@ print('is_training:', is_training, output)
 
 We called `dropout` while `autograd` was recording this time, so our network was in training mode and we see dropout of the input this time. Since the probability of dropout was 50%, the output is automatically scaled by 1/0.5=2 to preserve the average activation.
 
-We can force some operators to behave as they would during training, even in inference mode. One example is setting `mode='always'` on the [Dropout](https://mxnet.apache.org/api/python/ndarray/ndarray.html?highlight=dropout#mxnet.ndarray.Dropout) operator, but this usage is uncommon.
+We can force some operators to behave as they would during training, even in inference mode. One example is setting `mode='always'` on the [Dropout](/api/python/ndarray/ndarray.html?highlight=dropout#mxnet.ndarray.Dropout) operator, but this usage is uncommon.
 
 ## Advanced: Skipping the calculation of parameter gradients
 
@@ -196,7 +196,7 @@ print(x.grad)
 
 ## Advanced: Using Python control flow
 
-As mentioned before, one of the main advantages of `autograd` is the ability to automatically calculate gradients of dynamic graphs (i.e. graphs where the operators could be different on every forward pass). One example of this would be applying a tree structured recurrent network to parse a sentence using its parse tree. And we can use Python control flow operators to create a dynamic flow that depends on the data, rather than using [MXNet's control flow operators](https://mxnet.apache.org/versions/master/tutorials/control_flow/ControlFlowTutorial.html).
+As mentioned before, one of the main advantages of `autograd` is the ability to automatically calculate gradients of dynamic graphs (i.e. graphs where the operators could be different on every forward pass). One example of this would be applying a tree structured recurrent network to parse a sentence using its parse tree. And we can use Python control flow operators to create a dynamic flow that depends on the data, rather than using [MXNet's control flow operators](/api/python/tutorials/extend/control_flow.html).
 
 We'll write a function as a toy example of a dynamic network. We'll add an `if` condition and a loop with a variable number of iterations, both of which will depend on the input data. Although these can now be used in static graphs (with conditional operators) it's still much more natural to use native control flow.
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/custom_layer_beginners.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/custom_layer_beginners.md
index 01333a7f80a0..933a70bbdf2c 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/custom_layer_beginners.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/custom_layer_beginners.md
@@ -30,7 +30,7 @@ The only instance method needed to be implemented is [forward(self, x)](https://
 In the example below, we define a new layer and implement `forward()`  method to normalize input data by fitting it into a range of [0, 1].
 
 ```python
-# Do some initial imports used throughout this tutorial 
+# Do some initial imports used throughout this tutorial
 from __future__ import print_function
 import mxnet as mx
 from mxnet import nd, gluon, autograd
@@ -53,7 +53,7 @@ The rest of methods of the `Block` class are already implemented, and majority o
 
 Looking into implementation of [existing layers](https://mxnet.apache.org/api/python/gluon/nn.html), one may find that more often a block inherits from a [HybridBlock](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/block.py#L428), instead of directly inheriting from `Block`.
 
-The reason for that is that `HybridBlock` allows to write custom layers that can be used in imperative programming as well as in symbolic programming. It is convenient to support both ways, because the imperative programming eases the debugging of the code and the symbolic one provides faster execution speed. You can learn more about the difference between symbolic vs. imperative programming from [this article](https://mxnet.apache.org/versions/master/architecture/program_model.html).
+The reason for that is that `HybridBlock` allows to write custom layers that can be used in imperative programming as well as in symbolic programming. It is convenient to support both ways, because the imperative programming eases the debugging of the code and the symbolic one provides faster execution speed. You can learn more about the difference between symbolic vs. imperative programming from this [deep learning programming paradigm](/api/architecture/program_model) article.
 
 Hybridization is a process that Apache MxNet uses to create a symbolic graph of a forward computation. This allows to increase computation performance by optimizing the computational symbolic graph. Once the symbolic graph is created, Apache MxNet caches and reuses it for subsequent computations.
 
@@ -143,7 +143,7 @@ class NormalizationHybridLayer(gluon.HybridBlock):
                                       shape=scales.shape,
                                       init=mx.init.Constant(scales.asnumpy()),
                                       differentiable=False)
-            
+
     def hybrid_forward(self, F, x, weights, scales):
         normalized_data = F.broadcast_div(F.broadcast_sub(x, F.min(x)), (F.broadcast_sub(F.max(x), F.min(x))))
         weighted_data = F.FullyConnected(normalized_data, weights, num_hidden=self.weights.shape[0], no_bias=True)
@@ -175,14 +175,14 @@ def print_params(title, net):
     """
     print(title)
     hybridlayer_params = {k: v for k, v in net.collect_params().items() if 'normalizationhybridlayer' in k }
-    
+
     for key, value in hybridlayer_params.items():
         print('{} = {}\n'.format(key, value.data()))
 
 net = gluon.nn.HybridSequential()                             # Define a Neural Network as a sequence of hybrid blocks
 with net.name_scope():                                        # Used to disambiguate saving and loading net parameters
     net.add(Dense(5))                                         # Add Dense layer with 5 neurons
-    net.add(NormalizationHybridLayer(hidden_units=5, 
+    net.add(NormalizationHybridLayer(hidden_units=5,
                                      scales = nd.array([2]))) # Add our custom layer
     net.add(Dense(1))                                         # Add Dense layer with 1 neurons
 
@@ -195,15 +195,15 @@ label = nd.random_uniform(low=-1, high=1, shape=(5, 1))
 
 mse_loss = gluon.loss.L2Loss()                                # Mean squared error between output and label
 trainer = gluon.Trainer(net.collect_params(),                 # Init trainer with Stochastic Gradient Descent (sgd) optimization method and parameters for it
-                        'sgd', 
+                        'sgd',
                         {'learning_rate': 0.1, 'momentum': 0.9 })
-                        
-with autograd.record():                                       # Autograd records computations done on NDArrays inside "with" block 
+
+with autograd.record():                                       # Autograd records computations done on NDArrays inside "with" block
     output = net(input)                                       # Run forward propogation
-    
-    print_params("=========== Parameters after forward pass ===========\n", net)    
+
+    print_params("=========== Parameters after forward pass ===========\n", net)
     loss = mse_loss(output, label)                            # Calculate MSE
-    
+
 loss.backward()                                               # Backward computes gradients and stores them as a separate array within each NDArray in .grad field
 trainer.step(input.shape[0])                                  # Trainer updates parameters of every block, using .grad field using oprimization method (sgd in this example)
                                                               # We provide batch size that is used as a divider in cost function formula
@@ -213,7 +213,7 @@ print_params("=========== Parameters after backward pass ===========\n", net)
 ```python
 =========== Parameters after forward pass ===========
 
-hybridsequential94_normalizationhybridlayer0_weights = 
+hybridsequential94_normalizationhybridlayer0_weights =
 [[-0.3983642  -0.505708   -0.02425683 -0.3133553  -0.35161012]
  [ 0.6467543   0.3918715  -0.6154656  -0.20702496 -0.4243446 ]
  [ 0.6077331   0.03922009  0.13425875  0.5729856  -0.14446527]
@@ -221,13 +221,13 @@ hybridsequential94_normalizationhybridlayer0_weights =
  [-0.39846328  0.22245121  0.13075739  0.33387476 -0.10088372]]
 <NDArray 5x5 @cpu(0)>
 
-hybridsequential94_normalizationhybridlayer0_scales = 
+hybridsequential94_normalizationhybridlayer0_scales =
 [2.]
 <NDArray 1 @cpu(0)>
 
 =========== Parameters after backward pass ===========
 
-hybridsequential94_normalizationhybridlayer0_weights = 
+hybridsequential94_normalizationhybridlayer0_weights =
 [[-0.29839832 -0.47213346  0.08348035 -0.2324698  -0.27368504]
  [ 0.76268613  0.43080837 -0.49052125 -0.11322092 -0.3339738 ]
  [ 0.48665082 -0.00144657  0.00376363  0.47501418 -0.23885089]
@@ -235,7 +235,7 @@ hybridsequential94_normalizationhybridlayer0_weights =
  [-0.44946212  0.20532274  0.07579394  0.29261002 -0.14063817]]
 <NDArray 5x5 @cpu(0)>
 
-hybridsequential94_normalizationhybridlayer0_scales = 
+hybridsequential94_normalizationhybridlayer0_scales =
 [2.]
 <NDArray 1 @cpu(0)>
 ```
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
index 10d35ecde650..6ca58a92d032 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
@@ -219,7 +219,7 @@ We can see that the three lines of print statements defined in the `hybrid_forwa
 
 ## Key differences and limitations of hybridization
 
-The difference between a purely imperative `Block` and hybridizable `HybridBlock` can superficially appear to be simply the injection of the `F` function space (resolving to [`mx.nd`](https://mxnet.apache.org/api/python/docs/api/ndarray/index.html) or [`mx.sym`](https://mxnet.apache.org/api/python/docs/api/symbol/index.html)) in the forward function that is renamed from `forward` to `hybrid_forward`. However there are some limitations that apply when using hybrid blocks. In the following section we will review the main differences, giving example of code snippets that generate errors when such blocks get hybridized.
+The difference between a purely imperative `Block` and hybridizable `HybridBlock` can superficially appear to be simply the injection of the `F` function space (resolving to [`mx.nd`](/api/python/docs/api/ndarray/index.html) or [`mx.sym`](/api/python/docs/api/symbol/index.html)) in the forward function that is renamed from `forward` to `hybrid_forward`. However there are some limitations that apply when using hybrid blocks. In the following section we will review the main differences, giving example of code snippets that generate errors when such blocks get hybridized.
 
 ### Indexing
 
@@ -234,7 +234,7 @@ Would generate the following error:
 
 `TypeError: Symbol only support integer index to fetch i-th output`
 
-There are however several operators that can help you with array manipulations like: [`F.split`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.split.html#mxnet.ndarray.split), [`F.slice`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.slice.html#mxnet.ndarray.slice), [`F.take`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.take.html),[`F.pick`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.pick.html), [`F.where`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.where.html), [`F.reshape`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.reshape.html) or [`F.reshape_like`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.reshape_like.html).
+There are however several operators that can help you with array manipulations like: [`F.split`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.split), [`F.slice`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.slice), [`F.take`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.take),[`F.pick`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.pick), [`F.where`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.where), [`F.reshape`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.reshape) or [`F.reshape_like`](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.reshape_like).
 
 ### Data Type
 
@@ -277,10 +277,10 @@ def hybrid_forward(self, F, x):
 
 Trying to access the shape of a tensor in a hybridized block would result in this error: `AttributeError: 'Symbol' object has no attribute 'shape'`.
 
-Again, you cannot use the shape of the symbol at runtime as symbols only describe operations and not the underlying data they operate on. 
-Note: This will change in the future as Apache MXNet will support [dynamic shape inference](https://cwiki.apache.org/confluence/display/MXNET/Dynamic+shape), and the shapes of symbols will be symbols themselves 
+Again, you cannot use the shape of the symbol at runtime as symbols only describe operations and not the underlying data they operate on.
+Note: This will change in the future as Apache MXNet will support [dynamic shape inference](https://cwiki.apache.org/confluence/display/MXNET/Dynamic+shape), and the shapes of symbols will be symbols themselves
 
-There are also a lot of operators that support special indices to help with most of the use-cases where you would want to access the shape information. For example, `F.reshape(x, (0,0,-1))` will keep the first two dimensions unchanged and collapse all further dimensions into the third dimension. See the documentation of the [`F.reshape`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.reshape.html) for more details.
+There are also a lot of operators that support special indices to help with most of the use-cases where you would want to access the shape information. For example, `F.reshape(x, (0,0,-1))` will keep the first two dimensions unchanged and collapse all further dimensions into the third dimension. See the documentation of the [`F.reshape`](/api/python/docs/api/ndarray/ndarray.htmlmxnet.ndarray.reshape.html) for more details.
 
 ### Item assignment
 
@@ -294,7 +294,7 @@ def hybrid_forward(self, F, x):
 
 Would get you this error `TypeError: 'Symbol' object does not support item assignment`.
 
-Direct item assignment is not possible in symbolic graph since it needs to be part of a computational graph. One way is to use add more inputs to your graph and use masking or the [`F.where`](https://mxnet.apache.org/api/python/docs/api/ndarray/_autogen/mxnet.ndarray.where.html) operator.
+Direct item assignment is not possible in symbolic graph since it needs to be part of a computational graph. One way is to use add more inputs to your graph and use masking or the [`F.where`](/api/python/docs/api/ndarray/ndarray.htmlmxnet.ndarray.where.html) operator.
 
 e.g to set the first element to 2 you can do:
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/init.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/init.md
index 49df0a0795d6..ec9cd380acb0 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/init.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/init.md
@@ -27,7 +27,7 @@ work:
   dimensionality.
 * We added layers without regard to the output dimension of the previous layer.
 * We even 'initialized' these parameters without knowing how many parameters
-  were were to initialize.
+  we were going to initialize.
 
 All of those things sound impossible and indeed, they are. After all, there's
 no way MXNet (or any other framework for that matter) could predict what the
@@ -36,7 +36,7 @@ convolutional networks and images this problem will become even more pertinent,
 since the input dimensionality (i.e. the resolution of an image) will affect
 the dimensionality of subsequent layers. The ability to
 determine parameter dimensionality during run-time rather than at coding time
-greatly simplifies the process of doing deep learning. 
+greatly simplifies the process of doing deep learning.
 
 ## Instantiating a Network
 
@@ -65,10 +65,10 @@ accessing the parameters, that's exactly what happens.
 print(net.collect_params())
 ```
 
-You'll notice `None` here in each `Dense` layer. This absense of value is how
+You'll notice `None` here in each `Dense` layer. This absence of value is how
 MXNet keeps track of unspecified dimensionality. In particular, trying to access
 `net[0].weight.data()` at this point would trigger a runtime error stating that
-the network needs initializing before it can do anything. 
+the network needs initializing before it can do anything.
 
 Note that if we did want to specify dimensionality, we could have done so by
 using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.
@@ -318,5 +318,5 @@ cases you should now be aware of include custom initialization and tied paramete
 
 ## Recommended Next Steps
 
-* Check out the [API Docs](https://mxnet.apache.org/versions/master/api/python/optimization/optimization.html) on initialization for a list of available initialization methods.
-* See [this tutorial](https://mxnet.apache.org/versions/master/tutorials/gluon/naming.html) for more information on Gluon Parameters.
+* Check out the [API Docs](/api/python/docs/api/optimizer/index.html) on initialization for a list of available initialization methods.
+* See [this tutorial](/api/python/docs/tutorials/packages/gluon/blocks/naming.html) for more information on Gluon Parameters.
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
index 74ad76e415b2..60aa366ad2bb 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
@@ -20,7 +20,7 @@
 <!-- adapted from diveintodeeplearning -->
 
 As network complexity increases, we move from designing single to entire layers
-of neurons. 
+of neurons.
 
 Neural network designs like
 [ResNet-152](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)
@@ -52,7 +52,7 @@ net(x)
 
 This generates a network with a hidden layer of $256$ units, followed by a ReLU
 activation and another $10$ units governing the output. In particular, we used
-the [`nn.Sequential`](/api/gluon/_autogen/mxnet.gluon.nn.Sequential.html#mxnet.gluon.nn.Sequential)
+the [`nn.Sequential`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Sequential)
 constructor to generate an empty network into which we then inserted both
 layers. What exactly happens inside `nn.Sequential`
 has remained rather mysterious so far. In the following we will see that this
@@ -81,11 +81,11 @@ layers to defining blocks (of one or more layers):
 
 ## A Sequential Block
 
-The [`Block`](/api/gluon/nn.html#blocks) class is a generic component
+The [`Block`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Block) class is a generic component
 describing data flow. When the data flows through a sequence of blocks, each
 block applied to the output of the one before with the first block being
 applied on the input data itself, we have a special kind of block, namely the
-`Sequential` block. 
+`Sequential` block.
 
 `Sequential` has helper methods to manage the sequence, with `add` being the
 main one of interest allowing you to append blocks in sequence. Once the
@@ -172,8 +172,8 @@ initializes all of the Block-related parameters and then constructs the
 requisite layers. This attaches the coresponding layers and the required
 parameters to the class. Note that there is no need to define a backpropagation
 method in the class. The system automatically generates the `backward` method
-needed for back propagation by automatically finding the gradient (see the guide on [autograd](guide/packages/autograd.html)). The same
-applies to the [`initialize`](/api/gluon/_autogen/mxnet.gluon.nn.Block.initialize.html) method, which is generated automatically. Let's try
+needed for back propagation by automatically finding the gradient (see the tutorial on [autograd](/api/python/docs/tutorials/packages/autograd/index.html)). The same
+applies to the [`initialize`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Block.initialize) method, which is generated automatically. Let's try
 this out:
 
 ```{.python .input  n=2}
@@ -193,12 +193,12 @@ great flexibility.
 ## Coding with `Blocks`
 
 ### Blocks
-The [`Sequential`](/api/gluon/_autogen/mxnet.gluon.nn.Sequential.html) class
+The [`Sequential`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Sequential) class
 can make model construction easier and does not require you to define the
 `forward` method; however, directly inheriting from
-its parent class, [`Block`](/api/gluon/mxnet.gluon.nn.Block.html), can greatly
+its parent class, [`Block`](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Block), can greatly
 expand the flexibility of model construction. For example, implementing the
-`forward` method means you can introduce control flow in the network. 
+`forward` method means you can introduce control flow in the network.
 
 ### Constant parameters
 Now we'd like to introduce the notation of a *constant* parameter. These are
@@ -298,7 +298,7 @@ After all, we have lots of dictionary lookups, code execution, and lots of
 other Pythonic things going on in what is supposed to be a high performance
 deep learning library. The problems of Python's [Global Interpreter
 Lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well
-known. 
+known.
 
 In the context of deep learning, we often have highly performant GPUs that
 depend on CPUs running Python to tell them what to do. This mismatch can
@@ -306,8 +306,8 @@ manifest in the form of GPU starvation when the CPUs can not provide
 instruction fast enough. We can improve this situation by deferring to a more
 performant language instead of Python when possible.
 
-Gluon does this by allowing for [Hybridization](./hybridize.md). In it, the
+Gluon does this by allowing for [Hybridization](hybridize.html). In it, the
 Python interpreter executes the block the first time it's invoked. The Gluon
 runtime records what is happening and the next time around it short circuits
 any calls to Python. This can accelerate things considerably in some cases but
-care needs to be taken with [control flow](../../crash-course/3-autograd.md).
+care needs to be taken with [control flow](/api/python/getting-started/crash-course/3-autograd.html).
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/parameters.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/parameters.md
index 99f3b4a5f1e5..05461e9929a6 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/parameters.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/parameters.md
@@ -19,7 +19,7 @@
 
 <!-- adapted from diveintodeeplearning -->
 
-The ultimate goal of training deep neural networks is finding good parameter values for a given architecture. The [`nn.Sequential`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.nn.Sequential.html#mxnet.gluon.nn.Sequential) class is a perfect tool to work with standard models. However, very few models are entirely standard, and most scientists want to build novel things, which requires working with model parameters.
+The ultimate goal of training deep neural networks is finding good parameter values for a given architecture. The [nn.Sequential](/api/python/docs/api/gluon/nn/index.html#mxnet.gluon.nn.Sequential) class is a perfect tool to work with standard models. However, very few models are entirely standard, and most scientists want to build novel things, which requires working with model parameters.
 
 This section shows how to manipulate parameters. In particular we will cover the following aspects:
 
@@ -74,7 +74,7 @@ print(net[0].params['dense0_weight'].data())
 
 Note that the weights are nonzero as they were randomly initialized when we constructed the network.
 
-[`data`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.Parameter.data.html) is not the only method that we can invoke. For instance, we can compute the gradient with respect to the parameters. It has the same shape as the weight. However, since we did not invoke backpropagation yet, the values are all 0.
+[data](/api/python/docs/api/gluon/parameter.html#mxnet.gluon.Parameter.data) is not the only method that we can invoke. For instance, we can compute the gradient with respect to the parameters. It has the same shape as the weight. However, since we did not invoke backpropagation yet, the values are all 0.
 
 ```{.python .input  n=5}
 net[0].weight.grad()
@@ -82,7 +82,7 @@ net[0].weight.grad()
 
 ### All Parameters at Once
 
-Accessing parameters as described above can be a bit tedious, in particular if we have more complex blocks, or blocks of blocks (or even blocks of blocks of blocks), since we need to walk through the entire tree in reverse order to learn how the blocks were constructed. To avoid this, blocks come with a method [`collect_params`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.nn.Block.collect_params.html#mxnet.gluon.nn.Block.collect_params) which grabs all parameters of a network in one dictionary such that we can traverse it with ease. It does so by iterating over all constituents of a block and calls `collect_params` on sub-blocks as needed. To see the difference, consider the following:
+Accessing parameters as described above can be a bit tedious, in particular if we have more complex blocks, or blocks of blocks (or even blocks of blocks of blocks), since we need to walk through the entire tree in reverse order to learn how the blocks were constructed. To avoid this, blocks come with a method [collect_params](/api/python/docs/api/gluon/block.html#mxnet.gluon.Block.collect_params) which grabs all parameters of a network in one dictionary such that we can traverse it with ease. It does so by iterating over all constituents of a block and calls `collect_params` on sub-blocks as needed. To see the difference, consider the following:
 
 ```{.python .input  n=6}
 # Parameters only for the first layer
@@ -143,7 +143,7 @@ rgnet[0][1][0].bias.data()
 
 ### Saving and loading parameters
 
-In order to save parameters, we can use [`save_parameters`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.nn.Block.save_parameters.html#mxnet.gluon.nn.Block.save_parameters) method on the whole network or a particular subblock. The only parameter that is needed is the `file_name`. In a similar way, we can load parameters back from the file. We use [`load_parameters`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.nn.Block.load_parameters.html#mxnet.gluon.nn.Block.load_parameters) method for that:
+In order to save parameters, we can use [save_parameters](/api/python/docs/api/gluon/block.html#mxnet.gluon.Block.save_parameters) method on the whole network or a particular subblock. The only parameter that is needed is the `file_name`. In a similar way, we can load parameters back from the file. We use [load_parameters](/api/python/docs/api/gluon/block.html#mxnet.gluon.Block.load_parameters) method for that:
 
 ```{.python .input}
 rgnet.save_parameters('model.params')
@@ -152,7 +152,7 @@ rgnet.load_parameters('model.params')
 
 ## Parameter Initialization
 
-Now that we know how to access the parameters, let's look at how to initialize them properly. By default, MXNet initializes the weight matrices uniformly by drawing from $U[-0.07, 0.07]$ and the bias parameters are all set to $0$. However, we often need to use other methods to initialize the weights. MXNet's [`init`](https://mxnet.apache.org/api/python/docs/api/gluon-related/mxnet.initializer.html?#module-mxnet.initializer) module provides a variety of preset initialization methods, but if we want something unusual, we need to do a bit of extra work.
+Now that we know how to access the parameters, let's look at how to initialize them properly. By default, MXNet initializes the weight matrices uniformly by drawing from $U[-0.07, 0.07]$ and the bias parameters are all set to $0$. However, we often need to use other methods to initialize the weights. MXNet's [init](/api/python/docs/api/initializer/index.html#mxnet.initializer) module provides a variety of preset initialization methods, but if we want something unusual, we need to do a bit of extra work.
 
 ### Built-in Initialization
 
@@ -165,14 +165,14 @@ net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
 net[0].weight.data()[0]
 ```
 
-If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to [`Constant(1)`](https://mxnet.apache.org/api/python/docs/api/gluon-related/_autogen/mxnet.initializer.Constant.html#mxnet.initializer.Constant).
+If we wanted to initialize all parameters to 1, we could do this simply by changing the initializer to [Constant(1)](/api/python/docs/api/initializer/index.html#mxnet.initializer.Constant).
 
 ```{.python .input  n=10}
 net.initialize(init=init.Constant(1), force_reinit=True)
 net[0].weight.data()[0]
 ```
 
-If we want to initialize only a specific parameter in a different manner, we can simply set the initializer only for the appropriate subblock (or parameter) for that matter. For instance, below we initialize the second layer to a constant value of 42 and we use the [`Xavier`](https://mxnet.apache.org/api/python/docs/api/gluon-related/_autogen/mxnet.initializer.Xavier.html#mxnet.initializer.Xavier) initializer for the weights of the first layer.
+If we want to initialize only a specific parameter in a different manner, we can simply set the initializer only for the appropriate subblock (or parameter) for that matter. For instance, below we initialize the second layer to a constant value of 42 and we use the [Xavier](/api/python/docs/api/initializer/index.html#mxnet.initializer.Xavier) initializer for the weights of the first layer.
 
 ```{.python .input  n=11}
 net[1].initialize(init=init.Constant(42), force_reinit=True)
@@ -183,7 +183,7 @@ print(net[0].weight.data()[0])
 
 ### Custom Initialization
 
-Sometimes, the initialization methods we need are not provided in the `init` module. If this is the case, we can implement a subclass of the [`Initializer`](https://mxnet.apache.org/api/python/docs/api/gluon-related/_autogen/mxnet.initializer.Initializer.html#mxnet.initializer.Initializer) class so that we can use it like any other initialization method. Usually, we only need to implement the `_init_weight` method and modify the incoming NDArray according to the initial result. In the example below, we pick a nontrivial distribution, just to prove the point. We draw the coefficients from the following distribution:
+Sometimes, the initialization methods we need are not provided in the `init` module. If this is the case, we can implement a subclass of the [Initializer](/api/python/docs/api/initializer/index.html#mxnet.initializer.Initializer) class so that we can use it like any other initialization method. Usually, we only need to implement the `_init_weight` method and modify the incoming NDArray according to the initial result. In the example below, we pick a nontrivial distribution, just to prove the point. We draw the coefficients from the following distribution:
 
 $$
 \begin{aligned}
@@ -206,7 +206,7 @@ net.initialize(MyInit(), force_reinit=True)
 net[0].weight.data()[0]
 ```
 
-If even this functionality is insufficient, we can set parameters directly. Since `data()` returns an NDArray we can access it just like any other matrix. A note for advanced users - if you want to adjust parameters within an [`autograd`](https://mxnet.apache.org/api/python/docs/api/gluon-related/mxnet.autograd.html?#module-mxnet.autograd) scope you need to use [`set_data`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.Parameter.set_data.html#mxnet.gluon.Parameter.set_data) to avoid confusing the automatic differentiation mechanics.
+If even this functionality is insufficient, we can set parameters directly. Since `data()` returns an NDArray we can access it just like any other matrix. A note for advanced users - if you want to adjust parameters within an [autograd](/api/python/docs/api/autograd/index.html) scope you need to use [set_data](/api/python/docs/api/gluon/parameter.html#mxnet.gluon.Parameter.set_data) to avoid confusing the automatic differentiation mechanics.
 
 ```{.python .input  n=13}
 net[0].weight.data()[:] += 1
@@ -240,4 +240,4 @@ net[1].weight.data()[0,0] = 100
 print(net[1].weight.data()[0] == net[2].weight.data()[0])
 ```
 
-The above example shows that the parameters of the second and third layer are tied. They are identical rather than just being equal. That is, by changing one of the parameters the other one changes, too. What happens to the gradients is quite ingenious. Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are accumulated in the [`shared.params.grad()`](https://mxnet.apache.org/api/python/docs/api/gluon/_autogen/mxnet.gluon.Parameter.grad.html) during backpropagation.
+The above example shows that the parameters of the second and third layer are tied. They are identical rather than just being equal. That is, by changing one of the parameters the other one changes, too. What happens to the gradients is quite ingenious. Since the model parameters contain gradients, the gradients of the second hidden layer and the third hidden layer are accumulated in the [shared.params.grad()](/api/python/docs/api/gluon/parameter.html#mxnet.gluon.Parameter.grad) during backpropagation.
diff --git a/docs/python_docs/python/tutorials/packages/gluon/image/image-augmentation.md b/docs/python_docs/python/tutorials/packages/gluon/image/image-augmentation.md
index a2cdbb7cc97a..70be781be3d6 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/image/image-augmentation.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/image/image-augmentation.md
@@ -22,19 +22,19 @@ training data sets by making a series of random changes to the training images
 to produce similar, but different, training examples. Given its popularity in
 computer vision, the `mxnet.gluon.data.vision.transforms` model provides
 multiple pre-defined image augmentation methods. In this section we will briefly
-go through this module.  
+go through this module.
 
 First, import the module required for this section.
 
-```{.python .input  n=1}
+```python
 from matplotlib import pyplot as plt
 from mxnet import image
-from mxnet.gluon import data as gdata, utils 
+from mxnet.gluon import data as gdata, utils
 ```
 
 Then read the sample $400\times 500$ image.
 
-```{.python .input  n=2}
+```python
 utils.download('https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/cat.jpg')
 img = image.imread('cat.jpg')
 plt.imshow(img.asnumpy())
@@ -43,7 +43,7 @@ plt.show()
 
 In addition, we define a function to draw a list of images.
 
-```{.python .input  n=3}
+```python
 def show_images(imgs, num_rows, num_cols, scale=2):
     figsize = (num_cols * scale, num_rows * scale)
     _, axes = plt.subplots(num_rows, num_cols, figsize=figsize)
@@ -60,7 +60,7 @@ easier for us to observe the effect of image augmentation, we next define the
 auxiliary function `apply`. This function runs the image augmentation method
 `aug` multiple times on the input image `img` and shows all results.
 
-```{.python .input  n=4}
+```python
 def apply(img, aug, num_rows=2, num_cols=4, scale=3):
     Y = [aug(img) for _ in range(num_rows * num_cols)]
     show_images(Y, num_rows, num_cols, scale)
@@ -74,7 +74,7 @@ of image augmentation. Next, we use the `transforms` module to create the
 `RandomFlipLeftRight` instance, which introduces a 50% chance that the image is
 flipped left and right.
 
-```{.python .input  n=5}
+```python
 apply(img, gdata.vision.transforms.RandomFlipLeftRight())
 ```
 
@@ -83,14 +83,12 @@ However, at least for this example image, flipping up and down does not hinder
 recognition. Next, we create a `RandomFlipTopBottom` instance for a 50% chance
 of flipping the image up and down.
 
-```{.python .input  n=6}
+```python
 apply(img, gdata.vision.transforms.RandomFlipTopBottom())
 ```
 
 In the example image we used, the cat is in the middle of the image, but this
-may not be the case for all images. In the [“Pooling
-Layer”](../chapter_convolutional-neural-networks/pooling.md) section, we
-explained that the pooling layer can reduce the sensitivity of the convolutional
+may not be the case for all images. In the [Pooling Layer](https://d2l.ai/chapter_convolutional-neural-networks/pooling.html) section of the d2l.ai book, we explain that the pooling layer can reduce the sensitivity of the convolutional
 layer to the target location. In addition, we can make objects appear at
 different positions in the image in different proportions by randomly cropping
 the image. This can also reduce the sensitivity of the model to the target
diff --git a/docs/python_docs/python/tutorials/packages/gluon/image/pretrained_models.md b/docs/python_docs/python/tutorials/packages/gluon/image/pretrained_models.md
index a928f949260b..fca73ad46aff 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/image/pretrained_models.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/image/pretrained_models.md
@@ -39,18 +39,18 @@ import numpy as np
 
 ## Loading the model
 
-The [Gluon Model Zoo](https://mxnet.apache.org/api/python/gluon/model_zoo.html) provides a collection of off-the-shelf models. You can get the ImageNet pre-trained model by using `pretrained=True`. 
+The [Gluon Model Zoo](https://mxnet.apache.org/api/python/gluon/model_zoo.html) provides a collection of off-the-shelf models. You can get the ImageNet pre-trained model by using `pretrained=True`.
 If you want to train on your own classification problem from scratch, you can get an untrained network with a specific number of classes using the `classes` parameter: for example `net = vision.resnet18_v1(classes=10)`. However note that you cannot use the `pretrained` and `classes` parameter at the same time. If you want to use pre-trained weights as initialization of your network except for the last layer, have a look at the last section of this tutorial.
 
 We can specify the *context* where we want to run the model: the default behavior is to use a CPU context. There are two reasons for this:
 * First, this will allow you to test the notebook even if your machine is not equipped with a GPU :)
 * Second, we're going to predict a single image and we don't have any specific performance requirements. For production applications where you'd want to predict large batches of images with the best possible throughput, a GPU could definitely be the way to go.
-* If you want to use a GPU, make sure you have pip installed the right version of mxnet, or you will get an error when using the `mx.gpu()` context. Refer to the [install instructions](https://mxnet.apache.org/install/index.html)
+* If you want to use a GPU, make sure you have pip installed the right version of mxnet, or you will get an error when using the `mx.gpu()` context. Refer to the [install instructions](/get_started)
 
 
 ```python
-# We set the context to CPU, you can switch to GPU if you have one and installed a compatible version of MXNet 
-ctx = mx.cpu() 
+# We set the context to CPU, you can switch to GPU if you have one and installed a compatible version of MXNet
+ctx = mx.cpu()
 ```
 
 
@@ -141,7 +141,7 @@ def transform(image):
     cropped, crop_info = mx.image.center_crop(resized, (224, 224))
     normalized = mx.image.color_normalize(cropped.astype(np.float32)/255,
                                       mean=mx.nd.array([0.485, 0.456, 0.406]),
-                                      std=mx.nd.array([0.229, 0.224, 0.225])) 
+                                      std=mx.nd.array([0.229, 0.224, 0.225]))
     # the network expect batches of the form (N,3,224,224)
     transposed = normalized.transpose((2,0,1))  # Transposing from (224, 224, 3) to (3, 224, 224)
     batchified = transposed.expand_dims(axis=0) # change the shape from (3, 224, 224) to (1, 3, 224, 224)
@@ -176,7 +176,7 @@ for index in top_pred:
 ```
 
 
-Let's turn this into a function. Our parameters are an image, a model, a list of categories and the number of top categories we'd like to print. 
+Let's turn this into a function. Our parameters are an image, a model, a list of categories and the number of top categories we'd like to print.
 
 
 ```python
@@ -237,7 +237,7 @@ print(resnet18.output)
 Now you can train your model on your new data using the pre-trained weights as initialization. This is called transfer learning and it has proved to be very useful especially in the cases where you only have access to a small dataset. Your network will have already learned how to perform general pattern detection and feature extraction on the larger dataset.
 You can learn more about transfer learning and fine-tuning with MXNet in these tutorials:
 - [Transferring knowledge through fine-tuning](http://gluon.mxnet.io/chapter08_computer-vision/fine-tuning.html)
-- [Fine Tuning an ONNX Model](https://mxnet.apache.org/tutorials/onnx/fine_tuning_gluon.html)
+- [Fine Tuning an ONNX Model](/api/python/docs/tutorials/packages/onnx/fine_tuning_gluon.html)
 
 
 That's it! Explore the model zoo, have fun with pre-trained models!
diff --git a/docs/python_docs/python/tutorials/packages/gluon/index.rst b/docs/python_docs/python/tutorials/packages/gluon/index.rst
index 2212b93af578..c41bf3d9d116 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/index.rst
+++ b/docs/python_docs/python/tutorials/packages/gluon/index.rst
@@ -178,31 +178,35 @@ Advanced Topics
 
    .. card::
       :title: Naming
-      :link: naming.html
+      :link: blocks/naming.html
 
       Best practices for the naming of things.
 
    .. card::
       :title: Custom Layers
-      :link: custom-layer.html
+      :link: blocks/custom-layer.html
 
       A guide to implementing custom layers.
 
    .. card::
       :title: Custom Operators
-      :link: https://mxnet.apache.org/versions/master/tutorials/gluon/customop.html
+      :link: ../../extend/customop.html
 
       Building custom operators with numpy.
 
+..
+<-- tutorial missing -->
    .. card::
       :title: Custom Loss
       :link: custom-loss/custom-loss.html
 
       A guide to implementing custom losses.
 
+..
+
    .. card::
       :title: Gotchas using NumPy in Apache MXNet
-      :link: https://mxnet.apache.org/versions/master/tutorials/gluon/gotchas_numpy_in_mxnet.html
+      :link: ../ndarray/gotchas_numpy_in_mxnet.html
 
       Common misconceptions when using NumPy in Apache MXNet.
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/loss/custom-loss.md b/docs/python_docs/python/tutorials/packages/gluon/loss/custom-loss.md
index 2357e80ab369..6c31ecbe9ee5 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/loss/custom-loss.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/loss/custom-loss.md
@@ -17,11 +17,11 @@
 
 # Custom Loss Blocks
 
-All neural networks need a loss function for training. A loss function is a quantative measure of how bad the predictions of the network are when compared to ground truth labels. Given this score, a network can improve by iteratively updating its weights to minimise this loss. Some tasks use a combination of multiple loss functions, but often you'll just use one. MXNet Gluon provides a number of the most commonly used loss functions, and you'll choose certain functions depending on your network and task. Some common task and loss function pairs include:
+All neural networks need a loss function for training. A loss function is a quantitive measure of how bad the predictions of the network are when compared to ground truth labels. Given this score, a network can improve by iteratively updating its weights to minimise this loss. Some tasks use a combination of multiple loss functions, but often you'll just use one. MXNet Gluon provides a number of the most commonly used loss functions, and you'll choose certain functions depending on your network and task. Some common task and loss function pairs include:
 
-- Regression: [L1Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L1Loss.html), [L2Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L2Loss.html) 
-- Classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss.html), [SoftmaxBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SoftmaxBinaryCrossEntropyLoss.html) 
-- Embeddings: [HingeLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.HingeLoss.html)
+- Regression: [L1Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L1Loss), [L2Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L2Loss)
+- Classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss), [SoftmaxBinaryCrossEntropyLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.SoftmaxBinaryCrossEntropyLoss)
+- Embeddings: [HingeLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.HingeLoss)
 
 However, we may sometimes want to solve problems that require customized loss functions; this tutorial shows how we can do that in Gluon. We will implement contrastive loss which is typically used in Siamese networks.
 
@@ -35,17 +35,17 @@ import random
 
 ### What is Contrastive Loss
 
-[Contrastive loss](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf) is a distance-based loss function. During training, pairs of images are fed into a model. If the images are similar, the loss function will return 0, otherwise 1. 
+[Contrastive loss](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf) is a distance-based loss function. During training, pairs of images are fed into a model. If the images are similar, the loss function will return 0, otherwise 1.
 
 <img src="images/contrastive_loss.jpeg" width="400">
 
-*Y* is a binary label indicating similarity between training images. Contrastive loss uses the Euclidean distance *D* between images and is the sum of 2 terms: 
+*Y* is a binary label indicating similarity between training images. Contrastive loss uses the Euclidean distance *D* between images and is the sum of 2 terms:
  - the loss for a pair of similar points
  - the loss for a pair of dissimilar points
 
-The loss function uses a margin *m* which is has the effect that dissimlar pairs only contribute if their loss is within a certain margin. 
+The loss function uses a margin *m* which is has the effect that dissimlar pairs only contribute if their loss is within a certain margin.
 
-In order to implement such a customized loss function in Gluon, we only need to define a new class that is inheriting from the [`Loss`](https://mxnet.apache.org/api/python/gluon/loss.html#mxnet.gluon.loss.Loss) base class. We then define the contrastive loss logic in the [`hybrid_forward`](https://mxnet.apache.org/_modules/mxnet/gluon/block.html#HybridBlock.hybrid_forward) method. This method takes the images `image1`, `image2` and the label which defines whether  `image1` and `image2` are similar (=0) or  dissimilar (=1). The input F is an `mxnet.ndarry` or an `mxnet.symbol` if we hybridize the network. Gluon's `Loss` base class is in fact a [`HybridBlock`](https://mxnet.apache.org/api/python/gluon/gluon.html#mxnet.gluon.HybridBlock). This means we can either run  imperatively or symbolically. When we hybridize our custom loss function, we can get performance speedups.
+In order to implement such a customized loss function in Gluon, we only need to define a new class that is inheriting from the [`Loss`](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.Loss) base class. We then define the contrastive loss logic in the [`hybrid_forward`](/api/python/docs/api/gluon/hybrid_block.html#mxnet.gluon.HybridBlock.hybrid_forward) method. This method takes the images `image1`, `image2` and the label which defines whether  `image1` and `image2` are similar (=0) or  dissimilar (=1). The input F is an `mxnet.ndarry` or an `mxnet.symbol` if we hybridize the network. Gluon's `Loss` base class is in fact a [`HybridBlock`](/api/python/docs/api/gluon/hybrid_block.html). This means we can either run  imperatively or symbolically. When we hybridize our custom loss function, we can get performance speedups.
 
 
 ```python
@@ -66,7 +66,7 @@ loss = ContrastiveLoss(margin=6.0)
 ```
 
 ### Define the Siamese network
-A [Siamese network](https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf) consists of 2 identical networks, that share the same weights. They are trained on pairs of images and each network processes one image. The label defines whether the pair of images is similar or not. The Siamese network learns to differentiate between two input images. 
+A [Siamese network](https://papers.nips.cc/paper/769-signature-verification-using-a-siamese-time-delay-neural-network.pdf) consists of 2 identical networks, that share the same weights. They are trained on pairs of images and each network processes one image. The label defines whether the pair of images is similar or not. The Siamese network learns to differentiate between two input images.
 
 Our network consists of 2 convolutional and max pooling layers that downsample the input image. The output is then fed through a fully connected layer with 256 hidden units and another fully connected layer with 2 hidden units.
 
@@ -84,7 +84,7 @@ class Siamese(gluon.HybridBlock):
                 self.cnn.add(gluon.nn.MaxPool2D(2, 2))
                 self.cnn.add(gluon.nn.Dense(256, activation='relu'))
                 self.cnn.add(gluon.nn.Dense(2, activation='softrelu'))
-                
+
     def hybrid_forward(self, F, input0, input1):
         out0 = self.cnn(input0)
         out1 = self.cnn(input1)
@@ -143,13 +143,13 @@ test_dataloader = gluon.data.DataLoader(test.transform(transform),
                                         shuffle=False, batch_size=1)
 ```
 
-Following code plots some examples from the test dataset. 
+Following code plots some examples from the test dataset.
 
 
 ```python
 img1, img2, label = test[0]
 print("Same: {}".format(int(label.asscalar()) == 0))
-fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(10, 5)) 
+fig, (ax0, ax1) = plt.subplots(ncols=2, figsize=(10, 5))
 ax0.imshow(img1.asnumpy()[:,:,0], cmap='gray')
 ax0.axis('off')
 ax1.imshow(img2.asnumpy()[:,:,0], cmap='gray')
@@ -191,7 +191,7 @@ for epoch in range(10):
 ```
 
 ### Test the trained Siamese network
-During inference we compute the Euclidean distance between the output vectors of the Siamese network. High distance indicates dissimilarity, low values indicate similarity.  
+During inference we compute the Euclidean distance between the output vectors of the Siamese network. High distance indicates dissimilarity, low values indicate similarity.
 
 
 ```python
@@ -224,7 +224,7 @@ Verify whether the last network layer uses the correct activation function: for
 In our example, we computed the square root of squared distances between 2 images: `F.sqrt(distances_squared)`. If images are very similar we take the sqare root of a value close to 0, which can lead to *NaN* values. Adding a small epsilon to `distances_squared` avoids this problem.
 
 #### Shape of intermediate loss vectors
-In most cases having the wrong tensor shape will lead to an error, as soon as we compare data with labels. But in some cases, we may be able to normally run the training, but it does not converge. For instance, if we don't set `keepdims=True` in our customized loss function, the shape of the tensor changes. The example still runs fine but does not converge. 
+In most cases having the wrong tensor shape will lead to an error, as soon as we compare data with labels. But in some cases, we may be able to normally run the training, but it does not converge. For instance, if we don't set `keepdims=True` in our customized loss function, the shape of the tensor changes. The example still runs fine but does not converge.
 
 If you encounter a similar problem, then it is useful to check the tensor shape after each computation step in the loss function.
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md b/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
index 6ea44ef180f4..17aaef74d106 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
@@ -19,8 +19,8 @@
 
 Loss functions are used to train neural networks and to compute the difference between output and target variable. A critical component of training neural networks is the loss function. A loss function is a quantative measure of how bad the predictions of the network are when compared to ground truth labels. Given this score, a network can improve by iteratively updating its weights to minimise this loss. Some tasks use a combination of multiple loss functions, but often you'll just use one. MXNet Gluon provides a number of the most commonly used loss functions, and you'll choose certain loss functions depending on your network and task. Some common task and loss function pairs include:
 
-- regression: [L1Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L1Loss.html), [L2Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L2Loss.html) 
-- classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss.html), [SoftmaxBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SoftmaxBinaryCrossEntropyLoss.html) 
+- regression: [L1Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L1Loss.html), [L2Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L2Loss) 
+- classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss.html), [SoftmaxBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SoftmaxBinaryCrossEntropyLoss.html)
 - embeddings: [HingeLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.HingeLoss.html)
 
 We'll first import the modules, where the `mxnet.gluon.loss` module is imported as `gloss` to avoid the commonly used name `loss`.
@@ -30,7 +30,7 @@ from IPython import display
 from matplotlib import pyplot as plt
 import mxnet as mx
 from mxnet import nd, autograd
-from mxnet.gluon import nn, loss as gloss  
+from mxnet.gluon import nn, loss as gloss
 ```
 
 ## Basic Usages
@@ -58,7 +58,7 @@ These values should be equal to the math definition: $0.5\|x-y\|^2$.
 Next we show how to use a loss function to compute gradients.
 
 ```{.python .input}
-X = nd.random.uniform(shape=(2, 4)) 
+X = nd.random.uniform(shape=(2, 4))
 net = nn.Dense(1)
 net.initialize()
 with autograd.record():
@@ -86,11 +86,11 @@ def plot(x, y):
     plt.xlabel('x')
     plt.ylabel('loss')
     plt.show()
-    
+
 def show_regression_loss(loss):
     x = nd.arange(-5, 5, .1)
     y = loss(x, nd.zeros_like(x))
-    plot(x, y)  
+    plot(x, y)
 
 ```
 
@@ -100,7 +100,7 @@ Then plot the classification losses with label values fixed to be 1.
 def show_classification_loss(loss):
     x = nd.arange(-5, 5, .1)
     y = loss(x, nd.ones_like(x))
-    plot(x, y) 
+    plot(x, y)
 ```
 
 #### [L1 Loss](https://mxnet.apache.org/api/python/gluon/loss.html#mxnet.gluon.loss.L1Loss)
@@ -121,7 +121,7 @@ L2Loss, also called Mean Squared Error, is a regression loss function that compu
 
 $$ L = \frac{1}{2} \sum_i \vert {label}_i - {pred}_i \vert^2. $$
 
-Compared to L1, L2 loss it is a smooth function and it creates larger gradients for large loss values. However due to the squaring it puts high weight on outliers. 
+Compared to L1, L2 loss it is a smooth function and it creates larger gradients for large loss values. However due to the squaring it puts high weight on outliers.
 
 ```{.python .input}
 show_regression_loss(gloss.L2Loss())
@@ -130,7 +130,7 @@ show_regression_loss(gloss.L2Loss())
 #### [Huber Loss](https://mxnet.apache.org/api/python/gluon/loss.html#mxnet.gluon.loss.HuberLosss)
 
 HuberLoss  combines advantages of L1 and L2 loss. It calculates a smoothed L1 loss that is equal to L1 if the absolute error exceeds a threshold $$\rho$$, otherwise it is equal to L2. It is defined as:
-$$ 
+$$
 \begin{split}L = \sum_i \begin{cases} \frac{1}{2 {rho}} ({label}_i - {pred}_i)^2 &
                    \text{ if } |{label}_i - {pred}_i| < {rho} \\
                    |{label}_i - {pred}_i| - \frac{{rho}}{2} &
@@ -158,7 +158,7 @@ show_classification_loss(gloss.SigmoidBinaryCrossEntropyLoss())
 
 In classification, we often apply the
 softmax operator to the predicted outputs to obtain prediction probabilities,
-and then apply the cross entropy loss against the true labels: 
+and then apply the cross entropy loss against the true labels:
 
 $$ \begin{align}\begin{aligned}p = \softmax({pred})\\L = -\sum_i \sum_j {label}_j \log p_{ij}\end{aligned}\end{align}
 $$
@@ -184,9 +184,9 @@ $$
 show_classification_loss(gloss.HingeLoss())
 ```
 
-#### [Logistic Loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.LogisticLoss)
+#### [Logistic Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.LogisticLoss)
 
-The Logistic Loss function computes the performance of binary classification models. 
+The Logistic Loss function computes the performance of binary classification models.
 $$
 L = \sum_i \log(1 + \exp(- {pred}_i \cdot {label}_i))
 $$
@@ -196,7 +196,7 @@ The log loss decreases the closer the prediction is to the actual label. It is s
 show_classification_loss(gloss.LogisticLoss())
 ```
 
-#### [Kullback-Leibler Divergence Loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.KLDivLoss)
+#### [Kullback-Leibler Divergence Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.KLDivLoss)
 
 The Kullback-Leibler divergence loss measures the divergence between two probability distributions by calculating the difference between cross entropy and entropy. It takes as input the probability of predicted label and the probability of true label.
 
@@ -218,30 +218,30 @@ loss = loss_fn(output, target_dist)
 print('loss (kl divergence): {}'.format(loss.asnumpy().tolist()))
 ```
 
-#### [Triplet Loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.TripletLoss)
+#### [Triplet Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.TripletLoss)
 
-Triplet loss takes three input arrays and measures the relative similarity. It takes a positive and negative input and the anchor. 
+Triplet loss takes three input arrays and measures the relative similarity. It takes a positive and negative input and the anchor.
 
 $$
 L = \sum_i \max(\Vert {pos_i}_i - {pred} \Vert_2^2 -
     \Vert {neg_i}_i - {pred} \Vert_2^2 + {margin}, 0)
 $$
 
-The loss function minimizes the distance between similar inputs and maximizes the distance  between dissimilar ones.  
-In the case of learning embeddings for images of characters, the network may get as input the following 3 images: 
+The loss function minimizes the distance between similar inputs and maximizes the distance  between dissimilar ones.
+In the case of learning embeddings for images of characters, the network may get as input the following 3 images:
 
 ![triplet_loss](triplet_loss.png)
 
 The network would learn to minimize the distance between the two `A`'s and maximize the distance between `A` and `Z`.
 
-#### [CTC Loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.CTCLoss)
+#### [CTC Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.CTCLoss)
 
 CTC Loss is the [connectionist temporal classification loss](https://distill.pub/2017/ctc/) . It is used to train recurrent neural networks with variable time dimension. It learns the alignment and labelling of input sequences. It takes a sequence as input and gives probabilities for each timestep. For instance, in the following image the word is not well aligned with the 5 timesteps because of the different sizes of characters. CTC Loss finds for each timestep the highest probability e.g. `t1` presents with high probability a `C`. It combines the highest probapilities and returns the best path decoding. For an in-depth tutorial on how to use CTC-Loss in MXNet, check out this [example](https://github.com/apache/incubator-mxnet/tree/master/example/ctc).
 
 ![ctc_loss](ctc_loss.png)
 
-#### [Cosine Embedding Loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.CosineEmbeddingLoss)
-The cosine embedding loss computes the cosine distance between two input vectors. 
+#### [Cosine Embedding Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.CosineEmbeddingLoss)
+The cosine embedding loss computes the cosine distance between two input vectors.
 
 $$
 \begin{split}L = \sum_i \begin{cases} 1 - {cos\_sim({input1}_i, {input2}_i)} & \text{ if } {label}_i = 1\\
@@ -249,7 +249,7 @@ $$
 cos\_sim(input1, input2) = \frac{{input1}_i.{input2}_i}{||{input1}_i||.||{input2}_i||}\end{split}
 $$
 
-Cosine distance measures the similarity between two arrays given a label and is typically used for learning nonlinear embeddings.  
+Cosine distance measures the similarity between two arrays given a label and is typically used for learning nonlinear embeddings.
 For instance, in the following code example we measure the similarity between the input vectors `x` and `y`. Since they are the same the label equals `1`. The loss function returns $$ \sum_i 1 - {cos\_sim({input1}_i, {input2}_i)} $$ which is equal `0`.
 
 ```{.python .input}
@@ -270,7 +270,7 @@ loss = gloss.CosineEmbeddingLoss()
 print(loss(x,y,label))
 ```
 
-#### [PoissonNLLLoss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html#mxnet.gluon.loss.PoissonNLLLoss)
+#### [PoissonNLLLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.PoissonNLLLoss)
 Poisson distribution is widely used for modelling count data. It is defined as:
 
 $$
@@ -278,13 +278,13 @@ f(x) = \frac{\mu ^ {\kern 0.08 em x} e ^ {-\mu}} {x!} \qquad \qquad x = 0,1,2 ,
 $$
 
 
-For instance, the count of cars in road traffic approximately follows a Poisson distribution. Using an ordinary least squares model for Poisson distributed data would not work well because of two reasons: 
- - count data cannot be negative 
+For instance, the count of cars in road traffic approximately follows a Poisson distribution. Using an ordinary least squares model for Poisson distributed data would not work well because of two reasons:
+ - count data cannot be negative
  - variance may not be constant
 
-Instead we can use a Poisson regression model, also known as log-linear model. Thereby the Poisson incident rate $$\mu$$ is 
+Instead we can use a Poisson regression model, also known as log-linear model. Thereby the Poisson incident rate $$\mu$$ is
 modelled by a linear combination of unknown parameters.
-We can then use the PoissonNLLLoss which calculates the negative log likelihood for a target that follows a Poisson distribution. 
+We can then use the PoissonNLLLoss which calculates the negative log likelihood for a target that follows a Poisson distribution.
 
 $$ L = \text{pred} - \text{target} * \log(\text{pred}) +\log(\text{target!}) $$
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md b/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
index 16f0ceac3a08..9e4cbe2f5114 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/training/fit_api_tutorial.md
@@ -17,17 +17,17 @@
 
 # MXNet Gluon Fit API
 
-In this tutorial, you will learn how to use the [Gluon Fit API](https://cwiki.apache.org/confluence/display/MXNET/Gluon+Fit+API+-+Tech+Design) which is the easiest way to train deep learning models using the [Gluon API](https://mxnet.apache.org/versions/master/gluon/index.html) in Apache MXNet. 
+In this tutorial, you will learn how to use the [Gluon Fit API](https://cwiki.apache.org/confluence/display/MXNET/Gluon+Fit+API+-+Tech+Design) which is the easiest way to train deep learning models using the [Gluon API](/api/python/docs/tutorials/packages/gluon/index.html) in Apache MXNet.
 
 With the Fit API, you can train a deep learning model with a minimal amount of code. Just specify the network, loss function and the data you want to train on. You don't need to worry about the boiler plate code to loop through the dataset in batches (often called as 'training loop'). Advanced users can train with bespoke training loops, and many of these use cases will be covered by the Fit API.
 
-To demonstrate the Fit API, you will train an image classification model using the [ResNet-18](https://arxiv.org/abs/1512.03385) neural network architecture. The model will be trained using the [Fashion-MNIST dataset](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/). 
+To demonstrate the Fit API, you will train an image classification model using the [ResNet-18](https://arxiv.org/abs/1512.03385) neural network architecture. The model will be trained using the [Fashion-MNIST dataset](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/).
 
 ## Prerequisites
 
 To complete this tutorial, you will need:
 
-- [MXNet](https://mxnet.apache.org/install/#overview) (The version of MXNet will be >= 1.5.0, you can use `pip install mxnet` to get 1.5.0 release pip package or build from source with master, refer to [MXNet installation](https://mxnet.apache.org/versions/master/install/index.html?platform=Linux&language=Python&processor=CPU)
+- [MXNet](/get_started) (The version of MXNet will be >= 1.5.0, you can use `pip install mxnet` to get 1.5.0 release pip package or build from source with master, refer to [MXNet installation](/get_started?version=master&platform=linux&language=python&environ=pip&processor=cpu)
 - [Jupyter Notebook](https://jupyter.org/index.html) (For interactively running the provided .ipynb file)
 
 
@@ -44,16 +44,16 @@ ctx = [mx.gpu(i) for i in range(gpu_count)] if gpu_count > 0 else mx.cpu()
 
 ## Dataset
 
-[Fashion-MNIST](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/) dataset consists of fashion items divided into ten categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot. 
+[Fashion-MNIST](https://research.zalando.com/welcome/mission/research-projects/fashion-mnist/) dataset consists of fashion items divided into ten categories: t-shirt/top, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag and ankle boot.
 
-- It has 60,000 grayscale images of size 28 * 28 for training.  
-- It has 10,000 grayscale images of size 28 * 28 for testing/validation. 
+- It has 60,000 grayscale images of size 28 * 28 for training.
+- It has 10,000 grayscale images of size 28 * 28 for testing/validation.
 
 We will use the ```gluon.data.vision``` package to directly import the Fashion-MNIST dataset and perform pre-processing on it.
 
 
 ```python
-# Get the training data 
+# Get the training data
 fashion_mnist_train = gluon.data.vision.FashionMNIST(train=True)
 
 # Get the validation data
@@ -81,9 +81,9 @@ fashion_mnist_val = fashion_mnist_val.transform_first(transforms)
 batch_size = 256 # Batch size of the images
 num_workers = 4 # The number of parallel workers for loading the data using Data Loaders.
 
-train_data_loader = gluon.data.DataLoader(fashion_mnist_train, batch_size=batch_size, 
+train_data_loader = gluon.data.DataLoader(fashion_mnist_train, batch_size=batch_size,
                                           shuffle=True, num_workers=num_workers)
-val_data_loader = gluon.data.DataLoader(fashion_mnist_val, batch_size=batch_size, 
+val_data_loader = gluon.data.DataLoader(fashion_mnist_val, batch_size=batch_size,
                                         shuffle=False, num_workers=num_workers)
 ```
 
@@ -97,8 +97,8 @@ resnet_18_v1 = vision.resnet18_v1(pretrained=False, classes = 10)
 resnet_18_v1.initialize(init = mx.init.Xavier(), ctx=ctx)
 ```
 
-We will be using `SoftmaxCrossEntropyLoss` as the loss function since this is a multi-class classification problem. We will be using `sgd` (Stochastic Gradient Descent) as the optimizer. 
-You can experiment with a [different loss](https://mxnet.apache.org/versions/master/api/python/gluon/loss.html) or [optimizer](https://mxnet.apache.org/versions/master/api/python/optimization/optimization.html) as well. 
+We will be using `SoftmaxCrossEntropyLoss` as the loss function since this is a multi-class classification problem. We will be using `sgd` (Stochastic Gradient Descent) as the optimizer.
+You can experiment with a [different loss](/api/python/docs/api/gluon/loss/index.html) or [optimizer](/api/python/docs/api/optimizer/index.html) as well.
 
 
 ```python
@@ -111,7 +111,7 @@ Let's define the trainer object for training the model.
 ```python
 learning_rate = 0.04 # You can experiment with your own learning rate here
 num_epochs = 2 # You can run training for more epochs
-trainer = gluon.Trainer(resnet_18_v1.collect_params(), 
+trainer = gluon.Trainer(resnet_18_v1.collect_params(),
                         'sgd', {'learning_rate': learning_rate})
 ```
 
@@ -128,10 +128,10 @@ In the basic usage example, with just 2 lines of code, we will set up our model
 train_acc = mx.metric.Accuracy() # Metric to monitor
 
 # Define the estimator, by passing to it the model, loss function, metrics, trainer object and context
-est = estimator.Estimator(net=resnet_18_v1, 
-                          loss=loss_fn, 
-                          metrics=train_acc, 
-                          trainer=trainer, 
+est = estimator.Estimator(net=resnet_18_v1,
+                          loss=loss_fn,
+                          metrics=train_acc,
+                          trainer=trainer,
                           context=ctx)
 
 # ignore warnings for nightly test on CI only
@@ -146,9 +146,9 @@ with warnings.catch_warnings():
 ```text
     Training begin: using optimizer SGD with current learning rate 0.0400 <!--notebook-skip-line-->
     Train for 2 epochs. <!--notebook-skip-line-->
-    
+
     [Epoch 0] finished in 25.110s: train_accuracy : 0.7877 train_softmaxcrossentropyloss0 : 0.5905 <!--notebook-skip-line-->
-    
+
     [Epoch 1] finished in 23.595s: train_accuracy : 0.8823 train_softmaxcrossentropyloss0 : 0.3197 <!--notebook-skip-line-->
     Train finished using total 48s at epoch 1. train_accuracy : 0.8823 train_softmaxcrossentropyloss0 : 0.3197 <!--notebook-skip-line-->
 ```
@@ -157,19 +157,19 @@ with warnings.catch_warnings():
 
 The Fit API is also customizable with several `Event Handlers` which give a fine grained control over the steps in training and exposes callback methods that provide control over the stages involved in training. Available callback methods are: `train_begin`, `train_end`, `batch_begin`, `batch_end`, `epoch_begin` and `epoch_end`.
 
-You can use built-in event handlers such as `LoggingHandler`, `CheckpointHandler` or `EarlyStoppingHandler` to log and save the model at certain time-steps during training. You can also stop the training when the model's performance plateaus. 
-There are also some default utility handlers that will be added to your estimator by default. For example, `StoppingHandler` is used to control when the training ends, based on number of epochs or number of batches trained. 
-`MetricHandler` is used to calculate training metrics at end of each batch and epoch. 
+You can use built-in event handlers such as `LoggingHandler`, `CheckpointHandler` or `EarlyStoppingHandler` to log and save the model at certain time-steps during training. You can also stop the training when the model's performance plateaus.
+There are also some default utility handlers that will be added to your estimator by default. For example, `StoppingHandler` is used to control when the training ends, based on number of epochs or number of batches trained.
+`MetricHandler` is used to calculate training metrics at end of each batch and epoch.
 `ValidationHandler` is used to validate your model on test data at each epoch's end and then calculate validation metrics.
 You can create these utility handlers with different configurations and pass to estimator. This will override the default handler configuration.
-You can create a custom handler by inheriting one or multiple 
+You can create a custom handler by inheriting one or multiple
 [base event handlers](https://github.com/apache/incubator-mxnet/blob/master/python/mxnet/gluon/contrib/estimator/event_handler.py#L32)
  including: `TrainBegin`, `TrainEnd`, `EpochBegin`, `EpochEnd`, `BatchBegin`, `BatchEnd`.
 
 
 ### Custom Event Handler
 
-Here we will showcase an example custom event handler the inherits features from a few base handler classes. 
+Here we will showcase an example custom event handler the inherits features from a few base handler classes.
 Our custom event handler is a simple one: record the loss values at the end of every epoch in our training phase.
 
 Note: For each of the method, the `Estimator` object is passed along, so you can access training metrics.
@@ -205,7 +205,7 @@ class LossRecordHandler(TrainBegin, TrainEnd, EpochEnd):
 # Let's reset the model, trainer and accuracy objects from above
 
 resnet_18_v1.initialize(force_reinit=True, init = mx.init.Xavier(), ctx=ctx)
-trainer = gluon.Trainer(resnet_18_v1.collect_params(), 
+trainer = gluon.Trainer(resnet_18_v1.collect_params(),
                         'sgd', {'learning_rate': learning_rate})
 train_acc = mx.metric.Accuracy()
 ```
@@ -216,7 +216,7 @@ train_acc = mx.metric.Accuracy()
 est = estimator.Estimator(net=resnet_18_v1,
                           loss=loss_fn,
                           metrics=train_acc,
-                          trainer=trainer, 
+                          trainer=trainer,
                           context=ctx)
 
 # Define the handlers, let's say in built Checkpointhandler
@@ -224,7 +224,7 @@ checkpoint_handler = CheckpointHandler(model_dir='./',
                                        model_prefix='my_model',
                                        monitor=train_acc,  # Monitors a metric
                                        save_best=True)  # Save the best model in terms of
-# Let's instantiate another handler which we defined above 
+# Let's instantiate another handler which we defined above
 loss_record_handler = LossRecordHandler()
 # ignore warnings for nightly test on CI only
 import warnings
@@ -240,11 +240,11 @@ with warnings.catch_warnings():
 ```text
     Training begin: using optimizer SGD with current learning rate 0.0400 <!--notebook-skip-line-->
     Train for 2 epochs. <!--notebook-skip-line-->
-    
+
     [Epoch 0] finished in 25.236s: train_accuracy : 0.7917 train_softmaxcrossentropyloss0 : 0.5741 val_accuracy : 0.6612 val_softmaxcrossentropyloss0 : 0.8627 <!--notebook-skip-line-->
-    
+
     [Epoch 1] finished in 24.892s: train_accuracy : 0.8826 train_softmaxcrossentropyloss0 : 0.3229 val_accuracy : 0.8474 val_softmaxcrossentropyloss0 : 0.4262 <!--notebook-skip-line-->
-    
+
     Train finished using total 50s at epoch 1. train_accuracy : 0.8826 train_softmaxcrossentropyloss0 : 0.3229 val_accuracy : 0.8474 val_softmaxcrossentropyloss0 : 0.4262 <!--notebook-skip-line-->
 
     Training begin <!--notebook-skip-line-->
@@ -252,7 +252,7 @@ with warnings.catch_warnings():
     Epoch 2, loss 0.3229 <!--notebook-skip-line-->
 ```
 
-You can load the saved model, by using the `load_parameters` API in Gluon. For more details refer to the [Loading model parameters from file tutorial](save_load_params.html#saving-model-parameters-to-file)
+You can load the saved model, by using the `load_parameters` API in Gluon. For more details refer to the [Loading model parameters from file tutorial](../blocks/save_load_params.html#saving-model-parameters-to-file)
 
 
 ```python
@@ -260,10 +260,6 @@ resnet_18_v1 = vision.resnet18_v1(pretrained=False, classes=10)
 resnet_18_v1.load_parameters('./my_model-best.params', ctx=ctx)
 ```
 
-## Summary
-
-- To learn more about deep learning with MXNeT, see [Dive Into Deep Learning](https://gluon.io)
-
-## Next Steps 
+## Next Steps
 
-- For more hands on learning about deep learning, check out [Dive into Deep Learning](https://d2l.ai)
\ No newline at end of file
+- For more hands on learning about deep learning, check out [Dive into Deep Learning](https://d2l.ai)
diff --git a/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_finder.md b/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_finder.md
index 4fc8b780a2ef..a32c8a1e92cd 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_finder.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_finder.md
@@ -20,7 +20,7 @@
 
 Setting the learning rate for stochastic gradient descent (SGD) is crucially important when training neural network because it controls both the speed of convergence and the ultimate performance of the network. Set the learning too low and you could be twiddling your thumbs for quite some time as the parameters update very slowly. Set it too high and the updates will skip over optimal solutions, or worse the optimizer might not converge at all!
 
-Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We implement this method in MXNet (with the Gluon API) and create a 'Learning Rate Finder' which you can use while training your own networks. We take a look at the central idea of the paper, cyclical learning rate schedules, in the ['Advanced Learning Rate Schedules'](https://mxnet.apache.org/tutorials/gluon/learning_rate_schedules_advanced.html) tutorial.
+Leslie Smith from the U.S. Naval Research Laboratory presented a method for finding a good learning rate in a paper called ["Cyclical Learning Rates for Training Neural Networks"](https://arxiv.org/abs/1506.01186). We implement this method in MXNet (with the Gluon API) and create a 'Learning Rate Finder' which you can use while training your own networks. We take a look at the central idea of the paper, cyclical learning rate schedules, in the ['Advanced Learning Rate Schedules'](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.html) tutorial.
 
 ## Simple Idea
 
@@ -63,7 +63,7 @@ class Learner():
         self.net.initialize(mx.init.Xavier(), ctx=self.ctx)
         self.loss_fn = mx.gluon.loss.SoftmaxCrossEntropyLoss()
         self.trainer = mx.gluon.Trainer(net.collect_params(), 'sgd', {'learning_rate': .001})
-        
+
     def iteration(self, lr=None, take_step=True):
         """
         :param lr: learning rate to use for iteration (float)
@@ -81,9 +81,9 @@ class Learner():
         with mx.autograd.record():
             output = self.net(data)
             loss = self.loss_fn(output, label)
-        loss.backward()     
+        loss.backward()
         # Update parameters
-        if take_step: self.trainer.step(data.shape[0])  
+        if take_step: self.trainer.step(data.shape[0])
         # Set and return loss.
         self.iteration_loss = mx.nd.mean(loss).asscalar()
         return self.iteration_loss
@@ -102,7 +102,7 @@ from mxnet.gluon.data.vision import transforms
 transform = transforms.Compose([
     # Switches HWC to CHW, and converts to `float32`
     transforms.ToTensor(),
-    # Channel-wise, using pre-computed means and stds 
+    # Channel-wise, using pre-computed means and stds
     transforms.Normalize(mean=[0.4914, 0.4822, 0.4465],
                          std=[0.2023, 0.1994, 0.2010])
 ])
@@ -143,7 +143,7 @@ class LRFinder():
            and save and load parameters of the network (Learner)
         """
         self.learner = learner
-        
+
     def find(self, lr_start=1e-6, lr_multiplier=1.1, smoothing=0.3):
         """
         :param lr_start: learning rate to start search (float)
@@ -175,7 +175,7 @@ class LRFinder():
         self.learner.net.load_parameters("lr_finder.params", ctx=self.learner.ctx)
         self.learner.trainer.load_states("lr_finder.state")
         return self.results
-        
+
     def plot(self):
         lrs = [e[0] for e in self.results]
         losses = [e[1] for e in self.results]
@@ -209,7 +209,7 @@ class LRFinderStoppingCriteria():
         self.first_loss = None
         self.running_mean = None
         self.counter = 0
-        
+
     def __call__(self, loss):
         """
         :param loss: from single iteration (float)
@@ -327,6 +327,6 @@ Although we get quite similar results to when we set the learning rate at 0.05 (
 
 ## Wrap Up
 
-Give Learning Rate Finder a try on your current projects, and experiment with the different learning rate schedules found in the tutorials [here](https://mxnet.apache.org/tutorials/gluon/learning_rate_schedules.html) and [here](https://mxnet.apache.org/tutorials/gluon/learning_rate_schedules_advanced.html).
+Give Learning Rate Finder a try on your current projects, and experiment with the different learning rate schedules found in the [basic learning rate tutorial](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules.html) and the [advanced learning rate tutorial](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.html).
 
 <!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.md b/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.md
index 287f70d535e8..c59c9515f02e 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules_advanced.md
@@ -16,7 +16,7 @@
 <!--- under the License. -->
 
 
- # Advanced Learning Rate Schedules
+# Advanced Learning Rate Schedules
 
 Given the importance of learning rate and the learning rate schedule for training neural networks, there have been a number of research papers published recently on the subject. Although many practitioners are using simple learning rate schedules such as stepwise decay, research has shown that there are other strategies that work better in most situations. We implement a number of different schedule shapes in this tutorial and introduce cyclical schedules.
 
@@ -56,7 +56,7 @@ One adjustment proposed by [Jeremy Howard, Sebastian Ruder (2018)](https://arxiv
 
 ```python
 class TriangularSchedule():
-    def __init__(self, min_lr, max_lr, cycle_length, inc_fraction=0.5):     
+    def __init__(self, min_lr, max_lr, cycle_length, inc_fraction=0.5):
         """
         min_lr: lower bound for learning rate (float)
         max_lr: upper bound for learning rate (float)
@@ -67,7 +67,7 @@ class TriangularSchedule():
         self.max_lr = max_lr
         self.cycle_length = cycle_length
         self.inc_fraction = inc_fraction
-        
+
     def __call__(self, iteration):
         if iteration <= self.cycle_length*self.inc_fraction:
             unit_cycle = iteration * 1 / (self.cycle_length * self.inc_fraction)
@@ -107,7 +107,7 @@ class CosineAnnealingSchedule():
         self.min_lr = min_lr
         self.max_lr = max_lr
         self.cycle_length = cycle_length
-        
+
     def __call__(self, iteration):
         if iteration <= self.cycle_length:
             unit_cycle = (1 + math.cos(iteration * math.pi / self.cycle_length)) / 2
@@ -153,7 +153,7 @@ class LinearWarmUp():
         # calling mx.lr_scheduler.LRScheduler effects state, so calling a copy
         self.finish_lr = copy.copy(schedule)(0)
         self.length = length
-    
+
     def __call__(self, iteration):
         if iteration <= self.length:
             return iteration * (self.finish_lr - self.start_lr)/(self.length) + self.start_lr
@@ -196,7 +196,7 @@ class LinearCoolDown():
         self.start_idx = start_idx
         self.finish_idx = start_idx + length
         self.length = length
-    
+
     def __call__(self, iteration):
         if iteration <= self.start_idx:
             return self.schedule(iteration)
@@ -239,11 +239,11 @@ class OneCycleSchedule():
             raise ValueError("Must specify finish_lr when using cooldown_length > 0.")
         if (cooldown_length == 0) and (finish_lr is not None):
             raise ValueError("Must specify cooldown_length > 0 when using finish_lr.")
-            
+
         finish_lr = finish_lr if (cooldown_length > 0) else start_lr
         schedule = TriangularSchedule(min_lr=start_lr, max_lr=max_lr, cycle_length=cycle_length)
         self.schedule = LinearCoolDown(schedule, finish_lr=finish_lr, start_idx=cycle_length, length=cooldown_length)
-        
+
     def __call__(self, iteration):
         return self.schedule(iteration)
 ```
@@ -280,7 +280,7 @@ class CyclicalSchedule():
         self.length_decay = cycle_length_decay
         self.magnitude_decay = cycle_magnitude_decay
         self.kwargs = kwargs
-    
+
     def __call__(self, iteration):
         cycle_idx = 0
         cycle_length = self.length
@@ -290,7 +290,7 @@ class CyclicalSchedule():
             cycle_idx += 1
             idx += cycle_length
         cycle_offset = iteration - idx + cycle_length
-        
+
         schedule = self.schedule_class(cycle_length=cycle_length, **self.kwargs)
         return schedule(cycle_offset) * self.magnitude_decay**cycle_idx
 ```
@@ -322,4 +322,4 @@ plot_schedule(schedule)
 
 **_Want to learn more?_** Checkout the "Learning Rate Schedules" tutorial for a more basic overview of learning rates found in `mx.lr_scheduler`, and an example of how to use them while training your own models.
 
-<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
\ No newline at end of file
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/python_docs/python/tutorials/packages/gluon/training/trainer.md b/docs/python_docs/python/tutorials/packages/gluon/training/trainer.md
index e475fda8bfc5..8f140894b2f3 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/training/trainer.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/training/trainer.md
@@ -21,9 +21,9 @@ Training a neural network model consists of iteratively performing three simple
 
 The first step is the forward step which computes the loss.  In MXNet Gluon, this first step is achieved by doing a forward pass by calling `net.forward(X)` or simply `net(X)` and then calling the loss function with the result of the forward pass and the labels. For example `l = loss_fn(net(X), y)`.
 
-The second step is the backward step which computes the gradient of the loss with respect to the parameters. In Gluon, this step is  achieved by doing the first step in an [`autograd.record()`](https:///beta.mxnet.io/guide/packages/autograd/autograd.html) scope to record the computations needed to calculate the loss, and then calling `l.backward()` to compute the gradient of the loss with respect to the parameters.
+The second step is the backward step which computes the gradient of the loss with respect to the parameters. In Gluon, this step is  achieved by doing the first step in an [autograd.record()](/api/python/docs/api/autograd/index.html) scope to record the computations needed to calculate the loss, and then calling `l.backward()` to compute the gradient of the loss with respect to the parameters.
 
-The final step is to update the neural network model parameters using an optimization algorithm. In Gluon, this step is performed by the [`gluon.Trainer`](https:///beta.mxnet.io/api/gluon/mxnet.gluon.Trainer.html) and is the subject of this guide. When creating a  Gluon `Trainer` you must provide a collection of parameters that need to be learnt. You also provide an `Optimizer` that will be used to update the parameters every training iteration when `trainer.step` is called.
+The final step is to update the neural network model parameters using an optimization algorithm. In Gluon, this step is performed by the [gluon.Trainer](/api/python/docs/api/gluon/trainer.html) and is the subject of this guide. When creating a  Gluon `Trainer` you must provide a collection of parameters that need to be learnt. You also provide an `Optimizer` that will be used to update the parameters every training iteration when `trainer.step` is called.
 
 ## Basic Usage
 
@@ -97,7 +97,7 @@ print(curr_weight - net.weight.grad() * 1 / batch_size)
 
 In the previous example, we use the string argument `sgd` to select the optimization method, and `optimizer_params` to specify the optimization method arguments.
 
-All pre-defined optimization methods can be passed in this way and the complete list of implemented optimizers is provided in the [`mxnet.optimizer`](https:///beta.mxnet.io/api/gluon-related/mxnet.optimizer.html) module.
+All pre-defined optimization methods can be passed in this way and the complete list of implemented optimizers is provided in the [`mxnet.optimizer`](/api/python/docs/api/optimizer/index.html) module.
 
 However we can also pass an optimizer instance directly to the `Trainer` constructor.
 
@@ -114,14 +114,14 @@ trainer.step(batch_size)
 net.weight.data()
 ```
 
-For reference and implementation details about each optimizer, please refer to the [guide](https:///beta.mxnet.io/guide/packages/optimizer/optimizer.html) for the `optimizer` module.
+For reference and implementation details about each optimizer, please refer to the [guide](/api/python/docs/api/optimizer/index.html) for the `optimizer` module.
 
 ### KVStore Options
 
 The `Trainer` constructor also accepts the following keyword arguments for :
 
-- `kvstore` – how key value store  should be created for multi-gpu and distributed training. Check out  [`mxnet.kvstore.KVStore`](https:///beta.mxnet.io/api/gluon-related/mxnet.kvstore.KVStore.html#mxnet.kvstore.KVStore) for more information. String options are any of the following ['local', 'device', 'dist_device_sync', 'dist_device_async'].
-- `compression_params` – Specifies type of gradient compression and additional arguments depending on the type of compression being used. See [`mxnet.KVStore.set_gradient_compression_method`](https:///beta.mxnet.io/api/gluon-related/_autogen/mxnet.kvstore.KVStore.set_gradient_compression.html#mxnet.kvstore.KVStore.set_gradient_compression) for more details on gradient compression.
+- `kvstore` – how key value store  should be created for multi-gpu and distributed training. Check out  [`mxnet.kvstore.KVStore`](/api/python/docs/api/kvstore/index.html) for more information. String options are any of the following ['local', 'device', 'dist_device_sync', 'dist_device_async'].
+- `compression_params` – Specifies type of gradient compression and additional arguments depending on the type of compression being used. See [`mxnet.KVStore.set_gradient_compression_method`](/api/python/docs/api/kvstore/index.html#mxnet.kvstore.KVStore.set_gradient_compression) for more details on gradient compression.
 - `update_on_kvstore` – Whether to perform parameter updates on KVStore. If None, then the `Trainer` instance  will choose the more suitable option depending on the type of KVStore.
 
 ### Changing the Learning Rate
@@ -143,7 +143,7 @@ trainer.learning_rate
 
 
 
-In addition, there are multiple pre-defined learning rate scheduling methods that are already implemented in the [`mxnet.lr_scheduler`](https://mxnet.apache.org/api/python/docs/api/gluon-related/mxnet.lr_scheduler.html) module. The learning rate schedulers can be incorporated into your trainer by passing them in as an `optimizer_param` entry. Please refer to the [LR scheduler guide](https://mxnet.apache.org/versions/master/tutorials/gluon/learning_rate_schedules.html) to learn more.
+In addition, there are multiple pre-defined learning rate scheduling methods that are already implemented in the [mxnet.lr_scheduler](/api/python/docs/api/lr_scheduler/index.html) module. The learning rate schedulers can be incorporated into your trainer by passing them in as an `optimizer_param` entry. Please refer to the [LR scheduler guide](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules.html) to learn more.
 
 
 
@@ -160,9 +160,9 @@ In addition, there are multiple pre-defined learning rate scheduling methods tha
 
 While optimization and optimizers play a significant role in deep learning model training, there are still other important components to model training. Here are a few suggestions about where to look next.
 
-* The [Optimizer API](https://mxnet.apache.org/api/python/docs/api/gluon-related/mxnet.optimizer.html) and [guide](https://mxnet.apache.org/api/python/docs/tutorials/packages/optimizer/optimizer.html) have information about all the different optimizers implemented in MXNet and their update steps. The [Dive into Deep Learning](https://en.diveintodeeplearning.org/chapter_optimization/index.html) book also has a chapter dedicated to optimization methods and explains various key optimizers in great detail.
+* The [Optimizer API](/api/python/docs/api/optimizer/index.html) and [optimizer guide](/api/python/docs/tutorials/packages/optimizer/index.html) have information about all the different optimizers implemented in MXNet and their update steps. The [Dive into Deep Learning](https://en.diveintodeeplearning.org/chapter_optimization/index.html) book also has a chapter dedicated to optimization methods and explains various key optimizers in great detail.
 
-- Take a look at the [guide to parameter initialization](https://mxnet.apache.org/api/python/docs/tutorials/packages/gluon/init.html) in MXNet to learn about what initialization schemes are already implemented, and how to implement your custom initialization schemes.
-- Also check out this  [guide on parameter management](https://mxnet.apache.org/api/python/docs/tutorials/packages/gluon/parameters.html) to learn about how to manage model parameters in gluon.
-- Make sure to take a look at the [guide to scheduling learning rates](https://mxnet.apache.org/versions/master/tutorials/gluon/learning_rate_schedules.html) to learn how to create learning rate schedules to make your training converge faster.
-- Finally take a look at the [KVStore API](https://mxnet.apache.org/api/python/docs/api/gluon-related/mxnet.kvstore.KVStore.html#mxnet.kvstore.KVStore) to learn how parameter values are synchronized over multiple devices.
+- Take a look at the [guide to parameter initialization](/api/python/docs/tutorials/packages/gluon/blocks/init.html) in MXNet to learn about what initialization schemes are already implemented, and how to implement your custom initialization schemes.
+- Also check out this  [guide on parameter management](/api/python/docs/tutorials/packages/gluon/blocks/parameters.html) to learn about how to manage model parameters in gluon.
+- Make sure to take a look at the [guide to scheduling learning rates](/api/python/docs/tutorials/packages/gluon/training/learning_rates/learning_rate_schedules.html) to learn how to create learning rate schedules to make your training converge faster.
+- Finally take a look at the [KVStore API](/api/python/docs/api/kvstore/index.html) to learn how parameter values are synchronized over multiple devices.
diff --git a/docs/python_docs/python/tutorials/packages/kvstore/index.rst b/docs/python_docs/python/tutorials/packages/kvstore/index.rst
index d3d762d91ea1..f03c308e6a37 100644
--- a/docs/python_docs/python/tutorials/packages/kvstore/index.rst
+++ b/docs/python_docs/python/tutorials/packages/kvstore/index.rst
@@ -26,16 +26,11 @@ KVStore
 
       How to use the KVStore API to use multiple GPUs when training a model.
 
-   .. card::
-      :title: Distributed Key-Value Store
-      :link: https://mxnet.apache.org/versions/master/tutorials/python/kvstore.html
-
-      How to use the KVStore API to share data across different devices.
 
 References
 -----------------
 
-- `KVStore API. <../api/gluon-related/mxnet.kvstore.html>`_
+- `KVStore API. </api/python/docs/api/kvstore/index.html>`_
 
 .. toctree::
    :hidden:
diff --git a/docs/python_docs/python/tutorials/packages/kvstore/kvstore.md b/docs/python_docs/python/tutorials/packages/kvstore/kvstore.md
index 5b5dc94a7c2f..c03a03d1080e 100644
--- a/docs/python_docs/python/tutorials/packages/kvstore/kvstore.md
+++ b/docs/python_docs/python/tutorials/packages/kvstore/kvstore.md
@@ -108,7 +108,7 @@ print(b[1].asnumpy())
 ## Handle a List of Key-Value Pairs
 
 All operations introduced so far involve a single key. KVStore also provides
-an interface for a list of key-value pairs. 
+an interface for a list of key-value pairs.
 
 For a single device:
 
@@ -166,6 +166,6 @@ When the distributed version is ready, we will update this section.
 <!-- flexibly as your choice.  -->
 
 ## Next Steps
-* [MXNet tutorials index](https://mxnet.io/api)
+* [MXNet tutorials index](/api/python/docs/tutorials/)
 
 <!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/01-ndarray-intro.md b/docs/python_docs/python/tutorials/packages/ndarray/01-ndarray-intro.md
index a8c8056614ab..6bc373e356e8 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/01-ndarray-intro.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/01-ndarray-intro.md
@@ -23,10 +23,10 @@ will introduce you to how data is handled with MXNet. You will learn the basics
 about MXNet's multi-dimensional array format, `ndarray`.
 
 This content was extracted and simplified from the gluon tutorials in
-[Dive Into Deep Learning](http://gluon.io/).
+[Dive Into Deep Learning](https://d2l.ai/).
 
 ## Prerequisites
-* [MXNet installed in a Python environment](../../../install/index.html?language=Python).
+* [MXNet installed in a Python environment](/get_started?version=master&platform=linux&language=python&environ=pip&processor=cpu).
 * Python 2.7.x or Python 3.x
 
 
@@ -78,9 +78,7 @@ print(x)
 values of any of its entries. This means that the entries can have any form of
 values, including very big ones! Typically, we'll want our matrices initialized
 and very often we want a matrix of all zeros, so we can use the `.zeros`
-function. If you're feeling experimental, try one of the several [array creation
-functions](https://mxnet.apache.org/api/{.python
-.input}/ndarray/ndarray.html#array-creation-routines).
+function.
 
 <!-- showing something
 different here (3,10) since the zeros may not produce anything different from
@@ -93,8 +91,7 @@ print(x)
 ```
 
 Similarly, `ndarray` has a function to create a matrix of all ones aptly named
-[ones](https://mxnet.apache.org/api/{.python
-.input}/ndarray.html?highlight=random_normal#mxnet.ndarray.ones).
+[ones](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.ones).
 
 ```python
 x = nd.ones((3, 4))
@@ -105,8 +102,7 @@ Often, we'll want to create arrays whose values are sampled randomly. This is
 especially common when we intend to use the array as a parameter in a neural
 network. In this snippet, we initialize with values drawn from a standard normal
 distribution with zero mean and unit variance using
-[random_normal](https://mxnet.apache.org/api/{.python
-.input}/ndarray.html?highlight=random_normal#mxnet.ndarray.random_normal).
+[random_normal](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.random_normal).
 
 <!--
 Is it that important to introduce zero mean and unit variance right now?
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/02-ndarray-operations.md b/docs/python_docs/python/tutorials/packages/ndarray/02-ndarray-operations.md
index 0f2520f2a322..352f6b7a0f34 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/02-ndarray-operations.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/02-ndarray-operations.md
@@ -21,10 +21,10 @@
 This guide will introduce you to MXNet's array operations.
 
 This content was extracted and simplified from the gluon tutorials in
-[Dive Into Deep Learning](http://gluon.io/).
+[Dive Into Deep Learning](https://d2l.ai/).
 
 ## Prerequisites
-* [MXNet installed in a Python environment](../../../install/index.html?language=Python).
+* [MXNet installed in a Python environment](/get_started).
 * Python 2.7.x or Python 3.x
 
 
@@ -35,12 +35,12 @@ Such as element-wise addition:
 <!-- keeping it
 easy -->
 
-```{.python .input}
+```python
 import mxnet as mx
 from mxnet import nd
 ```
 
-```{.python .input}
+```python
 x = nd.ones((3, 4))
 y = nd.random_normal(0, 1, shape=(3, 4))
 print('x=', x)
@@ -51,7 +51,7 @@ print('x = x + y, x=', x)
 
 Multiplication:
 
-```{.python .input}
+```python
 x = nd.array([1, 2, 3])
 y = nd.array([2, 2, 2])
 x * y
@@ -61,7 +61,7 @@ And exponentiation:
 <!-- with these next ones we'll just have to take your word
 for it... -->
 
-```{.python .input}
+```python
 nd.exp(x)
 ```
 
@@ -69,13 +69,10 @@ We can also grab a matrix's transpose to compute a proper matrix-matrix product.
 <!-- because we need to do that before we have coffee every day... and you know
 how those dirty, improper matrixeses can be... -->
 
-```{.python .input}
+```python
 nd.dot(x, y.T)
 ```
 
-We'll explain these operations and present even more operators in the [linear
-algebra](P01-C03-linear-algebra.ipynb) chapter. But for now, we'll stick with
-the mechanics of working with NDArrays.
 
 ## In-place operations
 
@@ -96,7 +93,7 @@ detail, and quite possibily in its own notebook since I think it will help to
 show some gotchas like you mentioned verbally. I am still leaning toward
 delaying the introduction of this topic....-->
 
-```{.python .input}
+```python
 print('y=', y)
 print('id(y):', id(y))
 y = y + x
@@ -107,7 +104,7 @@ print('id(y):', id(y))
 We can assign the result to a previously allocated array with slice notation,
 e.g., `result[:] = ...`.
 
-```{.python .input}
+```python
 print('x=', x)
 z = nd.zeros_like(x)
 print('z is zeros_like x, z=', z)
@@ -123,7 +120,7 @@ before copying it to z. To make better use of memory, we can perform operations
 in place, avoiding temporary buffers. To do this we specify the `out` keyword
 argument every operator supports:
 
-```{.python .input}
+```python
 print('x=', x, 'is in id(x):', id(x))
 print('y=', y, 'is in id(y):', id(y))
 print('z=', z, 'is in id(z):', id(z))
@@ -139,7 +136,7 @@ itself. There are two ways to do this in MXNet.
 = x op y
 2. By using the op-equals operators like `+=`
 
-```{.python .input}
+```python
 print('x=', x, 'is in id(x):', id(x))
 x += y
 print('x=', x, 'is in id(x):', id(x))
@@ -158,7 +155,7 @@ the whole array: a[:]
 
 Here's an example of reading the second and third rows from `x`.
 
-```{.python .input}
+```python
 x = nd.array([1, 2, 3])
 print('1D complete array, x=', x)
 s = x[1:3]
@@ -171,7 +168,7 @@ print('slicing the 2nd and 3rd elements, s=', s)
 
 Now let's try writing to a specific element.
 
-```{.python .input}
+```python
 print('original x, x=', x)
 x[2] = 9.0
 print('replaced entire row with x[2] = 9.0, x=', x)
@@ -183,7 +180,7 @@ print('replaced range of elements with x[1:2,1:3] = 5.0, x=', x)
 
 Multi-dimensional slicing is also supported.
 
-```{.python .input}
+```python
 x = nd.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
 print('original x, x=', x)
 s = x[1:2,1:3]
@@ -217,7 +214,7 @@ a shape like (3,3) you lose some of the impact and miss some errors if people
 play with the values. Better to have a distinct shape so that it is more obvious
 what is happening and what can break.-->
 
-```{.python .input}
+```python
 x = nd.ones(shape=(3,6))
 print('x = ', x)
 y = nd.arange(6)
@@ -233,7 +230,7 @@ That's because broadcasting prefers to duplicate along the left most axis.
 We can alter this behavior by explicitly giving `y` a $2$D shape using `.reshape`.
 You can also chain `.arange` and `.reshape` to do this in one step.
 
-```{.python .input}
+```python
 y = y.reshape((3,1))
 print('y = ', y)
 print('x + y = ', x+y)
@@ -245,12 +242,12 @@ print('y = ', y)
 Converting MXNet NDArrays to and from
 NumPy is easy. The converted arrays do not share memory.
 
-```{.python .input}
+```python
 a = x.asnumpy()
 type(a)
 ```
 
-```{.python .input}
+```python
 y = nd.array(a)
 print('id(a)=', id(a), 'id(x)=', id(x), 'id(y)=', id(y))
 ```
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/03-ndarray-contexts.md b/docs/python_docs/python/tutorials/packages/ndarray/03-ndarray-contexts.md
index f6e365974d64..006d77c03263 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/03-ndarray-contexts.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/03-ndarray-contexts.md
@@ -21,10 +21,10 @@
 This guide will introduce you to managing CPU versus GPU contexts for handling data.
 
 This content was extracted and simplified from the gluon tutorials in
-[Dive Into Deep Learning](http://gluon.io/).
+[Dive Into Deep Learning](https://d2l.ai/).
 
 ## Prerequisites
-* [MXNet installed (with GPU support) in a Python environment](../../../install/index.html?language=Python).
+* [MXNet installed (with GPU support) in a Python environment](/get_started).
 * Python 2.7.x or Python 3.x
 * **One or more GPUs**
 
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md b/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
index 69ec2bf0afbf..c1e49cd75bbb 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
@@ -22,29 +22,29 @@ The goal of this tutorial is to explain some common misconceptions about using [
 
 ## Asynchronous and non-blocking nature of Apache MXNet
 
-Instead of using NumPy arrays Apache MXNet offers its own array implementation named [NDArray](https://mxnet.apache.org/api/python/ndarray/ndarray.html). `NDArray API` was intentionally designed to be similar to `NumPy`, but there are differences.
+Instead of using NumPy arrays Apache MXNet offers its own array implementation named [NDArray](/api/python/docs/api/ndarray/index.html). `NDArray API` was intentionally designed to be similar to `NumPy`, but there are differences.
 
-One key difference is in the way calculations are executed. Every `NDArray` manipulation in Apache MXNet is done in asynchronous, non-blocking way. That means, that when we write code like `c = a * b`, where both `a` and `b` are `NDArrays`, the function is pushed to the [Execution Engine](https://mxnet.apache.org/architecture/overview.html#execution-engine), which starts the calculation. The function immediately returns back, and the  user thread can continue execution, despite the fact that the calculation may not have been completed yet. 
+One key difference is in the way calculations are executed. Every `NDArray` manipulation in Apache MXNet is done in asynchronous, non-blocking way. That means, that when we write code like `c = a * b`, where both `a` and `b` are `NDArrays`, the function is pushed to the [Execution Engine](/api/architecture/overview.html#execution-engine), which starts the calculation. The function immediately returns back, and the  user thread can continue execution, despite the fact that the calculation may not have been completed yet.
 
-`Execution Engine` builds the computation graph which may reorder or combine some calculations, but it honors dependency order: if there are other manipulation with `c` done later in the code, the `Execution Engine` will start doing them once the result of `c` is available. We don't need to write callbacks to start execution of subsequent code - the `Execution Engine` is going to do it for us. 
+`Execution Engine` builds the computation graph which may reorder or combine some calculations, but it honors dependency order: if there are other manipulation with `c` done later in the code, the `Execution Engine` will start doing them once the result of `c` is available. We don't need to write callbacks to start execution of subsequent code - the `Execution Engine` is going to do it for us.
 
-To get the result of the computation we only need to access the resulting variable, and the flow of the code will be blocked until the computation results are assigned to the resulting variable. This behavior allows to increase code performance while still supporting imperative programming mode. 
+To get the result of the computation we only need to access the resulting variable, and the flow of the code will be blocked until the computation results are assigned to the resulting variable. This behavior allows to increase code performance while still supporting imperative programming mode.
 
-Refer to the [intro tutorial to NDArray](https://mxnet.apache.org/tutorials/basic/ndarray.html), if you are new to Apache MXNet and would like to learn more how to manipulate NDArrays.
+Refer to the [intro tutorial to NDArray](/api/python/docs/tutorials/packages/ndarray/index.html), if you are new to Apache MXNet and would like to learn more how to manipulate NDArrays.
 
 ## Converting NDArray to NumPy Array blocks calculation
 
-Many people are familiar with NumPy and flexible doing tensor manipulations using it. `NDArray API` offers  a convinient [.asnumpy() method](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy) to cast `nd.array` to `np.array`. However, by doing this cast and using `np.array` for calculation, we cannot use all the goodness of `Execution Engine`. All manipulations done on `np.array` are blocking. Moreover, the cast to `np.array` itself is a blocking operation (same as [.asscalar()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.NDArray.asscalar), [.wait_to_read()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.NDArray.wait_to_read) and [.waitall()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.waitall)). 
+Many people are familiar with NumPy and flexible doing tensor manipulations using it. `NDArray API` offers  a convinient [.asnumpy() method](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy) to cast `nd.array` to `np.array`. However, by doing this cast and using `np.array` for calculation, we cannot use all the goodness of `Execution Engine`. All manipulations done on `np.array` are blocking. Moreover, the cast to `np.array` itself is a blocking operation (same as [.asscalar()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.asscalar), [.wait_to_read()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.wait_to_read) and [.waitall()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.waitall)).
 
 That means that if we have a long computation graph and, at some point, we want to cast the result to `np.array`, it may feel like the casting takes a lot of time. But what really takes this time is `Execution Engine`, which finishes all the async calculations we have pushed into it to get the final result, which then will be converted to `np.array`.
 
-Because of the blocking nature of [.asnumpy() method](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy), using it reduces the execution performance, especially if the calculations are done on GPU: Apache MXNet has to copy data from GPU to CPU to return `np.array`. 
+Because of the blocking nature of [.asnumpy() method](api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy), using it reduces the execution performance, especially if the calculations are done on GPU: Apache MXNet has to copy data from GPU to CPU to return `np.array`.
 
 The best solution is to **make manipulations directly on NDArrays by methods provided in [NDArray API](https://mxnet.apache.org/api/python/ndarray/ndarray.html)**.
 
 ## NumPy operators vs. NDArray operators
 
-Despite the fact that [NDArray API](https://mxnet.apache.org/api/python/ndarray/ndarray.html) was specifically designed to be similar to `NumPy`, sometimes it is not easy to replace existing `NumPy` computations. The main reason is that not all operators, that are available in `NumPy`, are available in `NDArray API`. The list of currently available operators is available on [NDArray class page](http://mxnet.apache.org/api/python/ndarray/ndarray.html#the-ndarray-class).
+Despite the fact that [NDArray API](/api/python/docs/api/ndarray/index.html) was specifically designed to be similar to `NumPy`, sometimes it is not easy to replace existing `NumPy` computations. The main reason is that not all operators, that are available in `NumPy`, are available in `NDArray API`. The list of currently available operators is available on [NDArray class page](/api/python/docs/api/ndarray/ndarray.html).
 
 If a required operator is missing from `NDArray API`, there are few things you can do.
 
@@ -57,7 +57,7 @@ There are a situation, when you can assemble a higher level operator using exist
 from mxnet import nd
 import numpy as np
 
-# NumPy has full_like() operator 
+# NumPy has full_like() operator
 np_y = np.full_like(a=np.arange(6, dtype=int), fill_value=10)
 
 # NDArray doesn't have it, but we can replace it with
@@ -73,29 +73,29 @@ np.array_equal(np_y, nd_y.asnumpy())
 
 ### Find similar operator with different name and/or signature
 
-Some operators may have slightly different name, but are similar in terms of functionality. For example [nd.ravel_multi_index()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.ravel_multi_index) is similar to [np.ravel()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ma.ravel.html#numpy.ma.ravel). In other cases some operators may have similar names, but different signatures. For example [np.split()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.split.html#numpy.split) and [nd.split()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.split) are similar, but the former works with indices and the latter requires the number of splits to be provided.
+Some operators may have slightly different name, but are similar in terms of functionality. For example [nd.ravel_multi_index()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.ravel_multi_index) is similar to [np.ravel()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.ma.ravel.html#numpy.ma.ravel). In other cases some operators may have similar names, but different signatures. For example [np.split()](https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.split.html#numpy.split) and [nd.split()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.split) are similar, but the former works with indices and the latter requires the number of splits to be provided.
 
-One particular example of different input requirements is [nd.pad()](https://mxnet.apache.org/api/python/ndarray/ndarray.html#mxnet.ndarray.pad). The trick is that it can only work with 4-dimensional tensors. If your input has less dimensions, then you need to expand its number before using `nd.pad()` as it is shown in the code block below: 
+One particular example of different input requirements is [nd.pad()](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.pad). The trick is that it can only work with 4-dimensional tensors. If your input has less dimensions, then you need to expand its number before using `nd.pad()` as it is shown in the code block below:
 
 
 ```python
 def pad_array(data, max_length):
     # expand dimensions to 4, because nd.pad can work only with 4 dims
     data_expanded = data.reshape(1, 1, 1, data.shape[0])
-    
+
     # pad all 4 dimensions with constant value of 0
     data_padded = nd.pad(data_expanded,
                              mode='constant',
                              pad_width=[0, 0, 0, 0, 0, 0, 0, max_length - data.shape[0]],
                              constant_value=0)
-    
-    # remove temporary dimensions 
+
+    # remove temporary dimensions
     data_reshaped_back = data_padded.reshape(max_length)
     return data_reshaped_back
 
 pad_array(nd.array([1, 2, 3]), max_length=10)
 ```
-    
+
 `[ 1.  2.  3.  0.  0.  0.  0.  0.  0.  0.]` <!--notebook-skip-line-->
 
 
@@ -104,7 +104,7 @@ pad_array(nd.array([1, 2, 3]), max_length=10)
 
 ### Search for an operator on [Github](https://github.com/apache/incubator-mxnet/labels/Operator)
 
-Apache MXNet community is responsive to requests, and everyone is welcomed to contribute new operators. Have in mind, that there is always a lag between new operators being merged into the codebase and release of a next stable version. For example, [nd.diag()](https://github.com/apache/incubator-mxnet/pull/11643) operator was recently introduced to Apache MXNet, but on the moment of writing this tutorial, it is not in any stable release. You can always get all latest implementations by installing the [master version](https://mxnet.apache.org/install/index.html?version=master#) of Apache MXNet.
+Apache MXNet community is responsive to requests, and everyone is welcomed to contribute new operators. Have in mind, that there is always a lag between new operators being merged into the codebase and release of a next stable version. For example, [nd.diag()](https://github.com/apache/incubator-mxnet/pull/11643) operator was recently introduced to Apache MXNet, but on the moment of writing this tutorial, it is not in any stable release. You can always get all latest implementations by installing the [master version](/get_started?version=master&platform=linux&language=python&environ=pip&processor=cpu#) of Apache MXNet.
 
 ## How to minimize the impact of blocking calls
 
@@ -156,10 +156,10 @@ for data, label in train_data:
         out = net(data)
         # This call saves new loss and returns previous loss
         prev_loss = loss_buffer.new_loss(ce(out, label))
-        
+
     loss_buffer.loss.backward()
     trainer.step(data.shape[0])
-    
+
     if prev_loss is not None:
         print("Loss: {}".format(np.mean(prev_loss.asnumpy())))
 ```
@@ -184,4 +184,4 @@ for data, label in train_data:
 
 For performance reasons, it is better to use native `NDArray API` methods and avoid using NumPy altogether. In case when you must use NumPy, you can use convenient method `.asnumpy()` on `NDArray` to get NumPy representation. By doing so, you block the whole computational process, and force data to be synced between CPU and GPU. If it is a necessary evil to do that, try to minimize the blocking time by calling `.asnumpy()` in time, when you expect the value to be already computed.
 
-<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
\ No newline at end of file
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/sparse/row_sparse.md b/docs/python_docs/python/tutorials/packages/ndarray/sparse/row_sparse.md
index 7218ef9beae1..1241182af85b 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/sparse/row_sparse.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/sparse/row_sparse.md
@@ -35,12 +35,12 @@ Y = mx.nd.dot(X, W)
 ```
 
 ```
-{'W': 
+{'W':
  [[ 3.  4.  5.]
   [ 6.  7.  8.]]
- <NDArray 2x3 @cpu(0)>, 'X': 
+ <NDArray 2x3 @cpu(0)>, 'X':
  [[ 1.  0.]]
- <NDArray 1x2 @cpu(0)>, 'Y': 
+ <NDArray 1x2 @cpu(0)>, 'Y':
  [[ 3.  4.  5.]]
  <NDArray 1x3 @cpu(0)>}
 ```
@@ -87,7 +87,7 @@ As you can see, row 0 of ``grad_W`` contains non-zero values while row 1 of ``gr
 If you look at how ``grad_W`` is calculated, notice that since column 1 of ``X`` is filled with zeros, row 1 of ``grad_W`` is filled with zeros too.
 
 In the real world, gradients for parameters that interact with sparse inputs ususally have gradients where many row slices are completely zeros. Storing and manipulating such sparse matrices with many row slices of all zeros in the default dense structure results in wasted memory and processing on the zeros. More importantly, many gradient based optimization methods such as SGD, [AdaGrad](https://stanford.edu/~jduchi/projects/DuchiHaSi10_colt.pdf) and [Adam](https://arxiv.org/pdf/1412.6980.pdf)
-take advantage of sparse gradients and prove to be efficient and effective. 
+take advantage of sparse gradients and prove to be efficient and effective.
 **In MXNet, the ``RowSparseNDArray`` stores the matrix in ``row sparse`` format, which is designed for arrays of which most row slices are all zeros.**
 In this tutorial, we will describe what the row sparse format is and how to use RowSparseNDArray for sparse gradient updates in MXNet.
 
@@ -95,7 +95,7 @@ In this tutorial, we will describe what the row sparse format is and how to use
 
 To complete this tutorial, we need:
 
-- MXNet. See the instructions for your operating system in [Setup and Installation](https://mxnet.apache.org/install/index.html)
+- MXNet. See the instructions for your operating system in [Setup and Installation](/get_started)
 - [Jupyter](http://jupyter.org/)
     ```
     pip install jupyter
@@ -267,10 +267,10 @@ indices = a.indices
 
 
 ```
-{'a.stype': 'row_sparse', 'data': 
+{'a.stype': 'row_sparse', 'data':
  [[ 1.  2.]
   [ 3.  4.]]
- <NDArray 2x2 @cpu(0)>, 'indices': 
+ <NDArray 2x2 @cpu(0)>, 'indices':
  [1 4]
  <NDArray 2 @cpu(0)>}
 ```
@@ -294,10 +294,10 @@ dense = rsp.tostype('default')
 
 
 ```
-{'dense': 
+{'dense':
  [[ 1.  1.]
   [ 1.  1.]]
- <NDArray 2x2 @cpu(0)>, 'rsp': 
+ <NDArray 2x2 @cpu(0)>, 'rsp':
  <RowSparseNDArray 2x2 @cpu(0)>}
 ```
 
@@ -318,10 +318,10 @@ dense = mx.nd.sparse.cast_storage(rsp, 'default')
 
 
 ```
-{'dense': 
+{'dense':
  [[ 1.  1.]
   [ 1.  1.]]
- <NDArray 2x2 @cpu(0)>, 'rsp': 
+ <NDArray 2x2 @cpu(0)>, 'rsp':
  <RowSparseNDArray 2x2 @cpu(0)>}
 ```
 
@@ -393,7 +393,7 @@ rsp_retained = mx.nd.sparse.retain(rsp, mx.nd.array([0, 1]))
         [ 0.,  0.],
         [ 3.,  4.],
         [ 5.,  6.],
-        [ 0.,  0.]], dtype=float32), 'rsp_retained': 
+        [ 0.,  0.]], dtype=float32), 'rsp_retained':
  <RowSparseNDArray 5x2 @cpu(0)>, 'rsp_retained.asnumpy()': array([[ 1.,  2.],
         [ 0.,  0.],
         [ 0.,  0.],
@@ -424,7 +424,7 @@ transpose_dot = mx.nd.sparse.dot(lhs, rhs, transpose_a=True)
 
 
 ```
-{'transpose_dot': 
+{'transpose_dot':
  <RowSparseNDArray 5x2 @cpu(0)>, 'transpose_dot.asnumpy()': array([[ 7.,  7.],
         [ 9.,  9.],
         [ 8.,  8.],
@@ -576,7 +576,7 @@ except mx.MXNetError as err:
     sys.stderr.write(str(err))
 ```
 
-## Next 
+## Next
 
 [Train a Linear Regression Model with Sparse Symbols](http://mxnet.apache.org/tutorials/sparse/train.html)
 
diff --git a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
index e8b0761dbc4d..f1e710555e61 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
@@ -31,7 +31,7 @@ In this tutorial we will:
 ## Pre-requisite
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet > 1.1.0](http://mxnet.apache.org/install/index.html)
+- [MXNet > 1.1.0]()
 - [onnx](https://github.com/onnx/onnx)
 - matplotlib
 
@@ -333,7 +333,7 @@ The trainer will retrain and fine-tune the entire network. If we use `dense_laye
 
 
 ```python
-trainer = gluon.Trainer(net.collect_params(), 'sgd', 
+trainer = gluon.Trainer(net.collect_params(), 'sgd',
                         {'learning_rate': LEARNING_RATE,
                          'wd':WDECAY,
                          'momentum':MOMENTUM})
@@ -388,15 +388,15 @@ for epoch in range(5):
         loss.backward()
         trainer.step(data.shape[0])
 
-    nd.waitall() # wait at the end of the epoch    
-    new_val_accuracy = evaluate_accuracy_gluon(dataloader_test, net)    
+    nd.waitall() # wait at the end of the epoch
+    new_val_accuracy = evaluate_accuracy_gluon(dataloader_test, net)
     print("Epoch [{0}] Test Accuracy {1:.4f} ".format(epoch, new_val_accuracy))
 
     # We perform early-stopping regularization, to prevent the model from overfitting
     if val_accuracy > new_val_accuracy:
         print('Validation accuracy is decreasing, stopping training')
         break
-    val_accuracy = new_val_accuracy              
+    val_accuracy = new_val_accuracy
 ```
 
 `Epoch 4, Test Accuracy 0.8942307829856873`<!--notebook-skip-line-->
@@ -453,4 +453,4 @@ plot_predictions(caltech101_images_test, result, categories, TOP_P)
 ![png](https://github.com/dmlc/web-data/blob/master/mxnet/doc/tutorials/onnx/caltech101_correct.png?raw=true)<!--notebook-skip-line-->
 
 
-**Great!** The network classified these images correctly after being fine-tuned on a dataset that contains images of `wrench`, `dolphin` and `lotus`
\ No newline at end of file
+**Great!** The network classified these images correctly after being fine-tuned on a dataset that contains images of `wrench`, `dolphin` and `lotus`
diff --git a/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md b/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
index 9b1e45cb40c7..481200059559 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
@@ -29,7 +29,7 @@ In this tutorial we will:
 ## Pre-requisite
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet > 1.1.0](http://mxnet.apache.org/install/index.html)
+- [MXNet > 1.1.0]()
 - [onnx](https://github.com/onnx/onnx) (follow the install guide)
 - matplotlib
 
diff --git a/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md b/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
index 5d3ad021e25c..a3aa8f807384 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
@@ -24,7 +24,7 @@ In this tutorial we will:
 
 ## Prerequisites
 This example assumes that the following python packages are installed:
-- [mxnet](http://mxnet.apache.org/install/index.html)
+- [mxnet]()
 - [onnx](https://github.com/onnx/onnx) (follow the install guide)
 - Pillow - A Python Image Processing package and is required for input pre-processing. It can be installed with ```pip install Pillow```.
 - matplotlib
diff --git a/docs/static_site/src/pages/api/faq/distributed_training.md b/docs/static_site/src/pages/api/faq/distributed_training.md
index 169b52183d75..caf0123b7aea 100644
--- a/docs/static_site/src/pages/api/faq/distributed_training.md
+++ b/docs/static_site/src/pages/api/faq/distributed_training.md
@@ -69,7 +69,7 @@ The distributed mode of KVStore is enabled by calling `mxnet.kvstore.create` fun
 with a string argument which contains the word `dist` as follows:
 > kv = mxnet.kvstore.create('dist_sync')
 
-Refer [KVStore API]({{'/api/python/docs/api/gluon-related/mxnet.kvstore.KVStore.html#mxnet.kvstore.KVStore'|relative_url}}) for more information about KVStore.
+Refer [KVStore API]({{'/api/python/docs/api/kvstore/index.html#mxnet.kvstore.KVStore'|relative_url}}) for more information about KVStore.
 
 ### Distribution of Keys
 Each server doesn't necessarily store all the keys or parameter arrays.
@@ -91,7 +91,7 @@ In the case of distributed training though, we would need to divide the dataset
 
 Typically, this split of data for each worker happens through the data iterator,
 on passing the number of parts and the index of parts to iterate over.
-Some iterators in MXNet that support this feature are [mxnet.io.MNISTIterator]({{'/api/python/docs/api/gluon-related/_autogen/mxnet.io.MNISTIter.html#mxnet.io.MNISTIter'|relative_url}}) and [mxnet.io.ImageRecordIter]({{'/api/python/docs/api/gluon-related/_autogen/mxnet.io.ImageRecordIter.html#mxnet.io.ImageRecordIter'|relative_url}}).
+Some iterators in MXNet that support this feature are [mxnet.io.MNISTIterator]({{'//api/mxnet/io/index.html#mxnet.io.MNISTIter'|relative_url}}) and [mxnet.io.ImageRecordIter]({{'/api/mxnet/io/index.html#mxnet.io.ImageRecordIter'|relative_url}}).
 If you are using a different iterator, you can look at how the above iterators implement this.
 We can use the kvstore object to get the number of workers (`kv.num_workers`) and rank of the current worker (`kv.rank`).
 These can be passed as arguments to the iterator.
diff --git a/docs/static_site/src/pages/api/r/docs/tutorials/five_minutes_neural_network.md b/docs/static_site/src/pages/api/r/docs/tutorials/five_minutes_neural_network.md
index a5d57347e01e..f7121407892e 100644
--- a/docs/static_site/src/pages/api/r/docs/tutorials/five_minutes_neural_network.md
+++ b/docs/static_site/src/pages/api/r/docs/tutorials/five_minutes_neural_network.md
@@ -25,7 +25,7 @@ permalink: /api/r/docs/tutorials/five_minutes_neural_network
 Develop a Neural Network with MXNet in Five Minutes
 =============================================
 
-This tutorial is designed for new users of the `mxnet` package for R. It shows how to construct a neural network to do regression in 5 minutes. It shows how to perform classification and regression tasks, respectively. The data we use is in the `mlbench` package. Instructions to install R and MXNet's R package in different environments can be found [here](https://mxnet.apache.org/install/index.html?platform=Linux&language=R&processor=CPU).
+This tutorial is designed for new users of the `mxnet` package for R. It shows how to construct a neural network to do regression in 5 minutes. It shows how to perform classification and regression tasks, respectively. The data we use is in the `mlbench` package. Instructions to install R and MXNet's R package in different environments can be found [here](/get_started?version=master&platform=linux&language=r&environ=pip&processor=cpu).
 
 ## Classification
 

From 508ebb0e1aafdbd8d1ee0aef6f5379276f94807f Mon Sep 17 00:00:00 2001
From: Aaron Markham <markhama@amazon.com>
Date: Wed, 16 Oct 2019 06:51:29 -0700
Subject: [PATCH 2/9] Update
 docs/python_docs/python/tutorials/deploy/export/onnx.md

---
 docs/python_docs/python/tutorials/deploy/export/onnx.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/export/onnx.md b/docs/python_docs/python/tutorials/deploy/export/onnx.md
index f3ba3b979fe3..b292b359bd59 100644
--- a/docs/python_docs/python/tutorials/deploy/export/onnx.md
+++ b/docs/python_docs/python/tutorials/deploy/export/onnx.md
@@ -28,7 +28,7 @@ In this tutorial, we will learn how to use MXNet to ONNX exporter on pre-trained
 ## Prerequisites
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet >= 1.3.0]()
+- [MXNet >= 1.3.0](/get_started)
 - [onnx]( https://github.com/onnx/onnx#installation) v1.2.1 (follow the install guide)
 
 *Note:* MXNet-ONNX importer and exporter follows version 7 of ONNX operator set which comes with ONNX v1.2.1.
@@ -147,4 +147,4 @@ checker.check_graph(model_proto.graph)
 
 If the converted protobuf format doesn't qualify to ONNX proto specifications, the checker will throw errors, but in this case it successfully passes. 
 
-This method confirms exported model protobuf is valid. Now, the model is ready to be imported in other frameworks for inference!
\ No newline at end of file
+This method confirms exported model protobuf is valid. Now, the model is ready to be imported in other frameworks for inference!

From 7567d43f6518e2f81b496ae871abd0a921fa810b Mon Sep 17 00:00:00 2001
From: Aaron Markham <markhama@amazon.com>
Date: Wed, 16 Oct 2019 06:53:08 -0700
Subject: [PATCH 3/9] Update
 docs/python_docs/python/tutorials/packages/onnx/super_resolution.md

---
 .../python/tutorials/packages/onnx/super_resolution.md        | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md b/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
index a3aa8f807384..eec904d80a64 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/super_resolution.md
@@ -24,7 +24,7 @@ In this tutorial we will:
 
 ## Prerequisites
 This example assumes that the following python packages are installed:
-- [mxnet]()
+- [mxnet](/get_started)
 - [onnx](https://github.com/onnx/onnx) (follow the install guide)
 - Pillow - A Python Image Processing package and is required for input pre-processing. It can be installed with ```pip install Pillow```.
 - matplotlib
@@ -137,4 +137,4 @@ You can now compare the input image and the resulting output image. As you will
 | ----------- | ------------ | <!--notebook-skip-line-->
 | ![input](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/onnx/images/super_res_input.jpg?raw=true) | ![output](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/doc/tutorials/onnx/images/super_res_output.jpg?raw=true) | <!--notebook-skip-line-->
 
-<!-- INSERT SOURCE DOWNLOAD BUTTONS -->
\ No newline at end of file
+<!-- INSERT SOURCE DOWNLOAD BUTTONS -->

From eb0c73d76af31aa5e385e467a05a58b820a3782a Mon Sep 17 00:00:00 2001
From: Aaron Markham <markhama@amazon.com>
Date: Wed, 16 Oct 2019 06:54:17 -0700
Subject: [PATCH 4/9] Apply suggestions from code review

---
 .../python/tutorials/packages/onnx/fine_tuning_gluon.md         | 2 +-
 .../python/tutorials/packages/onnx/inference_on_onnx_model.md   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
index f1e710555e61..f77731494215 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/fine_tuning_gluon.md
@@ -31,7 +31,7 @@ In this tutorial we will:
 ## Pre-requisite
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet > 1.1.0]()
+- [MXNet > 1.1.0](/get_started)
 - [onnx](https://github.com/onnx/onnx)
 - matplotlib
 
diff --git a/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md b/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
index 481200059559..faad53bc6793 100644
--- a/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
+++ b/docs/python_docs/python/tutorials/packages/onnx/inference_on_onnx_model.md
@@ -29,7 +29,7 @@ In this tutorial we will:
 ## Pre-requisite
 
 To run the tutorial you will need to have installed the following python modules:
-- [MXNet > 1.1.0]()
+- [MXNet > 1.1.0](/get_started)
 - [onnx](https://github.com/onnx/onnx) (follow the install guide)
 - matplotlib
 

From af2522f5f2968c1fe2f70d1bf84dbe0d957e3132 Mon Sep 17 00:00:00 2001
From: Aaron Markham <aaron.s.markham@gmail.com>
Date: Wed, 16 Oct 2019 09:37:25 -0700
Subject: [PATCH 5/9] update cloud guide

---
 .../tutorials/deploy/run-on-aws/cloud.rst     | 82 ++-----------------
 1 file changed, 5 insertions(+), 77 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst
index 20f69a80e118..8549ebbd92ba 100644
--- a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst
+++ b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst
@@ -26,80 +26,8 @@ learning models. Using AWS, we can rapidly fire up multiple machines
 with multiple GPUs each at will and maintain the resources for precisely
 the amount of time needed.
 
-Set Up an AWS GPU Cluster from Scratch
---------------------------------------
-
-In this document, we provide a step-by-step guide that will teach you
-how to set up an AWS cluster with *MXNet*. We show how to:
-
--  ``Use Amazon S3 to host data``\ \_
--  ``Set up an EC2 GPU instance with all dependencies installed``\ \_
--  ``Build and run MXNet on a single computer``\ \_
--  ``Set up an EC2 GPU cluster for distributed training``\ \_
-
-Use Amazon S3 to Host Data
-:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`~`````````\ ~`\ ~~
-
-Amazon S3 provides distributed data storage which proves especially
-convenient for hosting large datasets. To use S3, you need
-``AWS credentials``\ \_, including an ``ACCESS_KEY_ID`` and a
-``SECRET_ACCESS_KEY``.
-
-To use *MXNet* with S3, set the environment variables
-``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` by adding the
-following two lines in ``~/.bashrc`` (replacing the strings with the
-correct ones):
-
-.. code:: bash
-
-export AWS\_ACCESS\_KEY\_ID=AKIAIOSFODNN7EXAMPLE export
-AWS\_SECRET\_ACCESS\_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
-
-There are several ways to upload data to S3. One simple way is to use
-``s3cmd``\ \_. For example:
-
-.. code:: bash
-
-wget http://data.mxnet.io/mxnet/data/mnist.zip unzip mnist.zip && s3cmd
-put t\*-ubyte s3://dmlc/mnist/
-
-Use Pre-installed EC2 GPU Instance
-:sub:`:sub:`~`\ :sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`~`````````````\ ~~
-
-The ``Deep Learning AMI``\ \_ is an Amazon Linux image supported and
-maintained by Amazon Web Services for use on Amazon Elastic Compute
-Cloud (Amazon EC2). It contains ``MXNet-v0.9.3 tag``\ \_ and the
-necessary components to get going with deep learning, including Nvidia
-drivers, CUDA, cuDNN, Anaconda, Python2 and Python3. The AMI IDs are the
-following:
-
--  us-east-1: ami-e7c96af1
--  us-west-2: ami-dfb13ebf
--  eu-west-1: ami-6e5d6808
-
-Now you can launch *MXNet* directly on an EC2 GPU instance. You can also
-use ``Jupyter``\ \_ notebook on EC2 machine. Here is a
-``good tutorial``\ \_ on how to connect to a Jupyter notebook running on
-an EC2 instance.
-
-Set Up an EC2 GPU Instance from Scratch
-:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`~``````\ :sub:`:sub:`:sub:`:sub:`:sub:`:sub:`:sub:`~```````\ :sub:`:sub:`~```
-
-*MXNet* requires the following libraries:
-
--  C++ compiler with C++11 support, such as ``gcc >= 4.8``
--  ``CUDA`` (``CUDNN`` in optional) for GPU linear algebra
--  ``BLAS`` (cblas, open-blas, atblas, mkl, or others)
-
-.. \_Use Amazon S3 to host data: #use-amazon-s3-to-host-data .. \_Set up
-an EC2 GPU instance with all dependencies installed:
-#set-up-an-ec2-gpu-instance .. \_Build and run MXNet on a single
-computer: #build-and-run-mxnet-on-a-gpu-instance .. \_Set up an EC2 GPU
-cluster for distributed training:
-#set-up-an-ec2-gpu-cluster-for-distributed-training .. \_AWS
-credentials:
-http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html
-.. \_s3cmd: http://s3tools.org/s3cmd .. *Deep Learning AMI:
-https://aws.amazon.com/marketplace/pp/B01M0AXXQB?qid=1475211685369&sr=0-1&ref*\ =srh\_res\_product\_title
-.. \_MXNet-v0.9.3 tag: https://github.com/apache/incubator-mxnet .. \_Jupyter:
-http://jupyter.org
+Here are some ways you can use MXNet on AWS:
+1. Use [Amazon SageMaker](https://aws.amazon.com/sagemaker/developer-resources/)
+1. Use the [AWS Deep Learning AMI with Conda](https://docs.aws.amazon.com/dlami/latest/devguide/overview-conda.html) (comes preinstalled!)
+1. Use an [AWS Deep Learning Container](https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers.html)
+1. Install MXNet on a [AWS Deep Learning Base AMI](https://docs.aws.amazon.com/dlami/latest/devguide/overview-base.html)

From d7684bc590655e91ad85a687a423d1a931717d71 Mon Sep 17 00:00:00 2001
From: Aaron Markham <markhama@amazon.com>
Date: Wed, 16 Oct 2019 12:36:52 -0700
Subject: [PATCH 6/9] Apply suggestions from code review

Co-Authored-By: Talia <31782251+TEChopra1000@users.noreply.github.com>
---
 docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst | 2 +-
 docs/python_docs/python/tutorials/packages/autograd/index.md  | 4 ++--
 .../python/tutorials/packages/gluon/blocks/hybridize.md       | 4 ++--
 docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md | 2 +-
 docs/python_docs/python/tutorials/packages/gluon/loss/loss.md | 4 ++--
 .../tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md      | 2 +-
 6 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst b/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
index 46ef737e596a..522c44ee4039 100644
--- a/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
+++ b/docs/python_docs/python/tutorials/deploy/run-on-aws/index.rst
@@ -42,7 +42,7 @@ The following tutorials will help you learn how to deploy MXNet on various AWS p
 
    .. card::
       :title: Training with Data from S3
-      :link: s3_integration.html
+      :link: /api/faq/s3_integration
 
       How to train with data from Amazon S3 buckets.
 
diff --git a/docs/python_docs/python/tutorials/packages/autograd/index.md b/docs/python_docs/python/tutorials/packages/autograd/index.md
index 67229c2d11c4..586794a696ba 100644
--- a/docs/python_docs/python/tutorials/packages/autograd/index.md
+++ b/docs/python_docs/python/tutorials/packages/autograd/index.md
@@ -159,7 +159,7 @@ print('is_training:', is_training, output)
 
 We called `dropout` while `autograd` was recording this time, so our network was in training mode and we see dropout of the input this time. Since the probability of dropout was 50%, the output is automatically scaled by 1/0.5=2 to preserve the average activation.
 
-We can force some operators to behave as they would during training, even in inference mode. One example is setting `mode='always'` on the [Dropout](/api/python/ndarray/ndarray.html?highlight=dropout#mxnet.ndarray.Dropout) operator, but this usage is uncommon.
+We can force some operators to behave as they would during training, even in inference mode. One example is setting `mode='always'` on the [Dropout](/api/python/ndarray/ndarray.html#mxnet.ndarray.Dropout) operator, but this usage is uncommon.
 
 ## Advanced: Skipping the calculation of parameter gradients
 
@@ -196,7 +196,7 @@ print(x.grad)
 
 ## Advanced: Using Python control flow
 
-As mentioned before, one of the main advantages of `autograd` is the ability to automatically calculate gradients of dynamic graphs (i.e. graphs where the operators could be different on every forward pass). One example of this would be applying a tree structured recurrent network to parse a sentence using its parse tree. And we can use Python control flow operators to create a dynamic flow that depends on the data, rather than using [MXNet's control flow operators](/api/python/tutorials/extend/control_flow.html).
+As mentioned before, one of the main advantages of `autograd` is the ability to automatically calculate gradients of dynamic graphs (i.e. graphs where the operators could be different on every forward pass). One example of this would be applying a tree structured recurrent network to parse a sentence using its parse tree. And we can use Python control flow operators to create a dynamic flow that depends on the data, rather than using [MXNet's control flow operators](/api/python/docs/tutorials/packages/autograd/index.html#Advanced:-Using-Python-control-flow).
 
 We'll write a function as a toy example of a dynamic network. We'll add an `if` condition and a loop with a variable number of iterations, both of which will depend on the input data. Although these can now be used in static graphs (with conditional operators) it's still much more natural to use native control flow.
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
index 6ca58a92d032..1dd778d74abd 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/hybridize.md
@@ -280,7 +280,7 @@ Trying to access the shape of a tensor in a hybridized block would result in thi
 Again, you cannot use the shape of the symbol at runtime as symbols only describe operations and not the underlying data they operate on.
 Note: This will change in the future as Apache MXNet will support [dynamic shape inference](https://cwiki.apache.org/confluence/display/MXNET/Dynamic+shape), and the shapes of symbols will be symbols themselves
 
-There are also a lot of operators that support special indices to help with most of the use-cases where you would want to access the shape information. For example, `F.reshape(x, (0,0,-1))` will keep the first two dimensions unchanged and collapse all further dimensions into the third dimension. See the documentation of the [`F.reshape`](/api/python/docs/api/ndarray/ndarray.htmlmxnet.ndarray.reshape.html) for more details.
+There are also a lot of operators that support special indices to help with most of the use-cases where you would want to access the shape information. For example, `F.reshape(x, (0,0,-1))` will keep the first two dimensions unchanged and collapse all further dimensions into the third dimension. See the documentation of the [F.reshape](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.reshape) for more details.
 
 ### Item assignment
 
@@ -294,7 +294,7 @@ def hybrid_forward(self, F, x):
 
 Would get you this error `TypeError: 'Symbol' object does not support item assignment`.
 
-Direct item assignment is not possible in symbolic graph since it needs to be part of a computational graph. One way is to use add more inputs to your graph and use masking or the [`F.where`](/api/python/docs/api/ndarray/ndarray.htmlmxnet.ndarray.where.html) operator.
+Direct item assignment is not possible in symbolic graph since it needs to be part of a computational graph. One way is to use add more inputs to your graph and use masking or the [F.where](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.where) operator.
 
 e.g to set the first element to 2 you can do:
 
diff --git a/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md b/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
index 60aa366ad2bb..a3324fea7318 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md
@@ -310,4 +310,4 @@ Gluon does this by allowing for [Hybridization](hybridize.html). In it, the
 Python interpreter executes the block the first time it's invoked. The Gluon
 runtime records what is happening and the next time around it short circuits
 any calls to Python. This can accelerate things considerably in some cases but
-care needs to be taken with [control flow](/api/python/getting-started/crash-course/3-autograd.html).
+care needs to be taken with [control flow](/api/python/docs/tutorials/packages/autograd/index.html#Advanced:-Using-Python-control-flow).
diff --git a/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md b/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
index 17aaef74d106..af5fe04ab2cb 100644
--- a/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
+++ b/docs/python_docs/python/tutorials/packages/gluon/loss/loss.md
@@ -19,8 +19,8 @@
 
 Loss functions are used to train neural networks and to compute the difference between output and target variable. A critical component of training neural networks is the loss function. A loss function is a quantative measure of how bad the predictions of the network are when compared to ground truth labels. Given this score, a network can improve by iteratively updating its weights to minimise this loss. Some tasks use a combination of multiple loss functions, but often you'll just use one. MXNet Gluon provides a number of the most commonly used loss functions, and you'll choose certain loss functions depending on your network and task. Some common task and loss function pairs include:
 
-- regression: [L1Loss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.L1Loss.html), [L2Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L2Loss) 
-- classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss.html), [SoftmaxBinaryCrossEntropyLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.SoftmaxBinaryCrossEntropyLoss.html)
+- regression: [L1Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L1Loss), [L2Loss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.L2Loss) 
+- classification: [SigmoidBinaryCrossEntropyLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.SigmoidBinaryCrossEntropyLoss), [SoftmaxCrossEntropyLoss](/api/python/docs/api/gluon/loss/index.html#mxnet.gluon.loss.SoftmaxCrossEntropyLoss)
 - embeddings: [HingeLoss](/api/python/docs/api/gluon/_autogen/mxnet.gluon.loss.HingeLoss.html)
 
 We'll first import the modules, where the `mxnet.gluon.loss` module is imported as `gloss` to avoid the commonly used name `loss`.
diff --git a/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md b/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
index c1e49cd75bbb..1fe40bc1671f 100644
--- a/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
+++ b/docs/python_docs/python/tutorials/packages/ndarray/gotchas_numpy_in_mxnet.md
@@ -38,7 +38,7 @@ Many people are familiar with NumPy and flexible doing tensor manipulations usin
 
 That means that if we have a long computation graph and, at some point, we want to cast the result to `np.array`, it may feel like the casting takes a lot of time. But what really takes this time is `Execution Engine`, which finishes all the async calculations we have pushed into it to get the final result, which then will be converted to `np.array`.
 
-Because of the blocking nature of [.asnumpy() method](api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy), using it reduces the execution performance, especially if the calculations are done on GPU: Apache MXNet has to copy data from GPU to CPU to return `np.array`.
+Because of the blocking nature of [.asnumpy() method](/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.NDArray.asnumpy), using it reduces the execution performance, especially if the calculations are done on GPU: Apache MXNet has to copy data from GPU to CPU to return `np.array`.
 
 The best solution is to **make manipulations directly on NDArrays by methods provided in [NDArray API](https://mxnet.apache.org/api/python/ndarray/ndarray.html)**.
 

From e14b6cb124ba5b9264b542a2373b06e85a3d8326 Mon Sep 17 00:00:00 2001
From: Aaron Markham <aaron.s.markham@gmail.com>
Date: Wed, 16 Oct 2019 15:30:38 -0700
Subject: [PATCH 7/9] change file format

---
 .../python/tutorials/deploy/run-on-aws/{cloud.rst => cloud.md}    | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/python_docs/python/tutorials/deploy/run-on-aws/{cloud.rst => cloud.md} (100%)

diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
similarity index 100%
rename from docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.rst
rename to docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md

From 5442a6511ef5c22a60e43e9af0de9f982e3ad3ae Mon Sep 17 00:00:00 2001
From: Aaron Markham <aaron.s.markham@gmail.com>
Date: Wed, 16 Oct 2019 15:32:20 -0700
Subject: [PATCH 8/9] change file format

---
 .../tutorials/deploy/run-on-aws/cloud.md      | 28 +++++++++----------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
index 8549ebbd92ba..e0b8a6a02f15 100644
--- a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
+++ b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
@@ -1,19 +1,19 @@
-.. Licensed to the Apache Software Foundation (ASF) under one
-   or more contributor license agreements.  See the NOTICE file
-   distributed with this work for additional information
-   regarding copyright ownership.  The ASF licenses this file
-   to you under the Apache License, Version 2.0 (the
-   "License"); you may not use this file except in compliance
-   with the License.  You may obtain a copy of the License at
+<!--- Licensed to the Apache Software Foundation (ASF) under one -->
+<!--- or more contributor license agreements.  See the NOTICE file -->
+<!--- distributed with this work for additional information -->
+<!--- regarding copyright ownership.  The ASF licenses this file -->
+<!--- to you under the Apache License, Version 2.0 (the -->
+<!--- "License"); you may not use this file except in compliance -->
+<!--- with the License.  You may obtain a copy of the License at -->
 
-     http://www.apache.org/licenses/LICENSE-2.0
+<!---   http://www.apache.org/licenses/LICENSE-2.0 -->
 
-   Unless required by applicable law or agreed to in writing,
-   software distributed under the License is distributed on an
-   "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-   KIND, either express or implied.  See the License for the
-   specific language governing permissions and limitations
-   under the License.
+<!--- Unless required by applicable law or agreed to in writing, -->
+<!--- software distributed under the License is distributed on an -->
+<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
+<!--- KIND, either express or implied.  See the License for the -->
+<!--- specific language governing permissions and limitations -->
+<!--- under the License. -->
 
 MXNet on the Cloud
 ==================

From 24c6a708bcd7b13941761eb1d84bcb3c1f79899f Mon Sep 17 00:00:00 2001
From: Aaron Markham <aaron.s.markham@gmail.com>
Date: Wed, 16 Oct 2019 15:54:00 -0700
Subject: [PATCH 9/9] fix formatting

---
 .../python/tutorials/deploy/run-on-aws/cloud.md        | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
index e0b8a6a02f15..1c2ff5448d2a 100644
--- a/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
+++ b/docs/python_docs/python/tutorials/deploy/run-on-aws/cloud.md
@@ -19,15 +19,11 @@ MXNet on the Cloud
 ==================
 
 Deep learning can require extremely powerful hardware, often for
-unpredictable durations of time. Moreover, *MXNet* can benefit from both
-multiple GPUs and multiple machines. Accordingly, cloud computing, as
-offered by AWS and others, is especially well suited to training deep
-learning models. Using AWS, we can rapidly fire up multiple machines
-with multiple GPUs each at will and maintain the resources for precisely
-the amount of time needed.
+unpredictable durations of time. Moreover, *MXNet* can benefit from both multiple GPUs and multiple machines. Accordingly, cloud computing, as offered by AWS and others, is especially well suited to training deep learning models. Using AWS, we can rapidly fire up multiple machines with multiple GPUs each at will and maintain the resources for precisely the amount of time needed.
 
 Here are some ways you can use MXNet on AWS:
+
 1. Use [Amazon SageMaker](https://aws.amazon.com/sagemaker/developer-resources/)
-1. Use the [AWS Deep Learning AMI with Conda](https://docs.aws.amazon.com/dlami/latest/devguide/overview-conda.html) (comes preinstalled!)
+1. Use the [AWS Deep Learning AMI with Conda](https://docs.aws.amazon.com/dlami/latest/devguide/overview-conda.html)
 1. Use an [AWS Deep Learning Container](https://docs.aws.amazon.com/dlami/latest/devguide/deep-learning-containers.html)
 1. Install MXNet on a [AWS Deep Learning Base AMI](https://docs.aws.amazon.com/dlami/latest/devguide/overview-base.html)