diff --git a/docs/api/python/autograd/autograd.md b/docs/api/python/autograd/autograd.md
index 862718136fec..c2ad67420940 100644
--- a/docs/api/python/autograd/autograd.md
+++ b/docs/api/python/autograd/autograd.md
@@ -98,23 +98,40 @@ backward nodes, not the full initial graph that includes the forward nodes.
 The idiom to calculate higher order gradients is the following:
 
 ```python
-import mxnet autograd as ag
+from mxnet import ndarray as nd
+from mxnet import autograd as ag
+x=nd.array([1,2,3])
+def f(x):
+    # A function which supports higher oder gradients
+    return x*x
+```
+
+If the operators used in `f` don't support higher order gradients you will get an error like
+`operator ... is non-differentiable because it didn't register FGradient attribute.`. This means
+that it doesn't support getting the gradient of the gradient. Which is, running backward on
+the backward graph.
+
+Using mxnet.autograd.grad multiple times:
+
+```python
 with ag.record():
     y = f(x)
     y_grad = ag.grad(y, x, create_graph=True, retain_graph=True)[0]
     y_grad_grad = ag.grad(y_grad, x, create_graph=False, retain_graph=True)[0]
 ```
 
-or
+Running backward on the backward graph:
 
 ```python
-import mxnet autograd as ag
 with ag.record():
     y = f(x)
     y_grad = ag.grad(y, x, create_graph=True, retain_graph=True)[0]
 y_grad_grad = y_grad.backward()
 ```
 
+Both methods are equivalent, except that in the second case, retain_graph on running backward is set
+to False by default. But both calls are running a backward pass as on the graph as usual to get the
+gradient of the first gradient `y_grad` with respect to `x` evaluated at the value of `x`.