diff --git a/docs/source/advanced_theano.rst b/docs/source/advanced_theano.rst
new file mode 100644
index 0000000000..36f6b97600
--- /dev/null
+++ b/docs/source/advanced_theano.rst
@@ -0,0 +1,101 @@
+=================================
+Advanced usage of Theano in PyMC3
+=================================
+
+Using shared variables
+======================
+
+Shared variables allow us to use values in theano functions that are
+not considered an input to the function, but can still be changed
+later. They are very similar to global variables in may ways.::
+
+    a = tt.scalar('a')
+    # Create a new shared variable with initial value of 0.1
+    b = theano.shared(0.1)
+    func = theano.function([a], a * b)
+    assert func(2.) == 0.2
+
+    b.set_value(10.)
+    assert func(2.) == 20.
+
+Shared variables can also contain arrays, and are allowed to change
+their shape as long as the number of dimensions stays the same.
+
+We can use shared variables in PyMC3 to fit the same model to several
+datasets without the need to recreate the model each time (which can
+be time consuming if the number of datasets is large).::
+
+    # We generate 10 datasets
+    true_mu = [np.random.randn() for _ in range(10)]
+    observed_data = [mu + np.random.randn(20) for mu in true_mu]
+
+    data = theano.shared(observed_data[0])
+    pm.Model() as model:
+        mu = pm.Normal('mu', 0, 10)
+        pm.Normal('y', mu=mu, sd=1, observed=data)
+
+    # Generate one trace for each dataset
+    traces = []
+    for data_vals in observed_data:
+        # Switch out the observed dataset
+        data.set_value(data_vals)
+        with model:
+            traces.append(pm.sample())
+
+We can also sometimes use shared variables to work around limitations
+in the current PyMC3 api. A common task in Machine Learning is to predict
+values for unseen data, and one way to achieve this is to use a shared
+variable for our observations::
+
+    x = np.random.randn(100)
+    y = x > 0
+
+    x_shared = theano.shared(x)
+
+    with pm.Model() as model:
+      coeff = pm.Normal('x', mu=0, sd=1)
+      logistic = pm.math.sigmoid(coeff * x_shared)
+      pm.Bernoulli('obs', p=logistic, observed=y)
+
+      # fit the model
+      trace = pm.sample()
+
+      # Switch out the observations and use `sample_ppc` to predict
+      x_shared.set_value([-1, 0, 1.])
+      post_pred = pm.sample_ppc(trace, samples=500)
+
+However, due to the way we handle shapes at the moment, it is
+not possible to change the shape of a shared variable if that would
+also change the shape of one of the variables.
+
+
+Writing custom Theano Ops
+=========================
+
+While Theano includes a wide range of operations, there are cases where
+it makes sense to write your own. But before doing this it is a good
+idea to think hard if it is actually necessary. Especially if you want
+to use algorithms that need gradient information — this includes NUTS and
+all variational methods, and you probably *should* want to use those —
+this is often quite a bit of work and also requires some math and
+debugging skills for the gradients.
+
+Good reasons for defining a custom Op might be the following:
+
+- You require an operation that is not available in Theano and can't
+  be build up out of existing Theano operations. This could for example
+  include models where you need to solve differential equations or
+  integrals, or find a root or minimum of a function that depends
+  on your parameters.
+- You want to connect your PyMC3 model to some existing external code.
+- After carefully considering different parametrizations and a lot
+  of profiling your model is still too slow, but you know of a faster
+  way to compute the gradient than what theano is doing. This faster
+  way might be anything from clever maths to using more hardware.
+  There is nothing stopping anyone from using a cluster via MPI in
+  a custom node, if a part of the gradient computation is slow enough
+  and sufficiently parallelizable to make the cost worth it.
+  We would definitely like to hear about any such examples.
+
+Theano has extensive `documentation, <http://deeplearning.net/software/theano/extending/index.html>`_
+about how to write new Ops.
diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst
index 04d0836bb8..28bfc29169 100644
--- a/docs/source/getting_started.rst
+++ b/docs/source/getting_started.rst
@@ -8,3 +8,5 @@ Getting started
    notebooks/getting_started.ipynb
    notebooks/api_quickstart.ipynb
    notebooks/variational_api_quickstart.ipynb
+   theano.rst
+   advanced_theano.rst
diff --git a/docs/source/theano.rst b/docs/source/theano.rst
new file mode 100644
index 0000000000..096e2d229e
--- /dev/null
+++ b/docs/source/theano.rst
@@ -0,0 +1,216 @@
+================
+PyMC3 and Theano
+================
+
+What is Theano
+==============
+
+Theano is a package that allows us to define functions involving array
+operations and linear algebra. When we define a PyMC3 model, we implicitly
+build up a Theano function from the space of our parameters to
+their posterior probability density up to a constant factor. We then use
+symbolic manipulations of this function to also get access to its gradient.
+
+For a thorough introduction to Theano see the
+`theano docs <http://deeplearning.net/software/theano/introduction.html>`_,
+but for the most part you don't need detailed knowledge about it as long
+as you are not trying to define new distributions or other extensions
+of PyMC3. But let's look at a simple example to get a rough
+idea about how it works. Say, we'd like to define the (completely
+arbitrarily chosen) function
+
+.. math::
+
+  f\colon \mathbb{R} \times \mathbb{R}^n \times \mathbb{N}^n \to \mathbb{R}\\
+  (a, x, y) \mapsto \sum_{i=0}^{n} \exp(ax_i^3 + y_i^2).
+
+
+First, we need to define symbolic variables for our inputs (this
+is similar to eg SymPy's `Symbol`)::
+
+    import theano
+    import theano.tensor as tt
+    # We don't specify the dtype of our input variables, so it
+    # defaults to using float64 without any special config.
+    a = tt.scalar('a')
+    x = tt.vector('x')
+    # `tt.ivector` creates a symbolic vector of integers.
+    y = tt.ivector('y')
+
+Next, we use those variables to build up a symbolic representation
+of the output of our function. Note, that no computation is actually
+being done at this point. We only record what operations we need to
+do to compute the output::
+
+    inner = a * x**3 + y**2
+    out = tt.exp(inner).sum()
+
+.. note::
+
+   In this example we use `tt.exp` to create a symbolic representation
+   of the exponential of `inner`. Somewhat surprisingly, it
+   would also have worked if we used `np.exp`. This is because numpy
+   gives objects it operates on a chance to define the results of
+   operations themselves. Theano variables do this for a large number
+   of operations. We usually still prefer the theano
+   functions instead of the numpy versions, as that makes it clear that
+   we are working with symbolic input instead of plain arrays.
+
+Now we can tell Theano to build a function that does this computation.
+With a typical configuration Theano generates C code, compiles it,
+and creates a python function which wraps the C function::
+
+    func = theano.function([a, x, y], [out])
+
+We can call this function with actual arrays as many times as we want::
+
+    a_val = 1.2
+    x_vals = np.random.randn(10)
+    y_vals = np.random.randn(10)
+
+    out = func(a_val, x_vals, y_vals)
+
+For the most part the symbolic Theano variables can be operated on
+like NumPy arrays. Most NumPy functions are available in `theano.tensor`
+(which is typically imported as `tt`). A lot of linear algebra operations
+can be found in `tt.nlinalg` and `tt.slinalg` (the NumPy and SciPy
+operations respectively). Some support for sparse matrices is available
+in `theano.sparse`. For a detailed overview of available operations,
+see `the theano api docs <http://deeplearning.net/software/theano/library/tensor/index.html>`_.
+
+A notable exception where theano variables do *not* behave like
+NumPy arrays are operations involving conditional execution:
+
+Code like this won't work as expected::
+
+    a = tt.vector('a')
+    if (a > 0).all():
+        b = tt.sqrt(a)
+    else:
+        b = -a
+
+`(a > 0).all()` isn't actually a boolean as it would be in NumPy, but
+still a symbolic variable. Python will convert this object to a boolean
+and according to the rules for this conversion, things that aren't empty
+containers or zero are converted to `True`. So the code is equivalent
+to this::
+
+    a = tt.vector('a')
+    b = tt.sqrt(a)
+
+To get the desired behaviour, we can use `tt.switch`::
+
+    a = tt.vector('a')
+    b = tt.switch((a > 0).all(), tt.sqrt(a), -a)
+
+Indexing also works similarly to NumPy::
+
+    a = tt.vector('a')
+    # Access the 10th element. This will fail when a function build
+    # from this expression is executed with an array that is too short.
+    b = a[10]
+
+    # Extract a subvector
+    b = a[[1, 2, 10]]
+
+Changing elements of an array is possible using `tt.set_subtensor`::
+
+    a = tt.vector('a')
+    b = tt.set_subtensor(a[:10], 1)
+
+    # is roughly equivalent to this (although theano avoids
+    # the copy if `a` isn't used anymore)
+    a = np.random.randn(10)
+    b = a.copy()
+    b[:10] = 1
+
+How PyMC3 uses Theano
+=====================
+
+Now that we have a basic understanding of Theano we can look at what
+happens if we define a PyMC3 model. Let's look at a simple example::
+
+    true_mu = 0.1
+    data = true_mu + np.random.randn(50)
+
+    with pm.Model() as model:
+        mu = pm.Normal('mu', mu=0, sd=1)
+        y = pm.Normal('y', mu=mu, sd=1, observed=data)
+
+In this model we define two variables: `mu` and `y`. The first is
+a free variable that we want to infer, the second is an observed
+variable. To sample from the posterior we need to build the function
+
+.. math::
+
+   \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: \text{logp}(μ)\\
+
+where with the normal likelihood :math:`N(x|μ,σ^2)`
+
+.. math::
+
+    \text{logp}\colon \mathbb{R} \to \mathbb{R}\\
+    μ \mapsto \log N(μ|0, 1) + \log N(y|0, 1),
+
+To build that function we need to keep track of two things: The parameter
+space (the *free variables*) and the logp function. For each free variable
+we generate a Theano variable. And for each variable (observed or otherwise)
+we add a term to the global logp. In the background something similar to
+this is happening::
+
+    # For illustration only, those functions don't actually exist
+    # in exactly this way!
+    model = pm.Model()
+
+    mu = tt.scalar('mu')
+    model.add_free_variable(mu)
+    model.add_logp_term(pm.Normal.dist(0, 1).logp(mu))
+
+    model.add_logp_term(pm.Normal.dist(mu, 1).logp(data))
+
+So calling `pm.Normal()` modifies the model: It changes the logp function
+of the model. If the `observed` keyword isn't set it also creates a new
+free variable. In contrast, `pm.Normal.dist()` doesn't care about the model,
+it just creates an object that represents the normal distribution. Calling
+`logp` on this object creates a theano variable for the logp probability
+or log probability density of the distribution, but again without changing
+the model in any way.
+
+Continuous variables with support only on a subset of the real numbers
+are treated a bit differently. We create a transformed variable
+that has support on the reals and then modify this variable. For
+example::
+
+    with pm.Model() as model:
+        mu = pm.Normal('mu', 0, 1)
+        sd = pm.HalfNormal('sd', 1)
+        y = pm.Normal('y', mu=mu, sd=sd, observed=data)
+
+is roughly equivalent to this::
+
+    # For illustration only, not real code!
+    model = pm.Model()
+    mu = tt.scalar('mu')
+    model.add_free_variable(mu)
+    model.add_logp_term(pm.Normal.dist(0, 1).logp(mu))
+
+    sd_log__ = tt.scalar('sd_log__')
+    model.add_free_variable(sd_log__)
+    model.add_logp_term(corrected_logp_half_normal(sd_log__))
+
+    sd = tt.exp(sd_log__)
+    model.add_deterministic_variable(sd)
+
+    model.add_logp_term(pm.Normal.dist(mu, sd).logp(data))
+
+The return values of the variable constructors are subclasses
+of theano variables, so when we define a variable we can use any
+theano operation on them::
+
+    design_matrix = np.array([[...]])
+    with pm.Model() as model:
+        # beta is a tt.dvector
+        beta = pm.Normal('beta', 0, 1, shape=len(design_matrix))
+        predict = tt.dot(design_matrix, beta)
+        sd = pm.HalfCauchy('sd', beta=2.5)
+        pm.Normal('y', mu=predict, sd=sd, observed=data)