Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions docs/source/advanced_theano.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
=================================
Advanced usage of Theano in PyMC3
=================================

Using shared variables
======================

Shared variables allow us to use values in theano functions that are
not considered an input to the function, but can still be changed
later. They are very similar to global variables in may ways.::

a = tt.scalar('a')
# Create a new shared variable with initial value of 0.1
b = theano.shared(0.1)
func = theano.function([a], a * b)
assert func(2.) == 0.2

b.set_value(10.)
assert func(2.) == 20.

Shared variables can also contain arrays, and are allowed to change
their shape as long as the number of dimensions stays the same.

We can use shared variables in PyMC3 to fit the same model to several
datasets without the need to recreate the model each time (which can
be time consuming if the number of datasets is large).::

# We generate 10 datasets
true_mu = [np.random.randn() for _ in range(10)]
observed_data = [mu + np.random.randn(20) for mu in true_mu]

data = theano.shared(observed_data[0])
pm.Model() as model:
mu = pm.Normal('mu', 0, 10)
pm.Normal('y', mu=mu, sd=1, observed=data)

# Generate one trace for each dataset
traces = []
for data_vals in observed_data:
# Switch out the observed dataset
data.set_value(data_vals)
with model:
traces.append(pm.sample())

We can also sometimes use shared variables to work around limitations
in the current PyMC3 api. A common task in Machine Learning is to predict
values for unseen data, and one way to achieve this is to use a shared
variable for our observations::

x = np.random.randn(100)
y = x > 0

x_shared = theano.shared(x)

with pm.Model() as model:
coeff = pm.Normal('x', mu=0, sd=1)
logistic = pm.math.sigmoid(coeff * x_shared)
pm.Bernoulli('obs', p=logistic, observed=y)

# fit the model
trace = pm.sample()

# Switch out the observations and use `sample_ppc` to predict
x_shared.set_value([-1, 0, 1.])
post_pred = pm.sample_ppc(trace, samples=500)

However, due to the way we handle shapes at the moment, it is
not possible to change the shape of a shared variable if that would
also change the shape of one of the variables.


Writing custom Theano Ops
=========================

While Theano includes a wide range of operations, there are cases where
it makes sense to write your own. But before doing this it is a good
idea to think hard if it is actually necessary. Especially if you want
to use algorithms that need gradient information — this includes NUTS and
all variational methods, and you probably *should* want to use those —
this is often quite a bit of work and also requires some math and
debugging skills for the gradients.

Good reasons for defining a custom Op might be the following:

- You require an operation that is not available in Theano and can't
be build up out of existing Theano operations. This could for example
include models where you need to solve differential equations or
integrals, or find a root or minimum of a function that depends
on your parameters.
- You want to connect your PyMC3 model to some existing external code.
- After carefully considering different parametrizations and a lot
of profiling your model is still too slow, but you know of a faster
way to compute the gradient than what theano is doing. This faster
way might be anything from clever maths to using more hardware.
There is nothing stopping anyone from using a cluster via MPI in
a custom node, if a part of the gradient computation is slow enough
and sufficiently parallelizable to make the cost worth it.
We would definitely like to hear about any such examples.

Theano has extensive `documentation, <http://deeplearning.net/software/theano/extending/index.html>`_
about how to write new Ops.
2 changes: 2 additions & 0 deletions docs/source/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,5 @@ Getting started
notebooks/getting_started.ipynb
notebooks/api_quickstart.ipynb
notebooks/variational_api_quickstart.ipynb
theano.rst
advanced_theano.rst
216 changes: 216 additions & 0 deletions docs/source/theano.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
================
PyMC3 and Theano
================

What is Theano
==============

Theano is a package that allows us to define functions involving array
operations and linear algebra. When we define a PyMC3 model, we implicitly
build up a Theano function from the space of our parameters to
their posterior probability density up to a constant factor. We then use
symbolic manipulations of this function to also get access to its gradient.

For a thorough introduction to Theano see the
`theano docs <http://deeplearning.net/software/theano/introduction.html>`_,
but for the most part you don't need detailed knowledge about it as long
as you are not trying to define new distributions or other extensions
of PyMC3. But let's look at a simple example to get a rough
idea about how it works. Say, we'd like to define the (completely
arbitrarily chosen) function

.. math::

f\colon \mathbb{R} \times \mathbb{R}^n \times \mathbb{N}^n \to \mathbb{R}\\
(a, x, y) \mapsto \sum_{i=0}^{n} \exp(ax_i^3 + y_i^2).


First, we need to define symbolic variables for our inputs (this
is similar to eg SymPy's `Symbol`)::

import theano
import theano.tensor as tt
# We don't specify the dtype of our input variables, so it
# defaults to using float64 without any special config.
a = tt.scalar('a')
x = tt.vector('x')
# `tt.ivector` creates a symbolic vector of integers.
y = tt.ivector('y')

Next, we use those variables to build up a symbolic representation
of the output of our function. Note, that no computation is actually
being done at this point. We only record what operations we need to
do to compute the output::

inner = a * x**3 + y**2
out = tt.exp(inner).sum()

.. note::

In this example we use `tt.exp` to create a symbolic representation
of the exponential of `inner`. Somewhat surprisingly, it
would also have worked if we used `np.exp`. This is because numpy
gives objects it operates on a chance to define the results of
operations themselves. Theano variables do this for a large number
of operations. We usually still prefer the theano
functions instead of the numpy versions, as that makes it clear that
we are working with symbolic input instead of plain arrays.

Now we can tell Theano to build a function that does this computation.
With a typical configuration Theano generates C code, compiles it,
and creates a python function which wraps the C function::

func = theano.function([a, x, y], [out])

We can call this function with actual arrays as many times as we want::

a_val = 1.2
x_vals = np.random.randn(10)
y_vals = np.random.randn(10)

out = func(a_val, x_vals, y_vals)

For the most part the symbolic Theano variables can be operated on
like NumPy arrays. Most NumPy functions are available in `theano.tensor`
(which is typically imported as `tt`). A lot of linear algebra operations
can be found in `tt.nlinalg` and `tt.slinalg` (the NumPy and SciPy
operations respectively). Some support for sparse matrices is available
in `theano.sparse`. For a detailed overview of available operations,
see `the theano api docs <http://deeplearning.net/software/theano/library/tensor/index.html>`_.

A notable exception where theano variables do *not* behave like
NumPy arrays are operations involving conditional execution:

Code like this won't work as expected::

a = tt.vector('a')
if (a > 0).all():
b = tt.sqrt(a)
else:
b = -a

`(a > 0).all()` isn't actually a boolean as it would be in NumPy, but
still a symbolic variable. Python will convert this object to a boolean
and according to the rules for this conversion, things that aren't empty
containers or zero are converted to `True`. So the code is equivalent
to this::

a = tt.vector('a')
b = tt.sqrt(a)

To get the desired behaviour, we can use `tt.switch`::

a = tt.vector('a')
b = tt.switch((a > 0).all(), tt.sqrt(a), -a)

Indexing also works similarly to NumPy::

a = tt.vector('a')
# Access the 10th element. This will fail when a function build
# from this expression is executed with an array that is too short.
b = a[10]

# Extract a subvector
b = a[[1, 2, 10]]

Changing elements of an array is possible using `tt.set_subtensor`::

a = tt.vector('a')
b = tt.set_subtensor(a[:10], 1)

# is roughly equivalent to this (although theano avoids
# the copy if `a` isn't used anymore)
a = np.random.randn(10)
b = a.copy()
b[:10] = 1

How PyMC3 uses Theano
=====================

Now that we have a basic understanding of Theano we can look at what
happens if we define a PyMC3 model. Let's look at a simple example::

true_mu = 0.1
data = true_mu + np.random.randn(50)

with pm.Model() as model:
mu = pm.Normal('mu', mu=0, sd=1)
y = pm.Normal('y', mu=mu, sd=1, observed=data)

In this model we define two variables: `mu` and `y`. The first is
a free variable that we want to infer, the second is an observed
variable. To sample from the posterior we need to build the function

.. math::

\log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: \text{logp}(μ)\\

where with the normal likelihood :math:`N(x|μ,σ^2)`

.. math::

\text{logp}\colon \mathbb{R} \to \mathbb{R}\\
μ \mapsto \log N(μ|0, 1) + \log N(y|0, 1),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a typo, it should be

μ \mapsto \log N(μ|0, 1) + \log N(y|μ, 1),

right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, thank you.
Fix is coming up.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad I can help. Great introduction by the way:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks :-)
Fixed in #2497.


To build that function we need to keep track of two things: The parameter
space (the *free variables*) and the logp function. For each free variable
we generate a Theano variable. And for each variable (observed or otherwise)
we add a term to the global logp. In the background something similar to
this is happening::

# For illustration only, those functions don't actually exist
# in exactly this way!
model = pm.Model()

mu = tt.scalar('mu')
model.add_free_variable(mu)
model.add_logp_term(pm.Normal.dist(0, 1).logp(mu))

model.add_logp_term(pm.Normal.dist(mu, 1).logp(data))

So calling `pm.Normal()` modifies the model: It changes the logp function
of the model. If the `observed` keyword isn't set it also creates a new
free variable. In contrast, `pm.Normal.dist()` doesn't care about the model,
it just creates an object that represents the normal distribution. Calling
`logp` on this object creates a theano variable for the logp probability
or log probability density of the distribution, but again without changing
the model in any way.

Continuous variables with support only on a subset of the real numbers
are treated a bit differently. We create a transformed variable
that has support on the reals and then modify this variable. For
example::

with pm.Model() as model:
mu = pm.Normal('mu', 0, 1)
sd = pm.HalfNormal('sd', 1)
y = pm.Normal('y', mu=mu, sd=sd, observed=data)

is roughly equivalent to this::

# For illustration only, not real code!
model = pm.Model()
mu = tt.scalar('mu')
model.add_free_variable(mu)
model.add_logp_term(pm.Normal.dist(0, 1).logp(mu))

sd_log__ = tt.scalar('sd_log__')
model.add_free_variable(sd_log__)
model.add_logp_term(corrected_logp_half_normal(sd_log__))

sd = tt.exp(sd_log__)
model.add_deterministic_variable(sd)

model.add_logp_term(pm.Normal.dist(mu, sd).logp(data))

The return values of the variable constructors are subclasses
of theano variables, so when we define a variable we can use any
theano operation on them::

design_matrix = np.array([[...]])
with pm.Model() as model:
# beta is a tt.dvector
beta = pm.Normal('beta', 0, 1, shape=len(design_matrix))
predict = tt.dot(design_matrix, beta)
sd = pm.HalfCauchy('sd', beta=2.5)
pm.Normal('y', mu=predict, sd=sd, observed=data)