diff --git a/docs/source/advanced_theano.rst b/docs/source/advanced_theano.rst new file mode 100644 index 0000000000..36f6b97600 --- /dev/null +++ b/docs/source/advanced_theano.rst @@ -0,0 +1,101 @@ +================================= +Advanced usage of Theano in PyMC3 +================================= + +Using shared variables +====================== + +Shared variables allow us to use values in theano functions that are +not considered an input to the function, but can still be changed +later. They are very similar to global variables in may ways.:: + + a = tt.scalar('a') + # Create a new shared variable with initial value of 0.1 + b = theano.shared(0.1) + func = theano.function([a], a * b) + assert func(2.) == 0.2 + + b.set_value(10.) + assert func(2.) == 20. + +Shared variables can also contain arrays, and are allowed to change +their shape as long as the number of dimensions stays the same. + +We can use shared variables in PyMC3 to fit the same model to several +datasets without the need to recreate the model each time (which can +be time consuming if the number of datasets is large).:: + + # We generate 10 datasets + true_mu = [np.random.randn() for _ in range(10)] + observed_data = [mu + np.random.randn(20) for mu in true_mu] + + data = theano.shared(observed_data[0]) + pm.Model() as model: + mu = pm.Normal('mu', 0, 10) + pm.Normal('y', mu=mu, sd=1, observed=data) + + # Generate one trace for each dataset + traces = [] + for data_vals in observed_data: + # Switch out the observed dataset + data.set_value(data_vals) + with model: + traces.append(pm.sample()) + +We can also sometimes use shared variables to work around limitations +in the current PyMC3 api. A common task in Machine Learning is to predict +values for unseen data, and one way to achieve this is to use a shared +variable for our observations:: + + x = np.random.randn(100) + y = x > 0 + + x_shared = theano.shared(x) + + with pm.Model() as model: + coeff = pm.Normal('x', mu=0, sd=1) + logistic = pm.math.sigmoid(coeff * x_shared) + pm.Bernoulli('obs', p=logistic, observed=y) + + # fit the model + trace = pm.sample() + + # Switch out the observations and use `sample_ppc` to predict + x_shared.set_value([-1, 0, 1.]) + post_pred = pm.sample_ppc(trace, samples=500) + +However, due to the way we handle shapes at the moment, it is +not possible to change the shape of a shared variable if that would +also change the shape of one of the variables. + + +Writing custom Theano Ops +========================= + +While Theano includes a wide range of operations, there are cases where +it makes sense to write your own. But before doing this it is a good +idea to think hard if it is actually necessary. Especially if you want +to use algorithms that need gradient information — this includes NUTS and +all variational methods, and you probably *should* want to use those — +this is often quite a bit of work and also requires some math and +debugging skills for the gradients. + +Good reasons for defining a custom Op might be the following: + +- You require an operation that is not available in Theano and can't + be build up out of existing Theano operations. This could for example + include models where you need to solve differential equations or + integrals, or find a root or minimum of a function that depends + on your parameters. +- You want to connect your PyMC3 model to some existing external code. +- After carefully considering different parametrizations and a lot + of profiling your model is still too slow, but you know of a faster + way to compute the gradient than what theano is doing. This faster + way might be anything from clever maths to using more hardware. + There is nothing stopping anyone from using a cluster via MPI in + a custom node, if a part of the gradient computation is slow enough + and sufficiently parallelizable to make the cost worth it. + We would definitely like to hear about any such examples. + +Theano has extensive `documentation, `_ +about how to write new Ops. diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst index 04d0836bb8..28bfc29169 100644 --- a/docs/source/getting_started.rst +++ b/docs/source/getting_started.rst @@ -8,3 +8,5 @@ Getting started notebooks/getting_started.ipynb notebooks/api_quickstart.ipynb notebooks/variational_api_quickstart.ipynb + theano.rst + advanced_theano.rst diff --git a/docs/source/theano.rst b/docs/source/theano.rst new file mode 100644 index 0000000000..096e2d229e --- /dev/null +++ b/docs/source/theano.rst @@ -0,0 +1,216 @@ +================ +PyMC3 and Theano +================ + +What is Theano +============== + +Theano is a package that allows us to define functions involving array +operations and linear algebra. When we define a PyMC3 model, we implicitly +build up a Theano function from the space of our parameters to +their posterior probability density up to a constant factor. We then use +symbolic manipulations of this function to also get access to its gradient. + +For a thorough introduction to Theano see the +`theano docs `_, +but for the most part you don't need detailed knowledge about it as long +as you are not trying to define new distributions or other extensions +of PyMC3. But let's look at a simple example to get a rough +idea about how it works. Say, we'd like to define the (completely +arbitrarily chosen) function + +.. math:: + + f\colon \mathbb{R} \times \mathbb{R}^n \times \mathbb{N}^n \to \mathbb{R}\\ + (a, x, y) \mapsto \sum_{i=0}^{n} \exp(ax_i^3 + y_i^2). + + +First, we need to define symbolic variables for our inputs (this +is similar to eg SymPy's `Symbol`):: + + import theano + import theano.tensor as tt + # We don't specify the dtype of our input variables, so it + # defaults to using float64 without any special config. + a = tt.scalar('a') + x = tt.vector('x') + # `tt.ivector` creates a symbolic vector of integers. + y = tt.ivector('y') + +Next, we use those variables to build up a symbolic representation +of the output of our function. Note, that no computation is actually +being done at this point. We only record what operations we need to +do to compute the output:: + + inner = a * x**3 + y**2 + out = tt.exp(inner).sum() + +.. note:: + + In this example we use `tt.exp` to create a symbolic representation + of the exponential of `inner`. Somewhat surprisingly, it + would also have worked if we used `np.exp`. This is because numpy + gives objects it operates on a chance to define the results of + operations themselves. Theano variables do this for a large number + of operations. We usually still prefer the theano + functions instead of the numpy versions, as that makes it clear that + we are working with symbolic input instead of plain arrays. + +Now we can tell Theano to build a function that does this computation. +With a typical configuration Theano generates C code, compiles it, +and creates a python function which wraps the C function:: + + func = theano.function([a, x, y], [out]) + +We can call this function with actual arrays as many times as we want:: + + a_val = 1.2 + x_vals = np.random.randn(10) + y_vals = np.random.randn(10) + + out = func(a_val, x_vals, y_vals) + +For the most part the symbolic Theano variables can be operated on +like NumPy arrays. Most NumPy functions are available in `theano.tensor` +(which is typically imported as `tt`). A lot of linear algebra operations +can be found in `tt.nlinalg` and `tt.slinalg` (the NumPy and SciPy +operations respectively). Some support for sparse matrices is available +in `theano.sparse`. For a detailed overview of available operations, +see `the theano api docs `_. + +A notable exception where theano variables do *not* behave like +NumPy arrays are operations involving conditional execution: + +Code like this won't work as expected:: + + a = tt.vector('a') + if (a > 0).all(): + b = tt.sqrt(a) + else: + b = -a + +`(a > 0).all()` isn't actually a boolean as it would be in NumPy, but +still a symbolic variable. Python will convert this object to a boolean +and according to the rules for this conversion, things that aren't empty +containers or zero are converted to `True`. So the code is equivalent +to this:: + + a = tt.vector('a') + b = tt.sqrt(a) + +To get the desired behaviour, we can use `tt.switch`:: + + a = tt.vector('a') + b = tt.switch((a > 0).all(), tt.sqrt(a), -a) + +Indexing also works similarly to NumPy:: + + a = tt.vector('a') + # Access the 10th element. This will fail when a function build + # from this expression is executed with an array that is too short. + b = a[10] + + # Extract a subvector + b = a[[1, 2, 10]] + +Changing elements of an array is possible using `tt.set_subtensor`:: + + a = tt.vector('a') + b = tt.set_subtensor(a[:10], 1) + + # is roughly equivalent to this (although theano avoids + # the copy if `a` isn't used anymore) + a = np.random.randn(10) + b = a.copy() + b[:10] = 1 + +How PyMC3 uses Theano +===================== + +Now that we have a basic understanding of Theano we can look at what +happens if we define a PyMC3 model. Let's look at a simple example:: + + true_mu = 0.1 + data = true_mu + np.random.randn(50) + + with pm.Model() as model: + mu = pm.Normal('mu', mu=0, sd=1) + y = pm.Normal('y', mu=mu, sd=1, observed=data) + +In this model we define two variables: `mu` and `y`. The first is +a free variable that we want to infer, the second is an observed +variable. To sample from the posterior we need to build the function + +.. math:: + + \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: \text{logp}(μ)\\ + +where with the normal likelihood :math:`N(x|μ,σ^2)` + +.. math:: + + \text{logp}\colon \mathbb{R} \to \mathbb{R}\\ + μ \mapsto \log N(μ|0, 1) + \log N(y|0, 1), + +To build that function we need to keep track of two things: The parameter +space (the *free variables*) and the logp function. For each free variable +we generate a Theano variable. And for each variable (observed or otherwise) +we add a term to the global logp. In the background something similar to +this is happening:: + + # For illustration only, those functions don't actually exist + # in exactly this way! + model = pm.Model() + + mu = tt.scalar('mu') + model.add_free_variable(mu) + model.add_logp_term(pm.Normal.dist(0, 1).logp(mu)) + + model.add_logp_term(pm.Normal.dist(mu, 1).logp(data)) + +So calling `pm.Normal()` modifies the model: It changes the logp function +of the model. If the `observed` keyword isn't set it also creates a new +free variable. In contrast, `pm.Normal.dist()` doesn't care about the model, +it just creates an object that represents the normal distribution. Calling +`logp` on this object creates a theano variable for the logp probability +or log probability density of the distribution, but again without changing +the model in any way. + +Continuous variables with support only on a subset of the real numbers +are treated a bit differently. We create a transformed variable +that has support on the reals and then modify this variable. For +example:: + + with pm.Model() as model: + mu = pm.Normal('mu', 0, 1) + sd = pm.HalfNormal('sd', 1) + y = pm.Normal('y', mu=mu, sd=sd, observed=data) + +is roughly equivalent to this:: + + # For illustration only, not real code! + model = pm.Model() + mu = tt.scalar('mu') + model.add_free_variable(mu) + model.add_logp_term(pm.Normal.dist(0, 1).logp(mu)) + + sd_log__ = tt.scalar('sd_log__') + model.add_free_variable(sd_log__) + model.add_logp_term(corrected_logp_half_normal(sd_log__)) + + sd = tt.exp(sd_log__) + model.add_deterministic_variable(sd) + + model.add_logp_term(pm.Normal.dist(mu, sd).logp(data)) + +The return values of the variable constructors are subclasses +of theano variables, so when we define a variable we can use any +theano operation on them:: + + design_matrix = np.array([[...]]) + with pm.Model() as model: + # beta is a tt.dvector + beta = pm.Normal('beta', 0, 1, shape=len(design_matrix)) + predict = tt.dot(design_matrix, beta) + sd = pm.HalfCauchy('sd', beta=2.5) + pm.Normal('y', mu=predict, sd=sd, observed=data)