From 8f458d1c0db49005caf1ef1eeabde529db679801 Mon Sep 17 00:00:00 2001 From: Adrian Seyboldt Date: Thu, 6 Jul 2017 15:36:20 +0200 Subject: [PATCH 1/3] Add draft of theano introduction --- docs/source/theano.rst | 157 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 157 insertions(+) create mode 100644 docs/source/theano.rst diff --git a/docs/source/theano.rst b/docs/source/theano.rst new file mode 100644 index 0000000000..a05e9e8bb5 --- /dev/null +++ b/docs/source/theano.rst @@ -0,0 +1,157 @@ +# pymc3 and theano + +## What is theano + +Theano is a package that allows us to define functions involving array +operations and linear algebra. When we define a pymc3 model, we implicitly +build up a theano function from the space of our parameters to +their posterior probability density up to a constant factor. We then use +symbolic manipulations of this function to also get access to its gradient. + +For a thorough introduction into theano see the thenao docs (link), +but for the most part you don't need many details about it as long +as you are not trying to define new distributions or other extensions +of pymc3. But let's look at a simple example to give you a rough +idea about how it works. Say, we'd like to define the (completely +arbitrarily chosen) function +$$ + f\colon \mathbb{R} \times \mathbb{R}^n \mathbb{N}^n \to \mathbb{R}\\ + (a, x, y) \mapsto \sum_{i=0}{n} \exp(ax_i^3 + y_i^2). +$$ + +First, we need to define symbolic variables for our inputs (this +is similar to eg sympy's `Symbol`):: + + import theano + import theano.tensor as tt + # We don't specify the dtype of our input variables, so it + # defaults to using float64 without any special config. + a = tt.scalar('a') + x = tt.vector('x') + y = tt.vector('y') + +Next, we use those variables to build up a symbolic representation +of the output of our function. Note, that no computation is actually +being done at this point. We only record what operations we need to +do to compute the output:: + + inner = a * x**3 + y**2 + out = tt.exp(inner).sum() + +.. note:: + + In this example we use `tt.exp` to create a symbolic representation + of the exponential of `inner`. Somewhat surprisingly, it + would also have worked if we used `np.exp`. This is because numpy + gives objects it operates on a chance to define the results of + operations themselves. Theano variables do this for a large number + of operations. We usually still prefer it if people use the theano + function instead of the numpy versions, as that makes it clear that + we are working with symbolic input instead of plain arrays. + +Now we can tell theano to build a function that does this computation. +With a typical configuration theano generates C code, compiles it, +and creates a python function which wrapps the C function:: + + func = theano.function([a, x, y], [out]) + +We can call this function with actual arrays as many times as we want:: + + a_val = 1.2 + x_vals = np.random.randn(10) + y_vals = np.random.randn(10) + + out = func(a_val, x_vals, y_vals) + +For the most part the symbolic theano variables can be operated on +like numpy arrays. Most numpy functions are available in `theano.tensor` +(which is typically imported as `tt`). A lot of linear algebra operations +can be found in `tt.nlinalg` and `tt.slinalg` (the numpy and scipy +operations respectively). Some support for sparse matrices is available +in `theano.sparse`. + +A notable exception where theano variables *don't* behave like +numpy arrays are operations involving conditional execution: + +Code like this won't work as expected if used on theano variables:: + + a = tt.vector('a') + if (a > 0).all(): + b = tt.sqrt(a) + else: + b = -a + +`(a > 0).all()` isn't actually a boolean as it would be in numpy, but +still a symbolic variable. Python will convert this object to a boolean +and according to the rules for this conversions, things that aren't empty +containers or zero are converted to True. So the code is equivalent +to this:: + + a = tt.vector('a') + b = tt.sqrt(a) + +To get the desired behaviour, we can use `tt.switch`:: + + a = tt.vector('a') + b = tt.switch((a > 0).all(), tt.sqrt(a), -a) + +Indexing also works similar to numpy:: + + a = tt.vector('a') + # Access the 10th element. This will fail when a function build + # from this expression is executed with an array that is too short. + b = a[10] + + # Extract a subvector + b = a[[1, 2, 10]] + +Changing element of an array is possible using `tt.set_subtensor`:: + + a = tt.vector('a') + b = tt.set_subtensor(a[:10], 1) + + # is roughly equivalent this (although theano is usually able + # to avoid the copy) + a = np.random.randn(10) + b = a.copy() + b[:10] = 1 + +## How pymc3 uses theano + +Now that we have a basic understanding of theano we can look at what +happens if we define a pymc3 model. Let's look at a simple example:: + + true_mu = 0.1 + data = true_mu + np.random.randn(50) + + with pm.Model() as model: + mu = pm.Normal('mu', mu=0, sd=1) + y = pm.Normal('y', mu=mu, sd=1, observed=data) + +In this model we define two variables: `mu` and `y`. The first is +a free variable that we want to infer, the second is an observed +variable. To sample from the posterior we need to build the function +$$ + \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: f(μ)\\ +$$ +where with the normal likelihood $N(x|μ,σ^2)$ +$$ + f\colon \mathbb{R} \to \mathbb{R}\\ + f(μ) = \log N(μ|0, 1) + \log N(y|0, 1), +$$ + +To build that function we need to keep track of two things: The parameter +space (the *free variables*) and the logp function. For each free variable +we generate a theano variable. And for each variable (observed or otherwise) +we add a term to the global logp. In the background something similar to +this is happening:: + + # For illustration only, those functions don't actually exist + # in exactly this way! + model = pm.Model() + + mu = tt.scalar('mu') + model.add_free_variable(mu) + model.add_logp_term(normal_logp(μ| 0, 1)) + + model.add_logp_term(normal_logp(data| mu, 1)) From b85e348d4c385342e13ac89455d1811a5312a169 Mon Sep 17 00:00:00 2001 From: Adrian Seyboldt Date: Thu, 6 Jul 2017 16:17:08 +0200 Subject: [PATCH 2/3] Fix formatting in theano intro --- docs/source/getting_started.rst | 1 + docs/source/theano.rst | 84 +++++++++++++++++++-------------- 2 files changed, 49 insertions(+), 36 deletions(-) diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst index 04d0836bb8..096a80f898 100644 --- a/docs/source/getting_started.rst +++ b/docs/source/getting_started.rst @@ -8,3 +8,4 @@ Getting started notebooks/getting_started.ipynb notebooks/api_quickstart.ipynb notebooks/variational_api_quickstart.ipynb + theano diff --git a/docs/source/theano.rst b/docs/source/theano.rst index a05e9e8bb5..07e6e49a53 100644 --- a/docs/source/theano.rst +++ b/docs/source/theano.rst @@ -1,26 +1,32 @@ -# pymc3 and theano +================ +PyMC3 and Theano +================ -## What is theano +What is Theano +============== Theano is a package that allows us to define functions involving array -operations and linear algebra. When we define a pymc3 model, we implicitly -build up a theano function from the space of our parameters to +operations and linear algebra. When we define a PyMC3 model, we implicitly +build up a Theano function from the space of our parameters to their posterior probability density up to a constant factor. We then use symbolic manipulations of this function to also get access to its gradient. -For a thorough introduction into theano see the thenao docs (link), +For a thorough introduction to Theano see the +`theano docs `_, but for the most part you don't need many details about it as long as you are not trying to define new distributions or other extensions -of pymc3. But let's look at a simple example to give you a rough +of PyMC3. But let's look at a simple example to give you a rough idea about how it works. Say, we'd like to define the (completely arbitrarily chosen) function -$$ - f\colon \mathbb{R} \times \mathbb{R}^n \mathbb{N}^n \to \mathbb{R}\\ - (a, x, y) \mapsto \sum_{i=0}{n} \exp(ax_i^3 + y_i^2). -$$ + +.. math:: + + f\colon \mathbb{R} \times \mathbb{R}^n \times \mathbb{N}^n \to \mathbb{R}\\ + (a, x, y) \mapsto \sum_{i=0}^{n} \exp(ax_i^3 + y_i^2). + First, we need to define symbolic variables for our inputs (this -is similar to eg sympy's `Symbol`):: +is similar to eg SymPy's `Symbol`):: import theano import theano.tensor as tt @@ -28,7 +34,8 @@ is similar to eg sympy's `Symbol`):: # defaults to using float64 without any special config. a = tt.scalar('a') x = tt.vector('x') - y = tt.vector('y') + # `tt.ivector` creates a symbolic vector of integers. + y = tt.ivector('y') Next, we use those variables to build up a symbolic representation of the output of our function. Note, that no computation is actually @@ -49,9 +56,9 @@ do to compute the output:: function instead of the numpy versions, as that makes it clear that we are working with symbolic input instead of plain arrays. -Now we can tell theano to build a function that does this computation. -With a typical configuration theano generates C code, compiles it, -and creates a python function which wrapps the C function:: +Now we can tell Theano to build a function that does this computation. +With a typical configuration Theano generates C code, compiles it, +and creates a python function which wraps the C function:: func = theano.function([a, x, y], [out]) @@ -63,17 +70,18 @@ We can call this function with actual arrays as many times as we want:: out = func(a_val, x_vals, y_vals) -For the most part the symbolic theano variables can be operated on -like numpy arrays. Most numpy functions are available in `theano.tensor` +For the most part the symbolic Theano variables can be operated on +like NumPy arrays. Most NumPy functions are available in `theano.tensor` (which is typically imported as `tt`). A lot of linear algebra operations -can be found in `tt.nlinalg` and `tt.slinalg` (the numpy and scipy +can be found in `tt.nlinalg` and `tt.slinalg` (the NumPy and SciPy operations respectively). Some support for sparse matrices is available -in `theano.sparse`. +in `theano.sparse`. For a detailed overview of available operations, +see `the theano api docs `_. A notable exception where theano variables *don't* behave like -numpy arrays are operations involving conditional execution: +NumPy arrays are operations involving conditional execution: -Code like this won't work as expected if used on theano variables:: +Code like this won't work as expected if used on Theano variables:: a = tt.vector('a') if (a > 0).all(): @@ -81,10 +89,10 @@ Code like this won't work as expected if used on theano variables:: else: b = -a -`(a > 0).all()` isn't actually a boolean as it would be in numpy, but +`(a > 0).all()` isn't actually a boolean as it would be in NumPy, but still a symbolic variable. Python will convert this object to a boolean and according to the rules for this conversions, things that aren't empty -containers or zero are converted to True. So the code is equivalent +containers or zero are converted to `True`. So the code is equivalent to this:: a = tt.vector('a') @@ -95,7 +103,7 @@ To get the desired behaviour, we can use `tt.switch`:: a = tt.vector('a') b = tt.switch((a > 0).all(), tt.sqrt(a), -a) -Indexing also works similar to numpy:: +Indexing also works similar to NumPy:: a = tt.vector('a') # Access the 10th element. This will fail when a function build @@ -110,16 +118,17 @@ Changing element of an array is possible using `tt.set_subtensor`:: a = tt.vector('a') b = tt.set_subtensor(a[:10], 1) - # is roughly equivalent this (although theano is usually able - # to avoid the copy) + # is roughly equivalent to this (although theano is usually + # able to avoid the copy) a = np.random.randn(10) b = a.copy() b[:10] = 1 -## How pymc3 uses theano +How PyMC3 uses Theano +===================== -Now that we have a basic understanding of theano we can look at what -happens if we define a pymc3 model. Let's look at a simple example:: +Now that we have a basic understanding of Theano we can look at what +happens if we define a PyMC3 model. Let's look at a simple example:: true_mu = 0.1 data = true_mu + np.random.randn(50) @@ -131,18 +140,21 @@ happens if we define a pymc3 model. Let's look at a simple example:: In this model we define two variables: `mu` and `y`. The first is a free variable that we want to infer, the second is an observed variable. To sample from the posterior we need to build the function -$$ - \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: f(μ)\\ -$$ -where with the normal likelihood $N(x|μ,σ^2)$ -$$ + +.. math:: + + \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: f(μ)\\ + +where with the normal likelihood :math:`N(x|μ,σ^2)` + +.. math:: + f\colon \mathbb{R} \to \mathbb{R}\\ f(μ) = \log N(μ|0, 1) + \log N(y|0, 1), -$$ To build that function we need to keep track of two things: The parameter space (the *free variables*) and the logp function. For each free variable -we generate a theano variable. And for each variable (observed or otherwise) +we generate a Theano variable. And for each variable (observed or otherwise) we add a term to the global logp. In the background something similar to this is happening:: From 70433417aafa7617d513bc1878ce12dc85fde6c4 Mon Sep 17 00:00:00 2001 From: Adrian Seyboldt Date: Thu, 6 Jul 2017 20:08:46 +0200 Subject: [PATCH 3/3] Update doc on Theano --- docs/source/advanced_theano.rst | 101 ++++++++++++++++++++++++++++++++ docs/source/getting_started.rst | 3 +- docs/source/theano.rst | 79 ++++++++++++++++++++----- 3 files changed, 166 insertions(+), 17 deletions(-) create mode 100644 docs/source/advanced_theano.rst diff --git a/docs/source/advanced_theano.rst b/docs/source/advanced_theano.rst new file mode 100644 index 0000000000..36f6b97600 --- /dev/null +++ b/docs/source/advanced_theano.rst @@ -0,0 +1,101 @@ +================================= +Advanced usage of Theano in PyMC3 +================================= + +Using shared variables +====================== + +Shared variables allow us to use values in theano functions that are +not considered an input to the function, but can still be changed +later. They are very similar to global variables in may ways.:: + + a = tt.scalar('a') + # Create a new shared variable with initial value of 0.1 + b = theano.shared(0.1) + func = theano.function([a], a * b) + assert func(2.) == 0.2 + + b.set_value(10.) + assert func(2.) == 20. + +Shared variables can also contain arrays, and are allowed to change +their shape as long as the number of dimensions stays the same. + +We can use shared variables in PyMC3 to fit the same model to several +datasets without the need to recreate the model each time (which can +be time consuming if the number of datasets is large).:: + + # We generate 10 datasets + true_mu = [np.random.randn() for _ in range(10)] + observed_data = [mu + np.random.randn(20) for mu in true_mu] + + data = theano.shared(observed_data[0]) + pm.Model() as model: + mu = pm.Normal('mu', 0, 10) + pm.Normal('y', mu=mu, sd=1, observed=data) + + # Generate one trace for each dataset + traces = [] + for data_vals in observed_data: + # Switch out the observed dataset + data.set_value(data_vals) + with model: + traces.append(pm.sample()) + +We can also sometimes use shared variables to work around limitations +in the current PyMC3 api. A common task in Machine Learning is to predict +values for unseen data, and one way to achieve this is to use a shared +variable for our observations:: + + x = np.random.randn(100) + y = x > 0 + + x_shared = theano.shared(x) + + with pm.Model() as model: + coeff = pm.Normal('x', mu=0, sd=1) + logistic = pm.math.sigmoid(coeff * x_shared) + pm.Bernoulli('obs', p=logistic, observed=y) + + # fit the model + trace = pm.sample() + + # Switch out the observations and use `sample_ppc` to predict + x_shared.set_value([-1, 0, 1.]) + post_pred = pm.sample_ppc(trace, samples=500) + +However, due to the way we handle shapes at the moment, it is +not possible to change the shape of a shared variable if that would +also change the shape of one of the variables. + + +Writing custom Theano Ops +========================= + +While Theano includes a wide range of operations, there are cases where +it makes sense to write your own. But before doing this it is a good +idea to think hard if it is actually necessary. Especially if you want +to use algorithms that need gradient information — this includes NUTS and +all variational methods, and you probably *should* want to use those — +this is often quite a bit of work and also requires some math and +debugging skills for the gradients. + +Good reasons for defining a custom Op might be the following: + +- You require an operation that is not available in Theano and can't + be build up out of existing Theano operations. This could for example + include models where you need to solve differential equations or + integrals, or find a root or minimum of a function that depends + on your parameters. +- You want to connect your PyMC3 model to some existing external code. +- After carefully considering different parametrizations and a lot + of profiling your model is still too slow, but you know of a faster + way to compute the gradient than what theano is doing. This faster + way might be anything from clever maths to using more hardware. + There is nothing stopping anyone from using a cluster via MPI in + a custom node, if a part of the gradient computation is slow enough + and sufficiently parallelizable to make the cost worth it. + We would definitely like to hear about any such examples. + +Theano has extensive `documentation, `_ +about how to write new Ops. diff --git a/docs/source/getting_started.rst b/docs/source/getting_started.rst index 096a80f898..28bfc29169 100644 --- a/docs/source/getting_started.rst +++ b/docs/source/getting_started.rst @@ -8,4 +8,5 @@ Getting started notebooks/getting_started.ipynb notebooks/api_quickstart.ipynb notebooks/variational_api_quickstart.ipynb - theano + theano.rst + advanced_theano.rst diff --git a/docs/source/theano.rst b/docs/source/theano.rst index 07e6e49a53..096e2d229e 100644 --- a/docs/source/theano.rst +++ b/docs/source/theano.rst @@ -13,9 +13,9 @@ symbolic manipulations of this function to also get access to its gradient. For a thorough introduction to Theano see the `theano docs `_, -but for the most part you don't need many details about it as long +but for the most part you don't need detailed knowledge about it as long as you are not trying to define new distributions or other extensions -of PyMC3. But let's look at a simple example to give you a rough +of PyMC3. But let's look at a simple example to get a rough idea about how it works. Say, we'd like to define the (completely arbitrarily chosen) function @@ -52,8 +52,8 @@ do to compute the output:: would also have worked if we used `np.exp`. This is because numpy gives objects it operates on a chance to define the results of operations themselves. Theano variables do this for a large number - of operations. We usually still prefer it if people use the theano - function instead of the numpy versions, as that makes it clear that + of operations. We usually still prefer the theano + functions instead of the numpy versions, as that makes it clear that we are working with symbolic input instead of plain arrays. Now we can tell Theano to build a function that does this computation. @@ -78,10 +78,10 @@ operations respectively). Some support for sparse matrices is available in `theano.sparse`. For a detailed overview of available operations, see `the theano api docs `_. -A notable exception where theano variables *don't* behave like +A notable exception where theano variables do *not* behave like NumPy arrays are operations involving conditional execution: -Code like this won't work as expected if used on Theano variables:: +Code like this won't work as expected:: a = tt.vector('a') if (a > 0).all(): @@ -91,7 +91,7 @@ Code like this won't work as expected if used on Theano variables:: `(a > 0).all()` isn't actually a boolean as it would be in NumPy, but still a symbolic variable. Python will convert this object to a boolean -and according to the rules for this conversions, things that aren't empty +and according to the rules for this conversion, things that aren't empty containers or zero are converted to `True`. So the code is equivalent to this:: @@ -103,7 +103,7 @@ To get the desired behaviour, we can use `tt.switch`:: a = tt.vector('a') b = tt.switch((a > 0).all(), tt.sqrt(a), -a) -Indexing also works similar to NumPy:: +Indexing also works similarly to NumPy:: a = tt.vector('a') # Access the 10th element. This will fail when a function build @@ -113,13 +113,13 @@ Indexing also works similar to NumPy:: # Extract a subvector b = a[[1, 2, 10]] -Changing element of an array is possible using `tt.set_subtensor`:: +Changing elements of an array is possible using `tt.set_subtensor`:: a = tt.vector('a') b = tt.set_subtensor(a[:10], 1) - # is roughly equivalent to this (although theano is usually - # able to avoid the copy) + # is roughly equivalent to this (although theano avoids + # the copy if `a` isn't used anymore) a = np.random.randn(10) b = a.copy() b[:10] = 1 @@ -143,14 +143,14 @@ variable. To sample from the posterior we need to build the function .. math:: - \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: f(μ)\\ + \log P(μ|y) + C = \log P(y|μ) + \log P(μ) =: \text{logp}(μ)\\ where with the normal likelihood :math:`N(x|μ,σ^2)` .. math:: - f\colon \mathbb{R} \to \mathbb{R}\\ - f(μ) = \log N(μ|0, 1) + \log N(y|0, 1), + \text{logp}\colon \mathbb{R} \to \mathbb{R}\\ + μ \mapsto \log N(μ|0, 1) + \log N(y|0, 1), To build that function we need to keep track of two things: The parameter space (the *free variables*) and the logp function. For each free variable @@ -164,6 +164,53 @@ this is happening:: mu = tt.scalar('mu') model.add_free_variable(mu) - model.add_logp_term(normal_logp(μ| 0, 1)) + model.add_logp_term(pm.Normal.dist(0, 1).logp(mu)) - model.add_logp_term(normal_logp(data| mu, 1)) + model.add_logp_term(pm.Normal.dist(mu, 1).logp(data)) + +So calling `pm.Normal()` modifies the model: It changes the logp function +of the model. If the `observed` keyword isn't set it also creates a new +free variable. In contrast, `pm.Normal.dist()` doesn't care about the model, +it just creates an object that represents the normal distribution. Calling +`logp` on this object creates a theano variable for the logp probability +or log probability density of the distribution, but again without changing +the model in any way. + +Continuous variables with support only on a subset of the real numbers +are treated a bit differently. We create a transformed variable +that has support on the reals and then modify this variable. For +example:: + + with pm.Model() as model: + mu = pm.Normal('mu', 0, 1) + sd = pm.HalfNormal('sd', 1) + y = pm.Normal('y', mu=mu, sd=sd, observed=data) + +is roughly equivalent to this:: + + # For illustration only, not real code! + model = pm.Model() + mu = tt.scalar('mu') + model.add_free_variable(mu) + model.add_logp_term(pm.Normal.dist(0, 1).logp(mu)) + + sd_log__ = tt.scalar('sd_log__') + model.add_free_variable(sd_log__) + model.add_logp_term(corrected_logp_half_normal(sd_log__)) + + sd = tt.exp(sd_log__) + model.add_deterministic_variable(sd) + + model.add_logp_term(pm.Normal.dist(mu, sd).logp(data)) + +The return values of the variable constructors are subclasses +of theano variables, so when we define a variable we can use any +theano operation on them:: + + design_matrix = np.array([[...]]) + with pm.Model() as model: + # beta is a tt.dvector + beta = pm.Normal('beta', 0, 1, shape=len(design_matrix)) + predict = tt.dot(design_matrix, beta) + sd = pm.HalfCauchy('sd', beta=2.5) + pm.Normal('y', mu=predict, sd=sd, observed=data)