This document is a tutorial for RBM and its use in the RBMAdvanced.jl package. If you're already familiar with RBMs, feel free to skip ahead to the Implementation: *
sections.
RBM (Restricted Boltzmann Machine) is a type of neural network that learns the probability distribution of given data. It was initially proposed by P. Smolensky in 1986. Unlike vanilla Boltzmann machines, which allow arbitrary connections, RBMs restrict the connections to only those between two layers: the visible and hidden layers. Connections between neurons within the same layer are not allowed. If you represent an RBM as a graph, it forms a structure similar to a dense (or linear) layer in a neural network.
But how does an RBM represent a probability distribution? For each state—where each neuron is assigned either 0 or 1—we define the energy of a state
where
where
Now, we define the probability of each state
where
is the normalization factor. Since we are usually more interested in the distribution of the visible state, we compute the marginalized probability by summing over all hidden states.
Now that we know how to compute the probability distribution of an RBM, at least in theory, we can use the "naive inference" method implemented in RBMAdvanced.jl to test this.
using RBMAdvanced
visibleSize = 3
hiddenSize = 5
weight = randInitRBMWeight(visibleSize, hiddenSize)
println(inferRBM_naive(weight))
Example output:
Dict{Any, Any}([0, 1, 0] => 0.0028824897391779834, [1, 0, 1] => 0.027611608972332023, [1, 0, 0] => 0.002147013339664838, [0, 0, 1] => 0.0014878417722425205, [0, 0, 0] => 0.00016172793688411086, [1, 1, 1] => 0.8744774773571675, [1, 1, 0] => 0.05579950571941121, [0, 1, 1] => 0.035432335163119846)
This prints the probability distribution over the three visible binary variables (as defined by visibleSize
), based on a randomly generated RBM weight matrix. You can modify the weight matrix as desired and run inferRBM_naive
to observe how changes in the weights affect the probability distribution represented by the RBM.
However, this method quickly becomes computationally intractable. For example, try setting visibleSize = 30
in the above code. Even on powerful computers, the program will take a very long time to run, as it must iterate over all possible visibleSize + hiddenSize
neurons). Clearly, naive inference is not practical for large problems.
MCMC (Markov chain Monte Carlo) is a powerful method for sampling from (possibly intractable) distribution. It is widely used in machine learning, and RBM is no exception. To sample from an RBM, we construct the Markov chain as follows:
Starting from a random
To sample from
The below example draws 100 samples using CD algorithm from an RBM with randomly generated weights. We sample after discarding the first 100 warmup samples to reduce dependence on initial sample.
using RBMAdvanced
visibleSize = 3
hiddenSize = 5
weight = randInitRBMWeight(visibleSize, hiddenSize)
println(inferRBM(weight, 1, 100, 100))
Example output:
Any[[1 1 0], [1 1 1], [1 1 1], [1 1 0], [1 1 1], [1 1 0], [1 1 0], [1 1 1], [1 1 1], [0 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [0 1 1], [0 1 1], [0 1 0], [1 1 1], [1 1 1], [1 1 1], [1 0 0], [1 1 0], [1 0 1], [1 1 1], [0 1 1], [0 1 1], [1 0 1], [1 1 1], [1 1 1], [1 0 1], [1 1 1], [1 1 0], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 0], [1 1 1], [0 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [0 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 0 1], [1 1 1], [0 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 1 1], [1 0 1], [1 0 0], [1 1 1], [1 1 1], [0 1 1], [1 0 1], [1 1 1], [1 0 1], [1 1 0], [1 1 1], [1 1 0], [1 1 1], [0 1 1], [0 1 0], [1 0 1], [1 1 1]]
You may try scaling visibleSize
to a large number and verify that CD can still return samples quickly where the naive method struggled.
So far, we have explored methods to infer the probability distribution represented by an RBM given its weights. Now we demonstrate how to train an RBM to learn the probability distribution given a dataset.
RBM's training objective is to minimize KL divergence of its visible probability distribution with respect to the true distribution. So we should first compute the visible probability distribution first, to optimize the KL divergence. Writing
Then we have the same issue here - computing the exact probability distribution of RBM is too expensive. So we also use contrastive divergence for training the RBM as well.
Here is the procedure for training the RBM with contrastive divergence. Suppose we are given the dataset
Then, the gradients of
Now we run a optimization step using the gradients obtained above. In RBMAdvanced.jl, gradient descent on full dataset batch is used.
Finally, we repeat the above steps for desired number of epochs.
train_RBM
method in RBMAdvanced.jl is designed to handle RBM training using CD and its variants. It accepts an initial RBM weight, dataset, and training config to train the RBM. Note that the RBM weight will be modified in-place.
Here's an example code to use trainRBM!
method. We first define a simple toy dataset. Note that dataset
has to be |\mathcal{D}| \times |V|
matrix of batched dataset. Then we prepare a randomly initialized RBM weight. Also, we initialize the config, and call validateConfig
to make sure the config is valid. This method will also print out the training method for debugging purposes. Finally we call trainRBM!
to train the RBM. Then we use inferRBM
to infer from the trained weight using CD.
using RBMAdvanced
dataset = [0 0 0; 0 0 1]
visibleSize = 3
hiddenSize = 5
weight = randInitRBMWeight(visibleSize, hiddenSize)
cdConfig = TrainingConfig(
usePersistentChain=false,
betas=[1.0],
k=1,
epochs=100,
learningRate=0.1
)
validateConfig(cdConfig)
trainRBM!(weight, dataset, cdConfig)
println(inferRBM(weight, 1, 100, 100))
Example output:
Using CD
Any[[1 0 0], [0 0 1], [0 0 0], [0 1 0], [0 0 1], [0 0 0], [0 0 0], [0 0 0], [0 0 0], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 0], [1 0 0], [1 0 0], [0 0 0], [0 0 0], [0 0 1], [0 0 0], [0 1 1], [0 0 0], [0 0 0], [0 0 0], [0 0 0], [0 0 0], [0 0 0], [0 0 1], [0 1 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 0], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 1], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 0], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 1 1], [0 1 0], [0 0 0], [0 0 1], [0 0 0], [0 0 1], [0 0 1], [1 1 0], [1 0 0], [0 0 1], [0 0 1], [0 0 0], [0 0 0], [0 0 0], [0 0 1], [0 0 1], [0 0 1], [0 0 0], [0 0 0], [1 0 0], [0 0 0], [0 0 1], [0 0 0], [0 0 0], [0 0 1]]
Counting the example output, we have:
'[0 0 1]': 49, '[0 0 0]': 40, '[1 0 0]': 5, '[0 1 0]': 3, '[0 1 1]': 2, '[1 1 0]': 1
The output is slightly noisy, but it reflects the 50:50 mixture of [0 0 1]
and [0 0 0]
in the toy dataset.
Although CD provides a fairly good approximation of RBM's probability distribution, it inherits the drawbacks of MCMC. MCMC can approximate the true distribution but only in asymptotic sense. So within finite runs, for glassy problems in particular, CD may produce heavily biased distribution that is concentrated around initial point. Parallel tempering (PT) is an advanced training method for dealing with this issue.
Before we explore PT, we should first look at PCD (persistent CD) which uses a persistent Markov chain for sampling. In contrast, vanilla CD starts a new chain for each training step, initializing
Parallel tempering extends PCD to multiple Markov chains running in parallel. The chains are assigned different values of temperature (or different
PT is same as PCD except for the "exchange" step. After sampling is done for each chain, we first pick two neighboring chains
Following this probability, we swap the states of chains
Run the below code to train an RBM on the same toy dataset as seen in CD example, but this time using PT algorithm for training.
using RBMAdvanced
dataset = [0 0 0; 0 0 1]
visibleSize = 3
hiddenSize = 5
weight = randInitRBMWeight(visibleSize, hiddenSize)
cdConfig = TrainingConfig(
usePersistentChain=true,
betas=[1.0, 0.1, 0.01], # use betas=[1.0] for PCD
k=1,
epochs=100,
learningRate=0.1
)
validateConfig(cdConfig)
trainRBM!(weight, dataset, cdConfig)
println(inferRBM(weight, 1, 100, 100))
Note that this example code differs with CD training example by only two lines (for config) while uses the more complex training method under the hood; this demonstrates the convenience of config system of RBMAdvanced.jl that unifies RBM training methods.
See mnist_example.md to see how to train RBMs on handwritten digit images using methods explained above.
Desjardins, G., Courville, A., Bengio, Y., Vincent, P., & Delalleau, O. (2010, May). Parallel tempering for training of restricted Boltzmann machines. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 145-152). Cambridge, MA: MIT Press.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14(8), 1771-1800.
Smolensky, P. (1986). Information processing in dynamical systems: Foundations of harmony theory.
Tieleman, T. (2008, July). Training restricted Boltzmann machines using approximations to the likelihood gradient. In Proceedings of the 25th international conference on Machine learning (pp. 1064-1071).