# Boltzmann Machine

The Boltzmann Machine (BM) is known as a type of generative stochastic neural network and Markov Random Field invented by Geoffrey Hinton and Terry Sejnowski in 1985 1. Boltzmann machines can be seen as the stochastic generative counterpart of Hopefield nets2. They were one of the first examples of a neural network capable of learning internal representations, and are able to represent and solve difficult combinatoric problems. They are theoretically intriguing because of the locality and Hebbian nature of their training algorithm, and because of their parallelism and the resemblance of their dynamics to simple physical processes. Due to a number of issues discussed below, Boltzmann machines with unconstrained connectivity have not proven useful for practical problems in machine learning or inference, but if the connectivity is properly constrained, the learning can be made efficient enough to be useful for practical problems. Usually, Restricted Bolzmann Machine (RBM) will be applied to practical use. 3

# A Brief Introduction to Restricted and fully connected Boltzmann Machine

## Structure

A Boltzmann machine, like a Hopefield network, is a network of units with an “energy” defined for the network. It also has binary units, but unlike Hopefield nets, Boltzmann machine units are stochastic. The global energy $E$, in a Boltzmann machine is identical in form to that of a Hopfield network

$E = -(\sum_{i,j}\omega_{ij} s_i s_j +\sum_i\theta_i s_i)$

Where:

• $\omega_{ij}$ is the connection strength between unit $j$ and unit $i$
• $s_i$ is the state, $s_i\in {0,1}$, of unit $i$
• $\theta_i$ is the bias of unit $i$ in the global energy function. ($-\theta_i$ is the activation threshold for the unit)

Often the weights are represented in matrix form with a symmetric matrix $W$, with zeros along the diagonal.

## Probability of a unit’s state

The difference in the global energy that results from a single unit $i$ being 0 (off) versus 1 (on), written $\Delta E_i$, assuming a symmetric matrix of weights, is given by:

$\Delta E_i = \sum_{j} \omega_{ij}s_j + \theta_i$

This can be expressed as the difference of energies of two states:

$\Delta E_i = E_{i=off} - E_{i=on}$

We then substitute the energy of each state with its relative probability

\begin{aligned} \Delta E_i &= -T\ln(p_{i=off})+T\ln(p_{i=on})\\\\ \frac{\Delta E_i}{T} &= \ln(p_{i=on}) - \ln(1-p_{i=on})\\\\ p_{i=on} &= \frac{1}{1+\exp(-\frac{\Delta E_i}{T})} \end{aligned}

Where $T$ is the temperature of the system.

## Equilibrium state

The network is run by repeatedly choosing a unit and setting its state according to the above formula. After running for long enough at a certain temperature, the probability of a global state of the network will depend only upon that global state’s energy, according to a Boltzmann distribution, and not on the initial state from which the process was started. This means that log-probabilities of global states become linear in their energies. This relationship is true when the machine is “at thermal equilibrium”, meaning that the probability distribution of global states has converged. If we start running the network from a high temperature, and gradually decrease it until we reach a thermal equilibrium at a low temperature, we may converge to a distribution where the energy level fluctuates around the global minimum. This process is called simulated annealing.

## Training

The units in the Boltzmann Machine are divided into ‘visible’ units, V, and ‘hidden’ units, H. The visible units are those which receive information from the ‘environment’, i.e. our training set is a set of binary vectors over the set V. The distribution over the training set is denoted $P^{+}(V)$.

As is discussed above, the distribution over global states converges as the Boltzmann machine reached thermal equilibrium. We denote this distribution, after we marginalize it over the hidden units, as $P^{-}(V)$.

Our goal is to approximate the ‘real’ distribution $P^{+}(V)$ using the $P^{-}(V)$ which will be produced (eventually) by the machine. To measure how similar the two distributions are, we use the Kullback-Leibler divergence (ususlly noted as KL-divergence) $KL(+|-)$:

$KL(+|-) = \sum_{v} P\^+ (v) \ln(\frac{P\^{+}(v)}{P\^{-}(v)})$

where the sum is over all the possible states of $V$. $KL(+|-)$ is a function of the weights, since they determine the energy of a state, and the energy determines $P\^{-}(v)$. Therefore, a gradient decent can be simply applied over $KL$

## Learning Goal (Maximizing likelihood)

Consider a training set of binary vectors which we will assume are binary images for the purpose of explanation. The training set can be modeled using a two-layer network called a “Restricted Boltzmann Machine” Define RBM’s Hamiltonian (Hopefield, 1982):

$E(\mathbf{v},\mathbf{h};\theta) = -\sum_{ij} \omega_{ij}v_i h_j - \sum_{i}b_i v_i - \sum_{j} a_j h_j$

where $v_i,h_j$ are the binary states of visible unit $i$ and hidden unit $j$, $a_i,b_j$ are their biases and $\omega_{ij}$ is the weight between them.