2 Neural Network Theory

This section will briefly explain the theory of neural networks (hereafter known as NN) and artificial neural networks (hereafter known as ANN). For a more in depth explanation of these concepts please consult the literature; [Hassoun, 1995] has good coverage of most concepts of ANN and [Hertz et al., 1991] describes the mathematics of ANN very thoroughly, while [Anderson, 1995] has a more psychological and physiological approach to NN and ANN. For the pragmatic I could recommend [Tettamanzi and Tomassini, 2001], which has a short and easily understandable introduction to NN and ANN.

2.1 Neural Networks

The human brain is a highly complicated machine capable of solving very complex problems. Although we have a good understanding of some of the basic operations that drive the brain, we are still far from understanding everything there is to know about the brain.

In order to understand ANN, you will need to have a basic knowledge of how the internals of the brain work. The brain is part of the central nervous system and consists of a very large NN. The NN is actually quite complicated, but I will only include the details needed to understand ANN, in order to simplify the explanation.

Figure 1: Simplified neuron.

The NN is a network consisting of connected neurons. The center of the neuron is called the nucleus. The nucleus is connected to other nucleuses by means of the dendrites and the axon. This connection is called a synaptic connection.

The neuron can fire electric pulses through its synaptic connections, which is received at the dendrites of other neurons. Figure 1 shows how a simplified neuron looks like.

When a neuron receives enough electric pulses through its dendrites, it activates and fires a pulse through its axon, which is then received by other neurons. In this way information can propagate through the NN. The synaptic connections change throughout the lifetime of a neuron and the amount of incoming pulses needed to activate a neuron (the threshold) also change. This behavior allows the NN to learn.

The human brain consists of around $ 10^{11}$ neurons which are highly interconnected with around $ 10^{15}$ connections [Tettamanzi and Tomassini, 2001]. These neurons activates in parallel as an effect to internal and external sources. The brain is connected to the rest of the nervous system, which allows it to receive information by means of the five senses and also allows it to control the muscles.

2.2 Artificial Neural Networks

It is not possible (at the moment) to make an artificial brain, but it is possible to make simplified artificial neurons and artificial neural networks. These ANNs can be made in many different ways and can try to mimic the brain in many different ways.

ANNs are not intelligent, but they are good for recognizing patterns and making simple rules for complex problems. They also have excellent training capabilities which is why they are often used in artificial intelligence research.

ANNs are good at generalizing from a set of training data. E.g. this means an ANN given data about a set of animals connected to a fact telling if they are mammals or not, is able to predict whether an animal outside the original set is a mammal from its data. This is a very desirable feature of ANNs, because you do not need to know the characteristics defining a mammal, the ANN will find out by itself.

2.2.1 The Artificial Neuron

A single artificial neuron can be implemented in many different ways. The general mathematic definition is as showed in equation 2.1.

$\displaystyle y(x) = g \left( \sum_{i=0}^{n}w_i x_i \right)$ (2.1)

$ x$ is a neuron with $ n$ input dendrites $ (x_0  \ldots  x_n)$ and one output axon $ y(x)$ and where $ (w_0  \ldots  w_n)$ are weights determining how much the inputs should be weighted.

$ g$ is an activation function that weights how powerful the output (if any) should be from the neuron, based on the sum of the input. If the artificial neuron should mimic a real neuron, the activation function $ g$ should be a simple threshold function returning 0 or 1. This is however, not the way artificial neurons are usually implemented. For many different reasons it is smarter to have a smooth (preferably differentiable) activation function. The output from the activation function is either between 0 and 1, or between -1 and 1, depending on which activation function is used. This is not entirely true, since e.g. the identity function, which is also sometimes used as activation function, does not have these limitations, but most other activation functions uses these limitations. The inputs and the weights are not restricted in the same way and can in principle be between $ -\infty$ and $ +\infty$, but they are very often small values centered around zero. The artificial neuron is also illustrated in figure 2.

Figure 2: An artificial neuron.

In the figure of the real neuron (figure 1), the weights are not illustrated, but they are implicitly given by the number of pulses a neuron sends out, the strength of the pulses and how closely connected the neurons are.

As mentioned earlier there are many different activation functions, some of the most commonly used are threshold (2.2), sigmoid (2.3) and hyperbolic tangent (2.4).

$\displaystyle g(x) = \left\{ \begin{array}{lll} 1 & if & x+t > 0   0 & if & x+t \le 0   \end{array} \right.$ (2.2)

$\displaystyle g(x) = \frac{1}{1 + e^{- 2 s (x+t)}}$ (2.3)

\begin{displaymath}\begin{array}{ccccc} g(x) & = & \tanh(s (x+t)) & = & \frac{\s...
...& \frac{e^{2(s (x+t))} - 1}{e^{2(s (x+t))} + 1}   \end{array}\end{displaymath} (2.4)

Where $ t$ is the value that pushes the center of the activation function away from zero and $ s$ is a steepness parameter. Sigmoid and hyperbolic tangent are both smooth differentiable functions, with very similar graphs, the only major difference is that hyperbolic tangent has output that ranges from -1 to 1 and sigmoid has output that ranges from 0 to 1. A graph of a sigmoid function is given in figure 3, to illustrate how the activation function look like.

Figure 3: A graph of a sigmoid function with $ s$ = 0.5 and $ t$ = 0.

The $ t$ parameter in an artificial neuron can be seen as the amount of incoming pulses needed to activate a real neuron. This parameter, together with the weights, are the parameters adjusted when the neuron learns.

2.2.2 The Artificial Neural Network

The ANN library I have chosen to implement is a multilayer feedforward ANN, which is the most common kind of ANN. In a multilayer feedforward ANN, the neurons are ordered in layers, starting with an input layer and ending with an output layer. Between these two layers are a number of hidden layers. Connections in these kinds of network only go forward from one layer to the next. Many other kinds of ANNs exists, but I will not explain them further here. [Hassoun, 1995] describes several of these other kinds of ANNs.

Multilayer feedforward ANNs have two different phases: A training phase (sometimes also referred to as the learning phase) and an execution phase. In the training phase the ANN is trained to return a specific output when given a specific input, this is done by continuous training on a set of training data. In the execution phase the ANN returns outputs on the basis of inputs.

The way the execution of a feedforward ANN functions are the following: An input is presented to the input layer, the input is propagated through all the layers (using equation 2.1) until it reaches the output layer, where the output is returned. In a feedforward ANN an input can easily be propagated through the network and evaluated to an output. It is more difficult to compute a clear output from a network where connections are allowed in all directions (like in the brain), since this will create loops. There are ways of dealing with these loops in recurrent networks, ([Hassoun, 1995] p. 271) describes how recurrent networks can be used to code time dependencies, but feedforward networks are usually a better choice for problems that are not time dependent.

Figure 4: A fully connected multilayer feedforward network with one hidden layer.

Figure 4 shows a multilayer feedforward ANN where all the neurons in each layer are connected to all the neurons in the next layer. This is called a fully connected network and although ANNs do not need to be fully connected, they often are.

Two different kinds of parameters can be adjusted during the training of an ANN, the weights and the $ t$ value in the activation functions. This is impractical and it would be easier if only one of the parameters should be adjusted. To cope with this problem a bias neuron is invented. The bias neuron lies in one layer, is connected to all the neurons in the next layer, but none in the previous layer and it always emits 1. Since the bias neuron emits 1 the weights, connected to the bias neuron, are added directly to the combined sum of the other weights (equation 2.1), just like the $ t$ value in the activation functions. A modified equation for the neuron, where the weight for the bias neuron is represented as $ w_{n+1}$, is shown in equation 2.5.

$\displaystyle y(x) = g \left( w_{n+1} \sum_{i=0}^{n}w_i x_i \right)$ (2.5)

Adding the bias neuron allows us to remove the $ t$ value from the activation function, only leaving the weights to be adjusted, when the ANN is being trained. A modified version of the sigmoid function is shown in equation 2.6.

$\displaystyle g(x) = \frac{1}{1 + e^{- 2 s x}}$ (2.6)

We cannot remove the $ t$ value without adding a bias neuron, since this would result in a zero output from the sum function if all inputs where zero, regardless of the values of the weights. Some ANN libraries do however remove the $ t$ value without adding bias neurons, counting on the subsequent layers to get the right results. An ANN with added bias neurons is shown in figure 5.

Figure 5: A fully connected multilayer feedforward network with one hidden layer and bias neurons.

2.2.3 Running Time of Executing an ANN

When executing an ANN, equation 2.5 needs to be calculated for each neuron which is not an input or bias neuron. This means that we have to do one multiplication and one addition for each connection (including the connections from the bias neurons), besides that we also need to make one call to the activation function for each neuron that is not an input or bias neuron. This gives the following running time:

$\displaystyle T = c A + (n - n_{i})G$ (2.7)

Where $ c$ is the number of connections, $ n$ is the total number of neurons, $ n_{i}$ is the number of input and bias neurons, $ A$ is the cost of multiplying the weight with the input and adding it to the sum, $ G$ is the cost of the activation function and $ T$ is the total cost.

If the ANN is fully connected, $ l$ is the number of layers and $ n_{l}$ is the number of neurons in each layer (not counting the bias neuron), this equation can be rewritten to:

$\displaystyle T = (l - 1)(n_{l}^{2} + n_{l}) A + (l - 1)n_{l} G$ (2.8)

This equation shows that the total cost is dominated by $ A$ in a fully connected ANN. This means that if we want to optimize the execution of a fully connected ANN, we need to optimize $ A$ and retrieval of the information needed to compute $ A$.

2.3 Training an ANN

When training an ANN with a set of input and output data, we wish to adjust the weights in the ANN, to make the ANN give the same outputs as seen in the training data. On the other hand, we do not want to make the ANN too specific, making it give precise results for the training data, but incorrect results for all other data. When this happens, we say that the ANN has been over-fitted.

The training process can be seen as an optimization problem, where we wish to minimize the mean square error of the entire set of training data (for further information on the mean square error see section 5.3.3). This problem can be solved in many different ways, ranging from standard optimization heuristics like simulated annealing, through more special optimization techniques like genetic algorithms to specialized gradient descent algorithms like backpropagation.

The most used algorithm is the backpropagation algorithm (see section 2.3.1), but this algorithm has some limitations concerning, the extent of adjustment to the weights in each iteration. This problem has been solved in more advanced algorithms like RPROP [Riedmiller and Braun, 1993] and quickprop [Fahlman, 1988], but I will not elaborate further on these algorithms.

2.3.1 The Backpropagation Algorithm

The backpropagation algorithm works in much the same way as the name suggests: After propagating an input through the network, the error is calculated and the error is propagated back through the network while the weights are adjusted in order to make the error smaller. When I explain this algorithm, I will only explain it for fully connected ANNs, but the theory is the same for sparse connected ANNs.

Although we want to minimize the mean square error for all the training data, the most efficient way of doing this with the backpropagation algorithm, is to train on data sequentially one input at a time, instead of training on the combined data. However, this means that the order the data is given in is of importance, but it also provides a very efficient way of avoiding getting stuck in a local minima.

I will now explain the backpropagation algorithm, in sufficient details to allow an implementation from this explanation:

First the input is propagated through the ANN to the output. After this the error $ e_k$ on a single output neuron $ k$ can be calculated as:

$\displaystyle e_k = d_k - y_k$ (2.9)

Where $ y_k$ is the calculated output and $ d_k$ is the desired output of neuron $ k$. This error value is used to calculate a $ \delta_k$ value, which is again used for adjusting the weights. The $ \delta_k$ value is calculated by:

$\displaystyle \delta_k = e_k  g'(y_k)$ (2.10)

Where $ g'$ is the derived activation function. The need for calculating the derived activation function was why I expressed the need for a differentiable activation function in section 2.2.1.

When the $ \delta_k$ value is calculated, we can calculate the $ \delta_j$ values for preceding layers. The $ \delta_j$ values of the previous layer is calculated from the $ \delta_k$ values of this layer. By the following equation:

$\displaystyle \delta_j = \eta  g'(y_j) \sum_{k=0}^{K} \delta_k w_{jk}$ (2.11)

Where $ K$ is the number of neurons in this layer and $ \eta$ is the learning rate parameter, which determines how much the weight should be adjusted. The more advanced gradient descent algorithms does not use a learning rate, but a set of more advanced parameters that makes a more qualified guess to how much the weight should be adjusted.

Using these $ \delta$ values, the $ \Delta w$ values that the weights should be adjusted by, can be calculated by:

$\displaystyle \Delta w_{jk} = \delta_j  y_k$ (2.12)

The $ \Delta w_{jk}$ value is used to adjust the weight $ w_{jk}$, by $ w_{jk} = w_{jk} + \Delta w_{jk}$ and the backpropagation algorithm moves on to the next input and adjusts the weights according to the output. This process goes on until a certain stop criteria is reached. The stop criteria is typically determined by measuring the mean square error of the training data while training with the data, when this mean square error reaches a certain limit, the training is stopped. More advanced stopping criteria involving both training and testing data are also used.

In this section I have briefly discussed the mathematics of the backpropagation algorithm, but since this report is mainly concerned with the implementation of ANN algorithms, I have left out details unnecessary for implementing the algorithm. I will refer to [Hassoun, 1995] and [Hertz et al., 1991] for more detailed explanation of the theory behind and the mathematics of this algorithm.

2.3.2 Running Time of Backpropagation

The backpropagation algorithm starts by executing the network, involving the amount of work described in section 2.2.3 in addition to the actual backpropagation.

If the ANN is fully connected, the running time of algorithms on the ANN is dominated by the operations executed for each connection (as with execution of an ANN in section 2.2.3).

The backpropagation is dominated by the calculation of the $ \delta_j$ and the adjustment of $ w_{jk}$, since these are the only calculations that are executed for each connection. The calculations executed for each connection when calculating $ \delta_j$ is one multiplication and one addition. When adjusting $ w_{jk}$ it is also one multiplication and one addition. This means that the total running time is dominated by two multiplications and two additions (three if you also count the addition and multiplication used in the forward propagation) per connection. This is only a small amount of work for each connection, which gives a clue to how important it is, for the data needed in these operations to be easily accessible.

Steffen Nissen 2003-11-07