3 Analysis

In this section I will analyze the requirements of the library and the methods needed to meet these requirements. A major requirement is the demand for speed and for this reason I will analyze numerous methods for optimizing speed.

3.1 Usage Analysis

The fann library is not intended to be used by me alone, it is intended to be used by many people. For the library to be a success in this area, it should be fast, easy to use and versatile enough to allow for it to be used in many different situations. A common mistake in software design is to keep adding more and more functionality without thinking about user friendliness. This added functionality can make the system difficult to comprehend for a user, which is why I will try to hide the inner architecture for the user and only expose functions needed by the user. ANNs have the appealing feature, that if implemented correctly they can be used by people, who have only very little understanding of the theory behind them.

I have already mentioned, that the kind of network I will implement is a feedforward network and that it should be fast. What I have not mentioned is which functionalities I would like to implement in the library and which I would not like to implement.

The library should be easy to use and easy to train. For this reason, I would like to be able to train a network directly from training data stored in a file (with one function call). I would also like to be able to save and load all information about an ANN to and from a file. The most basic ANN operations should be very easy to use. The basic operations are: creating a network with reasonable defaults, training a network and executing a network.

I would like to make the library easy to alter in the future, which is why it should be possible to make sparse connected ANNs. I would also like to implement at least one of the possible activation functions, but still allowing more to be added later.

It should be possible to alter all the parameters mentioned in the theory (section 2) at runtime. The parameters are: learning rate, activation function, the steepness value of the activation function and the values of the initial weights.

It should not be possible to hand design a network topology, but it should be possible to create a network and decide how many layers there should be, how many neurons there should be in each of these layers and how dense the network should be. I will get back to the density in section 4.3, but generally what I want, is to be able to create a network and say that e.g. only half of the possible connections should be connected.

Some ANN packages have GUIs for viewing information about an ANN, but I do not think that it is the primary goal of an ANN library and for this reason I do not want to implement this. Due to the flexible nature of the library that I will implement (a network can be saved to and loaded from a file), it would be possible to make a stand-alone program that could implement these kinds of features.

3.2 Fixed Point Analysis

As mentioned in section 1.1, the ANN should be able to run with fixed point numbers. This however raises a lot of questions and not all of them have easy answers.

The first big question is how much of the functionality implemented in the floating point library should also be implemented in the fixed point library. The obvious answer to this question would be all of the functionality, this however raises a new question: What should be done when overflow occurs?

The hardware integer overflow exception is usually masked by the operating system or the compiler². This implies that the only real alternatives are to check for overflow on each calculation or not to check for overflow at all. To check for overflow on each calculation would be too costly and would void the whole idea of using fixed point arithmetic for greater speed. On the other hand not to check at all would create overflow and unpredictable results in consequence.

This is an annoying problem, especially because you have no real control over the values of the weights. Usually in fixed point arithmetic you either give a guarantee, that there will never be integer overflow or you make a simple check, that can see if an overflow has occurred during a series of calculations. I can not find any simple check, that can guarantee that there has not been an overflow for a series of operations, but what I can do, is to guarantee that an overflow will never occur.

In order to make a guarantee that an overflow will never occur, I will have to reevaluate the amount of functionality which should be implemented in the fixed point library. Since fixed point arithmetic is mostly geared towards portable computers, it is safe to assume that there will be a standard PC available for training the ANN. This means that the training part of the ANN does not need to be implemented in fixed point. Another observation is that after the ANN is fully trained, the ANN never changes and it is therefore possible to make one check, after the training has finished, that will guarantee that an overflow will never occur.

These observations about the problem of fixed point arithmetic, give rise to several different implementation strategies. In section 4.4 I will choose an appropriate strategy and prove that there can be no overflow using this strategy.

3.3 Performance Analysis

The primary aim of the library is to be as fast as possible during training and execution of the ANN. To reach this aim, I will consider several kinds of optimization techniques. The techniques are partly inspired by the performance engineering course at DIKU and the rules defined in [Bentley, 1982] and partly by general common sense. The optimization techniques that I will consider are the following:

Algorithmic optimization
Architectural optimization
Cache optimization
Common subexpression elimination
In-lining of code
Specializations for fully connected ANNs
Loop unrolling
Table lookup

The cache optimizations are the most efficient, as can be seen in the benchmarks (section 6).

3.3.1 Algorithmic Optimization

When optimizing a piece of software, you will often find the most efficient improvements, in the algorithms used for the software. If you could change the running time of a piece of software, from $\Theta(n^2)$ to $\Theta(n)$ then this optimization would almost certainly be better than all other optimizations you could think of.

The backpropagation algorithm will have to visit all connections, this cannot be changed and it is therefore not possible to change the running time of the backpropagation algorithm. However, as described in section 2.3, other more advanced algorithms exists which could get better results than the backpropagation algorithm. These algorithms do not execute faster than the backpropagation algorithm, but they adjust the weights more precise, making them reach a result faster.

I have chosen to implement the backpropagation algorithm, because it is simple and effective enough in most cases. This decision means that I have knowingly not implemented an important optimization for the training algorithm, which implies that there is not much use in spending too much time on the other optimization strategies, because a highly tuned backpropagation algorithm will still be slower than an untuned RPROP algorithm. In spite of that, a basic level of optimization is still a desirable feature in the implementation of the backpropagation algorithm.

In conclusion; not much is done about the algorithms (although something could be done about the training), which means that the running time is still $\Theta(n)$ , where is the number of connections. However, there is still room for optimization of the overhead involved in executing the actual calculations.

3.3.2 Architectural Optimization

There are many ways of building the architecture (data structures) for a neural network. The object oriented approach would be to make everything an object and there are actually good abstract concepts like neurons, synapses etc. which would make for a great class hierarchy. In Jet's Neural Library [Heller, 2002] such an approach has been chosen, with all the advantages and disadvantages of this choice. There are several major disadvantage of this approach:

Data itself are not located closely together and cache performance is very bad.
Algorithms like executing the network has code located in several different classes, which makes the code hard to optimize and adds an overhead on several key functions.
It is difficult to make tight inner loops.

These are obviously problems that could be fixed, while still using the object oriented approach, but the object oriented approach makes it difficult to do so.

A good architecture for a neural network should not take up too much space and should not include too deep a level of objects. On the other hand some level of object abstraction is highly desired. Perhaps a three level hierarchy would be acceptable, with the outer level consisting of the entire ANN, the next level consisting of the individual layers and the last level consisting of the single neurons and connections.

A good architecture will also allow for easy access to information like total number of neurons etc.

3.3.3 Cache Optimization

If a good data architecture is in place much of the work for the cache optimization is already done. But some work still needs to be done in improving the architecture and making sure that the algorithms themselves are cache aware.

The architecture should assure that data could be accessed sequentially for good cache performance. A good example of this is the weights, which should be accessed sequentially when executing the network. For this reasons the weights should be aligned in memory in one long array, which could be accessed sequentially.

The algorithms themselves should obviously use this optimized architecture and access the data sequentially. The algorithms should also assure that all the code, that they execute, are located at the same place to utilize the code cache to an optimum.

3.3.4 Common Subexpression Elimination

Many expressions are calculated several times in standard neural network algorithms. Although a compiler can do common subexpression elimination, it is often a good idea to calculate expressions only once and store them in local variables. A person can often do a better job at this, because a person can predict side effects and aliasing³, which the compiler can not predict.

This is especially a good idea for the stop criteria of a loop, because this calculation is made in each run of the loop. If some of this calculation could be made only once, this would make for a good performance increase. Also variables from the ANN which is used in central loops could be prefetched to a local variable to avoid overhead of fetching the variable from memory each time.

The central algorithms should be hand optimized to evaluate all common subexpressions at an early state, while the not so central algorithms should let the compiler take care of this optimization.

3.3.5 In-lining of Code

All code which is evaluated more than once in either execution or training of the ANN, should be in-lined in the algorithm. This will avoid unnecessary overhead for function calls and allow the compiler to do optimizations across the function call.

The in-lining can be done by either writing the code directly in the algorithm, by using in-line functions or macros.

3.3.6 Specializations for Fully Connected ANNs

In fully connected ANNs we already know the connections between two layers. If we assure that the weights for fully connected ANNs are always located at the same place, we can implement algorithms which benefit from this information. This information can be used to access the weight independently of the information stored about connections.

Such an optimization benefits the performance in two ways: First of all, we can completely eliminate the need for using the memory, which store information about connections. Secondly, we can access the weights in one step less (one pointer reference instead of two).

3.3.7 Loop Unrolling

Unrolling loops can often be done more efficient by hand than by a compiler. This is partly because the compiler has to deal with aliasing, where a programmer can see that aliasing will not happen and make faster code.

A short example of this is:

a[0] = b[0];
a[0] += b[1]; 
a[0] += b[2];

Which could be rewritten by a programmer to the following (if the programmer was sure that a and b did not share data):

a[0] = b[0] + b[1] + b[2];

The compiler can not do this, because it can not be sure that b[1] and b[2] are not sharing the same memory as a[0] and is therefore altered by a[0] = b[0];.

3.3.8 Table Lookup

As seen in figure 3, the activation functions are often very close to zero for small values and close to one for large values. This leaves a relatively small span where the output is not zero or one. This span can be represented as a lookup table, with a reasonable resolution. It is hard to tell whether this lookup table will be faster than actually calculating the activation function, but it is worth a try.

Steffen Nissen 2003-11-07