I have already mentioned, that the kind of network I will implement is a feedforward network and that it should be fast. What I have not mentioned is which functionalities I would like to implement in the library and which I would not like to implement.
The library should be easy to use and easy to train. For this reason, I would like to be able to train a network directly from training data stored in a file (with one function call). I would also like to be able to save and load all information about an ANN to and from a file. The most basic ANN operations should be very easy to use. The basic operations are: creating a network with reasonable defaults, training a network and executing a network.
I would like to make the library easy to alter in the future, which is why it should be possible to make sparse connected ANNs. I would also like to implement at least one of the possible activation functions, but still allowing more to be added later.
It should be possible to alter all the parameters mentioned in the theory (section 2) at runtime. The parameters are: learning rate, activation function, the steepness value of the activation function and the values of the initial weights.
It should not be possible to hand design a network topology, but it should be possible to create a network and decide how many layers there should be, how many neurons there should be in each of these layers and how dense the network should be. I will get back to the density in section 4.3, but generally what I want, is to be able to create a network and say that e.g. only half of the possible connections should be connected.
Some ANN packages have GUIs for viewing information about an ANN, but I do not think that it is the primary goal of an ANN library and for this reason I do not want to implement this. Due to the flexible nature of the library that I will implement (a network can be saved to and loaded from a file), it would be possible to make a stand-alone program that could implement these kinds of features.
The first big question is how much of the functionality implemented in the floating point library should also be implemented in the fixed point library. The obvious answer to this question would be all of the functionality, this however raises a new question: What should be done when overflow occurs?
The hardware integer overflow exception is usually masked by the operating system or the compiler2. This implies that the only real alternatives are to check for overflow on each calculation or not to check for overflow at all. To check for overflow on each calculation would be too costly and would void the whole idea of using fixed point arithmetic for greater speed. On the other hand not to check at all would create overflow and unpredictable results in consequence.
This is an annoying problem, especially because you have no real control over the values of the weights. Usually in fixed point arithmetic you either give a guarantee, that there will never be integer overflow or you make a simple check, that can see if an overflow has occurred during a series of calculations. I can not find any simple check, that can guarantee that there has not been an overflow for a series of operations, but what I can do, is to guarantee that an overflow will never occur.
In order to make a guarantee that an overflow will never occur, I will have to reevaluate the amount of functionality which should be implemented in the fixed point library. Since fixed point arithmetic is mostly geared towards portable computers, it is safe to assume that there will be a standard PC available for training the ANN. This means that the training part of the ANN does not need to be implemented in fixed point. Another observation is that after the ANN is fully trained, the ANN never changes and it is therefore possible to make one check, after the training has finished, that will guarantee that an overflow will never occur.
These observations about the problem of fixed point arithmetic, give rise to several different implementation strategies. In section 4.4 I will choose an appropriate strategy and prove that there can be no overflow using this strategy.
The cache optimizations are the most efficient, as can be seen in the benchmarks (section 6).
The backpropagation algorithm will have to visit all connections, this cannot be changed and it is therefore not possible to change the running time of the backpropagation algorithm. However, as described in section 2.3, other more advanced algorithms exists which could get better results than the backpropagation algorithm. These algorithms do not execute faster than the backpropagation algorithm, but they adjust the weights more precise, making them reach a result faster.
I have chosen to implement the backpropagation algorithm, because it is simple and effective enough in most cases. This decision means that I have knowingly not implemented an important optimization for the training algorithm, which implies that there is not much use in spending too much time on the other optimization strategies, because a highly tuned backpropagation algorithm will still be slower than an untuned RPROP algorithm. In spite of that, a basic level of optimization is still a desirable feature in the implementation of the backpropagation algorithm.
In conclusion; not much is done about the algorithms (although something could be done about the training), which means that the running time is still , where is the number of connections. However, there is still room for optimization of the overhead involved in executing the actual calculations.
These are obviously problems that could be fixed, while still using the object oriented approach, but the object oriented approach makes it difficult to do so.
A good architecture for a neural network should not take up too much space and should not include too deep a level of objects. On the other hand some level of object abstraction is highly desired. Perhaps a three level hierarchy would be acceptable, with the outer level consisting of the entire ANN, the next level consisting of the individual layers and the last level consisting of the single neurons and connections.
A good architecture will also allow for easy access to information like total number of neurons etc.
The architecture should assure that data could be accessed sequentially for good cache performance. A good example of this is the weights, which should be accessed sequentially when executing the network. For this reasons the weights should be aligned in memory in one long array, which could be accessed sequentially.
The algorithms themselves should obviously use this optimized architecture and access the data sequentially. The algorithms should also assure that all the code, that they execute, are located at the same place to utilize the code cache to an optimum.
This is especially a good idea for the stop criteria of a loop, because this calculation is made in each run of the loop. If some of this calculation could be made only once, this would make for a good performance increase. Also variables from the ANN which is used in central loops could be prefetched to a local variable to avoid overhead of fetching the variable from memory each time.
The central algorithms should be hand optimized to evaluate all common subexpressions at an early state, while the not so central algorithms should let the compiler take care of this optimization.
The in-lining can be done by either writing the code directly in the algorithm, by using in-line functions or macros.
Such an optimization benefits the performance in two ways: First of all, we can completely eliminate the need for using the memory, which store information about connections. Secondly, we can access the weights in one step less (one pointer reference instead of two).
A short example of this is:
a[0] = b[0]; a[0] += b[1]; a[0] += b[2];
Which could be rewritten by a programmer to the following (if the programmer was sure that a and b did not share data):
a[0] = b[0] + b[1] + b[2];
The compiler can not do this, because it can not be sure that b[1] and b[2] are not sharing the same memory as a[0] and is therefore altered by a[0] = b[0];.
Steffen Nissen 2003-11-07