Neural Networks


Neural networks are a very popular data mining and image processing tool. Their origin stems from the attempt to model the human thought process as a an algorithm which can be efficiently run on a computer.

Brain Neurons

The human brain consists of neurons that send activation signals to each other (figure on the left) thereby creating intelligent thoughts. The algorithmic version of a neural network (called an artificial neural network) also consists of neurons which send activation signals to one another (figure below). The end result is that the artificial neural network can approximate a function of multiple inputs and outputs. As a consequence, neural networks can be used for a variety of data mining tasks, among which are classification, descriptive modeling, clustering, function approximation, and time series prediction. Neural networks are also commonly used for image processing.

Artificial Neural Network

Because the breadth of neural networks is very large, this project focuses on feed-forward neural networks, perhaps the most common type. Other popular types include radial basis function networks, self organizing maps, and recurrent neural networks. This page includes a description of feed-forward neural networks and the back-propagation learning algorithm, an overview of an implementation, several experiments of neural networks as classifiers on real and synthetic datasets, and a comparison between neural networks and some other classification algorithms.

Algorithm Description

Labeled Artificial Neural Network

Artificial neural networks (the ones that run on a computer as opposed to a brain) can be thought of as a model which approximates a function of multiple continuous inputs and outputs. The network consists of a topology graph of neurons, each of which computes a function (called an activation function) of the inputs carried on the in-edges and sends the output on its out-edges. The inputs and outputs are weighed by weights w_ij and shifted by bias factor specific to each neuron. It has been shown that for certain neural network topologies, any continuous function can be accurately approximated by some set of weights and biases. Therefore, we would like to have an algorithm which when given a function f, learns a set of weights and biases which accurately approximate the function. For feed-forward neural networks (artificial neural networks where the topology graph does not contain any directed cycles), the back-propagation algorithm described later does exactly that.

Computing Output

We will first see how to compute the output of an already learned neural network which means that the entire network (topology, activation functions, weights, and biases) is known at this point. In a later section, we will see how to construct a neural network to approximate any given function, but for now just assume that the neural network is given and that it has been trained. It might be helpful to refer to the figure above while reading this section. Recall that the purpose of a feed-forward neural network is to approximate a function of multiple inputs and outputs. The inputs and outputs are continuous (real values). Each input to the neural network is fed as input into a different input neuron. The input neurons are connected to hidden neurons which perform transformations on the input. At the end of the graph are the output neurons. The output of these neurons is defined to be the output of the entire network. It is also possible to connect an input neuron directly to an output neuron (without any hidden neurons in between) but that is usually not a good idea because there are no hidden neurons to perform transformations on the data and then it would not be possible to approximate a function accurately.

Computing the Output of a Single Neuron

We will now see how to compute the output of a single neuron N_i. A neuron has in-edges (edges pointing into the neuron) and out-edges (edges pointing out of the neuron). Each edge pointing into a neuron either comes from another neuron or directly from an input to the entire network. For an in-edge coming from neuron N_k, the input carried across that edge is the output of N_k. For an edge coming directly from an input x to the entire network, the input carried across that edge is simply x. In the figure at the left, the bias of the neuron is drawn as an in-edge with input 1 and weight equal to the bias.

Activation Functions

Each neuron has an activation function associated with it (often times all neurons have the same activation function). A list of common activation functions and their names can be found in Table 1.

The steps for computing the output of a single neuron are as follows: (1) compute the weighted sum of inputs to the neuron (2) add the bias to the sum (3) feed the sum as an input to the activation function of the neuron. The output of the activation function is defined to be the output of the neuron. These steps can be summarized in the following formula: Output=A(∑w_k*I_k+bias) where A is the activation function of the neuron, w_k is the weight of the kth in-edge, I_k is the input carried across the kth in-edge, and bias is the bias of the neuron.

Now that we know how to compute the output of a single neuron, we can easily compute the output of the entire network. The only tricky part is that the output of the neurons have to be computed in a topologically sorted order so that all of the inputs of each neuron have been computed before trying to compute its output.

Learning the Weights and Biases

We have seen how to compute the output of an artificial neural network when all weights and biases are specified. However, in practice, we rarely know the weights without learning them from the training data.

We first formally state what it means to learn the weights and biases from the data:

  • Given:
    • The topology of a neural network.
    • The activation function for each neuron in the network A_i.
    • Training data: pairs of sets of inputs and outputs, (X_i,Y_i). Each X_i denotes the input to all input neurons. Each Y_i is the desired output of the neural network after running the inputs in X_i. The entire training dataset is D=Union( (X_i, Y_i) )
    • A learning rate, r: a continuous value between 0 and 1 (described later).
  • Find:
    • o The weights of edges and biases of neurons in the network which minimize the sum of squares error (SSE) of the training data.

The sum of squares error (SSE) of the network with respect to the training data D is computed using the formula,

SSE(D)=Sum((Y_i - Z_i)^2)

where Z_i is the set of outputs of the neural network for the set of inputs X_i.

Several neural network learning algorithms exist, but this section will describe what is perhaps the most commonly used algorithm, back-propagation learning. The weights and biases are initially assigned to a random continuous (real) value between -1 and 1. The algorithm trains the neural network incrementally, meaning that each instance is considered independently, and after considering a training instance, the weights and biases are updated before considering another one. The update is done in a way that performs a gradient descent minimization of the error with respect to the weights and biases.

Back Propagation

The algorithm for learning an instance (X_i,Y_i )=((x_1,x_2,…,x_u ),(y_1,y_2,…,y_v ) ) can be split into three phases:

  1. Calculate Output - Calculate Z_i, the output of the network for the training instance’s input set X_i.
  2. Calculate Blame - Y_i is the desired value for Z_i, so when they are different, there is some error. From this error, we wish to compute a blame value for each neuron. Blames will later be used to decide by how much to adjust weights and biases. The figure above illustrates this step.
    • Calculate Blame for Output Neurons - The blame of each output neuron is calculated as y_i-z_i. Note that even though our goal is to minimize the sum squared error of the entire network, the blames are just the linear error. This linearity is due to the fact that we need the blame to be proportional to partial derivatives which are needed for the gradient descent.
    • Calculate Blame for Input and Hidden Neurons - The blame for an input or hidden neuron is calculated in terms of the blames of its out-neurons (the ones that it is pointing to). Because of this dependency, the blames need to be calculated in a topologically sorted order of the reverse of the network topology graph. This is similar to the order in which the output of neurons are calculated except that the topology graph must first be reversed before topologically sorting. The formula for the blame of an input or hidden neuron is ∑w_k*e_k where e_k is the blame of an out-neuron and w_k is the weight of the edge that connects to the out-neuron. This step of propagating the error backwards is why the learning algorithm is called back-propagation.
  3. Adjust Weights and Biases – This last step performs the actual gradient descent by adjusting the weights with the formula: w_ij = w_ij + r * e_j * A_j'(I_j) * O_i where r is the learning rate, e_j is the blame of neuron j, A_j' is the derivative of neuron j's activation function, I_j is the input fed to neuron j during calculation of the output in the first step, and O_i is the output of neuron i during the first step. The biases are similarly adjusted using the formula: bias_i = bias_i + r*e_i.

After all training instances are run, learning is complete. However, sometimes we may wish to run the training data multiple times through the learning algorithm to achieve better accuracy on the training data. When using this technique, it is important to make sure that over fitting does not occur.


Unlike many other classifiers, neural networks need a lot of parameters specified and many of these parameters can significantly affect the accuracy and runtime of neural networks. This section contains experiments that show the effect of the varying the number of neurons in the hidden layer and varying the number of times the training data is duplicated. Each experiment and its results are described.

Number of Hidden Neurons

What topology should we pick before training a neural network? The answer to this question is quite non-trivial. We have an infinite set of directed acyclic graphs to choose from. The only restriction is that the number of input nodes must match the number of inputs to the function and the number of output nodes must match the number of outputs of the function. But we are free to choose any topology for the hidden neurons as long as there is a path from each input neuron to some output neuron and vice versa.

Some sources claim that one hidden layer of neurons between the input nodes and output nodes is enough to accurately approximate any continuous function. This experiment tests the effect of varying the number of nodes in a single hidden layer between the input and output neurons. The charts show the error on training and testing sets. Surprisingly, on the UIC mushroom database, more than 5 neurons in the hidden layer did not affect the classification accuracy on the training or testing data. This is an extremely important finding because the computational complexity of the learning algorithm increases quadratically with the number of neurons in the network and this experiment shows that only very few neurons are necessary for this dataset.

Duplicating Training Data

The learning algorithm used in this project (back-propagation learning) is a form of gradient descent. Gradient descent may sometimes take a long time to reach a local minimum. The more data that is fed into the learning algorithm, the more “time” the gradient descent algorithm gets to find a local minimum. For this reason, it makes sense to duplicate the training data several times and feed it into the neural network.

This experiment shows the effect of duplicating the training data. Every time another copy of the training data is fed into the algorithm, it is fed in a new and randomly shuffled order which reduces dependency on the order of the training data. The charts below show the effect of duplicating training data for the purpose of classifying mushrooms in the UIC Mushroom database. The results show that in fact there is a decrease in both the testing and training errors when duplication is used. In fact, it is necessary to duplicate the data at least 10 times before the testing error levels off. To my surprise, I did not find this technique mentioned anywhere in the neural network literature I read.

Comparison to Other Classifiers

Although tweaking the parameters of a neural network is interesting, one may wonder if neural networks are even a good alternative to other classification algorithms such as decision trees and naïve Bayes. The chart below shows the classification accuracy of neural networks compared to that of decision trees and naïve Bayes. It seems that neural networks perform about the same as those two classifiers and often times produce a little better results than naïve Bayes.

Source Code

Coming soon! I need to separate the neural network code from a larger project first.

Related Work

© 2014 Emil Stefanov
About this Website