Neural Networks
Introduction
Neural networks are a very popular data mining and image processing tool.
Their origin stems from the attempt to model the human thought process as a an algorithm which can be efficiently run on a
computer.
The human brain consists of neurons that send activation signals to each
other (figure on the left) thereby creating intelligent
thoughts. The algorithmic version of a neural network (called an artificial neural network)
also consists of neurons which send activation signals to one another (figure below). The end result is
that the artificial neural network can approximate a function of multiple inputs and outputs. As a consequence, neural
networks can be used for a variety of data mining tasks, among which are classification, descriptive modeling, clustering,
function approximation, and time series prediction. Neural networks are also commonly used for image processing.
Because the breadth of neural networks is very large, this project focuses on
feed-forward neural networks, perhaps the most common type. Other popular types
include radial basis function networks, self organizing maps, and recurrent
neural networks. This page includes a description of feed-forward neural
networks and the back-propagation learning algorithm, an overview of an
implementation, several experiments of neural networks as classifiers on real
and synthetic datasets, and a comparison between neural networks and some other
classification algorithms.
Algorithm Description
Artificial neural networks (the ones that run on a computer as opposed to
a brain) can be thought of as a model which approximates a function of multiple
continuous inputs and outputs. The network consists of a topology graph
of neurons, each of which computes a function (called an activation
function) of the inputs carried on the in-edges and sends the output on its
out-edges. The inputs and outputs are weighed by weights w_ij and shifted
by bias factor specific to each neuron. It has been shown that for
certain neural network topologies, any continuous function can be accurately
approximated by some set of weights and biases. Therefore, we would like to have
an algorithm which when given a function f, learns a set of weights and biases
which accurately approximate the function. For feed-forward neural networks
(artificial neural networks where the topology graph does not contain any
directed cycles), the back-propagation algorithm described later does exactly
that.
Computing Output
We will first see how to compute the output of an already learned neural network
which means that the entire network (topology, activation functions, weights,
and biases) is known at this point. In a later section, we will see how to
construct a neural network to approximate any given function, but for now just assume
that the neural network is given and that it has been trained. It might be
helpful to refer to the figure above while reading this section. Recall that the purpose
of a feed-forward neural network is to approximate a function of multiple inputs
and outputs. The inputs and outputs are continuous (real values). Each input to
the neural network is fed as input into a different input neuron. The
input neurons are connected to hidden neurons which perform transformations on
the input. At the end of the graph are the output neurons. The output of
these neurons is defined to be the output of the entire network. It is also
possible to connect an input neuron directly to an output neuron (without any
hidden neurons in between) but that is usually not a good idea because there are
no hidden neurons to perform transformations on the data and then it would not
be possible to approximate a function accurately.
We will now see how to compute the output of a single neuron N_i. A neuron has
in-edges (edges pointing into the neuron) and out-edges (edges pointing out of
the neuron). Each edge pointing into a neuron either comes from another neuron
or directly from an input to the entire network. For an in-edge coming from
neuron N_k, the input carried across that edge is the output of N_k. For
an edge coming directly from an input x to the entire network, the input carried
across that edge is simply x. In the figure at the left, the bias of the neuron is drawn as an
in-edge with input 1 and weight equal to the bias.
Each neuron has an activation function associated with it (often times all
neurons have the same activation function). A list of common activation
functions and their names can be found in Table 1.
The steps for computing the output of a single neuron are as follows: (1) compute
the weighted sum of inputs to the neuron (2) add the bias to the sum (3) feed
the sum as an input to the activation function of the neuron. The output of the
activation function is defined to be the output of the neuron. These steps can
be summarized in the following formula: Output=A(∑w_k*I_k+bias) where A is the
activation function of the neuron, w_k is the weight of the kth in-edge, I_k is
the input carried across the kth in-edge, and bias is the bias of the neuron.
Now that we know how to compute the output of a single neuron, we can easily
compute the output of the entire network. The only tricky part is that the
output of the neurons have to be computed in a topologically sorted order so
that all of the inputs of each neuron have been computed before trying to
compute its output.
Learning the Weights and Biases
We have seen how to compute the output of an artificial neural network when all
weights and biases are specified. However, in practice, we rarely know the
weights without learning them from the training data.
We first formally state what it means to learn the weights and biases from
the data:
- Given:
- The topology of a neural network.
- The activation function for each neuron in the network A_i.
- Training data: pairs of sets of inputs and outputs, (X_i,Y_i). Each X_i
denotes the input to all input neurons. Each Y_i is the desired output of the
neural network after running the inputs in X_i. The entire training dataset is
D=Union( (X_i, Y_i) )
- A learning rate, r: a continuous value between 0 and 1 (described later).
- Find:
- o The weights of edges and biases of neurons in the network which minimize the
sum of squares error (SSE) of the training data.
The sum of squares error (SSE) of the network with respect to the training
data D is computed using the formula,
SSE(D)=Sum((Y_i - Z_i)^2)
where Z_i is the set of outputs of the neural network for the set of inputs X_i.
Several neural network learning algorithms exist, but this section will describe
what is perhaps the most commonly used algorithm, back-propagation learning.
The weights and biases are initially assigned to a random continuous (real)
value between -1 and 1. The algorithm trains the neural network incrementally,
meaning that each instance is considered independently, and after considering a
training instance, the weights and biases are updated before considering another
one. The update is done in a way that performs a gradient descent minimization
of the error with respect to the weights and biases.
The algorithm for learning an instance (X_i,Y_i )=((x_1,x_2,…,x_u
),(y_1,y_2,…,y_v ) ) can be split into three phases:
- Calculate Output - Calculate
Z_i, the output of the network for the training instance’s input set X_i.
- Calculate Blame - Y_i is the
desired value for Z_i, so when they are different, there is some error. From
this error, we wish to compute a blame value for each neuron. Blames will
later be used to decide by how much to adjust weights and biases. The figure above
illustrates this step.
- Calculate Blame for Output Neurons
- The blame of each output neuron is calculated as y_i-z_i. Note that even
though our goal is to minimize the sum squared error of the entire
network, the blames are just the linear error. This linearity is due to
the fact that we need the blame to be proportional to partial derivatives which
are needed for the gradient descent.
- Calculate Blame for Input and Hidden
Neurons - The blame for an input or hidden neuron is calculated in terms
of the blames of its out-neurons (the ones that it is pointing to). Because of
this dependency, the blames need to be calculated in a topologically sorted
order of the reverse of the network topology graph. This is similar to the order
in which the output of neurons are calculated except that the topology graph
must first be reversed before topologically sorting. The formula for the blame
of an input or hidden neuron is ∑w_k*e_k where e_k is the blame of an out-neuron
and w_k is the weight of the edge that connects to the out-neuron. This step of
propagating the error backwards is why the learning algorithm is called
back-propagation.
- Adjust Weights and Biases – This
last step performs the actual gradient descent by adjusting the weights with the
formula: w_ij = w_ij + r * e_j * A_j'(I_j) * O_i where r is the learning rate, e_j is the blame
of neuron j, A_j' is the derivative of neuron j's activation function, I_j is the input fed to neuron j during calculation of the output
in the first step, and O_i is the output of neuron i during the first step. The
biases are similarly adjusted using the formula: bias_i = bias_i + r*e_i.
After all training instances are run, learning is complete. However, sometimes
we may wish to run the training data multiple times through the learning
algorithm to achieve better accuracy on the training data. When using this
technique, it is important to make sure that over fitting does not occur.
Experiments
Unlike many other classifiers, neural networks need a lot of parameters
specified and many of these parameters can significantly affect the accuracy and
runtime of neural networks. This section contains experiments that show the
effect of the varying the number of neurons in the hidden layer and varying the
number of times the training data is duplicated. Each experiment and its results
are described.
Number of Hidden Neurons
What topology should we pick before training a neural network? The answer to
this question is quite non-trivial. We have an infinite set of directed acyclic
graphs to choose from. The only restriction is that the number of input nodes
must match the number of inputs to the function and the number of output nodes
must match the number of outputs of the function. But we are free to choose any
topology for the hidden neurons as long as there is a path from each input
neuron to some output neuron and vice versa.
Some sources claim that one hidden layer of neurons between the input nodes and
output nodes is enough to accurately approximate any continuous function. This
experiment tests the effect of varying the number of nodes in a single hidden
layer between the input and output neurons. The charts show the error on
training and testing sets. Surprisingly, on the UIC mushroom database, more than
5 neurons in the hidden layer did not affect the classification accuracy on the
training or testing data. This is an extremely important finding because the
computational complexity of the learning algorithm increases quadratically with
the number of neurons in the network and this experiment shows that only very
few neurons are necessary for this dataset.
Duplicating Training Data
The learning algorithm used in this project (back-propagation learning) is a
form of gradient descent. Gradient descent may sometimes take a long time to
reach a local minimum. The more data that is fed into the learning algorithm,
the more “time” the gradient descent algorithm gets to find a local minimum. For
this reason, it makes sense to duplicate the training data several times and
feed it into the neural network.
This experiment shows the effect of duplicating the training data. Every time
another copy of the training data is fed into the algorithm, it is fed in a new
and randomly shuffled order which reduces dependency on the order of the
training data. The charts below show the effect of duplicating training data for
the purpose of classifying mushrooms in the
UIC Mushroom database. The results
show that in fact there is a decrease in both the testing and training errors
when duplication is used. In fact, it is necessary to duplicate the data at
least 10 times before the testing error levels off. To my surprise, I did not
find this technique mentioned anywhere in the neural network literature I read.
Comparison to Other Classifiers
Although tweaking the parameters of a neural network is interesting, one may
wonder if neural networks are even a good alternative to other classification
algorithms such as decision trees and naïve Bayes. The chart below shows the
classification accuracy of neural networks compared to that of decision trees
and naïve Bayes. It seems that neural networks perform about the same as those
two classifiers and often times produce a little better results than naïve
Bayes.
Source Code
Coming soon! I need to separate the neural network code from a larger project first.
Related Work
|