Background

More on ConvNets

Essential CNN Terms

Typical CNN Layers

Convolution

Non Linearity (ReLU)

Pooling

Fully Connected Layer (FCL)

Training ConvNets (CNNs)

Training Flow

Delicate Area 1: Maintaining Spatial Accuracy

Delicate Area 2: Channel and Depth Translations

# What Makes ConvNets Different

Author: Oluwole Oyetoke (29th September, 2017)

## Background

When I first heard about Convolutional Neural Networks (ConvNets/CNNs), one of the puzzling questions I had to deal with was majorly around understanding the distinguishing factor(s) between them and the normal Multi Layer Perceptrons (MLP). Digging a bit deeper, I was able to grasp even more about what makes them peculiar. It is nothing extra/magical, but rather, its concept is embedded once again in a biologically inspired pipeline. This time, taking inspiration from the human visual cortex whose neurons get excited by the presence of specific features in scenes, thereby empowering each layer of the cortex to be able to assert the presence of a particular feature (e.g. edges, lines etc). Consequently, cascading these activities will help the higher-level layers detect more complex features and conclude recognitions.

Similarly, ConvNets are designed to make use of feature extractors which first of all help extract important features from the input dataset by primarily convolving multiple filters (of different shapes and sizes) against these input data. It is also important to note that ConvNets are best suited for image classification tasks, although they are not restricted to this alone.

## More on ConvNets

In 1989, LeCun et al. introduced Convolutional Neural Networks (ConvNets, CNN) for application in computer vision. Convolutional Neural Networks use images directly as input, and instead of handcrafted features, Convolutional Neural Networks are used to automatically learn a hierarchy of features which can then be used for classification purposes. This is accomplished by successively convolving the input image with learned filters to build up a hierarchy of feature maps. The hierarchical approach allows to learn more complex translation and distortion invariant features in higher layers. Deep Convolutional Neural Networks are analogous to traditional ANNs and can be trained more easily using traditional methods such as a combination of parameter optimization and error back propagation, just like is obtainable in conventional ANNs. This property is due to the constrained architecture of convolutional neural networks which is specific to input for which discrete convolution is define. The original Convolutional Neural Network is based on weight sharing and the only notable difference between CNNs and traditional ANNs is that CNNs are primarily used in the field of pattern recognition within images. This allows developers to encode image-specific features into the architecture, making the network more suited for image-focused tasks whilst further reducing the parameters required to set up the model.

As we know, Neural Networks are biologically inspired, and just like the Multi Layered Perceptron (MLP) ANN, the connectivity pattern between the neurons of a CNN is also biologically inspired, but by another kind of biological organization/operation which in this case is related to the organization of the animal visual cortex which has small regions of cells that are sensitive to specific regions of the visual fields. It was discovered that all the neuronal cells were organized in a columnar architecture and that together, they could produce visual perception. In summary, the observation showed that some of the neurons in the brain responded only in the presence of edges of a particular orientation, and this idea of having a system made up of components which are only responsive to specific features later became the founding idea behind the CNN.

Overall, the CNN takes in a collection of inputs and processes these inputs through a series of convolutional stages with different classes of filters which make up its different layers in order to extract specific features from the input. The extracted specific features are then passed on to the Fully Connected Layer (FCL) of the architecture which is primarily a classifier. CNNs are particularly used for image classification, especially in the case were huge data sets are involved. As we know, Image classification is the task of taking an input image and outputting a class (a table, chair, stool) or a probability of classes that best describes the image. The CNN takes an input array representing an image (coloured or grey scale) and produces an output numbers that describe the probability of the image being a certain class. The Convolution and Pooling layers act as Feature Extractors for the network while the Fully Connected layer acts as a Classifier.

Diagram 1: Architectural View of A Typical ConvNet Model

CNN represents one out of the various deep learning architectures such as the Deep Neural Networks (DNN), Deep Belief Network (DBN), Recurrent Neural Network (RNN) applied to fields of computer vision, speech recognition, natural language processing etc. Deep learning as the name implies are deep structured learning architectures which use a cascade of many layers of nonlinear processing units for feature extraction and transformation. Deep learning systems are part of the broader machine learning where each layer of the deep learning systems passes a more sharpened version of the data it receives to the next layer.

## Essential CNN Terms

For proper understanding and to prevent obscurity, some selected CNN terms explained below.

1. Channel: This is a conventional term used to refer to a certain component of an image. Coloured Images have 3 channels which are the Red Green and Blue components which make up three 2-dimensional matrices. On the other hand, a grey scaled image is made up of 1 channel (i.e. one 2-dimensional matrix) with pixel values ranging from 0 to 255 with zero (0) indicating a black and two hundred and fifty-five (255) indicating a white.
2. Depth: Depth corresponds to the number of filters used for a convolution operation in CNNs
3. Stride: Stride is the number of pixels by which the filter matrix would be slid over the input matrix
4. Parameter Sharing: Unlike Multi-Layer Perceptron (MLP) in which every neuron is fully connected to the neuron in the next layer, CNN adopts a parameter sharing mechanism whereby each neuron of the filter is connected to only a subset of the input data/image to that layer. This is based on the assumption that if one feature is useful to compute at some spatial position (x1, y1), then it should also be useful to compute at a different position (x2, y2)

## Typical CNN Layers

LeNet was one of the very first convolutional neural networks which helped propel the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 in 1988. There have been several new architectures proposed in the recent years which are improvements over the LeNet, but they all use the main concepts from the LeNet. There are four main layers of operations in the ConvNet

Convolution
Non-Linearity (ReLU)
Pooling or Sub Sampling
Classification (Fully Connected Layer)

## Convolution

At the convolutional layer, the input collection (array, matrix) is convolved (element wise multiplication) with a filter (neuron, feature detector or kernel) which is basically also a collection of numbers called weights. The filter and the input array must share equal depths for the multiplicative function of convolution to be able to take place. The resulting array of these convolution is termed the feature map or activation map. Each position of the feature map is generated by an element wise multiplication of the kernel with the input matrix at that location and summing up the outputs of the multiplication.

Diagram 2: Convolution Operation

CNNs aim to use spatial information between the pixels of an image as Convolution preserves the spatial relationship between pixels by learning image features using small squares of input data. Therefore, they are based on discrete convolution. Each of these filters at the convolution layer of the CNN can be thought of a feature identifiers, just like the biological neural cells in the cortex which only respond to certain kinds of patterns. Considering the fact that each of these filters will extract out information from the image input, the more the number of filters applied on the image input, the more (depths) different kinds of feature maps

will be generated e.g. a 4 by 3 by 72 feature map in the case where by 72 different kinds of filters are used. Other layers exist in-between the convolutional layers of the CNN, however, their primary aim is to provide nonlinearities and preservation of dimension that help to improve the robustness of the network and control overfitting. It is not restrictive that we move the kernel over the input matrix 1 pixel hop at a time during the convolution, Different stride values can be selected, however, the bigger the stride (jump steps during convolution), the smaller the feature maps that will be generated. Also, it is sometimes necessary to zero padding the input to fit a desired dimension, the size of the feature map to be used can also be regulated. In order words, even if the input matrix is too small to accommodate effectively the feature extractor during the convolution process, this can be zero padded to arbitrary widths and height to favour the feature extractor in use. Also, zero padding allows the kernel to be used to act efficiently on border values in a more efficient way.

## Non Linearity (ReLU)

In most CNN architectures, after every convolution operation in the CNN, a non-linearity operation is carried out. The Rectification Linear Unit (ReLU) performs an element wise operation which replaces all negative pixel values in the feature map by zero. The purpose of ReLU is to introduce non-linearity in CNNs as real-world data are mostly non-linear while convolution is a linear operation. The output feature map of this operation is called the ‘Rectified Feature Map’. Other non-linearity function exists and can be used, however, the performance of ReLU has been noted to surpass the rest.

## Pooling Layer Operation

This operation helps to reduce the dimensionality of (each of) the feature map while still retaining the key features. There are different ways in which pooling is achieved, either through the Max Pooling, Average Pooling or Sum Pooling method. In doing this, a spatial neighbourhood and stride length is defined (e.g. a 3 by 3 space, stride 1) and then the rectified feature map is looped through using this space dimension and stride. At every operation point, specific element/elements from the rectified feature map in that area is selected. If the Max Pooling technique is in use, then the maximum element from the rectified feature map of that space area is selected. In the case of Average Pooling, then the average of the elements in the space area is selected while for Sum Pooling, the sum of the elements in the area is selected.

In practice, max pooling is known to work better than the rest. The output of the Pooling operation produces equal number of maps as was inputted into it, but reduced in dimension. The function of this pooling is to progressively reduce the spatial size of the input representation thereby reducing the number of parameters and computations to be done by the network. Also, helps the network retain performance, even in the event of minor changes in input image, thereby providing an equivariant representation of the image.

Diagram 3: Pooling Operation

## Fully Connected Layer (FCL)

The Fully Connected layer is a traditional Multi-Layer Perceptron that uses a SoftMax activation function in the output layer. The term “Fully Connected” implies that every neuron in the previous layer is connected to every neuron on the next layer. The output from the convolutional and pooling layers represent high-level features of the input image while the purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset. Most of the features from convolutional and pooling layers may be good for the classification task, but combinations of those features might be even better which explains the benefit of adding the fully connected layer. The sum of output probabilities from the Fully Connected Layer is 1. This is ensured by using the SoftMax as the activation function in the output layer of the Fully Connected Layer. The SoftMax function takes a vector of arbitrary real-valued scores and squashes it to a vector of values between zero and one that sum to one. The SoftMax function helps to get the probability distribution of each output from a group of outputs. The SoftMax classifier is hence minimizing the cross-entropy between the estimated class probabilities.

As has been said, it is a generalization of the logistic function that "squashes" a K-dimensional vector z of arbitrary real values to a K-dimensional vector σ(z) of real values in the range (0, 1) that add up to 1. The function is given by:

$$σ(z)_j = {e^{z_j} \over \sum_{K=1}^K e^{z_k}}$$

Variations in the CNN is achieved today by reconfiguring the arrangement of the layers it is comprised of. In designing the CNN Architecture to use for an operation, a lot of intuition goes into it. However, some guidelines exist in helping with the choice of the number of filters to use.

## Training ConvNets (CNNs)

The CNN employs a similar training technique as those used by the traditional MLP (ANN). In summary, this technique involves the implementation of a Backpropagation mechanism which further reduces the degree of error contributed to the final output by each parameter/weight in the network. An iterative process of doing this fine tunes the network weights to generate the right classification probabilities for the inputs passed through it. Initially in the training process, random weight values are applied to the kernels and the neurons of the Fully Connected Layers.

## Training Flow

Steps 1 to 7 below explain the iterative process undergone by a typical CNN before it converges into a fully learned system capable of making near accurate predictions.

• Step 1: Initialize all filters and Fully Connected Layer (FCL) neuron weights with random values
• Step 2: Input training image. Image goes through the forward propagation steps (convolution, ReLU and pooling etc., depending on the specific CNN architecture)
• Step 3: Probabilities of the output of the final Fully Connected Layer is found.
• Step 4: The output probability is compared with the targeted probability
• Step 5: Total error of the system is calculated
• Step 6: Backpropagation is used to calculate the gradients of the error with respect to all weights in the network and gradient descent is used to update all filter and neuron values in a manner such as to minimize the output error.
• Step 7: Step 2 to 6 are repeated continuously, until there is very minimal error between the output probability of the network and the targeted output probability.

Training the ANN requires a conscious analysis, to ensure that the network is not a victim of over-fitting (over training) and when finally, the system has been correctly trained, the weights can be "frozen". At times, the system is designed not to lock itself in but rather continue to learn while in production use. This is entirely dependent on the decision of the system architect. In some industry application cases, the finalized network is not only frozen, but also turned into hardware so that its operation can be fast.

## Delicate Area 1: Maintaining Spatial Accuracy

To make sure the strides and dimension of the filters are set appropriately, the equation below is used, as it helps calculate the kernel size and stride which will fit into an input volume.

$$Spatial Arrangement Validator = {{(W-K)+2P}\over{S}}+1$$

$$W = Input Size$$

$$K = Size of Convolutional Layer Kernels$$

$$S = Stride Applied$$

If the resulting answer of the above is not an integer, then it is an indication that the strides are set incorrectly and the kernels cannot be tiled to fit across the input volume in a symmetric way. By Zero-padding the input matrix, the stride selection error can be rectified. Setting a stride of 1 ensures that the input and output volumes of the convolutional layer have the same dimension, however increases the number of computations that the network will need to carry out per time.

For example, if the result of the above equation was 55 in a layer with a depth size of 96 and input size of 11x11x3, we can easily calculate the layer’s output volume to be 55x55x96. Also, due to weight sharing, i.e. each depth slice only having one/same weight value for all its neurons, the total unique weight for that layer can be computed to be 96x11x11x3 (+96 biases if each of the kernels have biases).

## Delicate Area 2: Channel and Depth Translations

For the convolution of a multiple-channel input, the filter simultaneously glides over the same receptive field across the depth of the channel, computes the dot product at the multiple channels simultaneously and then sums up the convolution result for each of the channel layers into one to produce just one channel of convolved solution. For example, an input convolution layer of depth 256 filters each of size 2x2x3 interacting with an input of size 11x11x3 with a stride of 1 will glide through the first 2 by 2 position in input 11x11x1, 11x11x2, 11x11x3, compute the convolution per layer with the filter’s 2x2x1, 2x2x2, 2x2x3 respectively and then sum up the result from the three layers as the ultimate convolution result for the first 2 by 2 location of the input data. Therefore, the outcome of this convolution will result in just 256 feature maps and not 256 by 3 feature maps. In other words, a three channel input convolved with a 3 channel filter will produce a one channel output, primarily because the convolution is performed across the 3 channels simultaneously summed up to produce just one output for that activation space.