My Deep Learning Journey; From Onlooker to MSc Thesis
Author: Oluwole Oyetoke (4th December, 2017)
How The Journey Started
On the afternoon of Wednesday the 16th of November, 2016, I sat for a couple of hours in the renowned Edward Boyles Library of the University of Leeds, trying to craft a supervisory request email. A mail in which I had to convince my prospective supervisor to oversee my MSc project on the application of Deep Learning for Computer Vision in driverless cars. I did send the mail to him (see excerpt below) after spending about 8 hours putting together facts as to why I wanted to undertake the project and why I felt he would be the best to supervise me on the project.
Dear Dr. ---,
I believe you are doing very well. My name is Oluwole Oyetoke, currently on the Embedded Systems programme. I am writing this email to make inquiries about your availability to supervise my MSc project.
Reading through your technical interest profile on the Main Project's page, I believe you will be a suitable supervisor for me……………..
Few days later, I was opportune to have a meeting with him, and after a few deliberations, he asked that I narrowed down my proposed project scope and present to him how I wanted to go about it. After a bit of research, I fed him back with my decision to apply Convolutional Neural Networks in the classification and detection of road traffic signs. My resolve was to see how much we could accelerate the performance of the neural operations and computer vision system through various means of optimization, including legacy coding, parallel function dispatch and heterogeneous computing.
Prior to this decision, I had never before done anything with Neural Networks, but from everything I had read about their performance, I was fascinated. Being very enthusiastic about this I was able to put together a structure and approach to my mission (see project structure below).
Overall, I structured my project to explore the science behind Neural Networks (NN), its various flavours, application areas and then finally, narrow down by applying it in the design and development of a computer vision system which can be used for traffic sign recognition and detection in autonomous vehicles. The project started off by designing, developing, implementing and testing a model of the proposed vision system on a CPU using MATLAB and then afterwards, the performance of the implemented vision system was further optimized through vectorization, parallelism, legacy coding and heterogeneous computing. The project was then concluded with detailed analysis and evaluation of the various optimization schemes utilized as well as an evaluation of the excellent Neural Network’s classification accuracy.
My Initial Findings on NN
With all eagerness to fully explore this field of interest, I went on a comprehensive theoretical study of Neural Nets. Researching as deep as historical trends, its evolution, operation/training mechanisms, its different flavours, mathematical base and application areas. All of these helped me gain a very broad knowledge of the Neural Nets world. However, trust me, it way broader than you think and is one area so broad that you can hardly boast of knowing it all. Even with years of research and hands on, the field is so fast evolving that new findings become obsolete very quickly.
As I went on with my studies, it was evident to me that computers, as we have around today, are mere GIGO (Garbage-in-Garbage-Out) devices which are only capable of producing results based on what is inputted into them and how they have been originally programmed to respond to such inputs. As such, challenges exist with problem categories that cannot be formulated as algorithms, especially problems which depend on many subtle factors such as knowledge and understanding of previous scenes and corresponding reactions to them. As an example, for the recognition of the Queen of England’s image among a cluster of 100 other images, the human brain may be able to provide an informed guess, probably based on past knowledge and various other experiences combined, however, this cannot be accurately derived by a computer without an already pre-written algorithm. In the light of this, there has been growing interest in researches geared toward developing Artificial Intelligent (AI) models which are capable of learning and carrying out classification tasks without making references to any pre-written algorithm. One of such research area is in the field of Neural Networks (NN) which are a biologically inspired family of computation architectures built as extremely simplified models of the human brain.
A Bit More On My Findings
Despite the ever-increasing popularity of transferring (human) tasks to computers for simplification purposes, there are still a lot of human tasks that are still poorly done by computers, such as in areas of visual perception and intelligence. This is because the largest part of the human brain works continuously on data analysis and interpretation while the largest part of the computer is only available for passive data storage. Thus, the brain therefore performs even closer to its theoretical maximum. Although the computer is fast, reliable, unbiased, never tired, consistent and sometimes can even carry out much more complex computational combinations than the human brain is known to muscle, it is still unable to synthesize new rules and it is safe to say, it has no common sense. They rather have a group of arithmetic processing units and storage carefully interconnected to perform complex numerical calculations in a very short time but are not adaptive.
On the other hand, the human brain possesses what we know as common sense, a bigger knowledge base, ability to synthesize new rules and spontaneously detect trends in data without being pre-taught, even though based on capacity, the computer should be more powerful than the human brain as it averagely comprises of over 109 transistors with a switching time of 10-9 seconds while the brain in comparison consists of over 1011 neurons but with only a switching time of about 10-3 seconds. With closer analysis, we note that although the human brain is easily tired, bored, biased, inconsistent and cannot be fully trusted, it still outperforms the computer in some application areas due to its perceptive nature of operation (interpretation of sensory information in order to understand the environment). This explains why there still is major reliance on the human brain for classification tasks.
Juxtaposing the computer’s strengths and weaknesses against the human brain’s makes us realize that in as much as the human brain is better when it comes to perceptive tasks, it has endurance, bias and inconsistency issues. Therefore, effort is being made by researchers to develop systems which are capable of fusing together the advantages of both the brain and the computer into one near perfect outfit. A system which can take on the perceptive learning, out of the box synthesis, self-organizing and self-learning characteristic of the human brain, while maintaining the massive computational capability, speed and enduring features of the computer. This motive has led to increased research on neural networks which are a biologically inspired family of computation architectures built as extremely simplified models of the human brain
Obviously, Neural networks, both in humans and in their artificial replica are made up of interconnected neurons which can pass data between each other and act accordingly on these data. The ANN itself is a near efficient abstraction of the Neural Network of the human body, as they are a biologically inspired family of computation architectures built as extremely simplified models of the human brain
Training The Vision System’s ConvNet Model
At this point it was time to delve in hands on. It was time for me to make use of a tool for the first time. Time to put together my training dataset into a well labelled image database, create my convnet, train it using supervised learning and evaluate its performance. At every google search I carried out, there were multiple Neural Networks libraries and tools flashing at me on my computer screen. The TensorFlows, Torchs, Theanos, Caffes and MatConvNets of this world. Making a final decision as to which tool i was going to finally use took me quite a while and involved a lot of considerations. I finally narrowed decided to use MatConvNet at first, and then proceed afterwards with TensorFlow for other extensions of the project.
MatConvNet is a MATLAB toolbox used for implementing Convolutional Neural Networks (CNN) for computer vision applications. It is open-source released under a Berkeley Software Distribution (BSD)-like license (a family of permissive free software licenses, imposing minimal restrictions on its redistribution). MatConvNet can be used to replicate the architecture of many of the well know CNNs, however, its development team has also made available pre-trained models of these architectures. Notwithstanding, this project develops from the scratch its own AlexNet model, trains and tests its performance level using a brand new IMDB fabricated from the 30, 209 traffic sign images made available on the German Traffic Sign Recognition Benchmark (GTSRB) website
MatConvNet includes a variety of layer models contained in its MATLAB library directory, such as convolution, deconvolution, max and average pooling, ReLU activation, sigmoid activation and many other pre-written functions. There are enough elements available to help implement many interesting state-of-the-art networks out of the box, or even import them from other toolboxes such as Caffe
In a nutshell, while on this stage of the project, I was able to:
- Create an image database (IMDB) which was used to train and test the designed CNN architecture
- Design the selected CNN model (AlexNet) for the traffic sign classification in the proposed vison system using MATLAB.
- Train, test and evaluate the performance of the implemented CNN network
Creating My Image Database
A dataset of traffic sign images from the German Traffic Sign Recognition Benchmark (GTSRB) website is compiled and used to create the image database (IMDB) used to train and test the deep neural network. The German Traffic Sign Benchmark is a multi-class, single-image classification challenge held at the International Joint Conference on Neural Networks (IJCNN) 2011. The dataset consists of over 39,000 images in total, grouped into 43 different road traffic symbols.
Through a written IMDB creation script, the dataset is split into 70% training, 20% validation and 10% test sets of images which are used in the holistic training, validation and testing of the created AlexNet model. Validation images are used to test the performance of the network during the training process while the test images are reserved in this project to perform personal test on the network, post training. The validation set actually can be regarded as a part of training set, but it is usually used for parameter selection and to avoid overfitting. If a model is trained on a training set only, it is very likely to get close to 100% accuracy and over fit, thus get very poor performance on test set which have never been seen by the network before. The test-sets are only used to test the performance of a trained model and are the best means of detecting over fitting in the network.
The diagrams below show the results of the performance analysis and testing carried out and the descent in error rate of the classifier/CNN over the course of the 58 epochs (rounds) of training. This took about 47 hours using over 39000 training images on a 16 Gigabyte RAM quad core processor. The images below show the descent in the error rate as training proceeded, a classification example,and a bar graph showing performance improvement as the training went progressed.
Crafting My Sign Detection Mechanisms
Here, I needed to figure out a separate way of detecting my region of interest before zooming in to perform the actual image classification using my already trained neural Network. At this point again, I had the option of either deciding to use OpenCV or writing my own object detection functions. And again, considering the fact that i really wanted to know the nitty gritty of everything that happened within my computer vision system, I opted to write my functions from the scratch by myself. This meant I had to read up publications on Canny and Sobel edge detection, Harris corner detection, Hough's transform, circular hough and many other detection fundamentals. I loved it, and was very excited to take on this.
As we know, detecting and classifying objects of various sizes in a scene is an important sub-task in Computer Vision, as most of these objects of interest only occupy a small fraction of the entire image under analysis. Quite a number of the previous CNN image processing solutions target objects that occupy a large proportion of the image, and as we know, such solutions do not work well for target objects occupying only a small fraction of the image area. A typical example of these occurrences are in traffic scenes whereby the traffic light of interest only resides in a small fraction of the actual image. It is therefore important to device solid scene classification and object detection methodologies which will perform well even when the ROI is only a small but significant part of the entire image.
The fact that traffic signs are categorized according to function and are majorly in common shapes of circle, triangle, square and octagons while maintaining a combination of white, yellow, red and blue colours, makes it intuitively deductible that traffic scene classification can be optimally carried out by a combination of colour and shape analysis. The detection system uses a combination of information gleaned from this separate analysis to predict a Region of Interest (ROI). The sign classification operation follows this step to determine which specific kind of sign is present (if any). It is important to keep in mind that while detection means finding the Region of Interest, the actual classification is the one which analyses the detected region so as to provide a label to it. The images below show the integration plan for the system (detection/classification) as well as the different schemes used for the traffic sign detection (harris corner detection, sobel edge filtering, hough line/circular transforms, connected component analysis etc.)
Evaluating and Optimizing My Vision System’s Performance
At this point in the project, I was getting both classification and detection working. But there still existed a bottle-neck which was speed!. My detection and classification was happening at a 'very very very' slow number of frames per seconds. I figured it was because I had written my own detection functions by myself and also did not initially pay attention to optimization methodologies. As a consequence of this, I had to lay a foundation for ways through which performance can be improved via various methods such as code optimization as well as parallel and heterogeneous computing. With heterogeneous computing, the vision system simultaneously makes use of different kinds of compute devices (CPU, GPU and FPGA) for different operations needed to be carried out.
The key optimization schemes used in the improvement of the performance of this system include:
- MATLAB Vectorization
- Use of C/C++/FORTRAN for some subroutines
- MATLAB Parallel Computing
- Heterogeneous Computing
Optimization through vectorization (MATLAB)
Most modern CPUs are designed to be able to carry out single instructions with multiple data, thereby making them perform same operation on multiple data points simultaneously. This form of operation is very important when dealing with digital media content on CPU. With this, same instructions that need to act on different pieces of data can be computed simultaneously, thereby leading to an increase in speed.
Rather than using two nested for loops to scan through and select pixels of a 2D image (img) in MATLAB, a vectorised approach can be used (e.g img(:)) to do this. Going through the entire project codes and making effort to vectorise as many as possible image interaction instructions lead to a great deal of improvement in the speed of the detection system.
Optimization through the use of C/C++ for some subroutines
Although MATLAB is known to be very efficient for tasks which can be vectorized, it also has its own deficiency, especially with for loops, as MATLAB is originally built to operate on matrices. Considering the fact that there exist some part of the detection operation used in this project which are of the complexity order 0(n^3), some of the detection subroutines were therefore written in C/C++ and then called from MATLAB using the MEX gateway function. (See example in Appendix B.5). Two important functions were highly optimized using this method. The Connected Component Analysis function improved in speed by 508 times and the ‘Shape Analyser Function’ improving in speed by over 25 times and over 17 times improvement in speed for the edge detection function. The MATLAB Vectorized version of the other functions had better performance (see figures below). The C++ code for the CCA function was written using visual studio and linked using the MATLAB MEX gateway. For the other functions, the MATLAB code generator was used to generate their C/C++/FORTRAN equivalent
Optimization through parallel execution
Parallel computing is a type of computation in which many calculations or the execution of processes are carried out simultaneously. This is most feasible when different threads/tasks within a program can be executed independently without any form of dependencies between each other. In cases like this, computers with many cores can pass each of these tasks to their different cores for execution simultaneously. For this project’s computer vision system, there exist some independent task such as circle detection and shape analysis which have no shared dependencies. MATLAB parallel computing toolbox is used to fire the execution of these two functions (in parallel) while other serial interdependent parts of the programme keep running. In other words, the group of serially executed functions are executed concurrently with the independent functions. By the end of the execution of the group of serially dependent functions, the other independent functions being run in parallel will also have been completed, thereby increasing holistic detection time for a single image frame.
Optimization through heterogeneous computing (High Level Synthesis)
Heterogeneous computing has become a stable way through which computationally intensive Computer Vision task are now executed. As we know, CPUs are best for arithmetic tasks while GPUs do best with graphics related computations. Also, Field Programmable Gates Array (FPGAs) are even more unique, as their reconfigurable circuitry can be configured to model (in hardware), software operations. With Heterogeneous computing the best characteristics of these kinds of compute devices are leverage and used together in one system to provide a very fast and effective performance. In essence, to optimize the performance (speed) of the computer vision system in this project, the host PCs GPU (if present) is sanctioned with the task of performing the computationally intensive circle identification operation of the traffic sign detection.
To test the efficacy of these method, a custom built circle detection function was written in MATLAB and also replicated as an Open Computing Language (OpenCL) based application, so that its performance can be tested on an FPGA or GPU. OpenCL allows software and hardware programmer write codes targeted at specific hardware devices such as GPUs and FPGAs. Often, OpenCL applications run on a host CPU, but are able to call various other connected computing devices attached to them to assign task. The performance of the OpenCL based application is measured on FPGA as well as GPU and benchmarked against normal CPU performance with MATLAB.
A custom written circle detection function is used because of the complexity involved in trying to rewrite MATLAB’s ‘imfindcircles.m’ function which is known to call several other sub function also. Therefore for accurate comparison, a new ‘myDetectCircles.m’ function is written (in MATLAB) based on the Circular Hough Transform (CHT) theory and also in C/C++ before being resolved into an OpenCL application. It is this customized circle detection function whose performance is evaluated (MATLAB vs C++ vs OpenCL). Overall, performance improvement gleaned here can be projected as similar to what will be obtained if the MATLAB native circle detection function is also re-written as an OpenCL application. Being able to easily dedicate this task to the connected GPU/FPGA will greatly improve this projects computer traffic sign detection and recognition system’s performance (speed).
Where I Finally Landed
After playing around for months with the dynamics of applying Deep Learning to Computer Vision, I could clearly see that Neural Networks have become successful architectures for modelling artificial intelligent systems in multiple scientific and engineering areas, including vision, speech recognition, natural language processing and the likes. However, ConvNets, a variation of the different deep learning architectures which utilize cascaded layers of nonlinear processing units for feature extraction were the ones mostly used for image classification tasks.
In designing and implementing my computer vision system for traffic sign recognition and detection, I utilized a particular model of the ConvNet family called the AlexNet. A model known to record very low error rates in image classification task involving millions of images to be classified into thousands of categories. This project did not just stop at image/scene classification and detection on a normal CPU, but moved ahead to optimize the developed system so as to achieve higher rates of frame processing. I had explored optimization mechanism such as vectorization, parallel thread dispatch, legacy programming language (C/C++) and even heterogeneous computing to ensure optimal performance.
I fed real life video feeds to my vision system for testing and evaluation purposes and could clearly decipher that by engaging different methods of optimization (software/hardware) we can improve ConvNet operation speed. As in this project, I was able to push my computer vision system's performance from about 1.5 seconds per frame (i.e. about half a frame per second) to between 25 to 30 frames per seconds using my C/C++ solution running across multiple cores. The pure MATLAB solution optimized through vectorization and multi core dispatch gives a maximum of just about 8 frames per second which is also way ahead of the un-optimized MATLAB version of the vision system. In a nutshell, my final results had to do with the fact that I had:
- Implementation of a deep leaning model (AlexNet) for traffic sign classification which attained a classification accuracy of over 98%.
- Design and implantated multiple layers of traffic sign post detection mechanisms which attained a detection accuracy of over 99%.
- Optimized the detection algorithms performance from about half a frame per second to around 25 to 30 frames per second (classification operation included). This sums up to over 50 time’s improvement in speed (C/C++ version).
- Practically proved the fact that further improvement in speed can be achieved through heterogeneous computing. i.e., by dedicating some parts of the computationally expensive detection functions to suitable devices such as GPUs and FPGA.
Dont forget that you can also contribute to this project on GitHub