Heterogeneous Computing; The Helpful Framework Behind Many Deep Learning Applications
Author: Oluwole Oyetoke (20th November, 2017)
Many years ago, computers were made up of purely CPUs to help process needed calculations within them at fast rates. Along the line, new types of hardware cores emerged. The likes of Graphics Processing Units (GPUs) which are optimized to help carry out graphics related tasks such as tiling, rasterization, clipping etc in a very fast and efficient manner. Others include DSPs and FPGAs which are even more unique, as FPGA's reconfigurable circuitry can be configured to model (in hardware), software operations.
Since each of these hardware devices are known to be best suited for some kinds of application over the other, it became imperative to come up with a methodology through which one can interface with all of these kinds of devices on a single platform using a unified language. The consequences of this development has been a very exciting one, as fields relating to the application of Deep Learning have benefited majorly from this improvement. As we know, Deep learning application areas such a Computer Vision can sometime require extremely computational intensive operations for which the architecture of the CPUs may not be well optimized for. Some are best done with a GPU or even an FPGA. This post focuses majorly on the beneficial application of heterogeneous computing to Computer Vision (an application area of deep learning). It starts off by discussing in brief about OpenCL (an heterogeneous computing framework) and then later examines possible benefits of the implementation of heterogeneous computing to vision systems (especially object detection algorithms) and ConvNets in general.
Open Computing Language (The Unified Framework)
Interestingly, Engineers have now developed a framework (Open Computing Language - OpenCL) for writing programs that execute across heterogeneous platforms such as FPGAs, CPUs, GPUs and DSP chips simultaneously. OpenCL provides a standard interface for parallel computing using task- and data-based parallelism. It is based on the C99 and C++11 programming language and most importantly, specifies an Application Programming Interfaces (APIs) to control the hardware platform and execute instructions on them. OpenCL functions are normally referred to as ‘kernels’ as the framework is designed to see computing systems as systems which consists of a number of compute devices e.g. CPUs, GPUS attached to a CPU. Each of these compute devices are further broken down into multiple Processing Elements (PEs). OpenCL therefore uses a language closely related to C to write kernels which can be run on all or as many of the PEs in parallel. The subdivision of these individual compute devices into compute units and PEs is determined by the compute device designer. The OpenCL API allows programs running on the host to launch kernels on the compute devices and manage device memory.
OpenCL Operation Architecture
In the OpenCL architecture, there exist one CPU-based ‘Host’ which controls multiple ‘Compute Devices’ such as CPUs, DSP chips, GPUS etc. Each of these compute devices also consists of multiple ‘Compute Units’ such as Arithmetic and Logic Units (ALUs), processing unit groups on multicore CPUs. And then finally, each of these Compute Units have multiple ‘Processing Elements’ upon which the OpenCL kernels are executed.
An OpenCL application is split into two parts, the host program and the kernel(s). The host program is executed on the CPU of the system, and can perform any functions or computations as if it were a regular C program. In addition, an OpenCL host program is able to launch one or more kernels in order to speed up computation. A kernel is a special function written in OpenCL C that performs some user-defined computation.
OpenCL Execution Model
The OpenCL host uses the OpenCL API to query and select desired compute devices, submit work to these devices and manage the workload across compute contexts and work-queues. However, looking deeper into the execution of these work/tasks, there exists OpenCL kernels running in parallel on the various PEs on the compute device over a predefined N-dimensional computation domain. Each execution is termed a work-item, and these work-items are grouped together into ‘work-groups/thread blocks’. In summary, the execution model is an internetwork of ‘fine grained data parallelism and thread parallelism, nested within coarse-grained data parallelism and task parallelism’. The host primarily creates ‘Context’ to manage OpenCL resources and shares common memory buffers known as the memory object with the compute device. The command Queue manages the execution of kernels in and out of order, although commands are queued in order. The command queue accepts kernel execution commands, memory management commands and synchronization commands.
OpenCL Actual Execution Pipeline
The simplified model of the OpenCL execution pipeline is as listed below
- CPU host defines an N-dimensional computation domain over some region of DRAM memory.
- Every spot on the array will be a work-item. Each work-item is slated to execute the same Kernel concurrently within the compute unit
- Host groups the work items into work groups. The work items within the work group share the same memory.
- These work-groups are placed onto a work-queue.
- DRAM memory is then loaded into the global memory and each work group on the compute device is executed.
Note that each PE executes purely sequential code.
OpenCL Memory Architecture
OpenCL defines a four-level memory hierarchy for the compute device which are the
- Global Memory
- Per-element Private Memory (registers).
- Read-only Memory
- Local Memory
The global memory is shared by all processing elements, but has high access latency while the local memories are shared between a group of processing elements. The read only memory however can be written to by the host CPU but not the compute device itself. It is important to note that devices may or may not share memory with the host CPU. The lowest level execution unit has a small private memory space for program registers.
The memory architecture is structured in such a way that each work group has an independent memory access but can communicate with other work groups through shared memory (Global Memory).
This model allows work items to be able to synchronize with each other, as they share the same local memory and also work groups can communicate with each other because they share the same global memory. It is important to check the return conditions from the host API when statically allocating local data (per work-group). This will help detect instances of allocating too much per work-group.
OpenCL Programming Language Structure
The OpenCl programming language is based on C. Pointers are represented using qualifiers such as ‘__global’, ‘__constant’, ‘__local’ and ‘__private’ and its functions are represented with the entry statement ‘__kernel’. Recursions, variable-length array, function pointers and bit fields are not allowed. It provides fixed-length vector types such as float4 which is designed to map onto SIMD instructions sets. Float4s are simply 4-vector of single-precision floats.
Limitations in OpenCL
In OpenCL, global work size must be a multiple of the work-group size and work-group size must be less than or equal to the maximum allowable kernel work group size on the device. The maximum work-group size indicates the maximum concurrent threads within a work group on the device.
Types of Parallel Programming
There exist two major types of parallel programming which are:
- Data-parallel programming: Data is distributed across different nodes, which operate on them in parallel. Execution in these nodes are deemed independent.
- Task Parallel Programming: Distributed tasks/codes concurrently performed/acted upon by processes or threads across different processors
OpenCL API gives room for both of this form of parallelism, however, it is important to consider how well the kind of parallelism we want to implement compliments the OpenCL memory architecture e.g. access to global memory is slow, therefore, one can write codes whereby work-groups are separated by task parallel programming (since threads within a work-group can share local memory) and work items by data parallel programming. At times, one instruction executes identical code in parallel by different threads and each thread executes the code with different data.
Case Examination; Heterogeneous Computing Effect
In this section, we will look at cases where specific hardwares are used to perform some dedicated vision task and examine both theoretically and practically, the possible improvements we could gain. We will first look at the possibility of using an FPGA dedicatedly to handle circle detection, afterwards, we will examine a GPU vs CPU performance in this case and then finally discuss about convnets and why heteregeneous computing proves to be a good thing for them
Case 1: Accelerating Circle Detection Using OpenCL on FPGA
The steps listed below shows the needed procedure to be taken in the development of the OpenCL application for circle detection on FPGA. Diagram 5 below gives a diagrammatic explanation of the implementation process
- Writes OpenCL application code which is made up of C/C++ host program and OpenCL kernels
- Compile Host Programme using a C compiler e.g. GCC. This creates the executable binaries
- Compile the kernels using the AOC. This creates the .aocx file
- Use the AOCL utility to load the .aocx into the FPGA
- Run the host programme which will from time to time launch the OpenCL kernels on the FPGA (Accelerator) when it needs to do so.
The steps highlighted below shows on a granular level how the OpenCL kernel is executed on the FPGA at runtime, from the host platform
- Host platform initializes OpenCL by querying available platforms
- Desired platform is selected
- Select desired device (CPU, GPU, ACCELERATOR etc.)
- Create context to manage the selected device and kernel execution
- Create command queue attached to the device context
- Create memory objects attached to the context
- In preparation for the kernel about to be launched, write all needed data from host memory onto the memory objects (created in the step above) which actually will reside in the selected device’s memory
- Load kernel file (.cl) containing one or more kernels to be launched
- Create kernel program object (a collection of ready to run kernels) using the loaded kernel file
- Build created program object into a binaries
- Extract only the required kernel(s) from the built program
- Pass all the arguments needed by the kernel to run to it one after the other
- Enqueue the task by passing the command que and the kernel to the enqueue function
- Read device memory object content back to the host memory
- Use synchronization mechanism (e.g. clFinish(commandqueue)) to ensure that kernel finished execution before next host function is run (if need be)
Experiments have proven that if we can direct all circle detection task to a dedicated hardware like an FPGA in vision system, we can gain massive improvement in object detection speed.
Case 2: Acceleration of Circle Detection on GPU (OpenCL)
To test and understand how well off a performance improvement can be gotten in Computer Vision applications through heterogeneous computing, a computationally expensive vision task such as circle detection operation is experimented with. Here we will try to see how to go about achieving this improvement using OpenCL and also testing it out to see how well off this improvement is. I tried to use OpenCL with MATLAB to fire kernels in the GPU for circle detection. An OpenCL toolbox built by Radford Juang was used. This toolbox helps create a gateway through which OpenCL kernels can be called from MATLAB, as MATLAB is currently only written to support CUDA. Prior to this, the circle detection function is written in C/C++/OpenCL and tested through Visual Studio on the AMD Radeon R5 graphics card to ensure functionality. The code except in the code below shows the OpenCL Kernel. Note that the circle detection algorithm implemented is just a functional one and not a super optimized version. The aim here is to see how much improvement we can get in circle detection by rather executing it on GPU and not a CPU
Case 3: ConvNets in General
As we know, ConvNets are a modified type of Multi Layer Perceptron which make use of convolution layers for feature extraction. It has been proven that CPUs are not the best for convolution, and dedicated FPGA accelerators as well as GPUs do better with these kinds of operations. On the other hand, there exist a layer in most CNNs called the Softmax layer which helps to get the probability distribution of each output from a group of outputs from a ConvNet model. It has been proven that CPUs are the best to perform this kind of operation.
Similarly, where a bit of image processing need to be done such as in computer vision applications, we know that the GPU architecture is the best optimized for this kind of operation. As such, in the earlier days when just CPUs were used, performance was limited in many Deep Learning application areas. However, with the capability brought forward by heterogeneous computing, each of these different tasks can now be properly distributed to the best hardware on the platform for the task. This kind of benefit is far reaching and has really helped in making Deep Learning a more appealing field.