TensorFlow TFRecord; Loading In and Popping Out Data (Part 1)
Author: Oluwole Oyetoke (10th December, 2017)
Considering the fact that I already had prior experience with training Neural Nets using MatConvNet, a Neural Networks toolbox available on MATLAB, I felt confident that training a CNN using TensorFlow would not be much of a hassle. To set my foot in, I started with some basic TensorFlow tutorials, including the TensorFlow CNN Guide. I was able to follow through the tutorial and saw my Convolutional Neural Network learn using the MNIST dataset. I was confident I now knew how to go about working with ConvNets on TensorFlow. Unfortunately, my confidence was brought to rubbles few weeks later when I wanted to use a raw dataset from the German Traffic Sign Recognition Benchmark (GTSRB). At this point, I discovered that the MNIST dataset we had used on the TensorFlow guide for CNN tutorial was a ready one. As a matter of fact, it was imported with just a single line of code:
mnist = tf.contrib.learn.datasets.load_dataset("mnist")
With a little bit of Google search, it wasn't too long till I discovered that what I needed was called a 'TFRecord' in TensorFlow. It basically is TensorFlow's default data format. A record file containing serialized tf.train.examples. To avoid confusion, a TensorFlow 'example' is a normalized data format for storing data for training and inference purposes. It contains a key-value store where each key string maps to a feature message which can be things like a packed byte list, float list, int64 list. Many 'examples' come together to form a TFRecord
This post shows in a step by step manner how to create a TF Record file housing images. Note that:
- The image dataset should be in .jpeg format
- It will be helpful for you to have only folders in the base directory containing your training images. i.e 'Base folder-->Subfolders-->Each subfolder containing specific classes of image'. E.g. 'Training Folder -> stop_sign_folder -> 1.jpg, 2.jpg, 3.jpg....'
- The sub-directory should be the unique label associated with the images in the folder.
- The labels file should contain a list of valid labels relating to the class folder names. Each line corresponds to a labelSample Label File Sample Directory Structure
Like any programming challenge, there are multiple ways through which we can achieve our desired result of creating our own TF Record file in TensorFlow. Here, we will make use of an adapted version of a TFRecord creation script. A script I was able to dust out of the TensorFlow Code Repository on GitHub. The TFRecord creation script contains 5 major functions which work hand in hand to deliver to you the TFRecord as fast passible. These are:
- Connects to the location where the label list and image files are located.
- Builds a list of all the dataset image file path and their labels as strings and integers
- Creates multiple thread
- Fires many simultaneous runs of the function _process_image_files_batch()
- Creates a TFRecord writer
- Streams all images details passed to it to the _process_image() for formatting and extraction of only needed data.
- Uses the details received from _process_image() function to fire the _convert_to_example() function which returns the actual proto example object
- Proto example is then written into the TFRecord file.
- Ensures image files are in proper format e.g. '.JPEG'.
- It then extracts the needed details (returns image byte, shape etc)
- Uses arguments passed to it to create proto examples
Code Excerpt: _find_image_files()
Code Excerpt: _process_image_files()
Code Excerpt: _process_image_files_batch()
Code Excerpt: _process_image()
Code Excerpt: _convert_to_example()
For a better understanding, the diagram below shows the code flow