MNIST Database
The MNIST database contains a dataset with handwritten digits that are often used with machine learning algorithms or pattern recognition methods. This article step-wise explains how to download and work with the MNIST dataset and how to view the character digits as images. Please refer to our article MNIST dataset for a more general description of the data itself.
Step 1 – MNIST Dataset Download
The MNIST database with two datasets for training and testing can be downloaded here when you use a Windows system. But when you use Linux you can use the following commands to download the four files of the dataset directly:
wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Then unpack the the files using the following command:
gunzip train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz
The dataset is now available for training and testing of machine learning models. Both datasets follow not a standard image format data structure and in order to understand the data structure we recommend to have a look here. There is no need to convert necessarily the dataset to images as described in the next step when you would like to train machine learning models with it.
Step 2 – Convert MNIST Digits into PNG Images
For converting the data structure of the MNIST database into PNG images we use the small Python script below that in turn is using the PyPNG module that is available here. Instead of installation of this module we can alternatively perform the following command:
curl -LO https://raw.github.com/drj11/pypng/master/code/png.py
The Python script that performs the converting below should be in the same directory as the png.py file downloaded above. The original version of the Python script is available here, but we list it below to explain it better.
#!/usr/bin/env python import os import struct import sys from array import array from os import path import png # source: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py def read(dataset = "training", path = "."): if dataset is "training": fname_img = os.path.join(path, 'train-images-idx3-ubyte') fname_lbl = os.path.join(path, 'train-labels-idx1-ubyte') elif dataset is "testing": fname_img = os.path.join(path, 't10k-images-idx3-ubyte') fname_lbl = os.path.join(path, 't10k-labels-idx1-ubyte') else: raise ValueError, "dataset must be 'testing' or 'training'" flbl = open(fname_lbl, 'rb') magic_nr, size = struct.unpack(">II", flbl.read(8)) lbl = array("b", flbl.read()) flbl.close() fimg = open(fname_img, 'rb') magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16)) img = array("B", fimg.read()) fimg.close() return lbl, img, size, rows, cols def write_dataset(labels, data, size, rows, cols, output_dir): # create output directories output_dirs = [ path.join(output_dir, str(i)) for i in range(10) ] for dir in output_dirs: if not path.exists(dir): os.makedirs(dir) # write data for (i, label) in enumerate(labels): output_filename = path.join(output_dirs[label], str(i) + ".png") print("writing " + output_filename) with open(output_filename, "wb") as h: w = png.Writer(cols, rows, greyscale=True) data_i = [ data[ (i*rows*cols + j*cols) : (i*rows*cols + (j+1)*cols) ] for j in range(rows) ] w.write(h, data_i) if __name__ == "__main__": if len(sys.argv) != 3: print("usage: {0}".format(sys.argv[0])) sys.exit() input_path = sys.argv[1] output_path = sys.argv[2] for dataset in ["training", "testing"]: labels, data, size, rows, cols = read(dataset, input_path) write_dataset(labels, data, size, rows, cols, path.join(output_path, dataset))
The script sweeps through the directory training and testing and converts the dataset files into PNG images using the PyPNG module. In order to store the data based on their label (0-9) the script also uses the data in the label dataset files. The usage of the Python script above is quite simple and the execution does not take very long. In the realm of big data we can state that the execution of this script does not need parallel computing power while certain modeling activities on the datasets might benefit from parallel processing. First we need to create two directories called training and testing in the same directory as the Python script. We can use the following commands:
mkdir training mkdir testing
In these directories we need to copy the four different dataset files from Step 1 as follows. For the training and the testing dataset including their label data files we perform the following commands:
cp train-images-idx3-ubyte training cp train-labels-idx1-ubyte training cp t10k-images-idx3-ubyte testing cp t10k-labels-idx1-ubyte testing
Assuming that the content of the Python script above is within the file convert.py we perform the following command to convert the data into PNG images. The first parameter is ‘.’ meaning the current directory of the script. The second parameter ‘./PNG’ is the directory that should contain the PNG images after the converting is finished.
python convert.py . ./PNG
After this command was executed the directory ‘./PNG’ contains the PNG images according to the following structure: training/testing / image-label / id.png. When you are interested to have a look on some of the images in Linux like the 689.png image example below there is a simple command as follows:
display 689.png
Details on MNIST Database
We recommend to check the following video about this subject: