MNIST Database

by www.big-data.tips · Published May 27, 2017 · Updated May 27, 2017

The MNIST database contains a dataset with handwritten digits that are often used with machine learning algorithms or pattern recognition methods. This article step-wise explains how to download and work with the MNIST dataset and how to view the character digits as images. Please refer to our article MNIST dataset for a more general description of the data itself.

Step 1 – MNIST Dataset Download
The MNIST database with two datasets for training and testing can be downloaded here when you use a Windows system. But when you use Linux you can use the following commands to download the four files of the dataset directly:

wget http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
wget http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz

Then unpack the the files using the following command:

gunzip train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz

The dataset is now available for training and testing of machine learning models. Both datasets follow not a standard image format data structure and in order to understand the data structure we recommend to have a look here. There is no need to convert necessarily the dataset to images as described in the next step when you would like to train machine learning models with it.

Step 2 – Convert MNIST Digits into PNG Images
For converting the data structure of the MNIST database into PNG images we use the small Python script below that in turn is using the PyPNG module that is available here. Instead of installation of this module we can alternatively perform the following command:

curl -LO https://raw.github.com/drj11/pypng/master/code/png.py

The Python script that performs the converting below should be in the same directory as the png.py file downloaded above. The original version of the Python script is available here, but we list it below to explain it better.

#!/usr/bin/env python

import os
import struct
import sys

from array import array
from os import path

import png

# source: http://abel.ee.ucla.edu/cvxopt/_downloads/mnist.py
def read(dataset = "training", path = "."):
    if dataset is "training":
        fname_img = os.path.join(path, 'train-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 'train-labels-idx1-ubyte')
    elif dataset is "testing":
        fname_img = os.path.join(path, 't10k-images-idx3-ubyte')
        fname_lbl = os.path.join(path, 't10k-labels-idx1-ubyte')
    else:
        raise ValueError, "dataset must be 'testing' or 'training'"

    flbl = open(fname_lbl, 'rb')
    magic_nr, size = struct.unpack(">II", flbl.read(8))
    lbl = array("b", flbl.read())
    flbl.close()

    fimg = open(fname_img, 'rb')
    magic_nr, size, rows, cols = struct.unpack(">IIII", fimg.read(16))
    img = array("B", fimg.read())
    fimg.close()

    return lbl, img, size, rows, cols

def write_dataset(labels, data, size, rows, cols, output_dir):
    # create output directories
    output_dirs = [
        path.join(output_dir, str(i))
        for i in range(10)
    ]
    for dir in output_dirs:
        if not path.exists(dir):
            os.makedirs(dir)

    # write data
    for (i, label) in enumerate(labels):
        output_filename = path.join(output_dirs[label], str(i) + ".png")
        print("writing " + output_filename)
        with open(output_filename, "wb") as h:
            w = png.Writer(cols, rows, greyscale=True)
            data_i = [
                data[ (i*rows*cols + j*cols) : (i*rows*cols + (j+1)*cols) ]
                for j in range(rows)
            ]
            w.write(h, data_i)

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("usage: {0}  ".format(sys.argv[0]))
        sys.exit()

    input_path = sys.argv[1]
    output_path = sys.argv[2]

    for dataset in ["training", "testing"]:
        labels, data, size, rows, cols = read(dataset, input_path)
        write_dataset(labels, data, size, rows, cols,
                      path.join(output_path, dataset))

The script sweeps through the directory training and testing and converts the dataset files into PNG images using the PyPNG module. In order to store the data based on their label (0-9) the script also uses the data in the label dataset files. The usage of the Python script above is quite simple and the execution does not take very long. In the realm of big data we can state that the execution of this script does not need parallel computing power while certain modeling activities on the datasets might benefit from parallel processing. First we need to create two directories called training and testing in the same directory as the Python script. We can use the following commands:

mkdir training
mkdir testing

In these directories we need to copy the four different dataset files from Step 1 as follows. For the training and the testing dataset including their label data files we perform the following commands:

cp train-images-idx3-ubyte training
cp train-labels-idx1-ubyte training
cp t10k-images-idx3-ubyte testing
cp t10k-labels-idx1-ubyte testing

Assuming that the content of the Python script above is within the file convert.py we perform the following command to convert the data into PNG images. The first parameter is ‘.’ meaning the current directory of the script. The second parameter ‘./PNG’ is the directory that should contain the PNG images after the converting is finished.

python convert.py . ./PNG

After this command was executed the directory ‘./PNG’ contains the PNG images according to the following structure: training/testing / image-label / id.png. When you are interested to have a look on some of the images in Linux like the 689.png image example below there is a simple command as follows:

display 689.png

Details on MNIST Database

We recommend to check the following video about this subject:

MNIST Database

You may also like...

Subscribe to our Newsletter!

MNIST Database

Details on MNIST Database

You may also like...

File System for Big Data

PANGAEA Open Data Collection

Data Science and Systems 2016

Subscribe to our Newsletter!