MNIST Dataset
The MNIST dataset consists of handwritten characters that are not directly big data but are well known to be used with machine learning algorithms as a kind of benchmark dataset. It contains 60000 training samples (~47 MB) and 10000 test samples (~7.8 MB) of handwritten digits including corresponding labels with values 0 to 9 in two separate files for training and test. All digits have been size-normalized to 28 * 28 pixels and are centered in a fixed-size image for direct processing. This dataset is a subset of a larger dataset from NIST. It is thus real-world data preprocessing and formatted to start directly with the analysis. Although MNIST might be not a very challenging dataset it is an excellent dataset to try machine learning algorithms or pattern recognition methods. It is a benchmark dataset that has been used with a wide variety of machine learning models already one can compare against. In addition it is well suited to experiment when considering rather new learning models like deep learning with a convolutional neural network for example.
When working with the dataset it is important however to understand that the dataset is not in any standard image format like jpg, bmp, or gif. It is a file format not known to a graphics viewer and one needs to write typically a small program to read and work for them. Instead the data samples are stored in a simple file format that is designed for storing vectors and multidimensional matrices. The pixels of the handwritten digit images are organized row-wise with pixel values ranging from 0 (white background) to 255 (black foreground). Hence the images contain grey levels as a result of an anti-aliasing technique used by the normalization algorithm that generated this dataset. The dataset and the detailed description of the dataset file formats are freely available for download from here. Our article on the practical work with the MNIST database explains how to download the dataset and convert it to digit images.
Details on MNIST Dataset
The following video is interesting in context: