Free Data Publicly Available
Free datasets are typically hard to obtain since either the data includes sensitive information or it was very costly to create them. This page provides an overview of available datasets in order to practice big data analysis. They do not necessarily represent big data in terms of volume, but are rather application specific examples to test certain analysis and analytics techniques.
Free Data of Movie Ratings
A free dataset is available for movie ratings rating data sets from the MovieLens web site that can be found here. There are a several data sets that were collected over various periods of time and vary in size. Please refer to our movie ratings article to get the detailed list of free available datasets.
Free Data of PANGAEA online collection
PANGAEA is an open access library aimed at archiving, publishing and distributing georeferenced data from earth system research. The system guarantees long-term availability of its content. Most of the PANGAEA datasets are freely available and can be used under the terms of the license mentioned on the data set description for each entry. The description of each data set is always visible and includes the principle investigator (PI) who may be asked for Access or referenced. It is important to understand that each dataset can be identified, shared, published and cited by using a Digital Object Identifier (DOI).
Free Data of UCI Machine Learning Repository
The UCI Machine Learning Repository provides a large variety of data through roughly 350 free datasets. It is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms. Each of the datasets are very well described. Information about the dataset includes number of instances, area of interest, attribute characteristics, number of attributes, date donated, and missing values. Most notably, there are associated Tasks given such as classification and clustering. This repository is therefore an excellent way to start trying big data mining and machine learning algorithms.
Free Data of R Tool
The ‘Statistical Computing with R’ tool offers a wide variety of free datasets as part of its distribution within libraries and packages. One package is called ‘datasets’ and another library with free data is called ‘MASS’. Please refer to our R Datasets article to get the detailed list of free available datasets.
Free Data of MNIST
There is a free dataset that consists of handwritten digits (0-9) images with associated labels that is great as a benchmark for classification techniques for handwriting recognition. Our article about the MNIST dataset provides more pieces of information while our article about the MNIST database explains how to download and work with the dataset.
Free Data of ImageNet
A very interesting dataset is the ImageNet that consists of ~14,000,000 images with ~1,000,000 associated labels as bounding boxes. In order to download and work with the data you need to login. The data is available here.
Free Data of UCML
The UC Merced Land (UCML) use is an publicly available aerial image data that is often used in the domain of remote sensing for benchmarking different machine learning models. It is a set of approximately 30-cm spatial resolution images that have been acquired by the U.S. Geological Survey . The dataset consists of 2100 image patches of size 256 × 256 (RGB bands) from various U.S. cities and is separated in 21 classes with 100 instances per class. The dataset (~317 MB zipped) is available here.
More on free data
The following video provides an example of how to obtain remote sensing datasets:
Follow us on Facebook: