Sampling Methods
Sampling methods refer to techniques that pick a specifically choosen number of L samples out of a number of N data items in a dataset for data Analysis. More formally ‘sampling methods’ select a subset of observations or individual data items from within a statistical population. In machine learning it is assumed that the whole dataset was created by some probability distribution that is somehow represented in the statistical population. The goal is then to learn via sampling from the statistical propulation in order to estimate characteristics of the whole population. This area in machine learning has high overlaps with statistics in general and the important statistical learning theory in particular. The statistical assumptions guide the machine learning process and make ‘learning from data’ feasible. There are a wide variety of sampling methods that are used depending on the given task at hand. Typically used methods are ‘simple random sampling’, ‘systematic sampling’, ‘stratified sampling’, and often used in practice, ‘diversity-based sampling’.
Diversity-based Sampling Methods
In machine learning sometimes datasets have many similar data items (e.g. mobile phone data) and therefore the diversity-based sampling method makes sense. This increases the efficiency in the training process by decreasing the computational load simply by removing very similar data items in the data set. This means that the training time is reduced contributing to an overall lower time to solution, especially when applying cross-validation. In addition, the training set after applying ‘diversity-based sampling’ is much more typical and is able to represent Features of the data better.
Short video about sampling methods
A nice summary about sampling is provided in the following video:
Follow us on Facebook: