Generalization in Machine Learning
Generalization in Machine Learning is a very important element when using machine learning algorithms with big data. For example the key goal of a machine learning classification algorithm is to create a learning model that accurately predict the class labels of previously unknown data items. Hence the created learning model should work very well in getting future data elements correctly classified. Experts refer to this as generalizing well on ‘out of sample data’ while the learning algorithm learns from the available ‘in sample data’. The term ‘in sample data’ refers to the data available to us for learning. The term ‘out of sample data’ is unseen data, unknown data, and in many cases the future data a learning model will face.
The above terms are related to learning theory and the theory of generalization that includes the expectations that the ‘out of sample data performance’ tracks ‘in sample data performance’. This in turn is the first building block of the theory of generalization with the meaning that if we reduce the error of ‘in sample data’, it is likely that the error of ‘out of sample data’ will be also reduced and approximately the same. The second building block of generalization theory is then that the learning algorithms will practically reduce the error of ‘in sample data’ and bring it as close to zero as possible. The latter might lead to a problem called overfitting whereby we memorize data instead of learning from it. A learning model that is overfitting the ‘in sample data’ is less likely to generalize well on ‘out of sample data’.
Generalization in Machine Learning Details
We refer to the following video about this subject: