Spark Machine Learning
Spark machine learning algorithms are implemented in the machine learning library (MLlib) of Apache Spark that is able to handle Big Data. It is a scalable and parallel machine learning library with a number of implemented algorithms for classification, clustering, and regression. MLlib contains high-quality algorithms that leverage iteration that is one of the key benefits of Apache Spark. The library interoperates with NumPy in Python and R libraries. In order to use one machine learning algorithm the data needs to reside in a Hadoop data source such as the Hadoop Distributed File System (HDFS), HBase, or local files. More information is available in our article Using the Apache Spark Tool.
There are a wide variety of algorithms implemented in MLlib. Classification algorithms include logistic regression and naïve Bayes. Regression provides generalized linear regression and survival regression. There are also tree-based approaches such as standard decision trees, random forests, and even gradient-boosted trees. MLlib contains alternating least squares (ALS) that can be used to create recommendation engines. In terms of clustering MLlib offers K-means and Gaussian mixture models (GMM). In the context of pattern mining MLlib includes algorithms for mining frequent itemsets, association rule mining, and sequential pattern mining. All MLlib implementations of these algorithms take advantage of the parallel and scalable Apache Spark architecture. More information about the MLlib can be found on the official page here.
There are also selected tools that are often useful for machine learning such as feature transformations including standardization, normalization, or hashing. Also important are the toolset on model evaluation and hyper-parameter tuning. Distributed linear algebra is also provided with Singular Value Decomposition (SVD) or Principle Component Analysis (PCA). The supported machine learning techniques partly also take advantage of statistical
techniques such as summary statistics or hypothesis testing.
Spark machine learning details
We refer to the following video for this topic: