FP Growth Frequent Patterns
FP Growth stands for Frequent Pattern Growth and is a very popular mining algorithm for big data initially published around 2000. It enables users to find frequent itemsets in transaction data. The algorithm starts to calculate item frequencies and identify the important frequent items in the data. Based on this the algorithm uses a suffix tree structure also known as the frequent pattern (FP) in order to encode transactions. This tree is particularly a difference to the well-known Apriori algorithm since it avoids to genereate candidate sets that are expensive to generate. instead the frequent itemsets can be simply extracted from the tree structure enabling great performance. A more general introduction of mining frequent itemsets can be found in our article Association rules.
FP Growth Implementations
There are a wide variety of serial implementations available for the algorithm. There are also parallel and scalable implementations of the algorithm available. The Apache Spark MLlib provides a parallel implementation that is open source and available here. This implementation is based on the distributed PFP approach outlined in one of the research articles below.
Research Articles
One key research article is named “Mining frequent patterns without candidate generation”. It can be found here and was published in 2000 in Proceedings of the 2000 ACM SIGMOD international conference on Management of data. It covers the basics of the frequent pattern mining algorithm and describes the novel idea around the frequent pattern tree structure. Performance studies show that the FP Growth algorithm is efficient and scalable for mining both long and short frequent patterns and is about an order of magnitude faster than the often used Apriori algorithm. It is usable with transactional databases or more generally transaction datasets. It can be referenced as follows:
Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In Proceedings of the 2000 ACM SIGMOD international conference on Management of data (SIGMOD '00). ACM, New York, NY, USA, 1-12. DOI=http://dx.doi.org/10.1145/342009.335372
Another important research article is named “Pfp: parallel fp-growth for query recommendation”. It can be found here and was published in 2008 in Proceedings of the 2008 ACM conference on recommender systems. It covers the idea of a parallel version of FP Growth named PFP. The article shows that PFP distributes the work of growing frequent pattern trees based on the suffices of Transactions. It is further shown that this parallel approach is much more scalable than a serial implementation for one machine. It is usable with transactional databases or more generally transaction datasets. It can be referenced as follows:
Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. 2008. Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems (RecSys '08). ACM, New York, NY, USA, 107-114. DOI=http://dx.doi.org/10.1145/1454008.1454027
More about FP Growth
The following video provides more details: