Association Rules
Association rule mining is a methodology that is used to discover unknown relationships hidden in big data. Rules refer to a set of identified frequent itemsets that represent the uncovered relationships in the dataset. The underlying idea is to identify rules that will predict the occurence of one or more items based on the occurrences of other items in the dataset. It is an unsupervised machine learning method. This means that no direct guiding output data is given to find the patterns.
In many commercial environments large quantities of data is accumulated in databases from day-to-day operations. This lays the foundation for mining association rules. In retail, for example, customer purchase data are collected on a daily basis at the checkout counters of city stores or when shopping at online stores. The accumulated data items are often market basket transactions. Managers of stores are interested in analyzing the collected data in order to learn the purchasing behaviour of customers. This enables a large variety of business-related applications (see below) based on the identified rules in the data.
Below is one of the famous examples illustrating a rule based on a strong relationship between the sale of Diapers and the sale of Beer. In other words many customers who buy Diapers also buy Beer. Investigating the transactions below in order to find those frequent itemsets seems to be easy. But consider in real datasets millions or billions of transactions are searched across 100000 of different items that may identify 1000 of rules. This motivates the automation of the process using association rule mining algorithms. In retail these rules help to identify new opportunities and ways for cross-selling products to customers.
The example above illustrated the core idea of association rule mining based on frequent itemsets. It is a simple example since it essentially ignores other relevant attributes such as quantity of items sold, the Price, or even a specific brand of items. The mining process to discover unknown patterns in large transaction datasets is computationally expensive. A particular challenge is also the size of the millions or billions of the transaction dataset when performing mining within memory. In other words smart out-of-cpu strategies are needed to work on the big datasets as a whole in order to identify frequent itemsets. This is the case since sub-sampling of dataset items may increase the risk to overlook frequent itemset patterns.
Another challenge when performing association rule mining is that some uncovered patterns are simply not true since they have happen by chance. The latter therefore needs often manual post-processing or even application domain knowledge in order to find true rules out of the undiscovered patterns. Important is to find actionable insights from rules that can be used to change a product portfolio, store setup, or customer relations. Several configuration options are available for association rules (e.g. support, confidence) in order to provide a more clear set of rules.
Association Rules Applications
Understanding the customer purchasing behaviour by using association rule mining enables different applications. As shown above rules help to identify new opportunities and ways for cross-selling products to customers. It is used for personalised marketing promotions, smarter inventory management, product placement strategies in stores, and a better customer relationship management.
Algorithms
There are different algorithms used to identify frequent itemsets in order to perform association rule mining. The most known algorithm is the Apriori algorithm, but also the FP Growth algorithm is often used. Another related algorithm called Maximal Frequent Itemset Algorithm (MAFIA Algorithm) is also available. All algorithms have distinct advantage and disadvantages and need to be choosen given a specific data analysis problem.
Toolset
Apart from several serial implementations, parallel implementations to work with big data are not broadly available. One example of a serial implementation well known is the statistical computing with R package called ‘arules’. A parallel implementation of the FP-Groth implementation is available in the open source machine learning library (MLlib) of Apache Spark and Apache Mahout.
More information about Association Rules
Please have a look on the following short video in order to understand it better:
Follow us on Facebook: