Clickstream Data Mining With Markov Chain and cSPADE

In this post, we’ll show you how to mine clickstream data using two key algorithms: Markov Chain for determining state transition probabilities and cSPADE for discovering sequential patterns. Using these techniques you can leverage your clickstream data to generate insights that enable you to deliver a better customer experience.

Grouping your clickstream data into sessions

Before you can start mining your clickstream data, you need to split it into chunks of user actions, or sessions. For your applications, a session is the time between two consecutive application start events. For your websites, a session is the time from entry until logout or timeout.

With these definitions in mind, you’ll transform your raw event log data into clickstream data by:

Identifying the actions or events performed by an individual user and grouping them together.
Splitting these events into subgroups based on events performed during the same session.
Arranging the events of each session in an appropriate time order.

At this point, you’ll have a dataset that looks something like this:

In the above dataset, each row corresponds to a session. The first column contains the session names, while the rest of the columns denote the actions (represented by A8, A14, A9, etc.) performed by the user during that particular session.

Determining state transition probability with Markov Chains

The Markov process is a stochastic process satisfying the Markov property of memorylessness. It is a random process in which the future is independent of the past, given the present.

A Markov Chain is a Markov process that describes the sequence of possible events, in which the probability of each event is dependent on the state attained in the previous event. It can be graphically represented as a transition diagram along with the corresponding probabilities, as below:

For clickstream analysis - a process X(n) takes the state m(n) from a finite set m at a given time n.

The order of the Markov chain is derived from the number of states on which the current state of the event depends. For example, a zero-order chain implies that the probability of being in a state in the next step is completely independent of any and all previous states.

Note: Balance here is crucial. The concept of a higher-order Markov Chain leads to more realistic models, but the number of parameters required for its representation increases exponentially.

Fitting a Markov Chain

Recall the example dataset:

We will use a third-order Markov Chain for this dataset because:

The number of parameters required to represent the Markov Chain are manageable. A third order chain will deliver a valuable model without an unnecessarily high number of parameters.
At least half of the clickstreams should consist of as many clicks as the order of the Markov Chain you’re fitting. It makes no sense to select a 3rd order Markov Chain if the majority of the clickstream consists of only 2 states (with no state three steps behind to consider).

By fitting the Markov Chain model you get transition probabilities matrices, lambda parameters of the chain for each of the three lags, and start and end probabilities. In the clickstream context, start and end probabilities equal the probability that a clickstream will start or end with a specific event

A heat map is a useful tool for representing the transition probability matrix. The Y-axis represents the current state (From) and the X-axis represents the next state (To). You can see the utility in the heat map below, where the darker red represents a higher probability of transition.

For example, the transition from Action23 to Action1 is more probable than the transition from Action21 to Action1.

Predicting Clicks

Using Markov Chains for clickstream analysis allows you to develop a data driven understanding of user behavior. You can predict what a user's next action will be and know the likelihood of them taking the action at any point - all based upon their previous actions.

Further, by starting from the state with the maximum start probability and following the probable transitions, you can learn how your customers commonly use your application or navigate your site.

Discovering sequential patterns with cSPADE

Another approach to modeling clickstream data is to represent them as sequential patterns which you can mine to determine trends that meet a particular minimum support threshold.

The SPADE algorithm first calculates the frequency of sequences with only one item, then it calculates the frequency of sequences with two items, then three items, etc. Using the cSPADE function, you can extract all the pattern sequences with a given minimum support, creating a dataset you can use to understand and predict user navigation patterns.

In the example dataset, a minimum support of 40% generates the following 22 pattern sequences:

In the above table, for a given sequence pattern X, it is possible to predict the next click by searching for the pattern sequence with the highest support starting with X.

For example, after performing the action A14, the most probable next action is A11, according to the pattern sequence 8 – with a probability of 0.8045.

Note: lowering the minimum support will include less frequent patterns in your clickstream dataset - this can be useful in cases where you want to extract pattern sequences that lead to errors or failures in using our product or app.

Easily collecting clean clickstream datasets with RudderStack

Capturing clickstream data for analysis is hard, especially when you need to collect and combine data from multiple sites and applications. You can use a tool like RudderStack Event Stream to simplify the process. It makes real-time event data streaming easy and can pipe directly into the tools you're already using, like S3, Kinesis, Kafka, Redis and all of the major data warehouses.