Predicting Customer Purchase Intent in E-commerce using Clickstream Data

Data Mining Course Final Project @ University of Michigan

Introduction

Background
Traditionally, e-commerce companies model users' likelihood of converting using user demographic and transactional data. However, as demographic data are often only obtained from users with accounts or purchase history, and as new privacy rules restrict how companies collect data, access to user-related data may be limited. In these cases, behavioral data such as clickstream data can be helpful, as a user's browsing experience on a website can also provide insights into whether a customer is likely to make a purchase or not. This also allows companies to target campaigns and promotions in real-time before acquiring any user-related info.
Project Goal
In this project, we aim to utilize clickstream data to identify customers' purchase intent early on in a user session to enable targeting of real-time campaigns. We want to be able to accurately predict a user's purchase intent early in their user session.
A few key research questions that were explored throughout the project include:
1. At which point of user's click paths should we target them with promotions? In other words, what is the minimum length of clickstream sequence required to make acceptable predictions?
(We want to be able to make accurate predictions as early as possible during a user session so that we don't miss the opportunity to target the right promotions!)
2. Are sequential models or feature-based classification models better at predicting customer purchase intent?
3. What level of interaction (page-level, event-level, category-level) should we use for making predictions?
About the Dataset
The dataset used for this project is the Google Online Merchandise Store Sample Dataset provided by Google BigQuery. We queried data spanning from November 1, 2020 to December 31, 2020 by connecting Python to the BigQuery API and writing SQL. In total, we had 2.9 million rows of click log data with about 87,000 unique user sessions.
Project Scope & Workflow

Data Analysis

Exploratory Data Analysis
Data Insights

1. Imbalanced Dataset

The label distribution of sessions with and without purchase reveals that the dataset is highly imbalanced, with up to 93.5% of non-purchase sessions and only 6.4% purchase sessions. Therefore, it is necessary to balance the dataset before proceeding with model training.

2. Clickstream Lengths for Training

The clickstream length distribution showed a high degree of skewness in our data. Therefore, it is necessary to remove outliers and consider using median instead of mean when determining an appropriate clickstream length range for model training.
The distribution also showed that the clickstream lengths of non-purchase sessions are much shorter than sessions that resulted in purchases, with the median length falling around 25 clicks. This suggests that our truncated clickstreams used for prediction should not exceed 25 events.

3. Pages and Events

On average, non-purchasing user sessions tend to have a smaller number of unique pages visited compared to user sessions that result in successful purchases, regardless of the clickstream cutoff length. Therefore, the variety of unique pages visited may be correlated with purchase decisions.
When looking at the first 25 clicks of each user session, sessions without purchases typically consist of 2-3 unique pages, while sessions with purchases consist of up to 4-5 unique pages on average.

4. Highly Correlated Events

We plotted the transition probabilities between click interactions to get a sense of user navigation patterns across the website. The resulting heatmap helped us understand that click events such as "purchase" and "exit" may cause data leakage and should be removed from clickstreams, as their transition probabilities in both classes are noticeably different.

Data Preprocessing

1. Undersampling

Given the highly imbalanced dataset, we downsampled our dataset to achieve a 5:4 ratio of Non-Purchase labels to Purchase labels, avoiding false convergence to the majority class during training.

2. Removing Outliers

Sessions with either clickstreams length that are too short or too long with be removed as outliers. Clickstreams shorter than 2 clicks are excluded as well as it doesn't provide enough information for training.

3. Feature Engineering

Sessionizing: Click logs at different timestamps are organized into event sequences and grouped by sessions to prepare for modeling sequential data. To accomplish this sessionizing process, a preprocessing function from the Python library Markov Click was used.

Truncating Clickstreams at Different Cutoff Lengths: We will be experimenting with models training on different cutoff lengths ranging from 2 to 25 clicks. Therefore, clickstreams are truncated at different cutoff lengths prior to training each model.

4. Preventing Data Leakage

Some events/pages within click paths such as “Checkout Confirmation” or “Purchase” are strongly related to purchasing outcomes. Therefore, in order to prevent these events from “leaking” our labels, we excluded these events to ensure that our models predict purchase intent based only on normal click paths..

Model Development

Model Selection
This project explored three different methods for predicting purchase intent. Two of them are sequence-based methods trained on click sequences, while the third is feature-based trained on click frequencies.

1. Sequence Discrimination with Markov Chains
In this project, we implemented a sequence discrimination measure proposed by Durbin, R. and Miguéis, V. L. that can be used in the context of predicting customer actions. They assume that sequences in each class (e.g. purchase and non-purchase) come from a specific Markov process for each class, and that we can calculate the likelihood of an observing sequence stemming from either of the classes.
To calculate the likelihood of a sequence, we first create Markov transition probability matrices for each class:

The log odds of transitioning from one state to the other can thus be calculated based on the transition matrices using:

We then calculate the odds of an observing sequence originating from either a purchase class or non-puchase class based on transition probabilities of the two different classes:

In this case, a positive log odds ratio means that the provided sequence is more likely to originate from the buyers class, while a negative value means the opposite.

2. Sequence Classification with LSTM
A common method used in the field of NLP to classify text sequences is the LSTM network, which is a type of RNN that has the advantage of learning longer patterns and forgets/remembers data selectively. In this project, we treated each clickstream sequence as sentences to feed our data into the LSTM. Since click paths for each session in our dataset come in different lengths, each sequence is transformed into a “padded sequence” before passing in as training inputs.

3. Feature-based Standard Classification
The non-sequential features extracted for training standard classification are the click frequencies that are present in a sequence. Similar to storing term frequencies using Bag of Words in NLP, we transform our data into feature vectors where each event represents a single feature. Standard classifiers such as Random Forest and SVM were implemented to train on these features.

Model Interpretation

1. Experimentation & Evaluation
Aside from comparing across three methods proposed above, we also experimented with different training data combinations for each algorithm:
• Clickstream input data with different session length cutoffs, ranging from 2 to 25 clicks
• Different types of interaction levels – click paths on the page level, event level, and category level.

Overall, sequence classification using LSTM yielded the best performances at an earlier stage (5 to 15 clicks) of a user session. After around 15 clicks, other algorithms such as Markov Chain and Random Forest starts to outperform LSTM. However, since our goal is to predict purchase intent as early in a user's session as possible, the LSTM model would be the best given its accuracy and stability at an early prediction stage.

2. Feature Importance
While the sequential models overall performed better in terms of early stage prediction, they have limited interpretability compared to traditional feature-based classification models. The Random Forest classifier enables us to examine feature importances and identify the pages/events that have a strong impact on classification results. As a result, the "Checkout Your Information," "Home," and "Men's/Unisex | Apparel" pages were the top 3 pages that influenced our feature-based models, while "Begin Checkout", "View Promotion", and "View Item" were the top 3 events with the most impact

Conclusion

1. At which point of user's click paths should we target them with promotions? In other words, what is the minimum length of clickstream sequence required to make acceptable predictions?

Based on our best models (LSTM), we found that both the F1 score and AUC starts to increase less drastically around the 6th sequence length, and stabilizes around the 10th sequence length, with a F1 score up to 0.73 and AUC up to 0.80.  This suggests that a clickstream sequence length of at least 6 click logs is required for accurate predictions, with 10 or more providing even greater stability.

2. Are sequential models or feature-based classification models better at predicting customer purchase intent?

Sequential Models
Pros: Yielded better performances than the feature-based models at earlier stages of user sessions, which best meets our project goal.
Cons: Have limited interpretability compared to the traditional feature-based classification models.
Feature-based Classification Models
Pros: More interpretable compared to the sequential models. This may be crucial if a company is looking to optimize their website and enhance user experience. Understanding the importance of specific pages can also help marketing teams to develop targeted campaigns that are more likely to resonate with potential customers.
Cons: Could only outperform the sequential models if clickstreams are long enough (>15 clicks)

Despite the interpretability advantage of feature-based models, the sequential models' superior performance in detecting early stage user intent makes them a more suitable choice for our project goal. However, we acknowledge the importance of interpretability in the context of e-commerce clickstreams.

3. How does the level of interaction (page-level, event-level, category-level) affect prediction performances?

We utilized ANOVA to test whether the performance of models trained on different levels of interaction differed significantly from one another. The results of the test showed that the performances of the three types of click interaction were statistically different. As such, we can further conclude that the LSTM models trained on page-level interactions yielded the best performance out of all the models. However, on average, models trained on the event-level resulted in the highest scores.