Identifying Incentivized Sephora Product Reviews
with Text Classification

NLP Course Final Project @ University of Michigan


As consumers increasingly rely on online product reviews to make informed purchasing decisions, the use of incentivized reviews by companies to boost brand exposures has become a growing concern. This is particularly prevalent in the beauty industry, where platforms such as Influenster and PINCHme offer free beauty product samples in exchange for users leaving positive reviews on websites such as Sephora. To address this issue, this project aims to employ text classification techniques to identify incentivized product reviews on the Sephora website. The study will utilize various machine learning models, including traditional linear models like Logistic Regression and SVM, as well as advanced deep learning models like LSTM and BERT.

Data Collection

Web Scraping
The Sephora website has recently introduced new features that mark reviews as “Incentivized'' or as “Verified Purchase'' based on certain purchasing criteria. Given this feature, we are able to make use of these tags as labels for classifying reviews. For this project, we used BeautifulSoup to scrape data from the Sephora website and collected reviews from over 1,000 skincare product, totaling 184,638 reviews.

To ensure accuracy, only reviews that are tagged as "Incentivized" and "Non-Verified Purchase" are labelled as incentivized reviews. Reviews that were tagged as "Verified Purchase" without the "Incentivized" tag are labelled as non-incentivized reviews. As such, the final dataset for actual training consisted of 66,032 (79.5 %) incentivized reviews and 16,928 (20.5 %) non-incentivized reviews.

As a majority of reviews on Sephora are left untagged, we will also make predictions on those untagged data towards the end of the project in order to identify reviews that might have not been properly tagged by Sephora.

Text Preprocessing

To process our data in preparation for feature extraction, we first tokenized all text corpuses. We then cleaned the data by removing all stopwords and punctuations from the tokenized documents. Texts were lemmatized as well to transform them back to their original form.

NLP Models

For this project, both statistical linear methods and deep learning learning methods are experimented. Each method follows the general framework of going through feature extraction, model building, and classification as shown below:

Statistical Linear Models
Fundamental linear classifiers such as Logistic Regression and SVM that are common for solving text classification tasks are explored in this section. To improve the performance of these models, we experimented with using unigram-based TF-IDFs and both unigram and bigram-based TF-IDFs for feature extraction.

Bidirectional LSTM (Bi-LSTM)
Different from statistical linear models, the LSTM algorithm would allow meaning within documents to be captured and represented. We constructed a bidirectional LSTM model as it takes information in both directions and could train the model with a larger dimension. Our LSTM model consisted of four layers using PyTorch. The first layer is the word embedding layer that creates word embeddings with the size of vocabulary size by the hidden dimension (100), and then packs the padded sequences into batches. The batches then packed into batches and passed to the second layer, the LSTM layer. Since our LSTM is bidirectional, the last hidden states of the two directions are concatenated at this layer. The outputs of the hidden states are then passed to a linear layer to produce dense outputs, which are then passed to a sigmoid activation layer to output the final probabilities.

Pre-trained BERT
Although LSTMs are good at dealing with the problem of vanishing gradients, BERT is well known for being a better way to capture meaning within a sentence given its Masked Language Model approach. To build the BERT model, we used the pre-trained MiniLM model provided by Microsoft Huggingface. We extended their Trainer class and created a WeightedLossTrainer that fixes our imbalanced dataset through calculating loss functions with class weights. The transformed datasets are then passed into the trainer for training and evaluation.


Baseline Models
During the exploratory data analysis stage, we found that incentivized reviews leave higher product ratings on average. As such, one baseline model created for this project is a classifier based on a rating threshold. It classifies a review as incentivized whenever the review rating is higher than the rating threshold (4 stars) and classifies as non-incentivized otherwise. Another baseline model used is a Naive Bayes classifier, which is one of the simplest probabilistic classifiers in machine learning.


As a result, the BERT MiniLM model obtained the highest F1 and accuracy – 0.94 and 0.90 respectively, after training for 10 epochs (43 minutes). The LSTM model came second with a F1 score and accuracy around 0.86 after training for 12 epochs (50 minutes). The scores for the linear models, on the other hand, mostly fall between 0.71 to 0.86, which were less performant compared to the neural network models. From the ROC curve, we could also see that the BERT MiniLM again yielded the highest Area Under Curve (AUC), with the Naive Bayes model obtaining the lowest.


1. The evaluation results indicate that the deep neural network models consistently outperformed the linear models in terms of both F1 and accuracy scores. Notably, the BERT MiniLM model demonstrated the highest performance among all the models tested, suggesting that it is a highly effective approach for identifying incentivized reviews based solely on review texts.

2. Surprisingly, all bigram-based linear models resulted in much lower scores compared with unigram-based TF-IDFs when using the same classifiers. Upon reviewing our text preprocessing steps, we identified a possible explanation for this outcome. It is likely that we missed the crucial step of removing rare words, which is particularly important when working with bigrams as they are prone to generating more infrequent word combinations. The omission of this step may have contributed to the lower performance of the bigram-based models.

3. In addition to evaluating our model's performance on the validation and test datasets, we used our best-performing model (BERT MiniLM) to predict reviews that were neither tagged “Incentivized” nor marked as “Verified Purchase” by Sephora. The prediction results revealed that 43.9% of the untagged reviews were incentivized. This shows that incentivized reviews are still prevalent on the internet, and underscores the importance of accurately identifying and tagging such reviews. By doing so, it could help restore users trust in the authenticity of online beauty product reviews.