Customer Segmentation with Starbucks Dataset

15 min readJun 26, 2021

Project’s domain background

Due to the recent advancements in technology, the world is causing us to adapt extremely fast to the changes. It is thanks to the technology that I am able to write this proposal and email it to a mentor to be revised. With this being said, it is no secret that large enterprises and multi millionaire companies use this amazing invention called “the internet” to reach a larger audience and grow their business.

Starbucks is one of these many businesses that take advantage of the internet to obtain new customers daily. The following are some of the ways that Starbucks uses technology to reach new customers.

Free Wi-Fi: What a better way to attract buyers and tempt them to buy your product than giving them free access to Wi-Fi almost any time of the day?
Mobile or online purchase: Studies show that 11% of Starbucks sales come from mobile purchases.
Discounts and Offers: Like any other store, Starbucks sends their customers offers which can be very tempting but rewarding. This therefore helps to create loyal customers and increase sales.

Therefore, Starbucks is one of the first companies to take advantage of technology as a marketing strategy. “[…] In many technology-related industries, competition is intense and can often be the reason why a startup is not able to be profitable. However, competition cuts across the board and can contribute to potential business failure, regardless of the sector.”[1] With that, half of small businesses survive beyond five years. Additionally, one third of these businesses make it to 10 years.

With that being said, marketing is crucial for understanding how to meet customer’s needs. This of course keeping in mind that customers vary in characteristics like flavor preference, behaviors, traditions, economic situation, etc. This is where artificial intelligence and machine learning comes in. With the recent developments in the area, machine learning has been able to use marketing information to detect behavior patterns in customers and segmenting groups of people based on demographic information.

For this reason, this project will study the behaviors and reactions of different Starbucks customers. We will determine which customers are actually influenced by offers and which are not. This also tells us which customers are considered “loyal”. Finally, the end goal is to predict how a demographic group will react to an offer.

Problem statement:

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

The input for the experiment will be the Starbucks datasets which are thoroughly described in the next section. It contains important information like users gender, incomes, age, offers they received, offers they completed, etc.

The estimated output for the experiment is to determine whether or not a customer will buy an offer based on tendencies and behaviors. For this, a thorough data study is required.

The machine learning task responsible for this project is K-means and its variations. The idea is to use different models and algorithms and see which one best performs the task to obtain best results.

Datasets and inputs:

The datasets used for this project are the following:

profile.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

Portfolio.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

Transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Proposed Solution

The proposed solution to the problem is determining the most appropriate unsupervised machine learning model by running a test to see which one outputs the best score. The intention is to try K-means, MiniBatch K-means, Hierarchical Clustering, DBSCAN, and other related models.

Once this is done, the idea is to create a report or blog post explaining the step by step procedures that lead to the results. The intention is to try to complete all of the proposed tasks in this document. Also, it is important to make the blog very understandable as if it was being used to explain it to someone who has little knowledge of the topic.

Benchmark model

The benchmark model that I will utilize for this project is a very similar project with the same task but a slightly different approach than mine. I will be comparing my project against this one from a previous Udacity student:

https://github.com/dkhundley/starbucks-ml-capstone/blob/master/Hundley-Starbucks-Project.ipynb

However, like I previously mentioned, I will use a slightly different approach and compare against the benchmark results. The budget algorithms which will be used in comparison to the benchmark model are:

K-means
MiniBatch K-Means
Hierarchical Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
Gaussian Mixture Modelling (GMM)
MeanShift

Evaluation metrics

The planned metrics for this project are:

Elbow method: For determining the optimal number k-means clusters by plotting the value of the cost function produced by k values.
Silhouette value: Measures similarity between a point and its own cluster (cohesion) compared to other clusters (separation).
Davies Bouldin metric: Average similarity measure of each cluster to its most similar cluster. In this case similarity is the ratio of distances within the cluster to distances between clusters. The minimum score is zero, and lower values indicate better clustering.

Project design

The project for the problem is as following:

Establish a workspace in a jupyter / google colab environment
Download the data provided into my jupyter / google colab notebook
Initial data cleaning
Perform exploratory analysis on the data
Cleaning up the data as needed for modeling
Experimenting to determine most appropriate unsupervised learning model to use for the data, whether that be k-means clustering, DBSCAN, or some other model
Leveraging our benchmark model and evaluation metric(s) to ensure sanity
Summarizing our findings writing a blog post

Into the fun part

So first things first, we start exploring each of the given datasets and quickly start making calculations to explain some of the numbers in the data. The first is to fid out the percentage of offers viewed to received and completed to received.

We can see that 75.68% of the offers received were viewed and 44.02% were completed.

Prepare the data for merging

We prepare the data for merging by changing some column names and retrieving some data from object like values in the data frame. This looks something like this:

For the portfolio dataset, we change the format of the date of becoming member and add new columns to better visualize what we have (“days_as_memeber”, “age_range”).

We then extract the data from transcript by separating the orders made into 4 different groups: offer_completed: customer completed the offer, offer_received: customer received the offer, offer_viewed: customer viewed the offer, transaction: customer created a transaction which indicates a purchase. And that looks like this:

We also separate the value columns to indicate which values are transactions and which values are columns. This therefore tells us that there may be multiple forms of repeated offers and users in this dataset.

And then we separate this dataset into two different ones. One with offers and the other with transactions. I will show later why this was done. This is the dataset for offers:

And this is the one for transactions:

Extracting values from portfolio

We also need to extract values from the portfolio dataset which has columns like channels holding other values. And this would look like this:

Merging

Now we go ahead and merge all of the data into one by iterating through profile and adding useful information about each customer into the dataset entry. If you want to know the whole process behind this data merge, you can checkout the code on my GitHub which is at the end of this article. Which turns out to be like this:

Data exploration

Now we will explore some of the characteristics of the previously generated data and determine which columns are useful and how we can categorize the columns to begin the process of customer segmentation. Categorizing the data helps making data clustering algorithms’ job easier.

Getting rid of null values

Null values are really no use to use. All they do is distort the data and cause problems further down the line. For this, we go ahead and eliminate any row or entry that has any missing or null values.

333 null values were found in the “avg_spent” category. So we dropped those entries.

Visualizing some categories

We will go ahead and visualize some of the key features of this dataset and draw conclusions based on obtained results.

Visualize Gender:

This shows that there are more men than women in the customer base. 57.2% being men, 41.4% being women and 1.4% in the other category

Visualize incomes:

Here we can notice that women in this dataset have higher incomes than men do.

Visualize Total spent by gender:

Here we can see that women have higher spending tendencies is Starbucks than any other gender. Men have the least spending tendencies in the dataset.

Visualize age groups by decades:

Visualize total offers completed:

This graphic shows the percentage of people that completed a number of offers out of the ones that were sent to them.

Visualize total offers viewed:

Visualize groups of people that completed a percentage range of discount offers:

Visualize groups of people that completed a percentage range of bogo offers:

Visualize ranges of transactions done by customers:

Data Scaling

The next step would be to preprocess and scale the data to assign keys to categorical values and scale other values to be lower than 1. And this is the result:

PCA on the data

What is PCA you may ask? PCA stand for Principal Component Analysis and it is the reduction in dimensionality of data in order to process the data with as few dimensions as possible. When we apply PCA on our scaled data we reduce the dimension of its features and only choose some features that are most importance for the clustering algorithm. The result of the obtained pca data is the following:

K-means:

K-means is a popular data science clustering algorithm that segments data points with similar features into spherical clusters.

[…] “It ends up basically assuming that clusters are going to be gaussian balls rather than anything else”[2].

However, I do have to warn about this algorithm. It is not as precise as we may have learned before. This is effectively because of the fact that it assumes that clusters are just spheres and takes into account too much data noise that may cause non precise predictions.

Metrics:

Elbow method Inertia measures how well a data set is clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance and summing these squares into a cluster.

A good model is one with low inertia AND a low number of clusters (K). However, this is a trade-off because as K increases, the inertia decreases.

To find the optimal K for a data set, use the Elbow method; find the point where the decrease in inertia begins to decrease.

Silhouette value:

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

Davies Bouldin:

Average similarity measure of each cluster to its most similar cluster, where similarity is the ratio of distances within the cluster to distances between clusters. The minimum score is zero, and lower values indicate better clustering.

Visualize the distribution of data over clusters:

Based on the result obtained from the previous methods, it seems that using three clusters is the best choice. So, we get the cluster labels for each of our data points (counties) and visualize the distribution of points over each cluster.

Visualizing Centroids in Component Space:

HDBSCAN

[…] “HDBSCAN is a clustering algorithm developed by Campello, Moulavi, and Sander. It extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.” [3]. Basically, HDBSCAN uses the density based features if DBSCAN to calculated data clusters without knowing the probability density function.

This is great because well, we do not know how the PDF of our data neither do we know how many clusters can well suit our data. So we go ahead and try different parameters and previously saved datasets to try to achieve the best results and finally label each data point with its corresponding data cluster.

Cluster with Scaled dataset:

So first we try our luck with the scaled dataset that was obtained earlier and the results were the following:

It is worth mentioning that the black dots inside the graph represent the noise detected by the HDBSCAN algorithm, and they will be labeled as -1. In the meantime, it can be observed that the green cluster represents a large percentage of the data, as shown in the following image.

NOTE: After several trials it was defined that the best parameters were min_cluster_size=35 and min_samples=10 as it groups the data better and generates less noise.

We can also see that the algorithm identified 4 main clusters excluding the noise data points:

And the frequency of data points for each label are as follows:

As we can see, cluster 3 covers 82.6% of the data noise covers 15.1%. We can then assume that the algorithm did not perform as well because PCA was not applied to the input data.

Next we tried, the same procedure but with PCA applied to the input data.

Cluster with scaled dataset with PCA:

The results obtained were much more appreciable than the previous one:

There were a total of 3 data clusters generated excluding the noise data.

And the frequency for each cluster is as follows:

Why HDBSCAN?

The answer is that it simply is better at clustering data and it basically does all of the had work for you. It is way better than K-means and its derivatives and can calculate up to a million data points.

Here is a bit of comparisons done by Leland McInnes [2] to show how HDBSCAN performs better than other algorithms. He trialed all algorithms with the same dataset with varying types of data on each trial. Here is what the results showed:

What we can observe from the previous images is that HDBSCAN outperforms most algorithms in all three of the occasions even though more data was being added each time.

Final Segmentation

So, now that we have found the cluster label for each point, why not go ahead and place them in a their corresponding rows in the first scaled dataset that was created earlier?

We add the new column to the data and we get something that looks like this:

And as previously mentioned before, those labeled as -1 are noise data points. If you want to, you can go ahead and remove those points. But I will keep them to give the results more weight and credibility.

Results

So after so graphical representations and calculations, we were able to draw a few conclusions about the clusters and predict which cluster group is more likely to accept an offer.

Age and income:

First will take a look at the distribution of the cluster across the relationship between age and income.

Here we can see the graph indicated to us that the older people are the ones with the most income. Cluster one points to those customers who are 70+ years old and have the highest incomes. While cluster two may indicate those customers who are in the middle age and also have high incomes. However this does not tell us much which type of offer would best suit each cluster.

Gender:

Gender did not have much of an effect in the assignment of clusters to customers. However it is worth mentioning that each cluster has more females than males which may be useful information further down the line.

Offer types

According to the graphics, we can see that the cluster with the highest average of times that a BOGO was redeemed is cluster 1. However, as we observed earlier, cluster 1 only contains 1.1% of the total data. So in this case we will explore the possibility of the highest value coming from cluster 0 or cluster 2. Cluster -1 is not considered because it is considered as noise. Visually we see that between these two clusters, cluster 0 has the highest average value. So without any fear of assumption we can go ahead and predict that It is cluster 0 will respond the best to BOGO.

In the case of the discount offers however, on average, cluster two completed the most offers with a small difference to cluster 0. Cluster 1 has 0 completed offers on average and again we do not consider cluster -1 in this case because it is noise. So again without fear of assumption, we go ahead and say that cluster 2 responds best to discount offers.

References

Project Link

The GitHub repository containing all of the project material is here. Keep in mind that I ran the notebooks in Google Colab and in order to read external data I had to save the data in my external drive.

Also, if you intend on running the training notebook, you should also keep in mind that these cells were run on Amazon Sagemaker, so you may need an account in order to run it.

Contact me

If you wish to contact me or view some of my work, you can do so with the following medias:

Github

Twitter

Personal Site