Revisiting Customer Segmentation With Machine-Learning: We Now Make It More Interesting
Bridging K-Means Clustering with Hypothesis Testing to provide more relevant business value to customer segmentation.
This article describes a step-by-step approach to complement the application of a K-means Clustering Machine-Learning Algorithm with a bit of additional data science, using the Kaggle competition Mall Customer Segmentation analysis. We will base our work on DataFlair’s analyses regarding the Mall Customer Segmentation on R, since we’ll be working with R as our main analysis platform.
The dataset: Mall Customer Database
The dataset comprises both information on 200 customers of a mall. The dataset is open for download in .csv format here. It consists of 5 variables, described below:
- Customer ID: an ID number for each of the customers.
- Gender: whether the customer’s gender is Male or Female.
- Age: years of age of each customer.
- Annual Income (in thousands): annual income of each customer.
- Spending Score (from 1 to 100): an arbitrary score that denotes the degree of spending of each customer within the mall. The higher the number, the higher the spending (and vice-versa).
Our objectives
Our objectives are three-fold:
- Initially, understand the dataset’s variables to identify the relevant elements before applying any kind of analysis.
- Identify meaningful customer groups by employing business-relevant customer information (this is, in fact, the customer segmentation part). But it does not end here.
- Once we have defined our customer groups, let’s ask questions about them regarding our relevant variables: do these groups differ by gender, age, annual income or spending scores? This is actually the most important part, because this questions will bring value to our prior segmentation, as well as provide fertile grounds to make data-informed business decisions.
Onwards to the analysis
As previously stated, we’ll be working with R to perform all the analyses. We’ll move on a step-by-step basis to firstly identify meaningful customer groups and then analyze differences between these groupings regarding our variables of interest. Finally, we’ll finish with conclusions from the analyses and how we can bring business value to all the data-related work we’ve undergone.
Loading packages, data, and exploratory data analysis
We load the required R packages for this session.
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, unionlibrary(psych)
library(rstatix)##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filterlibrary(ggplot2)## Warning: package 'ggplot2' was built under R version 4.1.3
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alphalibrary(NbClust)## Warning: package 'NbClust' was built under R version 4.1.3library(sjPlot)## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
We load the csv file as a dataframe.
customer_data=read.csv("customer-segmentation-dataset/Mall_Customers.csv")
We employ a bit of exploratory data analysis to get a general and initial look at the data.
str(customer_data)## 'data.frame': 200 obs. of 5 variables:
## $ CustomerID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Gender : chr "Male" "Male" "Female" "Female" ...
## $ Age : int 19 21 20 23 31 22 35 23 64 30 ...
## $ Annual.Income..k.. : int 15 15 16 16 17 17 18 18 19 19 ...
## $ Spending.Score..1.100.: int 39 81 6 77 40 76 6 94 3 72 ...names(customer_data)## [1] "CustomerID" "Gender" "Age"
## [4] "Annual.Income..k.." "Spending.Score..1.100."head(customer_data)## CustomerID Gender Age Annual.Income..k.. Spending.Score..1.100.
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
We’ll rename the annual income and spending score variables because they are a bit annoying visually and not so clear.
customer_data <- rename(customer_data,
annualIncome.k = Annual.Income..k..,
spendingScore.1_100 = Spending.Score..1.100.)
Descriptive statistics of continuous variables: as opposed to DataFlair, here we use the describe function from the psych package, since it provides a handy summary of any necessary descriptive statistics for numerical continuous variables.
continuousVariables <- customer_data[, c("Age",
"annualIncome.k",
"spendingScore.1_100")]
describe(continuousVariables, fast = TRUE)## vars n mean sd min max range se
## Age 1 200 38.85 13.97 18 70 52 0.99
## annualIncome.k 2 200 60.56 26.26 15 137 122 1.86
## spendingScore.1_100 3 200 50.20 25.82 1 99 98 1.83
We calculate the gender frequencies and proportions. Deviating from other projects with the same dataset, I don’t think we need a plot to visualize this.
freq_table(customer_data$Gender)## # A tibble: 2 x 3
## group n prop
## <chr> <int> <dbl>
## 1 Female 112 56
## 2 Male 88 44
So, we see that, out of a total of 200 individuals, 112 (56%) are female and 88 (44%) are male.
Now, back to our continuous variables: Age, Annual Income and Spending Score. Now we do want to check these variables in a more “visual” manner. We can use the industry gold-standard plotting R-package: ggplot2. We’ll generate separate histograms to visualize the distributions of Age, Annual Income and Spending Score of our customers. Additionally, I like to combine histograms with density plots, which are a bit more aesthetically pleasing and help us process the distribution of our variables more intuitively. Lastly, I love color palettes so I’ll use some custom specific colors I liked from coolors.co (note the hex codes in the “fill” option).
Age histogram:
ggplot(data = customer_data,
mapping = aes(x = Age)) +
geom_histogram(color = "#DCCFEC",
fill = "#4F517D",
binwidth = 4) +
labs(title = "Histogram of Customer Age",
y = "Count") +
theme_bw()

Age density plot:
ggplot(data = customer_data,
mapping = aes(x = Age)) +
geom_density(color = "#DCCFEC",
fill = "#4F517D") +
labs(title = "Density Plot of Customer Age",
y = "Proportion") +
theme_bw()

Annual Income histogram:
ggplot(data = customer_data,
mapping = aes(x = annualIncome.k)) +
geom_histogram(color = "#DCCFEC",
fill = "#4F517D",
binwidth = 10) +
scale_x_continuous(limits = c(0, 150)) +
labs(title = "Histogram of Customer Annual Income",
x = "Annual Income (thousands)",
y = "Count") +
theme_bw()## Warning: Removed 2 rows containing missing values (`geom_bar()`).

Annual Income density plot:
ggplot(data = customer_data,
mapping = aes(x = annualIncome.k)) +
geom_density(color = "#DCCFEC",
fill = "#4F517D") +
scale_x_continuous(limits = c(0, 150)) +
labs(title = "Density Plot of Customer Annual Income",
x = "Annual Income (thousands)",
y = "Proportion") +
theme_bw()

Spending Score histogram:
ggplot(data = customer_data,
mapping = aes(x = spendingScore.1_100)) +
geom_histogram(color = "#DCCFEC",
fill = "#4F517D",
binwidth = 10) +
scale_x_continuous(limits = c(1, 100)) +
labs(title = "Histogram of Customer Spending Score",
x = "Spending Score",
y = "Count") +
theme_bw()## Warning: Removed 2 rows containing missing values (`geom_bar()`).

Spending Score density plot:
ggplot(data = customer_data,
mapping = aes(x = spendingScore.1_100)) +
geom_density(color = "#DCCFEC",
fill = "#4F517D") +
scale_x_continuous(limits = c(1, 100)) +
labs(title = "Density Plot of Customer Spending Score",
x = "Spending Score",
y = "Proportion") +
theme_bw()

Implementing the Machine-Learning algorithm
Ok! Now to our ML algorithm. We’ll determine the optimal number of clusters from the data before applying the K-means clustering algortithm.A relevant optimal cluster number selection method can be achieved by analyzing multiple cluster suitability indexes. This can be achieved with the NbClust function from the NbClust package. We can examine what is the optimal number of clusters to analyze within the relevant variables by opting for the most frequent number informed by more than 30 suitability indexes (Charrad et al., 2014). The latter cluster identification procedure is superior to other commonly implemented ones (i.e., the “Elbow” method, Average Silhouette method) because it does not take into account only one suitability index/criteria but multiple ones, being able to overcome the respective limitations of each index (Kodinariya & Makwana, 2013).
First, we’ll recode gender as a dummy variable, so we can also input it in the K-means clustering algorithm alongside our continuous variables. Being able to incorporate gender within the algorithm is highly relevant, since it is an individual characteristic that is closely related to economic behavior. We are not going to leave out valuable data!
customer_data <- customer_data %>%
mutate(female = ifelse(Gender == "Female", 1, 0))
We then use the NbClust function to look for the optimal number of clusters. It’s important to note we are leaving both the ID and the non-numerical Gender variable out of the analysis.
nb <- NbClust(subset(customer_data, select = -c(CustomerID, Gender)),
distance = "euclidean",
min.nc = 2,
max.nc = 15,
method = "kmeans",
index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##

## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 2 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 5 proposed 6 as the best number of clusters
## * 1 proposed 9 as the best number of clusters
## * 1 proposed 11 as the best number of clusters
## * 2 proposed 12 as the best number of clusters
## * 1 proposed 13 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
Here, we see that both 3 and 6 clusters are tenable solutions, on account of an equal majority of indexes suggesting both of these cluster numbers. In this case, we will choose 3 clusters as the optimal number, as it is the more parsimonious choice because it represents the simplest group configuration.
Finally, we now get to run our precious unsupervised Machine-Learning algorithm! Personally, this is the most fun part :). We input gender, age, annual income and spending scores, and ask for a 3-cluster grouping, as previously advised by our optimal cluster-number indexes.
kMeans3 <- kmeans(subset(customer_data, select = -c(CustomerID, Gender)),
centers = 3,
iter.max = 100,
nstart = 50,
algorithm = "Lloyd")
kMeans3$cluster## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132
## 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 3 2 3 2 3
## 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154
## 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176
## 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198
## 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## 199 200
## 2 3kMeans3$size## [1] 123 38 39prop.table(kMeans3$size)## [1] 0.615 0.190 0.195
Once performed the K-means clustering, two items are particularly relevant: the “cluster” vector, which shows group membership for each datapoint, and “size”, which gives us an idea of how large are each clusters (and how proportional is the grouping solution). In this case, the first cluster groups the majority of cases (n = 123, 61.5%), while the other two clusters are evenly distributed (n = 39 and n = 38, respectively, around 19% each).
We now append the variable denoting cluster membership of each case to the dataframe.
clusterMembership <- as.factor(kMeans3$cluster)
customer_data <- cbind(customer_data, clusterMembership)
names(customer_data)## [1] "CustomerID" "Gender" "Age" "annualIncome.k"
## [5] "spendingScore.1_100" "female" "clusterMembership"head(customer_data)## CustomerID Gender Age annualIncome.k spendingScore.1_100 female clusterMembership
## 1 1 Male 19 15 39 0 1
## 2 2 Male 21 15 81 0 1
## 3 3 Female 20 16 6 1 1
## 4 4 Female 23 16 77 1 1
## 5 5 Female 31 17 40 1 1
## 6 6 Female 22 17 76 1 1
Inferential hypothesis testing regarding our customer groups
Now that we have our customer groups, it is important to ask ourselves some questions regarding the relationships between the customer groups’ socioeconomic characteristics. The answers to these questions will provide the business value to the clustering analysis that we’ve performed so far. Since this is data science after all, we’ll ask and answer these questions via inferential statistics and hypothesis testing. Firstly, we’ll examine correlations for the input variables to analyze whether we’ll need to control for variables when we proceed with the inferential hypothesis testing phase.
variablesForCorrelations <- cbind(continuousVariables, customer_data$female)
cor(variablesForCorrelations)## Age annualIncome.k spendingScore.1_100 customer_data$female
## Age 1.00000000 -0.012398043 -0.327226846 -0.06086739
## annualIncome.k -0.01239804 1.000000000 0.009902848 -0.05640981
## spendingScore.1_100 -0.32722685 0.009902848 1.000000000 0.05810874
## customer_data$female -0.06086739 -0.056409810 0.058108739 1.00000000
It appears that the only substantial association is between age and spending score (as age increases, it appears that the spending score decreases). If we ask ourselves whether there are age differences or spending score differences between our customer groups, we’ll need to control for the potential confounding effects of spending score or age respectively.
So, now: to our inferential hypothesis testing. Let’s characterize our customer groups according to gender proportions, age, annual income, and spending scores. First, let’s reproduce the original post’s cluster plot to examine associations between our three clusters and their annual income and spending score.
ggplot(customer_data,
aes(x = annualIncome.k,
y = spendingScore.1_100)) +
geom_point(stat = "identity",
aes (color = clusterMembership)) +
scale_color_discrete(labels = c("Cluster 1",
"Cluster 2",
"Cluster 3")) +
ggtitle("Segments of Mall Customers",
subtitle = "K-means clustering algorithm")

The plot describes that the main driver in differences between cluster 1 versus 2 and 3 is the annual income. Also, the difference between cluster 2 and 3 lies on spending score. However, these are merely descriptive assertions that could be more precisely gauged by hypothesis testing. Moreover, we inputted age and gender as relevant variables in our clustering algorithm, so we need to include these latter ones into our conclusions; they are potentially valuable information which can be considered in business decision-making (i.e., they are individual characteristics of our customer population!). Thus, exhibiting the previous plot as a final result is not imprecise, but a bit incomplete.
The important questions:
Are there gender differences between our clusters?
So, onward. Let’s start with examining differences in gender proportions between clusters. We will perform a Chi-Squared test to test the hypothesis that there are differences in gender proportions (i.e., male and female) between our identified clusters.
chisquaredResult <- chisq.test(x = customer_data$female,
y = customer_data$clusterMembership)
chisquaredResult##
## Pearson's Chi-squared test
##
## data: customer_data$female and customer_data$clusterMembership
## X-squared = 1.7825, df = 2, p-value = 0.4101
The Chi-Squared test of association tells us that there are no statistically significant differences in gender proportions between clusters, χ²(2) = 1.78, p = .41. Thus, we could say that, as far as the data tells us, the groups are somewhat even in terms of gender proportions.
Are there any age differences between our customer groups?
Bear in mind that we previously detected that age and spending scores were correlated, so we might want to control for spending score when analyzing age differences between clusters (and vice-versa, although this comes later). We’d go with a Multivariate Analysis of Variance (MANOVA), but it would appear that’s not such a wise decision (Huang, 2019). Instead, we’ll opt for a Multivariate Linear Regression to assess the explanatory characteristics of cluster membership on age, with spending score as a covariate (spoiler alert, we’ll do the same but substituting age and spending score in our next analysis).
We’ll run the same identical model twice, but we’ll change the reference level of our cluster membership variable to compare estimated differences in age between groups (1 vs. 2 and 3, and then 2 vs. 1 and 3). This will tell us the estimated differences in age between groups while also controlling for the relationship between spending score and age.
AgeOnCluster_ref1 <- lm(Age ~ spendingScore.1_100 + relevel(as.factor(clusterMembership), ref = "1"),
data = customer_data)
AgeOnCluster_ref2 <- lm(Age ~ spendingScore.1_100 + relevel(as.factor(clusterMembership), ref = "2"),
data = customer_data)
tab_model(AgeOnCluster_ref1,
AgeOnCluster_ref2)

Our predictors of interest here are the cluster membership estimates. From our multivariate regression, we can see:
- Customer group 1 and customer group 2 do not show statistically significant differences in age (B2vs1 = 0.52, p = .86).
- Customer group 1 is estimated to be on average almost 8 years older than group 3 (B3vs1 = − 7.81, p = .01).
- In turn, customer group 2 is also estimated to be on average around 8 years older than group 3; although, this difference is only marginally statistically significant (B3vs2 = − 8.33, p = .073).
Attention! This result is particularly relevant, as it speaks to the benefit of controlling for the association between spending score and age while analyzing age differences in the clusters. If we calculated the mean age for each group, we would see that group 2 appears to be the youngest, with group 1 and 3 being similar in age. However, this contradicts the results of the multiple regression analysis, which might suggest that any difference observed in the average age of the groups might be attributable to differences in spending scores rather than age itself.
Are there differences in customer spending scores between clusters?
Let’s move on with asking whether there are differences in the customer spending scores between our clusters. We conduct a second multivariate linear regression, where our predictors of interest will also be the cluster membership estimates. Note that it will be a very similar analysis as the previous one, but we’ll exchange places with age and spending scores.
spendingScoreOnCluster_ref1 <- lm(spendingScore.1_100 ~ Age + relevel(as.factor(clusterMembership), ref = "1"),
data = customer_data)
spendingScoreOnCluster_ref2 <- lm(spendingScore.1_100 ~ Age + relevel(as.factor(clusterMembership), ref = "2"),
data = customer_data)
tab_model(spendingScoreOnCluster_ref1,
spendingScoreOnCluster_ref2)

From our multivariate regression, and observing our cluster membership predictors, we can see:
- Customer group 1 is estimated to have -on average- approximately 30 less points on the spending score scale than group 2 (B2vs1 = 29.43, p < .001).
- Customer group 1 is estimated to have -on average- approximately 30 more points on the spending score scale than group 3 (B3vs1 = − 31.17, p < .001).
- In turn, customer group 2 is estimated to have -on average- approximately 60 more points on the spending score scale than group 3 (B2vs3 = − 60.60, p < .001).
This means that there are significant differences between our mall customer groups regarding their spending scores, with group 2 exhibitting the highest spending score, followed by group 1, and then group 3.
Are there differences in customer annual income between our clusters?
We will perform a one-factor Analysis of Variance (ANOVA) to analyze differences in annual income between clusters in a hypothesis testing manner. If we find statistically significant differences between our customer clusters, we will follow up with a Tukey Post-Hoc test to assess the magnitude of the difference in annual income, and between which individual groups.
annualIncomeANOVA <- aov(annualIncome.k ~ clusterMembership, data = customer_data)
summary(annualIncomeANOVA)## Df Sum Sq Mean Sq F value Pr(>F)
## clusterMembership 2 85990 42995 165.1 <2e-16 ***
## Residuals 197 51288 260
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our analysis tells us that there are statistically significant differences between the annual incomes of our customer groups, F(2,197) = 165.10, p < .001. We’ll follow up with Tukey Post-Hoc tests to determine between which groups we observe these differences, and what are the magnitude of these differences in terms of income.
TukeyHSD(annualIncomeANOVA)## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = annualIncome.k ~ clusterMembership, data = customer_data)
##
## $clusterMembership
## diff lwr upr p adj
## 2-1 42.8455285 35.773528 49.917528 0.0000000
## 3-1 42.3839900 35.381600 49.386380 0.0000000
## 3-2 -0.4615385 -9.147041 8.223964 0.9913559
After conducting the Tukey Post-Hoc tests, we find:
- Customer group 1 earns approximately 42k less than group 2 (Estimate2 − 1 = 42.38, p < .001).
- Customer group 1 earns approximately 43k less than group 3 (Estimate2 − 1 = 42.85, p < .001).
- There are no statistically significant differences in income between groups 2 and 3 (Estimate3 − 2 = 0.46, p = .99).
This means that we could characterize group 1 as “low income”, whereas both customer groups 2 and 3 could be labeled as “high income”.
Overal results and conclusions
Finally! Our long-awaited overall results and conclusions:
From the available data of mall customers, we detected three groups using an unsupervised machine-learning model (K-means clustering algorithm). We also analyzed the associations between belonging to one of these three groups and gender proportion, age, spending scores, and annual income. We employed hypothesis testing to analyze the associations in a scientific manner, opting to use graphs only as an initial understanding but not conforming with going with only this latter approach.
Results
- Group 1: Older age, low income, medium spending. This comprises the majority of mall customers (61.5%).
- Group 2: Older age, high income, high spending. This segment is composed of 19.5% of customers.
- Group 3: Younger age, high income, low spending. This group includes 19.0% of the mall customers.
Conclusions:
- Business decisions that target products for older individuals will be fruitful, since younger customers exhibit low spending even when their income is high. Thus, targeting older customers will be fruitful due to being able to address the needs of the majority of the customers (+80%), of which a notable subset of these possess high levels of income.
- Since there are no substantial differences in proportions of genders across customer groups, business decisions should not be driven by gendered products.
- If we incorporate further variables (such as product reviews, or sales details such as products purchased), we can adapt and adopt the previous analysis pipeline relatively easily to further refine the assessment conducted above. Thus, we are using data to not only extract meaningful and valuable conclusions from existing information: we are laying the ground for steady growth and strengthening our decision-making by applying science to business data.
Contacts
Are you interested in Data Science? Let’s connect.
If you want to keep updated with my latest articles and projects follow me on Medium. These are some of my contacts details:
References:
- Charrad, M., Ghazzali, N., Boiteau, V., & Niknafs, A. (2014). NbClust: an R package for determining the relevant number of clusters in a data set. Journal of statistical software, 61, 1–36.
- Data Science Project — Customer Segmentation using Machine Learning in R. https://data-flair.training/blogs/r-data-science-project-customer-segmentation/
- Huang, F. L. (2020). MANOVA: A procedure whose time has passed?. Gifted Child Quarterly, 64(1), 56–60.
- Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of Cluster in K-Means Clustering. International Journal, 1(6), 90–95.