Identify Customer Segments
In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.
Firstly, I converted data that matches a ‘missing’ or ‘unknown’ value code into a numpy NaN value. For categorical data, I would ordinarily need to encode the levels as dummy variables. I have multi-level categoricals (three or more values) so, I can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder)
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1 instead of Normalizer for numeric values.
Then, I used sklearn’s PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there’s enough features to see the general trend in variability).
I decided to retain first 35 principal components in other words I decrease dimensions of data to 35 dimensions, since it explains approximately 88% variances. I used sklearn’s KMeans class to perform k-means clustering on the PCA-transformed data. Number of cluster for the data is seleceted as 13 by using elbow method.
We can say that all cluster are equally exist except cluster #0 in general population in Germany. If we look customer of Bertelsmann Arvato Analytics, cluster #4 and #12 type people have more portions. Therefore the company should reach cluster #4 and #12 type people in general population of Germany to increase their revenue. Because they are potential customer to the company. The company may use digital marketing to reach cluster #4 and $12 kind of people.