In this project, I applied unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that you will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.
Firstly, I converted data that matches a ‘missing’ or ‘unknown’ value code into a numpy NaN value. For categorical data, I would ordinarily need to encode the levels as dummy variables. I have multi-level categoricals (three or more values) so, I can choose to encode the values using multiple dummy variables (e.g. via OneHotEncoder)
Before we apply dimensionality reduction techniques to the data, we need to perform feature scaling so that the principal component vectors are not influenced by the natural differences in scale for features. For the actual scaling function, a StandardScaler instance is suggested, scaling each feature to mean 0 and standard deviation 1 instead of Normalizer for numeric values.
Then, I used sklearn’s PCA class to apply principal component analysis on the data, thus finding the vectors of maximal variance in the data. To start, you should not set any parameters (so all components are computed) or set a number of components that is at least half the number of features (so there’s enough features to see the general trend in variability).