Between fending off bad food and crisscrossing time zones, the wheels have come off this series schedule. At any rate….In the last post or the fifth iteration of this blog series, we looked at several techniques for ranking feature importance and then one method for subsequently doing feature selection using recursive feature elimination (RFE). We used these feature selection methods as a means to address datasets which are highly dimensional thus simplify the data in the hopes that it improves our machine learning model. In this post, we will wrap up the data preparation by looking at another way for us to handle highly dimensional data through a process known as either feature extraction or more accurately dimensional reduction.
Feature Selection vs Dimensional Reduction
Before diving in, it’s worth spending a few moments to define at least some context. We spent the prior post discussing feature selection and for many this probably seems relatively intuitive. If we think in terms of a dataset of data with many columns and rows, feature selection is all about reducing the number of columns by simply removing them from the dataset. In this case the data is lost in effect since it is not included when we go through our model training process. We attempt to minimize this data lost by selecting the most meaningful features as we saw previously.
Dimensional Reduction on the other hand is more of a compression technique so to speak. Specifically it seeks to compress the data without losing information. Continuing with our analogy we reduce the number of features or columns in our data by combining or projecting one or more of the columns into a single new column. This in essence is a transformation being applied to your data and as we will see shortly there are two common methods used to accomplish this.
Principal Component Analysis (PCA)
PCA is a unsupervised reduction technique that tries to maximize variances along new axes within a dataset. Don’t worry if your head is spinning and the math nerds around you are giggling. Let’s look at an example to better illustrate what PCA does and how it works. We will look at both a linear and non-linear data sets illustrated below.
Since we are interested in doing classification on these data sets to predict class and most methods look to linearly separate data for this purpose the ideal outcome of any technique applied to the data would be to enhance the divisions of the data to make it more readily separable. Now obviously these are neither big nor particularly complex in nature but let’s see if we can use PCA to reshape our datasets.
|Linear Data Set|
from sklearn.datasets import make_classification X, y = make_classification(n_features=6, n_redundant=0, n_informative=6, random_state=1, n_clusters_per_class=1, n_classes=3) plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5) plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5) plt.scatter(X[y==2, 0], X[y==2, 1], color='green', alpha=0.5) plt.title('Linear') plt.show()
|Non-Linear Data Set|
from sklearn.datasets import make_circles X, y = make_circles(n_samples=1000, random_state=123, noise=0.1, factor=0.2) plt.figure(figsize=(8,6)) plt.scatter(X[y==0, 0], X[y==0, 1], color='red', alpha=0.5) plt.scatter(X[y==1, 0], X[y==1, 1], color='blue', alpha=0.5) plt.title('Concentric circles') plt.show()
Beginning with the linear example, we created a “highly” dimensional set with six meaningful features. The initial plot is little bit of a mess and so we will apply PCA to the data set and reduce the dimensionality to just two features. The code sample provided uses scikit-learn and then plots the results. As you can see we now have a much simpler data set that is clearly easier to separate.
from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X_pca[y==0, 0], X_pca[y==0, 1], color='red', alpha=0.5) plt.scatter(X_pca[y==1, 0], X_pca[y==1, 1], color='blue', alpha=0.5) plt.scatter(X_pca[y==2, 0], X_pca[y==2, 1], color='green', alpha=0.5) plt.title('Linear PCA') plt.show()
The next example is more challenging and is representative of a non-linear dataset. The concentric circles are not linear separable and while we can apply a non-linear technique we can also use PCA. When dealing with non-linear data, if we just apply standard PCA, the results are interesting but still not linearly separable. Instead, we will use a special form of PCA that leverages the same kernel trick we saw applied to support vector machines way back in the second post. This implementation is called KernelPCA in scikit-learn and the code and resulting plot is provided below.
|Linear Techniques against Non-Linear Data|
from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X[y==0, 0], np.zeros((500,1))+0.1, color='red', alpha=0.5) plt.scatter(X[y==1, 0], np.zeros((500,1))-0.1, color='blue', alpha=0.5) plt.ylim([-15,15]) plt.title('Linear PCA on Non-Linear Data') plt.show()
|Kernel PCA on Non-Linear Data|
from sklearn.decomposition import KernelPCA kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15) X_kpca = kpca.fit_transform(X) plt.scatter(X_kpca[y==0, 0], X_kpca[y==0, 1], color='red', alpha=0.5) plt.scatter(X_kpca[y==1, 0], X_kpca[y==1, 1], color='blue', alpha=0.5) plt.title('Kernel PCA') plt.show()
Note that we’ve transformed a non-linear data set into a linear dataset and can now use linear-based machine learning techniques like logistic regression, support vector machines or naïve bayes.
The one point we haven’t discussed is the number of components we should target during the reduction process. The samples as you’ve probably caught on somewhat arbitrarily use two even though in the case of the linear dataset we have six meaningful features. We can visually plot out the amount of explain variance using PCA as seen in the code below. Using this code and the resulting plot allows us to select a meaningful number that explains the maximum amount of variance within our dataset.
from sklearn.decomposition import PCA pca = PCA() X_pca = pca.fit_transform(X) plt.figure(1, figsize=(4, 3)) plt.clf() plt.plot(pca.explained_variance_, linewidth=2) plt.axis('tight') plt.xlabel('n_components') plt.ylabel('explained variance')
Linear Discriminant Analysis (LDA)
LDA unlike PCA discussed in the previous section, is a supervised method to that seeks to optimize class separability. Since it is a supervised method for dimensional reduction, it uses the class labels during the training step to do its optimization. Note that LDA is a linear method and that it cannot (or more accurately…should not) be applied to non-linear datasets. That being said let’s look at a code example as applied to our linear dataset.
from sklearn.lda import LDA lda = LDA(n_components=2) X_lda = lda.fit_transform(X,y) plt.scatter(X_lda[y==0, 0], X_lda[y==0, 1], color='red', alpha=0.5) plt.scatter(X_lda[y==1, 0], X_lda[y==1, 1], color='blue', alpha=0.5) plt.scatter(X_lda[y==2, 0], X_lda[y==2, 1], color='green', alpha=0.5) plt.title('Linear LDA') plt.show()
In the preceding code, we used our sample data to train the model and the transform or reduce our feature set down from six to two. Both were accomplished in the fit_transform step. When we plot the result we can observe at first glance that using the LDA method has resulted in better linear separability among the three classes of data.
In this post, we looked at using Principal Component Analysis and Linear Discriminant Analysis for reducing the dimensionality of both linear and non-linear data. The one question that we left unanswered however is given the two methods, which method is better? Unfortunately as you might have guess “better” is relative and there is not a clear answer. As with most things in this space they both are varyingly useful in various situations. In fact sometimes they are even used together. I’d invite you to discover both of these more in depth if you are so inclined.
In our next post on this random walk through machine learning, we are going to leave behind data preparation and move on to modeling or more accurately model selection. We will look at how models are evaluated and some of the common measures you can use to determine how well your selected model performs.
Till next time!