Data Science Day 5 – Feature Selection

In my prior post, we introduce the CRISP-DM methodology as a broad process to guide us on our adventures in machine learning. We then continued this somewhat random walk, but looking at several common techniques for handling or wrangling data in data preparation step. In this post, we will continue this little adventure by discussing another facet of the data preparation step: feature selection.

The Curse of Dimensionality

In your various travels through the world of data, you’ve probably been given that advice that the more data you can throw at a problem the better off you will be. Well in most cases this is true, except when it’s not…let me explain.

The curse of dimensionality within the context of machine learning, refers to phenomenon in which we have highly-dimensional data (a lot of features/attributes/columns) resulting in data sparsity that many machine learning models are ill equipped at handling. This ill fit can be either a result of a poorly performing model (i.e. poor accuracy or generalization/overfitting) or a model that computationally too expensive in terms of either resources or time.

When we encounter this issue, we have two options:

  • Find a different model that is more suited to our highly dimensional model
  • Reduce the dimensionality of our data

As you might have guessed, this post is about the latter as we will explain feature selection as a means to reduce dimensionality by dialing in on those important features within our dataset. Note that a secondary option called feature extraction, in which we reduce dimensionality by reshaping out existing data can also be used and this will be the subject of the next post.

Be more selective

While there may be value in all the data within our dataset, typically some features or dimensions are going to be more predictive than others. In this regard, it may be possible to train our model using only those most predictive features thereby reducing the dimensionality and complexity of our data. So how do we go about identifying or “selecting” these features?


If you are like me and have been away from school for more than just a few years, it may be useful to start our discussion by doing a small review. Recalling from that Statistics class you took, correlation is the measure of relationship between two independent variables and is represented as a value or coefficient between 1 and -1.

Correlation measures how two variables move together and a correlation value of zero is said to indicate no relationship between variables. Meaning they are unrelated. A correlation value of 1 indicates a relationship that is perfectly positively correlated meaning the two values stay in sync and either grow or shrink together. A correlation value of -1 indicate perfectly negative relationship, meaning an inverse relationship exists. In this case as one variable grows, the other shrinks and vice versa.

We can leverage correlation to help identify relationships for our predictive model and calculating a correlation matrix is trivial in most statistical tools. Let’s look at a quick example.

In the code below, we are again leaning on the Iris dataset and using Python and the Pandas library to solve for the correlation coefficient between each variable or feature in the dataset. Note that I have combined the feature and target data into a single data frame.

import pandas as pd
from pandas import DataFrame
from sklearn import datasets 

iris = datasets.load_iris() 

X =
y =

feature_names = iris['feature_names']

df = pd.DataFrame(, columns=feature_names)
df['target'] =

The corr() function call produces a matrix of coefficients as seen below and as expected you can see that each variable is perfectly correlated to itself.


We can disregard most of the chart since the primary relationships we are interested in is between our target and features. What do you notice? Straight away clearly petal length and width have the strongest correlation with our target while the sepal length is somewhat weaker.

As you can see this was fairly easy and while informative it’s rather unsophisticated in how we arrive at understanding feature importance. If you’ve made it this far in the series you know by now that there are more sophisticated methods at our disposal.

Swing in the trees

A second method that we can use to understand feature importance relative to our target leverages ensembles of decision trees. Decision tree ensembles such as random forests, when trained produce a weight or score of the feature importance within the model. We can use this value to rank features much like we did with correlation.

In the code below, I used the ExtraTreesClassifier within scikit-learn to train a model against the iris dataset. Since this is an ensemble, referred to as a forest in this case, the classifier has multiple trees. For each tree that makes up our forest we summarize the feature importance by taking the standard deviation into a single sorted array. We use the remaining code to print out the sorted results in both textual and graphical format.

from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators=250,
                              random_state=0), y)
importances = forest.feature_importances_
std = np.std([tree.feature_importances_ for tree in forest.estimators_],
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
    print("%d. %s (%f)" % (f + 1, 
    feature_names[indices[f]], importances[indices[f]]))

# Plot the feature importances of the forest
fig = plt.figure()
plt.title("Feature importances")[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])

ax = fig.add_subplot(111)
xtickNames = ax.set_xticklabels(feature_names)
plt.setp(xtickNames, rotation=90, fontsize=10)


When we compare the results of this method with our correlation findings, we confirm that ranking in terms of importance is identical.

Feature Elimination

So far, we have only discussed methods for identifying and ranking features by their importance. If our goal however is to reduce the dimensionality of our dataset by reducing the number of features, then at some point we need to draw the proverbial line in the sand. So we ask ourselves, where should that line be? What is the right number of features?

We could start by arbitrarily pick some “n” number of features and begin iteratively evaluating our model either manually or by using a technique such as Recursive Feature Elimination (RFE) which I’ve done in the Jupyter notebook provided. While these may get us going we can lean again on the scikit-learn library and use a method known as Recursive Feature Elimination with Cross Validation (RFECV) to get a more intelligent answer.

Using this method, we will as the name implies recursively evaluate a defined model against our dataset using a stratified dataset. The scoring criteria is specified by us and is evaluated against each combination of features. From the results, we identify the optimal number of meaningful features.

In the code below we are using a linear support vector machine model to recursively evaluate our feature set optimizing for accuracy.

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")

rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy'), y)

print("Optimal number of features : %d" % rfecv.n_features_)


Unsurprisingly since we are using a toy dataset for our demo, the results indicate that all four of our features should be included in our model. To show the overall improvement in accuracy (our stated scoring criteria), we could plot the results.

# Plot number of features VS. cross-validation scores
plt.xlabel("# of features")
plt.ylabel("Cross validation score")
plt.plot(range(1, len(rfecv.grid_scores_) + 1), rfecv.grid_scores_)


We can get a better sense of this functions value if we use a generated dataset. The next snippet of code does just that and although the generated data is nonsensical, I hope that if gives you a better feel for what this function is accomplishing.

from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn.datasets import make_classification

# Use 5 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

svm = SVC(kernel="linear")
rfecv = RFECV(estimator=svm, step=1, cv=StratifiedKFold(y, 2),
              scoring='accuracy'), y)



In this post we discussed feature selection. We began by highlighting a couple of different techniques for ranking features by importance including correlation and using decision trees. We also looked at using recursive feature elimination as a way to intelligently trim our feature set. Keep in mind that this post is not an exhaustive discussion on all things feature selection and is meant only as an introduction.

In the next post, we will dive into a different technique that we can also use to reduce the dimensionality of our data, feature extraction.

Till next time,



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s