Data Science Day #8 – Model Selection

In our last post, we discussed multiple ways or more accurately metrics that we have at our disposal for evaluating a classification model. While this evaluation process is leading us down the road towards model selection we need to first consider optimizing or tuning our model for the task at hand.

Model Parameters

Over the course of this series we introduced a number (yet not exhaustive list) of different model algorithms that can be used for both binary and multiclass classification task. All of these models require model parameters to do their jobs or function. Some parameters are learned during the training process and do not require any special handling per se while others can be tuned to affect the behavior or the model. These parameters are often referred to as hyperparameters and will be the focus of this post.

Hyper-what?!?

My highly uneducated and unofficial guess is that many of you like me haven’t the slightest clue what a hyperparameter is or what impact it will have on a given algorithm. Couple that with the fact that different algorithms have different hyperparameters and its enough to make your head spin. Fear not there is a light at the end of the tunnel and we will arrive there soon. Before we do however, let’s consider some of the various hyperparameters you are likely to encounter.

  • Logistic Regression – Regularization (often referred to as ‘C’)
  • SVM – Kernel, Gamma and Regularization
  • Decision Trees – Depth

Optimal Configuration and Grid Search

To date, we’ve often ignored algorithm hyperparameters by accepting default values and without understanding the inner workings of an algorithm it’s difficult to guess the optimal configuration without some iterative random guess. Enter hyperparameter tuning in the form of a technique called grid search.

The grid search technique is simply a brute-force method where we exhaustively iterate over a collection of possible parameter values to find the optimal configuration for a model. Once all the possible combinations have been considered the “best” model is returned. As you can imagine this is an extremely computationally expensive process and many machine learning libraries have built in implementations of this technique. Conveniently enough the scikit-learn implementation is called GridSearchCV.

To use this technique we need to think through a couple of different things up front. First and most importantly is a scoring metric. Going back to our prior post, the score metric is largely based on the task at hand and if we choose something like Accuracy then GridSearch evaluate all the possible combinations of parameter values returning the “best” model with the highest accuracy score.

Next we need to specific a list or range of possible parameter values that we want to evaluate. For obvious reason these may vary between algorithms and if you are not sure where to start your friendly inter-webs search engine or the documentation for the ML library of your choice is going to be your friend.

With these two pieces in place we can use the code provided below to run our grid search to find the optimal model configuration.

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

parameters = {'kernel':('linear', 'rbf'), 
              'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
svc = svm.SVC()

clf = GridSearchCV(estimator = svc, 
    param_grid = parameters,
    scoring ='accuracy',
    cv = 5,
    n_jobs =-1)

clf.fit(X_train, y_train)

print('Best Config: %s' % clf.best_params_)
print('Best Score: %.3f' % clf.best_score_)
Best Config: {'kernel': 'linear', 'C': 1.0}
Best Score: 0.971

The preceding code outputs the optimal hyperparameters configuration as well as the highest score for the best performing estimator. To see how our model performs on unseen data, we could then train this optimal model and score it using our test data as seen below. The results show that in this particular example our model generalizes pretty well and actually has a higher accuracy score against our test dataset.

best = clf.best_estimator_
best.fit(X_train, y_train)

print('Score (Accuracy): %.3f' % best.score( X_test, y_test))
Score (Accuracy): 0.978

Nested Cross Validation

We originally discussed cross validation and more specifically k-fold cross validation in the prior post on model evaluation. We will extend that concept here in the context of model selection when we are interested in comparing one model to another while minimizing bias. Using this technique allows us to work with k-fold cross validation both for tuning our hyperparameters and in the subsequent evaluation of the model. The result being a score (in our case accuracy score) for each k-fold. Let’s look at an example.

from sklearn.cross_validation import cross_val_score

cv_scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)

print(cv_scores)
print('Accuracy: %.3f +/- %.3f' % (np.mean(cv_scores), np.std(cv_scores)))
[ 0.96666667  1.          0.9         0.96666667  1.        ]
Accuracy: 0.967 +/- 0.037

A quick look results using 5 folds, shows that our average model accuracy for the test is 96.7% with a standard deviation of 3.7%. Looking at the detail scores we see that our best estimator was perfect on two folds and has a low score of 90%. If we were evaluating multiple models for example a Logistic Regression or Decision Tree we could compare these scores to determine which model is likely to provide the best results.

Wrap-Up

In this post we discussed model selection through brute force hyperparameter tuning using grid search and nested cross validation as a means to minimize bias during evaluation. With that we’ve covered the basics within the context of a classification problem. In our next post we will discuss a means for tying all these concepts together in a streamlined and repeatable manner using pipelines.

 

Till next time!

Chris

Advertisements