Data Science Day #8 – Model Selection

In our last post, we discussed multiple ways or more accurately metrics that we have at our disposal for evaluating a classification model. While this evaluation process is leading us down the road towards model selection we need to first consider optimizing or tuning our model for the task at hand.

Model Parameters

Over the course of this series we introduced a number (yet not exhaustive list) of different model algorithms that can be used for both binary and multiclass classification task. All of these models require model parameters to do their jobs or function. Some parameters are learned during the training process and do not require any special handling per se while others can be tuned to affect the behavior or the model. These parameters are often referred to as hyperparameters and will be the focus of this post.

Hyper-what?!?

My highly uneducated and unofficial guess is that many of you like me haven’t the slightest clue what a hyperparameter is or what impact it will have on a given algorithm. Couple that with the fact that different algorithms have different hyperparameters and its enough to make your head spin. Fear not there is a light at the end of the tunnel and we will arrive there soon. Before we do however, let’s consider some of the various hyperparameters you are likely to encounter.

  • Logistic Regression – Regularization (often referred to as ‘C’)
  • SVM – Kernel, Gamma and Regularization
  • Decision Trees – Depth

Optimal Configuration and Grid Search

To date, we’ve often ignored algorithm hyperparameters by accepting default values and without understanding the inner workings of an algorithm it’s difficult to guess the optimal configuration without some iterative random guess. Enter hyperparameter tuning in the form of a technique called grid search.

The grid search technique is simply a brute-force method where we exhaustively iterate over a collection of possible parameter values to find the optimal configuration for a model. Once all the possible combinations have been considered the “best” model is returned. As you can imagine this is an extremely computationally expensive process and many machine learning libraries have built in implementations of this technique. Conveniently enough the scikit-learn implementation is called GridSearchCV.

To use this technique we need to think through a couple of different things up front. First and most importantly is a scoring metric. Going back to our prior post, the score metric is largely based on the task at hand and if we choose something like Accuracy then GridSearch evaluate all the possible combinations of parameter values returning the “best” model with the highest accuracy score.

Next we need to specific a list or range of possible parameter values that we want to evaluate. For obvious reason these may vary between algorithms and if you are not sure where to start your friendly inter-webs search engine or the documentation for the ML library of your choice is going to be your friend.

With these two pieces in place we can use the code provided below to run our grid search to find the optimal model configuration.

from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC

parameters = {'kernel':('linear', 'rbf'), 
              'C': [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
svc = svm.SVC()

clf = GridSearchCV(estimator = svc, 
    param_grid = parameters,
    scoring ='accuracy',
    cv = 5,
    n_jobs =-1)

clf.fit(X_train, y_train)

print('Best Config: %s' % clf.best_params_)
print('Best Score: %.3f' % clf.best_score_)
Best Config: {'kernel': 'linear', 'C': 1.0}
Best Score: 0.971

The preceding code outputs the optimal hyperparameters configuration as well as the highest score for the best performing estimator. To see how our model performs on unseen data, we could then train this optimal model and score it using our test data as seen below. The results show that in this particular example our model generalizes pretty well and actually has a higher accuracy score against our test dataset.

best = clf.best_estimator_
best.fit(X_train, y_train)

print('Score (Accuracy): %.3f' % best.score( X_test, y_test))
Score (Accuracy): 0.978

Nested Cross Validation

We originally discussed cross validation and more specifically k-fold cross validation in the prior post on model evaluation. We will extend that concept here in the context of model selection when we are interested in comparing one model to another while minimizing bias. Using this technique allows us to work with k-fold cross validation both for tuning our hyperparameters and in the subsequent evaluation of the model. The result being a score (in our case accuracy score) for each k-fold. Let’s look at an example.

from sklearn.cross_validation import cross_val_score

cv_scores = cross_val_score(clf, X, y, scoring='accuracy', cv=5)

print(cv_scores)
print('Accuracy: %.3f +/- %.3f' % (np.mean(cv_scores), np.std(cv_scores)))
[ 0.96666667  1.          0.9         0.96666667  1.        ]
Accuracy: 0.967 +/- 0.037

A quick look results using 5 folds, shows that our average model accuracy for the test is 96.7% with a standard deviation of 3.7%. Looking at the detail scores we see that our best estimator was perfect on two folds and has a low score of 90%. If we were evaluating multiple models for example a Logistic Regression or Decision Tree we could compare these scores to determine which model is likely to provide the best results.

Wrap-Up

In this post we discussed model selection through brute force hyperparameter tuning using grid search and nested cross validation as a means to minimize bias during evaluation. With that we’ve covered the basics within the context of a classification problem. In our next post we will discuss a means for tying all these concepts together in a streamlined and repeatable manner using pipelines.

 

Till next time!

Chris

Data Science Day #7 – Model Evaluation

Outside of the introduction to some of the common algorithms or models used for a classification task, this series has focus on all the small, yet important foundational things that need to be handled or considered prior to modeling and model selection. In this post we are going to drive the bus back in from left field and discuss model evaluation within the context of a supervised classification activity. Without further ado, let’s get started.

 Getting Data

We’ve spent a number of post talking about preparing data for optimal use in a machine learning activity and there’s one more step that needs to be considered. Since we are going to be working with supervised machine learning techniques, our models are going to require sets of data for training. Likewise, we talk about model evaluation shortly, we will need data for that too. Further complicating matters is the need for validation data once we discuss model selection. So how do we go about dividing out data appropriately?

Believe it or not this is a relatively easy task. Accomplishing this is as easy as simply splitting data randomly by some percentage, say use 80% for training and leaving 20% untouched for testing and evaluation. We could then take out training data and further carve out a set of validation data to help prevent us from overfitting our model during model selection. This is classically known as the holdout method while the percentages vary with 70/30 and 80/20 being common splits it is widely used because of its simplicity.

A second alternative to the holdout technique is known as cross validation. Cross validation has a few different implementations with the most common know a k-fold cross validation. In a cross validation the data set is divided in k-folds (for example 5) where each observation exists in only a single fold and each fold takes a turn both as a part of the training set and as a validation set. To better illustrate the differences between the holdout and cross validation with k-folds a diagram has included below.

Validaton

Before moving on it’s important to note that the way you divide up your data can have a significant impact on model performance. You shouldn’t shortcut yourself into potential issues like over fitting a model by reusing your training data later as validation data during model selection or not paying attention to small details like randomization when splitting up your data.

 Model Evaluation

So we have our data and we’ve selected a model or potentially models that we suspect will work nicely with our data. How do we evaluate the model’s actual performance? At first glance its simple, right? This is a classification activity and regardless of whether it’s a binary or multiclass problem, the model either picked the right class or it didn’t. This task is often referred to a scoring and involves testing our chosen model on a set of un-seen data to see if its predicted class matches the our known label or the ground truth.

Unfortunately though once we’ve trained, predicted and scored our model, there are multiple ways or metrics we can use to measure how our model performed. Let’s briefly look at some of the most common ways to evaluate our model performance.

Confusion Matrix

The simplest method for evaluating a models performance at a classification task is known as the confusion matrix (pictured below). The confusion matrix in a binary classification task is made up of four quadrants each containing a straight count of how our model’s predicted class aligns our ground truth or known label. The four quadrants in this example matrix are:

  • True Positive (TP) – predict true, known true
  • False Negative (FN) – predict false, known true, sometimes called Type-2 error
  • False Positive (FP) – predict true, known false, sometimes called Type-1 error
  • True Negative (TN) – predict false, known false

To further illustrate how this works, let’s consider a simple binary classification task where we are trying to predict whether a credit card transaction is fraudulent or not. In this case, we are simple predicting either true or false, with true indicating a fraudulent transactions. In the example pictured below our model would have correctly classified 92 fraudulent transactions and another 104 as non-fraudulent. Further, it misclassified 28 transactions as okay while they were in fact fraudulent and finally 26 as fraudulent while they were in fact okay transactions. Taken together this gives us a straight-forward summary of how our selected model performed and will be the basis of many different performance metrics we will discuss next.

confusion_matrix

 

Performance Metrics

When I started this series my goal was to do the whole thing without invoking a single math formula. While my intentions where probably good, like many of the politicians filling the airwaves I’m going to backtrack and present some math to highlight the various derivations that can be made from our confusion matrix. Don’t worry you won’t be tested on this stuff as it’s presented only for conceptual understanding and all of these metrics are often provided for you by ML libraries such as scikit-learn.

Error

Summarizes the percentage of observations that were misclassified by the model.

error

Accuracy

Summarizes the percentage of observations that were correctly classified by the model.

Accuracy

Precision

Can be calculated for either positive or negative results and is the proportion of true positives to all positive predictions.

Precision

Recall

Also referred to as sensitivity and is the proportion of true positive classifications to all cases which should have been positive. Higher recall means that we are correctly classify most positive cases and ignores false positives. This metric can also be calculated for negative results and is called specificity.

RecallSpecificity

F1-Score

Measures overall accuracy by considering both precision and recall, weighted equally using the harmonic mean. The highest F-1 score is one with the lowest being zero. This measure is typically used to compare one model to another.

F1

Receiver Operator Characteristics (ROC) Curve

The ROC curve is a visual plot that compares true-positive rate to the false-positive rate. In the plot below, we can note that a perfectly predictive model follows the axis-borders and a model with no predictive value (i.e. random guessing) forms the diagonal. We can plot and compare multiple models to one another. Using these plots we can generate another measure called area under the curve (AUC) to quantify the graphically presented data..

 

 

Model Evaluation & Selection

Now that we’ve explored some of the tools we have at our disposal to evaluate the performance of our model, let’s look at a practical application for how these metrics can be leveraged in evaluating whether our model is useful or not. Continuing with our credit card fraud model, let’s review and interpret the example results for this binary classification task.

  • Accuracy -97.7% – Very general in telling us that our model correctly predicted the correct label for 97.7% of our samples. It does not inform us of whether those misses were primarily one class or another. In our example this can be potentially problematic if for example we were extremely sensitive to fraud yet our model primarily missed on fraudulent charges. Likewise this can be misleading in cases where we have significant class imbalances
  • Consider a situation where 99.5% of transactions were okay and only 0.5% of transactions were fraudulent. If our model just simply guessed that every transaction was okay it would be 99.5% accurate but would have missed every single fraudulent charge.. For this reason, we want to typically consider some of the following metrics.
  • Precision -74,4% – Our model was able to successfully target 74.4% of the fraudulent transactions while limiting the number of false positives. This would be useful if we were worried about having a model that was overly aggressive and favored a more conservative model to limit false alarms.
  • Sensitivity (Recall) – 95.5% – Our model correctly classified 95.5% of fraudulent transactions meaning that 4.5%  of fraudulent transactions were missed. Whereas our precision metric takes into consideration false-positives, this metric is only concerned truly fraudulent transactions. If our tolerance for fraud were very low, we would be interested in optimizing our model for this metric.
  • Specificity – 97% – Indicates that 97% of good transactions were correctly classified and that 3% of good transactions were misclassified as fraudulent. Very similar to precision in that optimizing this metric would lead to a more conservative model if we were overly concerned with sounding false alarms.

As you have probably surmised by this point, it is nearly if not totally impossible to build a model that performs well across all the metrics. In the real-world we typically focus one area and part of the model selection process involves tuning or optimizing our model for the task at hand. This tuning process is called hyperparameter tuning and will be the focus of our next posts.

Metrics Example

Since this post has been code-free so far (and I’m itching for some code), I’ve included a brief sample that you can run to highlight how several of these metrics can be calculate using Python and scikit-learn. The example uses a generated set of data with the emphasis being places on how the various metrics discussed to this point can be calculated.

from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, n_informative=5)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

classifier = LogisticRegression()
y_pred = classifier.fit(X_train, y_train).predict(X_test)

#Confusion Matrix
print(confusion_matrix(y_test, y_pred))

#Accuracy Score
print(accuracy_score(y_test, y_pred))

#Precision Score
print(precision_score(y_test, y_pred))

#Classfication Report
print(classification_report(y_test, y_pred, target_names=['No', 'Yes']))

 

Wrap-Up

In this post we discussed the various facets for evaluating a classification model. From discussing a straight-forward strategy for dividing our experiment data to an introduction of some of the basic metrics that can be used to measure model performance we’ve set the stage for model evaluation. In the next posts, we will look at a strategy to tune our model using a hyperparameter tuning technique called grid search to optimize our model for a given target metric.

 

Till next time!

Chris

Data Science Day #3 – Classification Models

In the prior post, we discussed at a meta level model selection. In this post, we will explore various modeling techniques with examples. So let’s go…

Getting your own Demo Environment

This entire series will be presented in Python using Jupyter Notebooks. I will also lean heavily on the scikit-learn library (http://scikit-learn.org/stable/). My intention once I get to a good break, I will revisit the series and provide parallel samples in R.

To follow along you will simply need access to a Jupyter Notebook. The good news is that this is easy and doesn’t require you to install, set-up or configure anything on your local machine if you use Microsoft Azure Machine Learning (Azure ML) studio. The Azure ML service is free to sign-up for and the workspace allows you to create both Jupyter Notebooks and of course machine learning experiments (we will talk about this later). To sign-up for a free account visit: https://studio.azureml.net/.

The Demo Data

We will be using the Iris dataset from the UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets/Iris) since its widely available and well known. The Iris dataset contains various features about flowers and is used to predict the class of flower based on its features. Best of all, using it is simple through the scikit-learn library.

For our demo, we will limit our examples to only the features that describe petal length and width as well as the label. The label is multiclass since there are three classes (setosa, versicolor, virginica) of flowers represented.

Using Python, scikit-learn provides easy access to the dataset and the code to access the data and plot it on a graph is provided below:

from sklearn import datasets 

iris = datasets.load_iris() 

X = iris.data[:, [2, 3]] #only use petal length and width
y = iris.target

plt.scatter( X[y == 0,0], X[y == 0,1], 
            color ='red', marker ='^', alpha = 0.5) 
plt.scatter( X[y == 1,0], X[y == 1,1], 
            color ='blue', marker ='o', alpha = 0.5)
plt.scatter( X[y == 2,0], X[y == 2,1], 
            color ='green', marker ='x', alpha = 0.5)

plt.show()

The resulting code, generates a plot of our data with the petal length on the X-axis and petal with on the Y-axis..

iris

The features of our data (petal length and width) are both numeric and you can tell by the shape of the data that it is linear separable. So we are at a pretty good place to get started. Before we jump though we need to divide our dataset into training and testing datasets. This is necessary if you recall for supervised learning models since they must be trained.

To get a randomized split, we use the train_test_split function from scikit-learn as seen below using 70% of the data for training and 30% of the data for testing.

from sklearn.cross_validation import train_test_split
import numpy as np 

#Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
     test_size = 0.3, random_state = 0)

Next we will walk through five different models that can be used for classification. The demos are presented as a fairly high level and I will not get into parameter (or hyper-parameter) tuning. In a future post we will discuss parameter tuning techniques such as grid-search in a future post.

It should also be noted that the plots presented were generated by a function Sebastian Raschka presented in his book Python Machine Learning. Out of respect, I will not reproduce his code in this here and will invite you to buy his book if you are interested in the snippet.

Without further ado, let’s get started.

NOTE: You can download a copy of the Python Notebook used for these demos HERE.

Classification Using Logistic Regression

The liner logistic regression algorithm that is built into the scikit-learn library is a probabilistic model that is highly flexible and can be implemented with different solvers. These solvers make it capable of handling both small binary classification as well as multiclass classification on large datasets.

One of the most important parameters in logistic regression is the regularization. This coefficient that is used by the algorithm is represented by the parameter ‘C’ in the code sample below and higher values indicate a weaker regularization.

The default value of this parameter when it is not specified is set to 1.0. Running the experiment with the default values (no C value specified) results in a model that is 68.9% accurate in classifying our data which translates into 14 of the 45 test samples being classified incorrectly.

When we tweak the regularization parameter, say to 1000 as was done below, at the risk of overfitting the model to our training data (more on this later), we come up with much better results. Running the code below results in a model that is 97.8% accurate and only results in 1 of the 45 tests being misclassified.

from sklearn.linear_model import LogisticRegression

#Create a model and train it
lr = LogisticRegression(C=1000)
lr.fit(X_train, y_train)

pred = lr.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test_std.shape[0],(y_test != pred).sum()))

#Score the model...should result in 97.8% accuracy
score = lr.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

To further illustrate how the models are working, I plot our the decision regions for each trained model using Sebastian Raschka’s technique. This visual is helpful for pulling back the covers and understanding how each algorithm works by plotting multiple point between the min/max values for our X and Y axes which correspond to the petal length and width. The shape of the decision region as you will see if subsequent examples may be liner or non-linear (quadratic) .

logisticregression

Classification Using Naïve Bayes

The second model we will explore is also a probabilistic model that is noted for being extremely high performance and provides good results even though it is relatively unsophisticated. With scikit-learn there are multiple Naïve Bayes algorithms available that you can experiment with. For our demo we will only look at one.

from sklearn.naive_bayes import GaussianNB

nb = GaussianNB()
nb.fit(X_train, y_train)

pred = nb.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = nb.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

Running the sample code in which default configuration is used across the board, we wind up with a model that performs really well with our data. Out of the box, the Gaussian Naïve Bayes model matches our prior example with 97.8% accuracy. When we plot the decision regions, note the difference in the shapes that were produced.

naivebayes

Classification Using Support Vector Machines

In our third example, we will look at an implementation of a support vector machine (SVM). SVM models perform very similar to logistic regression models except that we can handle both linear and non-linear data using what’s known as the kernel trick. I have demonstrates this in the code sample below.

from sklearn.svm import SVC

svm = SVC(kernel='linear')
#svm = SVC(kernel='rbf')

svm.fit(X_train, y_train)

pred = svm.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = svm.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

svm

Classification Using Decision Trees

In our fourth example we implement a decision tree classifier. Recall from the prior blog post that a decision tree algorithm seeks to split the data set using a type of rule to maximize information gain and minimize entropy. One of the main parameters or arguments for your tree is the maximum depth (max_depth). This parameter sets the maximum number of levels the algorithm will consider before it stops. Setting the max_depth to a higher level should make the model perform better but you are likely to overfit the model to your training data and it will generally perform more poorly in the real world.

The implementation is straight-forward and like all the models we’ve looked as so far it performs really well with our data (97.8% accuracy).

from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(criterion='entropy', max_depth=3)
tree.fit(X_train, y_train)

pred = tree.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0],(y_test != pred).sum()))

score = tree.score(X_test, y_test)
print('Model Accuracy: %.3f' % (score)) 

When we plot the decision regions we can see a pretty big difference however, since the decision tree is split out data rather than attempting to fit a line, we wind up with boxes as seen below.

trees

Classification Using K-Means Clustering

The final classification algorithm we will look at is significantly different that the prior examples. K-means clustering is a form of unsupervised learning. In the prior examples, we had to fit or train our supervised models to our labelled training data. We then could use the trained model to make predictions which we did in the form of scoring the model using our held-out test data.

Clustering on the other hand, does not use labelled data. Instead it seeks to form clusters of points based on the distance . To do typical clustering we much provide at a minimum the number of clusters to form. We can cheat a little here since we know there are three classes we are trying to predict.

For each cluster a random center point or centroid is placed. The algorithm will then iterate (based on our max_iter parameter) and adjust the centroid to maximize the fit of the points to the clusters.

from sklearn.cluster import KMeans
from sklearn import metrics

km = KMeans(n_clusters=3, init='k-means++', max_iter=100, n_init=1)
km.fit(X_train)

We can now plot our data, including the centroids to better visualize this. Note that the clusters it found are not necessarily tied to our true labels or the ground truth. When we use this model to predict or classify a point, the result will be a cluster number. It is up to you to associate the clusters back to your labels.

 

 

 Wrap-Up

In this post we used Python and scikit-learn to experiment with multiple different classification models. We looked as five implementations both supervised and unsupervised methods that are capable of handling various types of data. We still have barely even scratched the surfaced.

In the next post, we are actually going to take a step backwards as we start a discussion on data wrangling and preparation.

Till next time!

Chris