Data Science Day #7 – Model Evaluation

Outside of the introduction to some of the common algorithms or models used for a classification task, this series has focus on all the small, yet important foundational things that need to be handled or considered prior to modeling and model selection. In this post we are going to drive the bus back in from left field and discuss model evaluation within the context of a supervised classification activity. Without further ado, let’s get started.

 Getting Data

We’ve spent a number of post talking about preparing data for optimal use in a machine learning activity and there’s one more step that needs to be considered. Since we are going to be working with supervised machine learning techniques, our models are going to require sets of data for training. Likewise, we talk about model evaluation shortly, we will need data for that too. Further complicating matters is the need for validation data once we discuss model selection. So how do we go about dividing out data appropriately?

Believe it or not this is a relatively easy task. Accomplishing this is as easy as simply splitting data randomly by some percentage, say use 80% for training and leaving 20% untouched for testing and evaluation. We could then take out training data and further carve out a set of validation data to help prevent us from overfitting our model during model selection. This is classically known as the holdout method while the percentages vary with 70/30 and 80/20 being common splits it is widely used because of its simplicity.

A second alternative to the holdout technique is known as cross validation. Cross validation has a few different implementations with the most common know a k-fold cross validation. In a cross validation the data set is divided in k-folds (for example 5) where each observation exists in only a single fold and each fold takes a turn both as a part of the training set and as a validation set. To better illustrate the differences between the holdout and cross validation with k-folds a diagram has included below.


Before moving on it’s important to note that the way you divide up your data can have a significant impact on model performance. You shouldn’t shortcut yourself into potential issues like over fitting a model by reusing your training data later as validation data during model selection or not paying attention to small details like randomization when splitting up your data.

 Model Evaluation

So we have our data and we’ve selected a model or potentially models that we suspect will work nicely with our data. How do we evaluate the model’s actual performance? At first glance its simple, right? This is a classification activity and regardless of whether it’s a binary or multiclass problem, the model either picked the right class or it didn’t. This task is often referred to a scoring and involves testing our chosen model on a set of un-seen data to see if its predicted class matches the our known label or the ground truth.

Unfortunately though once we’ve trained, predicted and scored our model, there are multiple ways or metrics we can use to measure how our model performed. Let’s briefly look at some of the most common ways to evaluate our model performance.

Confusion Matrix

The simplest method for evaluating a models performance at a classification task is known as the confusion matrix (pictured below). The confusion matrix in a binary classification task is made up of four quadrants each containing a straight count of how our model’s predicted class aligns our ground truth or known label. The four quadrants in this example matrix are:

  • True Positive (TP) – predict true, known true
  • False Negative (FN) – predict false, known true, sometimes called Type-2 error
  • False Positive (FP) – predict true, known false, sometimes called Type-1 error
  • True Negative (TN) – predict false, known false

To further illustrate how this works, let’s consider a simple binary classification task where we are trying to predict whether a credit card transaction is fraudulent or not. In this case, we are simple predicting either true or false, with true indicating a fraudulent transactions. In the example pictured below our model would have correctly classified 92 fraudulent transactions and another 104 as non-fraudulent. Further, it misclassified 28 transactions as okay while they were in fact fraudulent and finally 26 as fraudulent while they were in fact okay transactions. Taken together this gives us a straight-forward summary of how our selected model performed and will be the basis of many different performance metrics we will discuss next.



Performance Metrics

When I started this series my goal was to do the whole thing without invoking a single math formula. While my intentions where probably good, like many of the politicians filling the airwaves I’m going to backtrack and present some math to highlight the various derivations that can be made from our confusion matrix. Don’t worry you won’t be tested on this stuff as it’s presented only for conceptual understanding and all of these metrics are often provided for you by ML libraries such as scikit-learn.


Summarizes the percentage of observations that were misclassified by the model.



Summarizes the percentage of observations that were correctly classified by the model.



Can be calculated for either positive or negative results and is the proportion of true positives to all positive predictions.



Also referred to as sensitivity and is the proportion of true positive classifications to all cases which should have been positive. Higher recall means that we are correctly classify most positive cases and ignores false positives. This metric can also be calculated for negative results and is called specificity.



Measures overall accuracy by considering both precision and recall, weighted equally using the harmonic mean. The highest F-1 score is one with the lowest being zero. This measure is typically used to compare one model to another.


Receiver Operator Characteristics (ROC) Curve

The ROC curve is a visual plot that compares true-positive rate to the false-positive rate. In the plot below, we can note that a perfectly predictive model follows the axis-borders and a model with no predictive value (i.e. random guessing) forms the diagonal. We can plot and compare multiple models to one another. Using these plots we can generate another measure called area under the curve (AUC) to quantify the graphically presented data..



Model Evaluation & Selection

Now that we’ve explored some of the tools we have at our disposal to evaluate the performance of our model, let’s look at a practical application for how these metrics can be leveraged in evaluating whether our model is useful or not. Continuing with our credit card fraud model, let’s review and interpret the example results for this binary classification task.

  • Accuracy -97.7% – Very general in telling us that our model correctly predicted the correct label for 97.7% of our samples. It does not inform us of whether those misses were primarily one class or another. In our example this can be potentially problematic if for example we were extremely sensitive to fraud yet our model primarily missed on fraudulent charges. Likewise this can be misleading in cases where we have significant class imbalances
  • Consider a situation where 99.5% of transactions were okay and only 0.5% of transactions were fraudulent. If our model just simply guessed that every transaction was okay it would be 99.5% accurate but would have missed every single fraudulent charge.. For this reason, we want to typically consider some of the following metrics.
  • Precision -74,4% – Our model was able to successfully target 74.4% of the fraudulent transactions while limiting the number of false positives. This would be useful if we were worried about having a model that was overly aggressive and favored a more conservative model to limit false alarms.
  • Sensitivity (Recall) – 95.5% – Our model correctly classified 95.5% of fraudulent transactions meaning that 4.5%  of fraudulent transactions were missed. Whereas our precision metric takes into consideration false-positives, this metric is only concerned truly fraudulent transactions. If our tolerance for fraud were very low, we would be interested in optimizing our model for this metric.
  • Specificity – 97% – Indicates that 97% of good transactions were correctly classified and that 3% of good transactions were misclassified as fraudulent. Very similar to precision in that optimizing this metric would lead to a more conservative model if we were overly concerned with sounding false alarms.

As you have probably surmised by this point, it is nearly if not totally impossible to build a model that performs well across all the metrics. In the real-world we typically focus one area and part of the model selection process involves tuning or optimizing our model for the task at hand. This tuning process is called hyperparameter tuning and will be the focus of our next posts.

Metrics Example

Since this post has been code-free so far (and I’m itching for some code), I’ve included a brief sample that you can run to highlight how several of these metrics can be calculate using Python and scikit-learn. The example uses a generated set of data with the emphasis being places on how the various metrics discussed to this point can be calculated.

from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import classification_report

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, n_informative=5)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

classifier = LogisticRegression()
y_pred =, y_train).predict(X_test)

#Confusion Matrix
print(confusion_matrix(y_test, y_pred))

#Accuracy Score
print(accuracy_score(y_test, y_pred))

#Precision Score
print(precision_score(y_test, y_pred))

#Classfication Report
print(classification_report(y_test, y_pred, target_names=['No', 'Yes']))



In this post we discussed the various facets for evaluating a classification model. From discussing a straight-forward strategy for dividing our experiment data to an introduction of some of the basic metrics that can be used to measure model performance we’ve set the stage for model evaluation. In the next posts, we will look at a strategy to tune our model using a hyperparameter tuning technique called grid search to optimize our model for a given target metric.


Till next time!