Data Science Day #4 – Data Wrangling

In this preceding post we spent a fairly considerable about of time to understand both supervised and unsupervised machine learning methods for doing predictions or classifications. With that context, we are going to loop back and begin working our way through a “typical” machine learning project.

Project Framework

Like most other technology projects there is a common framework for tackling machine projects that we can use to guide us from start to finish. Known as the CRISP-DM or Cross-Industry Standard Process for Data Mining, this iterative framework sets up guide posts that we will use.


While it is beyond the scope of this post to discuss each step in detail, we can summarize the broad steps.

  • Business Understanding – understand the project objectives and define business requirements. This is the step where we outline the problem definition.
  • Data Understanding – initial data collection, data familiarization, profiling and identification of data quality problems. Typically this is the most time consuming step with some estimates saying 90% of project time is spent here.
  • Data Preparation – this is where we do our data wrangling or data preparation and transformation. We want to identify not only the data we will use but also the relevant features for the task at hand.
  • Modeling – model selection and calibration including model tuning/calibration
  • Evaluation – evaluate or measure model results against the stated business objectives
  • Deployment – deploy and operationalize model including maintenance plans for the required care and feeding

Note that this process is highly iterative. Feedback is important from step to step and the process is evolutionary in that feedback is often reincorporated in previous steps meaning that where you start is likely to not be where you end up.

Business & Data Understanding

We are not going to spend any real time discussing these points beyond saying there are a plethora of tools and techniques that you can leverage on the technical side. If you spent time doing traditional ETL or business intelligence types of task then you have inevitably encounter data quality issues which must be corrected before moving forward.

Data Wrangling

The remainder of this post will focus on data preparation or wrangling the data into a state that it can be used for machine learning. Specifically we will focus in on five of the most common task encountered: handling missing data, dealing with outliers, handling categorical or non-numeric data, binning or bucketing numeric data and finally data standardization or normalization.

Handling Missing Data

Data is not only often dirty, it is often incomplete. Missing data elements in your dataset can appear as NULL values, blanks or other placeholders and not only happen regularly in the real-world they are often beyond your control. Compounding the problem these missing values are typical incompatible with machine learning models or can lead to suboptimal results. So what can we do when we encounter this scenario?

The first and most obvious answer is that the data can be removed. When we talk data removal, this can mean eliminating the offending row or record of data or eliminating the entire feature (column) for the data set. But removing this data comes at the cost of information loss and can be particular problematic in smaller datasets where every observation or record is valuable.

So we need an alternative and those comes in the form of what’s known as imputed values. As the name implies, this technique for handling missing values allows us to substitute the missing value with a meaningful value. This meaningful value is generated using a strategy that is typically based on one of the measures of central tendency (mean, median, mode-most common value).

We can demonstrate using this technique using the sample provided straight-out of the scikit-learn documentation. Note that there are equivalent features in nearly every toolset (R, AzureML, etc) if you are not using Python and scikit-learn.

In the sample code, below note that we create out imputer using the mean strategy and identify what constitutes a missing value (NULLs are represented as NaN in Python). Next we go through a fit process where the imputer learns the mean or whatever strategy you chose for the data set. The transform call on the imputer substitutes out the missing values in the dataset. The sample code and output is provided below.

import numpy as np
from sklearn.preprocessing import Imputer

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)[[1, 2], [np.nan, 3], [7, 6]])

X = [[np.nan, 2], [6, np.nan], [7, 6]]


The results of running the code above are displayed below.

Detecting and Handling Outliers

Outliers in your dataset are observations or points that are extreme. Outliers can be valid (think comparing Bill Gates net worth to the net worth of people in your surround Postal Code) or invalid points (occurring as a result of measurement error) and have the potential to skew your machine learn model leading to poor or suboptimal results.

The first step in dealing with outliers is being able to identify them and the first place to start is through looking for values which fall outside the possible range of values. Examples of these can include negative values where they are not expected (i.e. negative values for height/weight of a person), unrealistic numbers (such as infant weights great than 100 lbs.) or other values which shouldn’t or can’t occur. This is what I call the smell test and can be identified using basic summary statistics (Minimum, Maximum, etc.).

Next, if your data follows a normal distribution, you can potentially use the median and standard deviations following 68-95-99.7 rule to identify outliers. Note that this is not always considering statistically sound since it is highly dependent on the distribution of data (meaning you should proceed with the utmost care and caution).

Finally, since we are discussing machine learning, you could guess that there machine learning techniques which allow us to identify both inliers and outliers. Out of the box scikit-learn has a couple of options for support for outlier detection ( You could also resort to using a model such as linear regression to fit a line to your data and then remove some percentage of the data based on the error (predicted value vs actual).

The task at hand determines how we handle outliers. Some models are sensitive to outliers and leaving them in place can lead to skew and ultimately bad predictions (this is common in linear regression). Other models are more robust and can handle outliers meaning it is not necessary to remove them.

Handling Categorical Data

Many machine learning models have specific requirements when it comes to feature data. Most machine learning models are not capable of handling categorical data and instead require all data to be numeric. These cases require that we convert our data..

Categorical data consist of two types of data and we first must determine what variety of categorical data we are dealing with. Ordered categorical data or ordinal data has an implicit order (i.e. clothing sizes, customer ratings) and in many case simply substituting the order sequence number for the category label will suffice. In non-ordered data this won’t work since some models may inappropriate derive a relationship based on the numeric order.

In this situation we use a process to encode the categories as a series of indicators or dummy variables. These dummy variables are simply binary fields and function as a pivoted set of categories. To illustrate this consider a feature that has three possible categories: red, blue and orange.

When encoded as dummy variables, our dataset would have three new features, is_red, is_blue and is_orange, where each features values is either 0 or 1 based on the sample categorical value. To create these dummy features in Python we can use the Pandas library. In the sample code below the get_dummies function handles the pivot and transforms our dataset in a single step.

import pandas as pd

data = ['red', 'blue', 'orange']



Binning and Bucketing Numerical Data

Another case that in which we may want to transform our data involves situations where our data is numeric and/or continuous in nature. In many case we may choose to bin or bucket it either because it is required or is necessary to optimize our model.

This process is referred to by many names including discretizing, binning or bucketing and involves converting our data into a discrete or fixed number of possible values. While there are multiple was to accomplish this ranging from fixed bucket to using quartiles we will look at one example using a fictitious customer annual income feature.

import pandas as pd

income = [10000, 35000, 65000, 95000, 1500000, 5000000]
bins = [0, 30000, 75000, 250000, 1000000000]
names = ['Low', 'Middle', 'High', 'Baller']

pd.cut(income, bins, labels=names)

The above code contains an array of incomes ranges from $10,000 to $5,000,000. We’ve defined bins and labels and then used the Pandas cut function to slot each value into the appropriate bucket. The results are displayed below.


Data Standardization and Normalization

The final data wrangling technique we will look at involves bringing features within out dataset to common scale. But why is this necessary? Often times the features in our dataset includes data with a diverse set of scales, for example sales quantity in unit and sales revenue in dollars. If these features do not share a common scale many machine learning models may weigh the largest feature more heavily skewing the model and given potentially inaccurate results.

Two common techniques for these include standardization and normalization and the difference between the two is subtle. Both techniques will bring features to a common scale, using standardization we rescale all values to new scale between 0 and 1. This technique is useful when we are trying to measure distance between points such as when doing k-means clustering.

The second technique, normalization scales data by shifting it to have a mean of 0 and a standard deviation of 1. This technique will preserves the distributed of data when it is important.

To demonstrate these techniques we can use the iris dataset from the prior post and the preprocessing functions in scikit-learn and compare the minimum and maximum values before and after the transformations are applied.

from sklearn.datasets import load_iris
from sklearn import preprocessing

iris = load_iris()
x =

normalized_x = preprocessing.normalize(x)
standardized_x = preprocessing.scale(x)

print("Original Min %.3f - Max %.3f" % (x.min(), x.max()))
print("Normalize Min %.3f - Max %.3f" % (
	normalized_x.min(), normalized_x.max()))
print ("Standardized Min %.3f - Max %.3f" % (
	standardized_x.min(), standardized_x.max()))

Note that in the resulting screen snip the normalized values all now all between 0 and 1 and the standardized values are now center on 0.



In this post, we explored data wrangling and more specifically five of the most common tasks you will encounter. We discussed handling missing data, identifying outliers, converting categorical data to numeric, converting numeric data into discrete categories through binning and finally bringing data to common scale through either standardization or normalization.

In our next post, we will continue our journey as we discuss techniques for identifying the most important features through a process called feature selection.

Till next time!



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s