The Data Science PipelineWrangling
Data is said to be in tidy format if each row corresponds to an observation and each column corresponds a different observation variable. For example, in the
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
Note that four of the columns are quantitative (that is, they contain numerical data), while one is categorical (that is, they contain strings which represent categories).
Many types of data do not fit naturally into the tidy data framework, like image data and passages of text. However, tidy data occupies a sweet spot of generality which covers a wide variety of use cases but is nevertheless specific enough to support tools which are both concise and powerful. In this mini-course, we will focus on tidy data.
Pandas
A data frame is an object for storing tidy data, and the package which provides data frames in the Python ecosystem is Pandas. Pandas is built on NumPy, which is the Python library for multi-dimensional
import pydataset iris = pydataset.data("iris") print(type(iris)) iris.head()
A data frame's index
attribute stores the labels of the rows, and the columns
attribute stores the column names.
iris.columns, iris.index
Columns of the data frame can be accessed using by indexing the data frame with the name of the column. For column names that are valid Python variable names, dot syntax access is also supported:
import pandas as pd data = [[179_335,47.7],[81_274,73.4],[24_590,19.9],[22_131,25.4]] columns = ['population', 'area'] index = ['Providence', 'Cranston', 'Newport', 'Bristol'] ri = pd.DataFrame(data, columns = columns, index = index) ri['population'] # or ri.population, same thing
Pandas DataFrame
objects provide two attributes, called loc
and iloc
, for accessing entries using names or integers, respectively.
ri.loc['Providence',:'area'], ri.iloc[0,:1]
Note that slices built with row or column names are inclusive, while integer slices follow the usual Python convention of being exclusive of the upper bound.
Exercise
Use the pydataset
package to load the 'Boston'
data set. Select all of the columns between indus
and rad
and assign the resulting data frame to a new variable. Then select the first 25 rows from that new data frame.
Solution. We index columns using loc
and the rows using iloc
boston = pydataset.data('Boston') boston_trimmed = boston.loc[:,'indus':'rad'] boston_trimmed.iloc[:25,:]
The six verbs of data manipulation
Although data frames support many transformations, we will follow the prominent data scientist
- Filter. Pick rows based on their values.
- Sort. Re-order the rows.
- Select. Choose specific columns.
- Transform. Create new columns from existing ones.
- Aggregate. Reduce the data frame to a single row by applying a function (like
sum
,min
,max
, etc.) which maps each column of values to a single value.
- Group. Collect the rows of the data frame into groups.
The grouping operation is different from the others because it outputs a grouped data frame object, which can be visualized as a stack of data frames. Here's how the popular data table product Airtable presents grouped data frames (using some of the rows from the iris data set above, grouped on species):
The other operations can be applied group-by-group.
Exercise
Experiment with the interactive table below to practice some of the operations above. You might want to click the "View larger version" icon in the bottom right corner to open the table in a new browser tab.
- Use the Filter button to select only those flowers whose sepal length is less than 5.0.
There are
The smallest sepal length value is
There are
The species with the largest average petal length is
Filter
There are two main ways to filter rows in Pandas. The first is to obtain a column of boolean values and use it to index the rows of the data frame. For example:
ri[ri.population > 50_000]
In the last line above, ri.population > 50_000
returns a column with the values [True, True, False, False]
, and indexing the data frame with a boolean array selects only those rows corresponding to the True
values.
We can combine predicates using the usual Python operations for
ri[(ri.population > 50_000) & (ri.area > 50)]
This approach has some drawbacks: First, we have to repeat the name of the data frame multiple times. This violates the programming maxim "Don't repeat yourself". Second, computing an expression like (ri.population > 50_000) & (ri.area > 50)
requires three array ri.population > 50_000
, one for ri.area > 50
, and one for the (ri.population > 50_000) & (ri.area > 50)
. These allocations are unnecessary, since it is possible to just loop over the rows and directly check the whole condition for each row (although you don't want to program that yourself in Python, because loops in Python are slow).
Pandas does provide a solution to this problem: the query
method. You supply your condition as a string, and Pandas makes things efficient for you under the hood:
ri.query('population > 50_000 & area > 50')
Python variables can be interpolated in query strings using the @
prefix. For example, the query above could also be written as
min_pop = 50_000 ri.query('population > @min_pop & area > 50')
Exercise
Use the query
method to identify the records in the 'Boston'
data set which have the property that indus
is at least 10 and either medv
is less than 8 or chas
is 1.
Solution. We use parentheses to group the given logical conditions:
import pydataset boston = pydataset.data('Boston') boston.query('indus >= 10 & (medv < 8 | chas == 1)')
Sort
The DataFrame
method that sorts values is called sort_values
. It takes an argument for the column labels (or list of columns labels) to use for sorting, and you can use the ascending
argument to specify whether the values in that column should be in increasing or decreasing order.
ri.sort_values('population',ascending=False)
If a list of columns is supplied, then each column after the first is used to break ties in the preceding columns:
d = pd.DataFrame([[3,2],[3,1],[2,4]],columns=['a','b']) d.sort_values(['a','b'])
Note that sort_values
returns a new data frame. It does not modify the original one.
Exercise
Sort the 'Boston'
data set in decreasing order of the value in the first column.
Solution. We can inspect columns
or look at the data frame's head to see that the first column is 'crim'
. Then we sort:
boston = pydataset.data('Boston') boston.sort_values(['crim'], ascending = False)
Select
To select columns in Pandas, you can just index the data frame with a list of column names:
ri[['population','area']]
If you want to keep all columns except specific ones, you can use the drop
method:
ri.drop('population', axis=1)
We have to specify that 'population'
refers to columns (axis=1
), because the drop
method's default is to look for rows to drop. Like sort_values
, drop
doesn't modify the original data frame.
Exercise
Select the columns in the Boston data frame which contain floating point numbers (as opposed to integers, which do not print with a decimal point).
Solution. We inspect the data frame to find that the columns which are not floats are 'chas'
, rad
, and 'tax'
. So we use drop instead of selecting:
boston_float_only = boston.drop(['chas', 'rad', 'tax'],axis=1) boston_float_only.head()
As a follow-up, we note that this is a sufficiently common operation that Pandas supplies a convenience method for it:
import numpy as np boston_float_only = boston.select_dtypes(exclude=['int']) boston_float_only.head()
You can inspect the types of the columns of a data frame using its dtypes
attribute.
Transform
We can create new columns in a data frame using the assign
method. For example:
ri.assign(density = ri.population / ri.area)
If the name of the data frame is quite long, you can avoid having to type it repeatedly by supplying an anonymous function to be applied to the data frame:
ri.assign(density = lambda d: d.population / d.area) \ .assign(**{'inverse density': lambda d: 1/d.density})
We are using two assign
calls to create a column called density
and then a second new column called inverse density
. (Note how we used
Exercise
For each observation in the 'toothpaste'
data set, find the ratio of the difference between means to the square root of the sum of the squares of the standard deviations for conditions "A" and "B".
Solution. Since the formula is quite involved, we use an anonymous function:
import numpy as np toothpaste = pydataset.data('toothpaste') toothpaste.assign(score = lambda d: (d.meanA - d.meanB)/np.sqrt(d.sdA**2 + d.sdB**2))
Aggregate
The agg
method applies a specified function (called the aggregation function) to each column in a data frame. Several aggregation functions are built-in and can be specified using a string:
Name | Description |
---|---|
count | number of entries |
sum | sum of entries |
prod | product of entries |
mean | average |
median | middle entry when sorted |
var | variance |
std | standard deviation |
min | minimum entry |
max | maximum entry |
For example, to find the total population of the four cities in our data frame, we run:
ri.population.agg('sum')
To find the average population and area:
ri.agg('mean')
You can also supply a custom aggregation function instead of a string.
Exercise
Find the range (the difference between max and min) for each of the four quantitative columns in the iris
dataset. Try using an anonymous function rather than using the built-in 'min'
and 'max'
aggregation functions.
Solution. We have to drop the categorical column first since we can't compute a range for that.
import numpy as np import pydataset iris = pydataset.data('iris') iris.drop(['Species'],axis=1).agg(lambda r: np.max(r) - np.min(r))
Group
The aggregation function is a little bland by itself. We often want to apply the aggregation function only to specific rows. For example consider the following transaction log:
Customer name | Transaction amount | |
Acme, Inc. | 45.03 | |
Acme, Inc. | 118.22 | |
XYZ Widgets | 35.55 | |
Acme, Inc. | 22.04 | |
XYZ Widgets | 34.78 |
What is the total transaction amount for each company? We could filter for each company and aggregate the two resulting data frames, but that would get unwieldy quickly as the number of companies grows. Instead, we group the data frame by customer name and then apply the aggregation function:
columns = ['Customer name', 'Transaction amount'] transactions = pd.DataFrame([['Acme, Inc.',45.03], ['Acme, Inc.',118.22], ['XYZ Widgets',35.55], ['Acme, Inc.',22.04], ['XYZ Widgets',34.78]], columns=columns) transactions.groupby('Customer name').agg('sum')
When a data frame is grouped along a particular column, the resulting grouped data frame is a dictionary-like object whose keys are the unique values of that column and whose values are the data frames obtained by filtering for the given key in that column. For example, transactions.groupby('Customer name')
maps the key 'Acme, Inc.'
to the data frame
Customer name | Transaction amount | |
---|---|---|
0 | Acme, Inc. | 45.03 |
1 | Acme, Inc. | 118.22 |
3 | Acme, Inc. | 22.04 |
and 'XYZ Widgets'
to
Customer name | Transaction amount | |
---|---|---|
2 | XYZ Widgets | 35.55 |
4 | XYZ Widgets | 34.78 |
The agg
method of DataFrameGroupBy
object operates on each each of these data frames to produce a single row, and these rows are collected into a new data frame. The row index for this output data frame comes from the keys of the dictionary-like DataFrameGroupBy
object.
Exercise
Use the interactive table below to perform the same aggregation operation described above (in other words, find the sum of the transaction amounts for each company).
Exercise
Group the flowers in the Iris data set in unit intervals of sepal length and find the average sepal width for each such group. (In other words, one group contains all flowers with sepal length in , another group has all flowers with sepal length in , and so on.)
Hint: make a new column to group by.
import pydataset iris = pydataset.data('iris')
Solution. We create a new column using the np.floor
function. Then we perform the grouping, select the column we want, and aggregate using the mean
function:
import pydataset import numpy as np iris = pydataset.data('iris') iris.assign(sepal_length_floor = np.floor(iris["Sepal.Length"])) \ .groupby('sepal_length_floor') \ [['Sepal.Width']] \ .agg('mean')
We conclude this section by noting that the other four operations (filter, sort, select, and transform) can be applied to grouped data frames as well. However, only selection works directly on grouped data frames. For the others, we use the apply
method of the grouped data frame object to operate group-by-group and collect the results into a single data frame.
For example, we can sort by petal length within species as follows:
iris.groupby('Species').apply(lambda d: d.sort_values('Petal.Length'))
You'll notice in data frame returned above that the row indices are retained from the original data frame. Actually, the group values are also incorporated into the indexing scheme—this is advanced feature of Pandas called a multi-indexing. The reset_index
method is useful for dropping this extra structure and simply re-indexing the rows from 0. We need to supply the value True
to the keyword argument drop
, because otherwise the reset_index
method will try to keep the old indices around as new columns, and that will fail since we already have a Species
column:
(iris.groupby('Species') .apply(lambda d: d.sort_values('Petal.Length')) .reset_index(drop = True))