6.12. Pandas: Missing Data and Hierarchical Indexing¶
6.12.1. Missing Data¶
The difference between data found in many tutorials and data in the real world is that real-world data is rarely clean and homogeneous. In particular, many interesting datasets will have some amount of data missing.
# Example 1 of missing data
import numpy as np
import pandas as pd
data = np.array([1, None, 6, 8])
data
If we use functions like sum() or min() on an array if none is present then it will return an error.
try:
data.sum()
except TypeError as e:
print(type(e))
print(e)
When a “None” value is present in an array, any arithmetic operation done on the array will always result in an error.
# Example 2 of missing data
data1 = np.array([1, np.nan, 2, 6])
3 + np.nan
5 * np.nan
Regardless of the operation, the result of arithmetic with NaN will be another NaN.
# Detecting null values
data = pd.Series([1, np.nan, 'example', None])
data.isnull()
data[data.notnull()]
# Dropping null values
data.dropna()
data2 = pd.DataFrame([[2, np.nan, 5],[4, 1, 7],[np.nan, 0, 3]])
data2
data2.dropna()
We cannot drop single values from a DataFrame; we can only drop full rows or full columns.
data2.dropna(axis='columns')
data2[3] = np.nan
data2
data2.dropna(axis='columns',how='all')
data2.dropna(axis='rows',thresh=3)
# Filling null values
data = pd.Series([3, np.nan, 4, None], index=list('abcd'))
data
data.fillna(0)
#Forward fill
data.fillna(method='ffill')
# Back-fill
data.fillna(method='bfill')
data2.fillna(method='ffill', axis=1)
6.12.2. Hierarchical Indexing¶
It is useful to store higher-dimensional data indexed by more than one or two keys. It can be done using hierarchical indexing which incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.
# A Multiple Indexed Series
index = [('California', 2000), ('California', 2015),('New York', 2000), ('New York', 2015),('Illinois', 2000), ('Illinois', 2015)]
populations = [24343547, 54343326,18976457, 23557766,12736448, 76453456]
pop = pd.Series(populations, index=index)
pop
pop[('California', 2015):('Illinois', 2000)]
# Selecting just data for year 2015
pop[[i for i in pop.index if i[1] == 2015]]
As similar operation can be done in a much more efficient way. Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have.
index = pd.MultiIndex.from_tuples(index)
pop = pop.reindex(index)
pop
Here the first two columns of the Series representation show the multiple index values, while the third column shows the data. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.
pop[:, 2015]
This syntax is much more convenient than the previous method.
# Unstack
pop1 = pop.unstack()
pop1
The unstack() method will quickly convert a multiple indexed Series into a conventionally indexed DataFrame.
# Stack
pop1.stack()
Why hierarchical indexing?
The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year with a MultiIndex this is as easy as adding another column to the DataFrame.
pop1 = pd.DataFrame({'total': pop,'under18': [3123445, 1234567,2346257, 2461346,6434785, 9876544]})
pop1
func = pop1['under18'] / pop1['total']
func.unstack()
All ufuncs and other functions used in pandas can be used in hierarchical indices too.
# Constructing Multiple index series
data3 = pd.DataFrame(np.random.rand(4, 2),index=[['a', 'b', 'b', 'b'], [1, 1, 2, 3]],columns=['data1', 'data2'])
data3
Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default.
data = {('California', 2000): 12345678,
('California', 2015): 23456789,
('Illinois', 2000): 34567890,
('Illinois', 2015): 45678901,
('New York', 2000): 56789012,
('New York', 2015): 67890123}
pd.Series(data)
# MultiIndex level names
pop.index.names = ['State', 'Year']
pop
# MultiIndex for Rows and columns
index = pd.MultiIndex.from_product([[2010, 2011], [1, 2]],names=['Year', 'Visit'])
columns = pd.MultiIndex.from_product([['A', 'B', 'C'], ['HR', 'Temp']],names=['Subject', 'Type'])
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 30
health = pd.DataFrame(data, index=index, columns=columns)
health
health['B']
6.12.3. Indexing and Slicing a MultiIndex¶
# Multiple indexed Series
pop['California', 2000]
pop['California']
pop[pop > 22000000]
pop[['California', 'Illinois']]
# Multiple indexed dataframe
health['A', 'HR']
health.loc[:, ('A', 'HR')]
health.iloc[:2, :2]
idx = pd.IndexSlice
health.loc[idx[:, 1], idx[:, 'HR']]
6.12.4. Rearranging Multi indices¶
# Sort
index = pd.MultiIndex.from_product([['x', 'z', 'y'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data
try:
data['x':'y']
except KeyError as e:
print(type(e))
print(e)
Any function performed on an unsorted multiindex data will result in an error.
data = data.sort_index()
data
data['x':'y']
# Unstack
pop.unstack(level=0)
pop.unstack(level=1)
# Stack
pop.unstack().stack()
# Reset Index
pop2 = pop.reset_index(name='population')
pop2
Often when you are working with data in the real world, the raw input data looks like this and it’s useful to build a MultiIndex from the column values.
# Set Index
pop2.set_index(['State', 'Year'])
6.12.5. Data Aggregation¶
mean = health.groupby(level='Year').mean()
mean
mean.groupby(axis=1, level='Type').mean()