6. Pandas

6.1. History of Pandas

Developer Wes McKinney started working on pandas in 2008 while at AQR Capital Management out of the need for a high performance, flexible tool to perform quantitative analysis on financial data. Before leaving AQR he was able to convince management to allow him to open source the library.

## Installing and Checking version of Pandas:
# !pip3 install pandas

## Loading pandas
import pandas as pd

## Loading numpy as well
import numpy as np

## Pandas Version

Note that an easy way to find the functions in Pandas library can be obtained by typing in pd. in jupyter notebook. More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

6.2. Pandas Objects

At a basic level, Pandas objects provide an extension to numpy array by identifying rows, columns by labels instead of simple indices 0, 1. This can recollect memories to dataframes for those who used R.

At first, let us start with introducing

  • Series

  • DataFrame

  • Index

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:

6.2.1. Series

data = pd.Series([0.25, 0.5, 0.75, 1.0])

As can be seen above, it is a numpy array but with index 0-3

print(f"data value at index 1 is {data[1]}")    ## f string in python.
data = pd.Series([0.25, 0.5, 0.75, 1.0],
 index=['a', 'b', 'c', 'd'])

data['b'] Series as Dictionary

Quick to note that series may also resemble dictionary due to its index seemingly resembling keys in dictionary. This was clarified by creating a pandas series using a dictionary.

population_dict = {'California': 38332521,
 'Texas': 26448193,
 'New York': 19651127,
 'Florida': 19552860,
 'Illinois': 12882135}
population = pd.Series(population_dict)

print(f"Dictionary way: {population_dict['Texas']}")
print(f"PD Series way: {population['Texas']}")

Difference however is that, the indices can still be used to get a range of values from PD Series by doing slicing and so on, while it is not the same in a dictionary

    print(f"Failure to do Slicing")
  • For a list or a numpy array, index defaults to integer sequence

  • Can also be a dictionary in which case, key becomes index

  • In all cases, it can be explicitly specified too.

  • Also, one can identify only a set of keys to be considered, like a set of stocks or so on.

These above points are exemplified below:

## From a list
pd.Series([2, 4, 6])
## Repeating numbers when index set is longer than data
pd.Series(5, index = [1, 2, 3])
# This uses dictionary but also indexes a smaller set 
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]) 

6.2.2. DataFrames

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index.

area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
 'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

## A dataframe is a dictionary with keys as column names and values as pd.Series
states = pd.DataFrame({'population': population,
 'area': area})


Panda dataframes can be constructed in various ways:

  • Using single series.

    pd.DataFrame(population, columns = ['population'])
  • From a list of dicts

    data = [{'a':i, 'b':2*i} for i in range(3)]
  • From a dictionary of Series objects This was shown above.

  • From a two-dimensional numpy array.

    columns = ['foo', 'bar'], index = ['a', 'b', 'c'])

6.3. Pandas Index:

  • Pandas index is an immutable array. It is like an array in many ways but cannot be modified unlike numpy arrays.

ind = pd.Index([2, 3, 5, 7, 11])
## Accessing index
print(ind[::2])  ## notice the difference between a single : and double ::
    ind[1] = 0
    "Error found"
indA =  pd.Index([1,3, 5, 7, 9])
indB = pd.Index([2,3, 5, 7, 11])

indC = indA.intersection(indB)  #intersection
## deprecated warnings
indC = indA.union(indB)

6.4. Data Indexing and Selection:

data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data['b']     ## accessing the b index location
data.keys()     ## also gives the index object
list(data.items())   ## gives both keys and values

The following mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

data['e'] = 1.25
print("slicing by explicit index")

print("slicing by implicit integer index")

print("masking (paranthesis have a lot of importance)")
print(data[(data > 0.3) & (data < 0.8)])

print("fancy indexing")

## remember when you are accessing like a list ":" won't work

6.5. Indexers, loc, iloc, ix

Note These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
## thus 'loc' always slices using explicit index

# does show the value in index 3
# and 'iloc' uses implicit python index

## doesn't show the valuee in location 3

6.6. Data Selection in Dataframe

area = pd.Series({'California': 423967, 'Texas': 695662,
'New York': 141297, 'Florida': 170312,
'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127, 'Florida': 19552860,
'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})

print(data.area is data['area'])
## wont work if area is also a special command for dataframes
print(data.pop is data['pop'])
data['density'] =  data['pop']/data['area']

6.7. DataFrame as two-dimensional array

As mentioned previously, we can also view the DataFrame as an enhanced twodimensional array. We can examine the raw underlying data array using the values attribute:

data.T    ## for Transpose
data.values[0]  ## accesses a row (like a dictionary)
data.iloc[:3, :2]
data.loc[:'Florida', :'pop']
data.loc[data.density > 100, ['pop', 'density']]

6.8. Operating on Data in Pandas

Pandas inherits much of this functionality from NumPy, and the ufuncs that we introduced in “Computation on NumPy Arrays: Universal Functions”on page 50 are key to this.

6.9. Ufuncs: Index Preservation

rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))

df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:


6.10. UFuncs: Index Alignment

area = pd.Series({'Alaska': 1723337, 'Texas': 695662,
'California': 423967}, name='area')
population = pd.Series({'California': 38332521, 'Texas': 26448193,
'New York': 19651127}, name='population')

print(population / area)

print(population.divide(area, fill_value=0))

6.11. Exercises

  1. Posters submitted to the 2021 UConn Sports Analytics Symposium were collected at a website. After the deadline, two files were generated: ucsas2021_poster.csv contains the poster submitter’s information including their names, emails, titles, and abstracts; ucsas2021_pdf.csv contains the file name of the pdf posters. To facilitate the virtual poster session, a group of UConn student volunteers signed up to set up virtual webex meetings. The Google spreadsheet ucsas2021_volunteers.csv contains a webex link for each poster presenter. Write a script to process the three input files to generate an output markdown file that gives a virtual directory of the poster session like https://statds.org/events/ucsas2021/poster_directory.html.