Pandas by Example

Mar 29, 2022 data coding python

I find most Pandas examples to be either too abstract (e.g., DataFrames are devised using single letters, random numbers, etc.) or the other way around (they load humongous data sets from the likes of Kaggle). Furthermore, the data sets tend to change depending on the method or property at hand. I, instead, use the same small data set for nearly all of my examples.

This is not a ‘my tutorial is better than your tutorial’ affair. It is, rather, an alternative set of examples which may work out for some people.

Essentials

1# Module Import
2
3import numpy as np     
4import pandas as pd
5import re as re      # only used in a few examples

Series Introduction

Series are similar to Pandas arrays except that the index may be user-defined.

Declaration

1# Declaration without labels
2pd.Series(['A+','A','C'])

0    A+
1     A
2     C
dtype: object

1# Declaration with labels by passing an index list
2pd.Series(['A+','A','C'], index = ['Daisy','Taffy','Nina'])

Daisy    A+
Taffy     A
Nina      C
dtype: object

1# Declaration with labels using a dictionary
2pd.Series({ 'Daisy' : 'A+',
3            'Taffy' : 'A',
4            'Nina'  : 'C'})

Daisy    A+
Taffy     A
Nina      C
dtype: object

Querying (Keys and Values)

1# Keys and values
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5print("  keys: {}".format(s.keys()))
6print("values: {}".format(s.to_list()))

  keys: Index(['Daisy', 'Taffy', 'Nina'], dtype='object')
values: ['A+', 'A', 'C']

Querying (Select Elements by Index and Key)

1# Elements by Index and Key
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5
6last_element = len(s)-1 # 2
7print("By index: {}".format(s.iloc[last_element]))
8print("By key:   {}".format(s.loc['Nina']))

By index: C
By key:   C

Querying (Head and Tail)

1# Head and Tail 
2s = pd.Series(np.random.randint(1,91,500)) # 500 random numbers 1 =< n < 91
3
4# Bingo numbers
5# We use to_list() since the result is another Series object
6print("head: {}".format(s.head().to_list()))
7print("tail: {}".format(s.tail().to_list()))

head: [81, 16, 13, 78, 39]
tail: [66, 4, 78, 57, 6]

Updating Series (Modifying Elements)

1# Changing Elements
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5
6s['Nina'] = 'B'
7s

Daisy    A+
Taffy     A
Nina      B
dtype: object

Updating Series (Appending Series)

1# Appending series
2s1 = pd.Series([1,2,3], index=['One','Two','Three'])
3s2 = pd.Series([4,'III'], index=['Four','Three'])
4s3 = s1.append(s2) # s1 is not modified in place unlike Python's arrays!
5s3

One        1
Two        2
Three      3
Four       4
Three    III
dtype: object

 1# Broadcasting (direct arithmetic)
 2s = pd.Series([1,2,3])
 3
 4# New Series
 5print("s * 2   = {}".format((s * 2).to_list()))
 6print("s - 1   = {}".format((s - 1).to_list()))
 7
 8# Modify in Place
 9s+=10
10print("After in place modification: {}".format(s.to_list()))

s * 2   = [2, 4, 6]
s - 1   = [0, 1, 2]
After in place modification: [11, 12, 13]

Series Aggregations and Computations

Unique values

1pd.Series([10,20,20,30]).unique()

array([10, 20, 30])

Sum

1pd.Series([1,2,3]).sum()

Series Gotchas

Indexing Using Square Brackets

1# Indexing without iloc using square brackets
2s = pd.Series({ 1 : 'First',
3                2 : 'Second'})
4try:
5    # This fails because numbers are treated as keys
6    print(s[0])
7except:
8    # This works because iloc will not cast the number incorrectly
9    print("This works: " + s.iloc[0])

This works: First

None vs NaN

1# None vs Not a Number (NaN)
2grades  = pd.Series(['A+','A',None])
3numbers = pd.Series([10,9,None])
4print ("In grades, the third element is:  {}".format(grades[2]))
5print ("In numbers, the third element is: {}".format(numbers[2]))
6
7# Testing for None doesn't work with equals, but requires a specific function
8print ("numbers[3] == None:   {}".format(numbers[2] == None))
9print ("np.isnan(numbers[3]): {}".format(np.isnan(numbers[2])))

In grades, the third element is:  None
In numbers, the third element is: nan
numbers[3] == None:   False
np.isnan(numbers[3]): True

Repeated Keys Are Possible

1# Repeated Keys
2# Series are not dictionaries nor arrays
3#    - Types may be mixed
4#    - Element labels/keys may be repeated
5pd.Series(["Austria","Germany","Italy"], index=["German","German","Italian"])

German     Austria
German     Germany
Italian      Italy
dtype: object

 1# Repeated Keys #2
 2# A location may return a single value or a Series!
 3s = pd.Series(["Austria","Germany","Italy"], index=["German","German","Italian"])
 4
 5# For Italian
 6italian = s.loc["Italian"]
 7print("* ", end='') 
 8print(italian, end='') 
 9print(type(italian))
10
11# For German
12print("* ", end='') 
13german = s.loc["German"]
14print(german.to_list(), end='')
15print(type(german))

Italy<class ‘str’>
[‘Austria’, ‘Germany’]<class ‘pandas.core.series.Series’>

DataFrames (Declaration)

DataFrames aggregate Series objects by treating Series as labeled rows.

1# Declaration using Series
2c1 = pd.Series(["French", "EUR",65],index=["Language","Currency","Population"])
3c2 = pd.Series(["English","GBP",65],index=["Language","Currency","Population"])
4c3 = pd.Series(["Italy",  "EUR",60],index=["Language","Currency","Population"])
5
6# The columns may be specified by adding 'name=Column' to each Series instead
7pd.DataFrame([c1,c2,c3],index=["France","UK","Italy"])

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Declaration using Tuples
2countries = [("French","EUR",65),
3             ("English","GBP",65),
4             ("Italy","EUR",60)]
5               
6pd.DataFrame(countries,
7             columns=["Language","Currency","Population"],
8             index=["France","UK","Italy"])

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

 1# Declaration using a list Dictionary values
 2countries = [{"Language"   : "French", 
 3              "Currency"   : "EUR",
 4              "Population" : 65},
 5            {"Language"    : "English", 
 6              "Currency"   : "GBP",
 7              "Population" : 65},
 8            {"Language"    : "Italy", 
 9              "Currency"   : "EUR",
10              "Population" : 60}]
11              
12df = pd.DataFrame(countries,index=["France","UK","Italy"])
13df

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

Querying (Simple)

Rows and Columns

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Rows and columns
2print("   rows: {}".format(df.index))
3print("columns: {}".format(df.columns))

   rows: Index(['France', 'UK', 'Italy'], dtype='object')
columns: Index(['Language', 'Currency', 'Population'], dtype='object')

Select Row By Index and Label

1# Sample DataFrame
2df

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Select row using index
2df.iloc[0]

Language      French
Currency         EUR
Population        65
Name: France, dtype: object

1# Select row using label
2df.loc['France'] # Produces a Series object

Language      French
Currency         EUR
Population        65
Name: France, dtype: object

1# Repeated labels will produce a DataFrame rather than a Series object!
2df2 = df.append(pd.Series(["Sicilian","EUR",60],
3                index=["Language","Currency","Population"],name="Italy"))
4df2.loc['Italy']

	Language	Currency	Population
Italy	Italy	EUR	60
Italy	Sicilian	EUR	60

Select Column By Label

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Select column using name
2df['Language']

France     French
UK        English
Italy       Italy
Name: Language, dtype: object

1# Select specific columns
2df[["Language","Currency"]]

	Language	Currency
France	French	EUR
UK	English	GBP
Italy	Italy	EUR

Select Element By Row and Column

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1df.loc["France","Currency"]

'EUR'

1# This can also be accomplished by first selecting the row and then the column
2df.loc["France"]["Currency"]

'EUR'

Querying (Advanced)

Selecting Rows Using a Boolean Mask

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# France we keep, UK we leave out, Italy we keep
2mask = pd.Series([True, False, True], index=["France","UK","Italy"])
3df[mask]

	Language	Currency	Population
France	French	EUR	65
Italy	Italy	EUR	60

1# Let's invert the mask so it becomes False, True, False
2df[mask.map(lambda x: not x)]

	Language	Currency	Population
UK	English	GBP	65

Creating Boolean Masks using Operators

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1print(df['Population'] >= 65)

France     True
UK         True
Italy     False
Name: Population, dtype: bool

1print(df['Currency'] == "EUR")

France     True
UK        False
Italy      True
Name: Currency, dtype: bool

Complex Queries: Boolean Logic plus Operators

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1(df['Population'] >= 65) & (df['Currency'] == "EUR")

France     True
UK        False
Italy     False
dtype: bool

Applying Boolean Masks to DataFrames

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Prepare mask separately
2mask = (df['Population'] >= 65) & (df['Currency'] == "EUR")
3df[mask]

	Language	Currency	Population
France	French	EUR	65

1# Nested
2df[(df['Population'] >= 65) & (df['Currency'] == "EUR")]

	Language	Currency	Population
France	French	EUR	65

Boolean Masks and Nulls/NaNs

1# At the Series level
2pd.Series([1,2,3,None]).isnull()

0    False
1    False
2    False
3     True
dtype: bool

1# At the DataFrame level
2df2 = df.copy(deep=True)
3df2.loc["UK","Currency"] = None
4df2.isnull()

	Language	Currency	Population
France	False	False	False
UK	False	True	False
Italy	False	False	False

Preserving Non-Matches using Where()

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Passing the Boolean Mask to where() does not drop the rows,
2# instead, they appear with their values as 'NaN'
3df.where((df['Population'] >= 65) & (df['Currency'] == "EUR"))

	Language	Currency	Population
France	French	EUR	65.0
UK	NaN	NaN	NaN
Italy	NaN	NaN	NaN

1# The NaN rows can be dropped using .dropna()
2# This is the behaviour implemented by the direct df[mask] method
3df.where((df['Population'] >= 65) & (df['Currency'] == "EUR")).dropna()

	Language	Currency	Population
France	French	EUR	65.0

Replacing Values including NaNs

Replacing NaNs/Nulls with Custom Value

1# Starting DataFrame
2df2 = df.copy(deep=True)
3df2.loc["France","Currency"] = None
4df2.loc["Italy","Currency"] = None

1df2.fillna("Bitcoin")

	Language	Currency	Population
France	French	Bitcoin	65
UK	English	GBP	65
Italy	Italy	Bitcoin	60

Back Fill and Forward Fill

1# Starting DataFrame
2df2 = df.copy(deep=True)
3df2.loc["UK","Population"] = None
4df2

	Language	Currency	Population
France	French	EUR	65.0
UK	English	GBP	NaN
Italy	Italy	EUR	60.0

1#       Value
2#         |
3#         |    Forward Fill     
4#         v
5#        NaN
6
7df2.fillna(method='ffill')

	Language	Currency	Population
France	French	EUR	65.0
UK	English	GBP	65.0
Italy	Italy	EUR	60.0

1#        NaN
2#         ^
3#         |    Backward Fill     
4#         |
5#       Value
6
7df2.fillna(method='bfill')

	Language	Currency	Population
France	French	EUR	65.0
UK	English	GBP	60.0
Italy	Italy	EUR	60.0

Value Replacement

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Single parameter, e.g., replace("EUR","Euro") also available
2df.replace(["EUR" ,"GBP"],
3           ["Euro","British Pound Sterling"])

	Language	Currency	Population
France	French	Euro	65
UK	English	British Pound Sterling	65
Italy	Italy	Euro	60

Value Replacement with Regexp

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1df.replace(to_replace="(ly$)", value="lia", regex=True)

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italia	EUR	60

Extracting and Splitting Columns

Using Apply() to Modify and/or Create Extra Columns

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Create extra column with the population 
2def pop_to_million(row):
3    row['Pop_total']=row['Population']*1000000
4    return row
5df.apply(pop_to_million, axis=1)

	Language	Currency	Population	Pop_total
France	French	EUR	65	65000000
UK	English	GBP	65	65000000
Italy	Italy	EUR	60	60000000

Using Extract() to Generate New Columns with a Regexp

1df2 = df.copy(deep=True)
2df2

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# More groups will result in more columns!
2iso = df2["Language"].str.extract("(?P<ISO>^\w{2})").applymap(str.upper)
3iso

	ISO
France	FR
UK	EN
Italy	IT

1# Adding new column

1df2["LANG_ISO"]=iso
2df2

	Language	Currency	Population	LANG_ISO
France	French	EUR	65	FR
UK	English	GBP	65	EN
Italy	Italy	EUR	60	IT

Sorting

Sort by Index

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# ascending=True
2df.sort_index()

	Language	Currency	Population
France	French	EUR	65
Italy	Italy	EUR	60
UK	English	GBP	65

Sort by Column

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# ascending=True
2df.sort_values("Language")

	Language	Currency	Population
UK	English	GBP	65
France	French	EUR	65
Italy	Italy	EUR	60

Transposing

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1df.T # After transposing

	France	UK	Italy
Language	French	English	Italy
Currency	EUR	GBP	EUR
Population	65	65	60

Updating Data

Deleting Rows

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Produces a new DataFrame, not in-place, unless inplace=True is added
2df.drop(["UK","France"])

	Language	Currency	Population
Italy	Italy	EUR	60

Deleting Columns

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Produces a new DataFrame, not in-place, unless inplace=True is added
2df.drop(["Currency","Population"],axis=1)

	Language
France	French
UK	English
Italy	Italy

1# Imperative alternative

1df2 = df.copy(deep=True)
2del df2["Population"]
3df2

	Language	Currency
France	French	EUR
UK	English	GBP
Italy	Italy	EUR

Updating Axes

Renaming Axes By Label

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Both columns and rows may be renamed at once
2df.rename(columns={"Language" : "LAN", "Currency" : "CUR", "Population" : "POP"},
3          index={"France" : "FR", "Italy" : "IT"})

	LAN	CUR	POP
FR	French	EUR	65
UK	English	GBP	65
IT	Italy	EUR	60

Replacing Axes Entirely

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1df2 = df.copy(deep=True)
2df2.columns = ["LAN","CUR","POP"]
3df2.index = ["FR","UK","IT"]
4df2

	LAN	CUR	POP
FR	French	EUR	65
UK	English	GBP	65
IT	Italy	EUR	60

Applying a Transformation Function

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Using mapper 
2from functools import partial
3
4f = lambda limit, label: label.upper()[:limit]
5df2=df.rename(mapper=partial(f,3), axis=1) # columns
6df3=df2.rename(mapper=partial(f,2), axis=0) # rows/index
7df3

	LAN	CUR	POP
FR	French	EUR	65
UK	English	GBP	65
IT	Italy	EUR	60

1# Replacing columns and index outright
2df2 = df.copy(deep=True)
3df2.columns = map(partial(f,3),df.columns)
4df2.index = map(partial(f,3),df.index)
5df2

	LAN	CUR	POP
FRA	French	EUR	65
UK	English	GBP	65
ITA	Italy	EUR	60

Advanced Indexing

Setting an Existing Column as the Index

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Here we set 'Language' as the index but we lose the country name!
2df.set_index('Language')

	Currency	Population
Language
French	EUR	65
English	GBP	65
Italy	EUR	60

1# A way to solve this is to save the index as a Series and then add it as a column
2countries = df.index
3df2 = df.set_index('Language')
4df2['Country'] = countries
5df2

	Currency	Population	Country
Language
French	EUR	65	France
English	GBP	65	UK
Italy	EUR	60	Italy

Promoting the Index to a Column

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# An automatic numerical index is created as a result
2df.reset_index()

	index	Language	Currency	Population
0	France	French	EUR	65
1	UK	English	GBP	65
2	Italy	Italy	EUR	60

Compound Index

1# We'll first start with a DataFrame as follows
2df2 = df.copy(deep=True)
3df2['Group'] = ["EU","Common Wealth","EU"]
4df2 = df2.reset_index().rename(columns={"index" : "Country"})
5df2

	Country	Language	Currency	Population	Group
0	France	French	EUR	65	EU
1	UK	English	GBP	65	Common Wealth
2	Italy	Italy	EUR	60	EU

1# Here we select two columns as the index
2df3 = df2.set_index(['Group','Country'])
3df3

		Language	Currency	Population
Group	Country
EU	France	French	EUR	65
Common Wealth	UK	English	GBP	65
EU	Italy	Italy	EUR	60

1# Equivalent to df3.loc[("EU")]
2df3.loc["EU"]

	Language	Currency	Population
Country
France	French	EUR	65
Italy	Italy	EUR	60

1# Equivalent to df3.loc[("EU","France")]
2df3.loc["EU","France"]

Language      French
Currency         EUR
Population        65
Name: (EU, France), dtype: object

1# Selecting multiple rows
2df3.loc[[("EU","France"),("Common Wealth","UK")]]

		Language	Currency	Population
Group	Country
EU	France	French	EUR	65
Common Wealth	UK	English	GBP	65

Loading Data from CSV File

Sample file used for example

resources/countries_small.csv

1!cat resources/countries_small.csv

"country","population","gdp_in_trillions"
"China",1439323776,12.238
"India",1380004385,2.651
"USA",331002651,19.485

1# No arguments
2pd.read_csv("resources/countries_small.csv")

	country	population	gdp_in_trillions
0	China	1439323776	12.238
1	India	1380004385	2.651
2	USA	331002651	19.485

1# Specifying index column
2pd.read_csv("resources/countries_small.csv", index_col=0)

	population	gdp_in_trillions
country
China	1439323776	12.238
India	1380004385	2.651
USA	331002651	19.485

1df2 = df.copy(deep=True)
2df2
3df2.drop('UK',axis=0)

	Language	Currency	Population
France	French	EUR	65
Italy	Italy	EUR	60

1df

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1
2df2 = df.reset_index()
3df2
4df2['index']

0    France
1        UK
2     Italy
Name: index, dtype: object

Joining DataFrames

The following two DataFrames will be used throughout this section

 1bands_df = pd.DataFrame([{'name' : 'Iron Maiden', 'country' : 'UK'},
 2                         {'name' : 'Metallica'  , 'country' : 'USA'},
 3                         {'name' : 'AC/DC'      , 'country' : 'Australia'}])
 4albums_df = pd.DataFrame([{'name' : 'Iron Maiden', 'best_album' : 'Powerslave'},
 5                          {'name' : 'Metallica'  , 'best_album' : 'Ride the Lightning'},
 6                          {'name' : 'Helloween'  , 'best_album' : 'Walls of Jericho'}])
 7# The index may be specified on demand using the merge(on='') argument.
 8bands_df.set_index('name')
 9albums_df.set_index('name')
10pass

Outer Join

The outer join is a union; non-matching items are preserved

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Outer Join
5pd.merge(bands_df, albums_df, how='outer')

	name	country
0	Iron Maiden	UK
1	Metallica	USA
2	AC/DC	Australia

	name	best_album
0	Iron Maiden	Powerslave
1	Metallica	Ride the Lightning
2	Helloween	Walls of Jericho

	name	country	best_album
0	Iron Maiden	UK	Powerslave
1	Metallica	USA	Ride the Lightning
2	AC/DC	Australia	NaN
3	Helloween	NaN	Walls of Jericho

Inner Join

The outer join is an intersection; only matching attributes are preserved

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Inner Join
5pd.merge(bands_df, albums_df, how='inner')

	name	country
0	Iron Maiden	UK
1	Metallica	USA
2	AC/DC	Australia

	name	best_album
0	Iron Maiden	Powerslave
1	Metallica	Ride the Lightning
2	Helloween	Walls of Jericho

	name	country	best_album
0	Iron Maiden	UK	Powerslave
1	Metallica	USA	Ride the Lightning

Left Join

Include all items on the left table regardless of whether they have a match on the right

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Left Join
5pd.merge(bands_df, albums_df, how='left')

	name	country
0	Iron Maiden	UK
1	Metallica	USA
2	AC/DC	Australia

	name	best_album
0	Iron Maiden	Powerslave
1	Metallica	Ride the Lightning
2	Helloween	Walls of Jericho

	name	country	best_album
0	Iron Maiden	UK	Powerslave
1	Metallica	USA	Ride the Lightning
2	AC/DC	Australia	NaN

Right Join

Include all items on the right table regardless of whether they have a match on the left

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Right Join
5pd.merge(bands_df, albums_df, how='right')

	name	country
0	Iron Maiden	UK
1	Metallica	USA
2	AC/DC	Australia

	name	best_album
0	Iron Maiden	Powerslave
1	Metallica	Ride the Lightning
2	Helloween	Walls of Jericho

	name	country	best_album
0	Iron Maiden	UK	Powerslave
1	Metallica	USA	Ride the Lightning
2	Helloween	NaN	Walls of Jericho

DataFrame Concatenation

The following two DataFrames are used to illustrate concatenation:

1ernie_bands_df = pd.DataFrame([{'name' : 'Iron Maiden'   , 'country' : 'UK'       , 'genre' : 'metal'},
2                               {'name' : 'Metallica'     , 'country' : 'USA'      , 'genre' : 'metal'},
3                               {'name' : 'AC/DC'         , 'country' : 'Australia', 'genre' : 'metal'}])
4adri_bands_df  = pd.DataFrame([{'name' : 'Modern Talking', 'country' : 'Germany'},
5                               {'name' : 'Fancy'         , 'country' : 'Germany'},
6                               {'name' : 'Ramses B'      , 'country' : 'UK'}])

Ignoring Index

This allows resequencing existing items.

1pd.concat([ernie_bands_df,adri_bands_df],ignore_index=True)

	name	country	genre
0	Iron Maiden	UK	metal
1	Metallica	USA	metal
2	AC/DC	Australia	metal
3	Modern Talking	Germany	NaN
4	Fancy	Germany	NaN
5	Ramses B	UK	NaN

Differentiating Between Items

It is possible to associate each item with its source DataFrame.

1pd.concat([ernie_bands_df,adri_bands_df],keys=['Ernie','Adri'],names=['List','Row Id'])

		name	country	genre
List	Row Id
Ernie	0	Iron Maiden	UK	metal
	1	Metallica	USA	metal
	2	AC/DC	Australia	metal
Adri	0	Modern Talking	Germany	NaN
	1	Fancy	Germany	NaN
	2	Ramses B	UK	NaN

DataFrame GroupBy

Simple Iteration

In its simplest form, DataFrame.groupby(COLUMN) provides as an iterable object in which the first component is the group name, and the second is the DataFrame which corresponds to the respective group. COLUMN may also be passed as by=COLUMN.

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1for group, frame in df.groupby("Currency"):
2    print("--- Group {} ----------".format(group))
3    print(frame)

--- Group EUR ----------
       Language Currency  Population
France   French      EUR          65
Italy     Italy      EUR          60
--- Group GBP ----------
   Language Currency  Population
UK  English      GBP          65

Grouping Using a Custom Function

A function in the form DataFrame.groupby(lambda index_value: index_value) may be provided as an argument, as an alternative to column name(s). This helps creating a custom grouping criteria rather than using existing column name(s). The function may also be passed using the by argument.

1# Let's use the numeric index. Here the function is neutral
2for group, frame in df.reset_index().groupby(lambda item: item):
3    print("--- Group {} ----------".format(group))
4    print(frame)

--- Group 0 ----------
    index Language Currency  Population
0  France   French      EUR          65
--- Group 1 ----------
  index Language Currency  Population
1    UK  English      GBP          65
--- Group 2 ----------
   index Language Currency  Population
2  Italy    Italy      EUR          60

1# Let's group items by even and odd index numers
2for group, frame in df.reset_index().groupby(lambda item: 'Even' if item % 2 == 0 else 'Odd'):
3    print("--- Group {} ----------".format(group))
4    print(frame)

--- Group Even ----------
    index Language Currency  Population
0  France   French      EUR          65
2   Italy    Italy      EUR          60
--- Group Odd ----------
  index Language Currency  Population
1    UK  English      GBP          65

Multi-level Index

If the index is multi-level, we may want to specify the level we want to group by.

1# Sample DataFrame
2df2 = df.copy(deep=True).reset_index().rename(columns={"index":"Country"}).set_index(["Country","Currency"]) 
3df2

		Language	Population
Country	Currency
France	EUR	French	65
UK	GBP	English	65
Italy	EUR	Italy	60

1# Use first index component
2for group, _ in df2.groupby(level=0):
3    print(group)

France
Italy
UK

1# Use second index component
2for group, _ in df2.groupby(level=1):
3    print(group)

EUR
GBP

1# Use both components
2for group, _ in df2.groupby(level=[0,1]):
3    print(group)

('France', 'EUR')
('Italy', 'EUR')
('UK', 'GBP')

1# Note that when passing functions, item becomes a tuple
2for group, _ in df2.groupby(lambda item: item[0] + "-" + item[1]):
3    print(group)

France-EUR
Italy-EUR
UK-GBP

Direct Aggregation

Rather than iterating through a Group object, aggregations may be performed directly upon grouped column name(s) using built-in aggregation functions such as min, max, mean, etc.

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1df.groupby("Currency").agg({"Population" : ["min","max","mean","median","sum"]})

	Population
	min	max	mean	median	sum
Currency
EUR	60	65	62.5	62.5	125
GBP	65	65	65.0	65.0	65

Transform

The transform function applies an aggregation (or a user-supplied function) to the columns other than that which is chosen for grouping.

1# Sample DataFrame
2df2 = df[["Currency","Population"]]
3df2

	Currency	Population
France	EUR	65
UK	GBP	65
Italy	EUR	60

1# Using built-in functions
2df2.groupby("Currency").transform("sum")

	Population
France	125
UK	65
Italy	125

1# Using user-defined functions
2df2.groupby("Currency").transform(lambda series: sum(series))

	Population
France	125
UK	65
Italy	125

Filtering

Applying a filtering function to a Group object helps remove undesired items. Please note that in this case the function receives a DataFrame rather than a Series object.

1# Sample DataFrame
2df2 = df[["Currency","Population"]]
3df2

	Currency	Population
France	EUR	65
UK	GBP	65
Italy	EUR	60

1# Remove countries whose combined population grouped by Currency is greater than 100
2df2.groupby("Currency").filter(lambda dataFrame: dataFrame['Population'].sum() > 100)

	Currency	Population
France	EUR	65
Italy	EUR	60

Creating New Columns With Apply

With DataFrame.groupby(COLUMN).apply(lambda dataFrame: dataFrame) it is possible to include the aggregated results in a new column.

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1def population_for_fx(dataFrame):
2    dataFrame["Pop_total_for_currency"] = dataFrame['Population'].sum()
3    return dataFrame
4    
5df2.groupby("Currency").apply(population_for_fx)

	Currency	Population	Pop_total_for_currency
France	EUR	65	125
UK	GBP	65	65
Italy	EUR	60	125

Scales

Data Binning

Data Binning is about ‘bucketing’ data. In Pandas, this is equivalent to converting data typically in a ratio/interval scale to a category.

The pd.cut(DataFrame,number_of_bins) function is used for this purpose.

1# Let's split a number of temperature readings into three buckets
2temperatures = pd.Series([0,5,10,15,20,25,30,35,40])
3pd.cut(temperatures,3)

0     (-0.04, 13.333]
1     (-0.04, 13.333]
2     (-0.04, 13.333]
3    (13.333, 26.667]
4    (13.333, 26.667]
5    (13.333, 26.667]
6      (26.667, 40.0]
7      (26.667, 40.0]
8      (26.667, 40.0]
dtype: category
Categories (3, interval[float64, right]): [(-0.04, 13.333] < (13.333, 26.667] < (26.667, 40.0]]

1# Let's use labels
2pd.cut(temperatures,3,labels=["cold","medium","hot"])

0      cold
1      cold
2      cold
3    medium
4    medium
5    medium
6       hot
7       hot
8       hot
dtype: category
Categories (3, object): ['cold' < 'medium' < 'hot']

Pivot Tables

Simple Pivot Table

A pivot table is similar to the chaining of df.groupby().agg() in that it allows running an aggregate function against a groupby criteria. The difference is that the grouped values appear as columns.

1df # Sample DataFrame

	Language	Currency	Population
France	French	EUR	65
UK	English	GBP	65
Italy	Italy	EUR	60

1# Equivalent groupby()/agg() version
2df.groupby("Currency").agg({"Population" : ["mean","sum"]})

	Population
	mean	sum
Currency
EUR	62.5	125
GBP	65.0	65

1# Pivot table version
2df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])

	mean		sum
Currency	EUR	GBP	EUR	GBP
Population	62.5	65.0	125	65

1# Subtotals can be enabled by setting margins to True
2df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'],margins=True)

	mean				sum
Currency	EUR	All	GBP	All	EUR	All	GBP	All
Population	62.5	62.5	65.0	65.0	125	125	65	65

Querying Labels with Idmax

While the likes of min() and max() provide the maximum numerical value in an array or Series, sometimes we need to find out the label associated with a maximum value. We use Series.idxmin() and Series.idxmax(), respectively, for this purpose.

1# Reference table
2pivot_df = df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])
3pivot_df

	mean		sum
Currency	EUR	GBP	EUR	GBP
Population	62.5	65.0	125	65

1# Obtain the currency with the largest population
2pivot_df["sum"].loc["Population"].idxmax()

'EUR'

Stacking and Unstacking

Stacking is the process of converting columns back to rows, whereas unstacking is the process of reversing this process.

1# Reference table
2pivot_df = df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])
3pivot_df

	mean		sum
Currency	EUR	GBP	EUR	GBP
Population	62.5	65.0	125	65

1# Here EUR and GBP become labels
2pivot_df.stack()

		mean	sum
	Currency
Population	EUR	62.5	125
Population	GBP	65.0	65

1# And here we get things back to normal again
2pivot_df.stack().unstack()

	mean		sum
Currency	EUR	GBP	EUR	GBP
Population	62.5	65.0	125	65

Date/Time

Timestamp Declaration

1# Using ISO-ish text string
2pd.Timestamp('2021-12-22 22:05')

Timestamp('2021-12-22 22:05:00')

1# Using date/time components
2pd.Timestamp(2021, 12, 22, 22, 5)

Timestamp('2021-12-22 22:05:00')

Common Timestamp Methods and Properties

1# Individual components 
2ts = pd.Timestamp(2021, 12, 22, 22, 5)
3"{}-{:02d}-{:02d} {:02d}:{:02d}:{:02d}".format(ts.year,ts.month,ts.day,ts.hour,ts.minute,ts.second)

'2021-12-22 22:05:00'

1# Weekday (1 - Monday ... 7 - Sunday)
2ts.isoweekday()

Timestamp Arithmetic, Deltas and Offsets

1# Subtraction between two timestamps
2pd.Timestamp('2021-12-22 22:05')-pd.Timestamp('2021-11-15 22:05')

Timedelta('37 days 00:00:00')

1# Subtracting TimeDelta values
2pd.Timestamp('2021-12-22 22:05')-pd.Timedelta('37 days')

Timestamp('2021-11-15 22:05:00')

1# Subtracting an Offset value (see pd.offsets for more options)
2pd.Timestamp('2021-12-22 22:05')-pd.offsets.Week()

Timestamp('2021-12-15 22:05:00')

Period Arithmetic

The Period object overloads arithmetic operators such as +, -, *, /, to perform data/time wise calculations

1# Year wise 
2p = pd.Period("2021")
3print(p-1)
4print(p)
5print(p+1)

2020
2021
2022

1# Month wise 
2p = pd.Period("2021-12")
3print(p-1)
4print(p)
5print(p+1)

2021-11
2021-12
2022-01

1# Day wise 
2p = pd.Period("2021-12-31")
3print(p-1)
4print(p)
5print(p+1)

2021-12-30
2021-12-31
2022-01-01

1# Hour wise 
2p = pd.Period("2021-12-31 23")
3print(p-1)
4print(p)
5print(p+1)

2021-12-31 22:00
2021-12-31 23:00
2022-01-01 00:00

DateTimeIndex and PeriodTimeIndex

A Date/Time value can be used as an index or value in Series and DataFrames.

1# DateTimeIndex
2s = pd.Series(['A','B','C'],index=[pd.Timestamp("2021-12-22"), 
3                                   pd.Timestamp("2021-12-23"), 
4                                   pd.Timestamp("2021-12-24")])
5print(s)
6print(type(s.index))

2021-12-22    A
2021-12-23    B
2021-12-24    C
dtype: object

1## PeriodIndex
2s = pd.Series(['A','B','C'],index=[pd.Period("2021-10"), 
3                                   pd.Period("2021-11"), 
4                                   pd.Period("2021-12")])
5print(s)
6print(type(s.index))

2021-10    A
2021-11    B
2021-12    C
Freq: M, dtype: object

Automatic Date Conversion

Pandas can figure out dates based on various common formats

 1d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
 2
 3s = pd.Series(["A","B","C","D","E"], 
 4              index=["5th of January, 2021",
 5                     "Feb 27, 2021",
 6                     "2021-01-02",
 7                     "02/01/21", # 2nd of January!
 8                     "03 Jan 21"])   
 9
10# dayfirst=true is a must for non-US (i.e., European) date formats
11s.index = pd.to_datetime(s.index,dayfirst=True)
12s

2021-01-05    A
2021-02-27    B
2021-01-02    C
2021-01-02    D
2021-01-03    E
dtype: object

Generating Date Ranges

Date ranges can be generated based on a starting date, a given number of periods, and a frequency (weekly, quarterly, etc.).

1# 12 months
2pd.date_range('2021-01-31', periods=12, freq='M')

DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31'],
              dtype='datetime64[ns]', freq='M')

1# First 10 weeks, with the week starting on Monday rather than Sunday
2pd.date_range('2021-12-06', periods=10, freq='W-MON')

DatetimeIndex(['2021-12-06', '2021-12-13', '2021-12-20', '2021-12-27',
               '2022-01-03', '2022-01-10', '2022-01-17', '2022-01-24',
               '2022-01-31', '2022-02-07'],
              dtype='datetime64[ns]', freq='W-MON')

Data Ranges in DataFrames

1# Sample DataFrame
2dates_index = pd.date_range('2021-01-01', periods=12, freq='MS') # Month Start
3dates_df    = pd.DataFrame({"number":list(range(1, 13))}, 
4                            index=dates_index)
5dates_df

	number
2021-01-01	1
2021-02-01	2
2021-03-01	3
2021-04-01	4
2021-05-01	5
2021-06-01	6
2021-07-01	7
2021-08-01	8
2021-09-01	9
2021-10-01	10
2021-11-01	11
2021-12-01	12

1# Here we add an extra column with the weekday
2dates_df["weekday"] = pd.Series(list(map(lambda x: 
3                                {0 : "MON", 1 : "TUE", 2 : "WED", 3 : "THR", 4 : "FRI", 5 : "SAT", 6 : "SUN"}[x], 
4                                dates_df.index.weekday)),index=dates_index)
5dates_df

	number	weekday
2021-01-01	1	FRI
2021-02-01	2	MON
2021-03-01	3	MON
2021-04-01	4	THR
2021-05-01	5	SAT
2021-06-01	6	TUE
2021-07-01	7	THR
2021-08-01	8	SUN
2021-09-01	9	WED
2021-10-01	10	FRI
2021-11-01	11	MON
2021-12-01	12	WED

Downsampling

This is the process of reducing the number of dates (for example, from daily to weekly), aggregating the values for the intermediate dates (which will be lost) in the process.

1# Sample DataFrame
2dates_df

	number	weekday
2021-01-01	1	FRI
2021-02-01	2	MON
2021-03-01	3	MON
2021-04-01	4	THR
2021-05-01	5	SAT
2021-06-01	6	TUE
2021-07-01	7	THR
2021-08-01	8	SUN
2021-09-01	9	WED
2021-10-01	10	FRI
2021-11-01	11	MON
2021-12-01	12	WED

1# Here we downsample the index from montlhy to quarterly (every three months), and sum the values
2# for all the months within a given quarter:
3#
4# 1  +  2 +  3 = 6
5# 4  +  5 +  6 = 15
6# 7  +  8 +  9 = 24
7# 10 + 11 + 12 = 33
8#
9dates_df.resample("Q").sum()

	number
2021-03-31	6
2021-06-30	15
2021-09-30	24
2021-12-31	33

Indexing and Slicing

1# Sample DataFrame
2dates_df

	number	weekday
2021-01-01	1	FRI
2021-02-01	2	MON
2021-03-01	3	MON
2021-04-01	4	THR
2021-05-01	5	SAT
2021-06-01	6	TUE
2021-07-01	7	THR
2021-08-01	8	SUN
2021-09-01	9	WED
2021-10-01	10	FRI
2021-11-01	11	MON
2021-12-01	12	WED

1# All dates 2021-02-*
2dates_df.loc["2021-02"]

	number	weekday
2021-02-01	2	MON

1## All dates between 2021-02-* and 2021-06-*
2dates_df.loc["2021-02":"2021-06"]

	number	weekday
2021-02-01	2	MON
2021-03-01	3	MON
2021-04-01	4	THR
2021-05-01	5	SAT
2021-06-01	6	TUE

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!