Pandas by Example

Share on:

Table of Contents

I find most Pandas examples to be either too abstract (e.g., DataFrames are devised using single letters, random numbers, etc.) or the other way around (they load humongous data sets from the likes of Kaggle). Furthermore, the data sets tend to change depending on the method or property at hand. I, instead, use the same small data set for nearly all of my examples.

This is not a ‘my tutorial is better than your tutorial’ affair. It is, rather, an alternative set of examples which may work out for some people.

Essentials

1# Module Import
2
3import numpy as np     
4import pandas as pd
5import re as re      # only used in a few examples

Series Introduction

Series are similar to Pandas arrays except that the index may be user-defined.

Declaration

1# Declaration without labels
2pd.Series(['A+','A','C'])
0    A+
1     A
2     C
dtype: object
1# Declaration with labels by passing an index list
2pd.Series(['A+','A','C'], index = ['Daisy','Taffy','Nina'])
Daisy    A+
Taffy     A
Nina      C
dtype: object
1# Declaration with labels using a dictionary
2pd.Series({ 'Daisy' : 'A+',
3            'Taffy' : 'A',
4            'Nina'  : 'C'})
Daisy    A+
Taffy     A
Nina      C
dtype: object

Querying (Keys and Values)

1# Keys and values
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5print("  keys: {}".format(s.keys()))
6print("values: {}".format(s.to_list()))
  keys: Index(['Daisy', 'Taffy', 'Nina'], dtype='object')
values: ['A+', 'A', 'C']

Querying (Select Elements by Index and Key)

1# Elements by Index and Key
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5
6last_element = len(s)-1 # 2
7print("By index: {}".format(s.iloc[last_element]))
8print("By key:   {}".format(s.loc['Nina']))
By index: C
By key:   C

Querying (Head and Tail)

1# Head and Tail 
2s = pd.Series(np.random.randint(1,91,500)) # 500 random numbers 1 =< n < 91
3
4# Bingo numbers
5# We use to_list() since the result is another Series object
6print("head: {}".format(s.head().to_list()))
7print("tail: {}".format(s.tail().to_list()))
head: [81, 16, 13, 78, 39]
tail: [66, 4, 78, 57, 6]

Updating Series (Modifying Elements)

1# Changing Elements
2s = pd.Series({ 'Daisy' : 'A+',
3                'Taffy' : 'A',
4                'Nina'  : 'C'})
5
6s['Nina'] = 'B'
7s
Daisy    A+
Taffy     A
Nina      B
dtype: object

Updating Series (Appending Series)

1# Appending series
2s1 = pd.Series([1,2,3], index=['One','Two','Three'])
3s2 = pd.Series([4,'III'], index=['Four','Three'])
4s3 = s1.append(s2) # s1 is not modified in place unlike Python's arrays!
5s3
One        1
Two        2
Three      3
Four       4
Three    III
dtype: object
 1# Broadcasting (direct arithmetic)
 2s = pd.Series([1,2,3])
 3
 4# New Series
 5print("s * 2   = {}".format((s * 2).to_list()))
 6print("s - 1   = {}".format((s - 1).to_list()))
 7
 8# Modify in Place
 9s+=10
10print("After in place modification: {}".format(s.to_list()))
s * 2   = [2, 4, 6]
s - 1   = [0, 1, 2]
After in place modification: [11, 12, 13]

Series Aggregations and Computations

Unique values

1pd.Series([10,20,20,30]).unique()
array([10, 20, 30])

Sum

1pd.Series([1,2,3]).sum()
6

Series Gotchas

Indexing Using Square Brackets

1# Indexing without iloc using square brackets
2s = pd.Series({ 1 : 'First',
3                2 : 'Second'})
4try:
5    # This fails because numbers are treated as keys
6    print(s[0])
7except:
8    # This works because iloc will not cast the number incorrectly
9    print("This works: " + s.iloc[0])
This works: First

None vs NaN

1# None vs Not a Number (NaN)
2grades  = pd.Series(['A+','A',None])
3numbers = pd.Series([10,9,None])
4print ("In grades, the third element is:  {}".format(grades[2]))
5print ("In numbers, the third element is: {}".format(numbers[2]))
6
7# Testing for None doesn't work with equals, but requires a specific function
8print ("numbers[3] == None:   {}".format(numbers[2] == None))
9print ("np.isnan(numbers[3]): {}".format(np.isnan(numbers[2])))
In grades, the third element is:  None
In numbers, the third element is: nan
numbers[3] == None:   False
np.isnan(numbers[3]): True

Repeated Keys Are Possible

1# Repeated Keys
2# Series are not dictionaries nor arrays
3#    - Types may be mixed
4#    - Element labels/keys may be repeated
5pd.Series(["Austria","Germany","Italy"], index=["German","German","Italian"])
German     Austria
German     Germany
Italian      Italy
dtype: object
 1# Repeated Keys #2
 2# A location may return a single value or a Series!
 3s = pd.Series(["Austria","Germany","Italy"], index=["German","German","Italian"])
 4
 5# For Italian
 6italian = s.loc["Italian"]
 7print("* ", end='') 
 8print(italian, end='') 
 9print(type(italian))
10
11# For German
12print("* ", end='') 
13german = s.loc["German"]
14print(german.to_list(), end='')
15print(type(german))
  • Italy<class ‘str’>
  • [‘Austria’, ‘Germany’]<class ‘pandas.core.series.Series’>

DataFrames (Declaration)

DataFrames aggregate Series objects by treating Series as labeled rows.

1# Declaration using Series
2c1 = pd.Series(["French", "EUR",65],index=["Language","Currency","Population"])
3c2 = pd.Series(["English","GBP",65],index=["Language","Currency","Population"])
4c3 = pd.Series(["Italy",  "EUR",60],index=["Language","Currency","Population"])
5
6# The columns may be specified by adding 'name=Column' to each Series instead
7pd.DataFrame([c1,c2,c3],index=["France","UK","Italy"])

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Declaration using Tuples
2countries = [("French","EUR",65),
3             ("English","GBP",65),
4             ("Italy","EUR",60)]
5               
6pd.DataFrame(countries,
7             columns=["Language","Currency","Population"],
8             index=["France","UK","Italy"])

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
 1# Declaration using a list Dictionary values
 2countries = [{"Language"   : "French", 
 3              "Currency"   : "EUR",
 4              "Population" : 65},
 5            {"Language"    : "English", 
 6              "Currency"   : "GBP",
 7              "Population" : 65},
 8            {"Language"    : "Italy", 
 9              "Currency"   : "EUR",
10              "Population" : 60}]
11              
12df = pd.DataFrame(countries,index=["France","UK","Italy"])
13df

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60

Querying (Simple)

Rows and Columns

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Rows and columns
2print("   rows: {}".format(df.index))
3print("columns: {}".format(df.columns))
   rows: Index(['France', 'UK', 'Italy'], dtype='object')
columns: Index(['Language', 'Currency', 'Population'], dtype='object')

Select Row By Index and Label

1# Sample DataFrame
2df

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Select row using index
2df.iloc[0]
Language      French
Currency         EUR
Population        65
Name: France, dtype: object
1# Select row using label
2df.loc['France'] # Produces a Series object
Language      French
Currency         EUR
Population        65
Name: France, dtype: object
1# Repeated labels will produce a DataFrame rather than a Series object!
2df2 = df.append(pd.Series(["Sicilian","EUR",60],
3                index=["Language","Currency","Population"],name="Italy"))
4df2.loc['Italy']

Language Currency Population
Italy Italy EUR 60
Italy Sicilian EUR 60

Select Column By Label

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Select column using name
2df['Language']
France     French
UK        English
Italy       Italy
Name: Language, dtype: object
1# Select specific columns
2df[["Language","Currency"]]

Language Currency
France French EUR
UK English GBP
Italy Italy EUR

Select Element By Row and Column

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1df.loc["France","Currency"]
'EUR'
1# This can also be accomplished by first selecting the row and then the column
2df.loc["France"]["Currency"]
'EUR'

Querying (Advanced)

Selecting Rows Using a Boolean Mask

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# France we keep, UK we leave out, Italy we keep
2mask = pd.Series([True, False, True], index=["France","UK","Italy"])
3df[mask]

Language Currency Population
France French EUR 65
Italy Italy EUR 60
1# Let's invert the mask so it becomes False, True, False
2df[mask.map(lambda x: not x)]

Language Currency Population
UK English GBP 65

Creating Boolean Masks using Operators

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1print(df['Population'] >= 65)
France     True
UK         True
Italy     False
Name: Population, dtype: bool
1print(df['Currency'] == "EUR")
France     True
UK        False
Italy      True
Name: Currency, dtype: bool

Complex Queries: Boolean Logic plus Operators

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1(df['Population'] >= 65) & (df['Currency'] == "EUR")
France     True
UK        False
Italy     False
dtype: bool

Applying Boolean Masks to DataFrames

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Prepare mask separately
2mask = (df['Population'] >= 65) & (df['Currency'] == "EUR")
3df[mask]

Language Currency Population
France French EUR 65
1# Nested
2df[(df['Population'] >= 65) & (df['Currency'] == "EUR")]

Language Currency Population
France French EUR 65

Boolean Masks and Nulls/NaNs

1# At the Series level
2pd.Series([1,2,3,None]).isnull()
0    False
1    False
2    False
3     True
dtype: bool
1# At the DataFrame level
2df2 = df.copy(deep=True)
3df2.loc["UK","Currency"] = None
4df2.isnull()

Language Currency Population
France False False False
UK False True False
Italy False False False

Preserving Non-Matches using Where()

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Passing the Boolean Mask to where() does not drop the rows,
2# instead, they appear with their values as 'NaN'
3df.where((df['Population'] >= 65) & (df['Currency'] == "EUR"))

Language Currency Population
France French EUR 65.0
UK NaN NaN NaN
Italy NaN NaN NaN
1# The NaN rows can be dropped using .dropna()
2# This is the behaviour implemented by the direct df[mask] method
3df.where((df['Population'] >= 65) & (df['Currency'] == "EUR")).dropna()

Language Currency Population
France French EUR 65.0

Replacing Values including NaNs

Replacing NaNs/Nulls with Custom Value

1# Starting DataFrame
2df2 = df.copy(deep=True)
3df2.loc["France","Currency"] = None
4df2.loc["Italy","Currency"] = None
1df2.fillna("Bitcoin")

Language Currency Population
France French Bitcoin 65
UK English GBP 65
Italy Italy Bitcoin 60

Back Fill and Forward Fill

1# Starting DataFrame
2df2 = df.copy(deep=True)
3df2.loc["UK","Population"] = None
4df2

Language Currency Population
France French EUR 65.0
UK English GBP NaN
Italy Italy EUR 60.0
1#       Value
2#         |
3#         |    Forward Fill     
4#         v
5#        NaN
6
7df2.fillna(method='ffill')

Language Currency Population
France French EUR 65.0
UK English GBP 65.0
Italy Italy EUR 60.0
1#        NaN
2#         ^
3#         |    Backward Fill     
4#         |
5#       Value
6
7df2.fillna(method='bfill')

Language Currency Population
France French EUR 65.0
UK English GBP 60.0
Italy Italy EUR 60.0

Value Replacement

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Single parameter, e.g., replace("EUR","Euro") also available
2df.replace(["EUR" ,"GBP"],
3           ["Euro","British Pound Sterling"])

Language Currency Population
France French Euro 65
UK English British Pound Sterling 65
Italy Italy Euro 60

Value Replacement with Regexp

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1df.replace(to_replace="(ly$)", value="lia", regex=True)

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italia EUR 60

Extracting and Splitting Columns

Using Apply() to Modify and/or Create Extra Columns

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Create extra column with the population 
2def pop_to_million(row):
3    row['Pop_total']=row['Population']*1000000
4    return row
5df.apply(pop_to_million, axis=1)

Language Currency Population Pop_total
France French EUR 65 65000000
UK English GBP 65 65000000
Italy Italy EUR 60 60000000

Using Extract() to Generate New Columns with a Regexp

1df2 = df.copy(deep=True)
2df2 

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# More groups will result in more columns!
2iso = df2["Language"].str.extract("(?P<ISO>^\w{2})").applymap(str.upper)
3iso

ISO
France FR
UK EN
Italy IT
1# Adding new column
1df2["LANG_ISO"]=iso
2df2

Language Currency Population LANG_ISO
France French EUR 65 FR
UK English GBP 65 EN
Italy Italy EUR 60 IT

Sorting

Sort by Index

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# ascending=True
2df.sort_index()

Language Currency Population
France French EUR 65
Italy Italy EUR 60
UK English GBP 65

Sort by Column

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# ascending=True
2df.sort_values("Language")

Language Currency Population
UK English GBP 65
France French EUR 65
Italy Italy EUR 60

Transposing

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1df.T # After transposing

France UK Italy
Language French English Italy
Currency EUR GBP EUR
Population 65 65 60

Updating Data

Deleting Rows

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Produces a new DataFrame, not in-place, unless inplace=True is added
2df.drop(["UK","France"])

Language Currency Population
Italy Italy EUR 60

Deleting Columns

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Produces a new DataFrame, not in-place, unless inplace=True is added
2df.drop(["Currency","Population"],axis=1)

Language
France French
UK English
Italy Italy
1# Imperative alternative
1df2 = df.copy(deep=True)
2del df2["Population"]
3df2

Language Currency
France French EUR
UK English GBP
Italy Italy EUR

Updating Axes

Renaming Axes By Label

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Both columns and rows may be renamed at once
2df.rename(columns={"Language" : "LAN", "Currency" : "CUR", "Population" : "POP"},
3          index={"France" : "FR", "Italy" : "IT"})

LAN CUR POP
FR French EUR 65
UK English GBP 65
IT Italy EUR 60

Replacing Axes Entirely

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1df2 = df.copy(deep=True)
2df2.columns = ["LAN","CUR","POP"]
3df2.index = ["FR","UK","IT"]
4df2

LAN CUR POP
FR French EUR 65
UK English GBP 65
IT Italy EUR 60

Applying a Transformation Function

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Using mapper 
2from functools import partial
3
4f = lambda limit, label: label.upper()[:limit]
5df2=df.rename(mapper=partial(f,3), axis=1) # columns
6df3=df2.rename(mapper=partial(f,2), axis=0) # rows/index
7df3

LAN CUR POP
FR French EUR 65
UK English GBP 65
IT Italy EUR 60
1# Replacing columns and index outright
2df2 = df.copy(deep=True)
3df2.columns = map(partial(f,3),df.columns)
4df2.index = map(partial(f,3),df.index)
5df2

LAN CUR POP
FRA French EUR 65
UK English GBP 65
ITA Italy EUR 60

Advanced Indexing

Setting an Existing Column as the Index

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Here we set 'Language' as the index but we lose the country name!
2df.set_index('Language')

Currency Population
Language
French EUR 65
English GBP 65
Italy EUR 60
1# A way to solve this is to save the index as a Series and then add it as a column
2countries = df.index
3df2 = df.set_index('Language')
4df2['Country'] = countries
5df2

Currency Population Country
Language
French EUR 65 France
English GBP 65 UK
Italy EUR 60 Italy

Promoting the Index to a Column

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# An automatic numerical index is created as a result
2df.reset_index()

index Language Currency Population
0 France French EUR 65
1 UK English GBP 65
2 Italy Italy EUR 60

Compound Index

1# We'll first start with a DataFrame as follows
2df2 = df.copy(deep=True)
3df2['Group'] = ["EU","Common Wealth","EU"]
4df2 = df2.reset_index().rename(columns={"index" : "Country"})
5df2

Country Language Currency Population Group
0 France French EUR 65 EU
1 UK English GBP 65 Common Wealth
2 Italy Italy EUR 60 EU
1# Here we select two columns as the index
2df3 = df2.set_index(['Group','Country'])
3df3

Language Currency Population
Group Country
EU France French EUR 65
Common Wealth UK English GBP 65
EU Italy Italy EUR 60
1# Equivalent to df3.loc[("EU")]
2df3.loc["EU"]

Language Currency Population
Country
France French EUR 65
Italy Italy EUR 60
1# Equivalent to df3.loc[("EU","France")]
2df3.loc["EU","France"]
Language      French
Currency         EUR
Population        65
Name: (EU, France), dtype: object
1# Selecting multiple rows
2df3.loc[[("EU","France"),("Common Wealth","UK")]]

Language Currency Population
Group Country
EU France French EUR 65
Common Wealth UK English GBP 65

Loading Data from CSV File

Sample file used for example

resources/countries_small.csv

1!cat resources/countries_small.csv
"country","population","gdp_in_trillions"
"China",1439323776,12.238
"India",1380004385,2.651
"USA",331002651,19.485
1# No arguments
2pd.read_csv("resources/countries_small.csv")

country population gdp_in_trillions
0 China 1439323776 12.238
1 India 1380004385 2.651
2 USA 331002651 19.485
1# Specifying index column
2pd.read_csv("resources/countries_small.csv", index_col=0)

population gdp_in_trillions
country
China 1439323776 12.238
India 1380004385 2.651
USA 331002651 19.485
1df2 = df.copy(deep=True)
2df2
3df2.drop('UK',axis=0)

Language Currency Population
France French EUR 65
Italy Italy EUR 60
1df

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1
2df2 = df.reset_index()
3df2
4df2['index']
0    France
1        UK
2     Italy
Name: index, dtype: object

Joining DataFrames

The following two DataFrames will be used throughout this section

 1bands_df = pd.DataFrame([{'name' : 'Iron Maiden', 'country' : 'UK'},
 2                         {'name' : 'Metallica'  , 'country' : 'USA'},
 3                         {'name' : 'AC/DC'      , 'country' : 'Australia'}])
 4albums_df = pd.DataFrame([{'name' : 'Iron Maiden', 'best_album' : 'Powerslave'},
 5                          {'name' : 'Metallica'  , 'best_album' : 'Ride the Lightning'},
 6                          {'name' : 'Helloween'  , 'best_album' : 'Walls of Jericho'}])
 7# The index may be specified on demand using the merge(on='') argument.
 8bands_df.set_index('name')
 9albums_df.set_index('name')
10pass

Outer Join

The outer join is a union; non-matching items are preserved

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Outer Join
5pd.merge(bands_df, albums_df, how='outer')

name country
0 Iron Maiden UK
1 Metallica USA
2 AC/DC Australia

name best_album
0 Iron Maiden Powerslave
1 Metallica Ride the Lightning
2 Helloween Walls of Jericho

name country best_album
0 Iron Maiden UK Powerslave
1 Metallica USA Ride the Lightning
2 AC/DC Australia NaN
3 Helloween NaN Walls of Jericho

Inner Join

The outer join is an intersection; only matching attributes are preserved

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Inner Join
5pd.merge(bands_df, albums_df, how='inner')

name country
0 Iron Maiden UK
1 Metallica USA
2 AC/DC Australia

name best_album
0 Iron Maiden Powerslave
1 Metallica Ride the Lightning
2 Helloween Walls of Jericho

name country best_album
0 Iron Maiden UK Powerslave
1 Metallica USA Ride the Lightning

Left Join

Include all items on the left table regardless of whether they have a match on the right

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Left Join
5pd.merge(bands_df, albums_df, how='left')

name country
0 Iron Maiden UK
1 Metallica USA
2 AC/DC Australia

name best_album
0 Iron Maiden Powerslave
1 Metallica Ride the Lightning
2 Helloween Walls of Jericho

name country best_album
0 Iron Maiden UK Powerslave
1 Metallica USA Ride the Lightning
2 AC/DC Australia NaN

Right Join

Include all items on the right table regardless of whether they have a match on the left

1# Reference Left and Right Tables
2display(bands_df)
3display(albums_df)
4# Right Join
5pd.merge(bands_df, albums_df, how='right')

name country
0 Iron Maiden UK
1 Metallica USA
2 AC/DC Australia

name best_album
0 Iron Maiden Powerslave
1 Metallica Ride the Lightning
2 Helloween Walls of Jericho

name country best_album
0 Iron Maiden UK Powerslave
1 Metallica USA Ride the Lightning
2 Helloween NaN Walls of Jericho

DataFrame Concatenation

The following two DataFrames are used to illustrate concatenation:

1ernie_bands_df = pd.DataFrame([{'name' : 'Iron Maiden'   , 'country' : 'UK'       , 'genre' : 'metal'},
2                               {'name' : 'Metallica'     , 'country' : 'USA'      , 'genre' : 'metal'},
3                               {'name' : 'AC/DC'         , 'country' : 'Australia', 'genre' : 'metal'}])
4adri_bands_df  = pd.DataFrame([{'name' : 'Modern Talking', 'country' : 'Germany'},
5                               {'name' : 'Fancy'         , 'country' : 'Germany'},
6                               {'name' : 'Ramses B'      , 'country' : 'UK'}])

Ignoring Index

This allows resequencing existing items.

1pd.concat([ernie_bands_df,adri_bands_df],ignore_index=True)

name country genre
0 Iron Maiden UK metal
1 Metallica USA metal
2 AC/DC Australia metal
3 Modern Talking Germany NaN
4 Fancy Germany NaN
5 Ramses B UK NaN

Differentiating Between Items

It is possible to associate each item with its source DataFrame.

1pd.concat([ernie_bands_df,adri_bands_df],keys=['Ernie','Adri'],names=['List','Row Id'])

name country genre
List Row Id
Ernie 0 Iron Maiden UK metal
1 Metallica USA metal
2 AC/DC Australia metal
Adri 0 Modern Talking Germany NaN
1 Fancy Germany NaN
2 Ramses B UK NaN

DataFrame GroupBy

Simple Iteration

In its simplest form, DataFrame.groupby(COLUMN) provides as an iterable object in which the first component is the group name, and the second is the DataFrame which corresponds to the respective group. COLUMN may also be passed as by=COLUMN.

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1for group, frame in df.groupby("Currency"):
2    print("--- Group {} ----------".format(group))
3    print(frame)
--- Group EUR ----------
       Language Currency  Population
France   French      EUR          65
Italy     Italy      EUR          60
--- Group GBP ----------
   Language Currency  Population
UK  English      GBP          65

Grouping Using a Custom Function

A function in the form DataFrame.groupby(lambda index_value: index_value) may be provided as an argument, as an alternative to column name(s). This helps creating a custom grouping criteria rather than using existing column name(s). The function may also be passed using the by argument.

1# Let's use the numeric index. Here the function is neutral
2for group, frame in df.reset_index().groupby(lambda item: item):
3    print("--- Group {} ----------".format(group))
4    print(frame)
--- Group 0 ----------
    index Language Currency  Population
0  France   French      EUR          65
--- Group 1 ----------
  index Language Currency  Population
1    UK  English      GBP          65
--- Group 2 ----------
   index Language Currency  Population
2  Italy    Italy      EUR          60
1# Let's group items by even and odd index numers
2for group, frame in df.reset_index().groupby(lambda item: 'Even' if item % 2 == 0 else 'Odd'):
3    print("--- Group {} ----------".format(group))
4    print(frame)
--- Group Even ----------
    index Language Currency  Population
0  France   French      EUR          65
2   Italy    Italy      EUR          60
--- Group Odd ----------
  index Language Currency  Population
1    UK  English      GBP          65

Multi-level Index

If the index is multi-level, we may want to specify the level we want to group by.

1# Sample DataFrame
2df2 = df.copy(deep=True).reset_index().rename(columns={"index":"Country"}).set_index(["Country","Currency"]) 
3df2

Language Population
Country Currency
France EUR French 65
UK GBP English 65
Italy EUR Italy 60
1# Use first index component
2for group, _ in df2.groupby(level=0):
3    print(group)
France
Italy
UK
1# Use second index component
2for group, _ in df2.groupby(level=1):
3    print(group)
EUR
GBP
1# Use both components
2for group, _ in df2.groupby(level=[0,1]):
3    print(group)
('France', 'EUR')
('Italy', 'EUR')
('UK', 'GBP')
1# Note that when passing functions, item becomes a tuple
2for group, _ in df2.groupby(lambda item: item[0] + "-" + item[1]):
3    print(group)
France-EUR
Italy-EUR
UK-GBP

Direct Aggregation

Rather than iterating through a Group object, aggregations may be performed directly upon grouped column name(s) using built-in aggregation functions such as min, max, mean, etc.

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1df.groupby("Currency").agg({"Population" : ["min","max","mean","median","sum"]})

Population
min max mean median sum
Currency
EUR 60 65 62.5 62.5 125
GBP 65 65 65.0 65.0 65

Transform

The transform function applies an aggregation (or a user-supplied function) to the columns other than that which is chosen for grouping.

1# Sample DataFrame
2df2 = df[["Currency","Population"]]
3df2

Currency Population
France EUR 65
UK GBP 65
Italy EUR 60
1# Using built-in functions
2df2.groupby("Currency").transform("sum")

Population
France 125
UK 65
Italy 125
1# Using user-defined functions
2df2.groupby("Currency").transform(lambda series: sum(series))

Population
France 125
UK 65
Italy 125

Filtering

Applying a filtering function to a Group object helps remove undesired items. Please note that in this case the function receives a DataFrame rather than a Series object.

1# Sample DataFrame
2df2 = df[["Currency","Population"]]
3df2

Currency Population
France EUR 65
UK GBP 65
Italy EUR 60
1# Remove countries whose combined population grouped by Currency is greater than 100
2df2.groupby("Currency").filter(lambda dataFrame: dataFrame['Population'].sum() > 100)

Currency Population
France EUR 65
Italy EUR 60

Creating New Columns With Apply

With DataFrame.groupby(COLUMN).apply(lambda dataFrame: dataFrame) it is possible to include the aggregated results in a new column.

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1def population_for_fx(dataFrame):
2    dataFrame["Pop_total_for_currency"] = dataFrame['Population'].sum()
3    return dataFrame
4    
5df2.groupby("Currency").apply(population_for_fx)

Currency Population Pop_total_for_currency
France EUR 65 125
UK GBP 65 65
Italy EUR 60 125

Scales

Categories

Categories allow defining custom scales as full data types so that they can be, for example, sorted and compared properly.

1# Sample DataFrame
2tshirts_df = pd.DataFrame([{ "type" : "unisex", "size" : "S"},
3                           { "type" : "unisex", "size" : "M"},
4                           { "type" : "unisex", "size" : "L"},
5                           { "type" : "kids",   "size" : "XS"},
6                           { "type" : "big",    "size" : "XL"},
7                           { "type" : "big",    "size" : "XXL"}])
8tshirts_df = tshirts_df.set_index("type")
9tshirts_df

size
type
unisex S
unisex M
unisex L
kids XS
big XL
big XXL
1# Sort doesn't work because Pandas doesn't know about the ordinal nature of t-shirt sizes
2tshirts_df.sort_values("size")

size
type
unisex L
unisex M
unisex S
big XL
kids XS
big XXL
1# The solution is to create a custom category
2tshirt_cat=pd.CategoricalDtype(categories=['XS','S','M','L','XL','XXL'],ordered=True)
3# We now apply the category
4sizes = tshirts_df["size"].astype(tshirt_cat)
5# Let us also create a new DataFrame which includes size using the correct column data type
6tshirts_df2 = tshirts_df.copy(deep=True)
7tshirts_df2["size"] = sizes
8tshirts_df2

size
type
unisex S
unisex M
unisex L
kids XS
big XL
big XXL
1# Now this works!
2tshirts_df2.sort_values("size")

size
type
kids XS
unisex S
unisex M
unisex L
big XL
big XXL
1# Comparisons also work
2tshirts_df2[tshirts_df2["size"] >= 'L']

size
type
unisex L
big XL
big XXL

Data Binning

Data Binning is about ‘bucketing’ data. In Pandas, this is equivalent to converting data typically in a ratio/interval scale to a category.

The pd.cut(DataFrame,number_of_bins) function is used for this purpose.

1# Let's split a number of temperature readings into three buckets
2temperatures = pd.Series([0,5,10,15,20,25,30,35,40])
3pd.cut(temperatures,3)
0     (-0.04, 13.333]
1     (-0.04, 13.333]
2     (-0.04, 13.333]
3    (13.333, 26.667]
4    (13.333, 26.667]
5    (13.333, 26.667]
6      (26.667, 40.0]
7      (26.667, 40.0]
8      (26.667, 40.0]
dtype: category
Categories (3, interval[float64, right]): [(-0.04, 13.333] < (13.333, 26.667] < (26.667, 40.0]]
1# Let's use labels
2pd.cut(temperatures,3,labels=["cold","medium","hot"])
0      cold
1      cold
2      cold
3    medium
4    medium
5    medium
6       hot
7       hot
8       hot
dtype: category
Categories (3, object): ['cold' < 'medium' < 'hot']

Pivot Tables

Simple Pivot Table

A pivot table is similar to the chaining of df.groupby().agg() in that it allows running an aggregate function against a groupby criteria. The difference is that the grouped values appear as columns.

1df # Sample DataFrame

Language Currency Population
France French EUR 65
UK English GBP 65
Italy Italy EUR 60
1# Equivalent groupby()/agg() version
2df.groupby("Currency").agg({"Population" : ["mean","sum"]})

Population
mean sum
Currency
EUR 62.5 125
GBP 65.0 65
1# Pivot table version
2df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])

mean sum
Currency EUR GBP EUR GBP
Population 62.5 65.0 125 65
1# Subtotals can be enabled by setting margins to True
2df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'],margins=True)

mean sum
Currency EUR All GBP All EUR All GBP All
Population 62.5 62.5 65.0 65.0 125 125 65 65

Querying Labels with Idmax

While the likes of min() and max() provide the maximum numerical value in an array or Series, sometimes we need to find out the label associated with a maximum value. We use Series.idxmin() and Series.idxmax(), respectively, for this purpose.

1# Reference table
2pivot_df = df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])
3pivot_df

mean sum
Currency EUR GBP EUR GBP
Population 62.5 65.0 125 65
1# Obtain the currency with the largest population
2pivot_df["sum"].loc["Population"].idxmax()
'EUR'

Stacking and Unstacking

Stacking is the process of converting columns back to rows, whereas unstacking is the process of reversing this process.

1# Reference table
2pivot_df = df.pivot_table(values='Population', columns='Currency', aggfunc=['mean','sum'])
3pivot_df

mean sum
Currency EUR GBP EUR GBP
Population 62.5 65.0 125 65
1# Here EUR and GBP become labels
2pivot_df.stack()

mean sum
Currency
Population EUR 62.5 125
GBP 65.0 65
1# And here we get things back to normal again
2pivot_df.stack().unstack()

mean sum
Currency EUR GBP EUR GBP
Population 62.5 65.0 125 65

Date/Time

Timestamp Declaration

1# Using ISO-ish text string
2pd.Timestamp('2021-12-22 22:05')
Timestamp('2021-12-22 22:05:00')
1# Using date/time components
2pd.Timestamp(2021, 12, 22, 22, 5)
Timestamp('2021-12-22 22:05:00')

Common Timestamp Methods and Properties

1# Individual components 
2ts = pd.Timestamp(2021, 12, 22, 22, 5)
3"{}-{:02d}-{:02d} {:02d}:{:02d}:{:02d}".format(ts.year,ts.month,ts.day,ts.hour,ts.minute,ts.second)
'2021-12-22 22:05:00'
1# Weekday (1 - Monday ... 7 - Sunday)
2ts.isoweekday()
3

Timestamp Arithmetic, Deltas and Offsets

1# Subtraction between two timestamps
2pd.Timestamp('2021-12-22 22:05')-pd.Timestamp('2021-11-15 22:05')
Timedelta('37 days 00:00:00')
1# Subtracting TimeDelta values
2pd.Timestamp('2021-12-22 22:05')-pd.Timedelta('37 days')
Timestamp('2021-11-15 22:05:00')
1# Subtracting an Offset value (see pd.offsets for more options)
2pd.Timestamp('2021-12-22 22:05')-pd.offsets.Week()
Timestamp('2021-12-15 22:05:00')

Period Arithmetic

The Period object overloads arithmetic operators such as +, -, *, /, to perform data/time wise calculations

1# Year wise 
2p = pd.Period("2021")
3print(p-1)
4print(p)
5print(p+1)
2020
2021
2022
1# Month wise 
2p = pd.Period("2021-12")
3print(p-1)
4print(p)
5print(p+1)
2021-11
2021-12
2022-01
1# Day wise 
2p = pd.Period("2021-12-31")
3print(p-1)
4print(p)
5print(p+1)
2021-12-30
2021-12-31
2022-01-01
1# Hour wise 
2p = pd.Period("2021-12-31 23")
3print(p-1)
4print(p)
5print(p+1)
2021-12-31 22:00
2021-12-31 23:00
2022-01-01 00:00

DateTimeIndex and PeriodTimeIndex

A Date/Time value can be used as an index or value in Series and DataFrames.

1# DateTimeIndex
2s = pd.Series(['A','B','C'],index=[pd.Timestamp("2021-12-22"), 
3                                   pd.Timestamp("2021-12-23"), 
4                                   pd.Timestamp("2021-12-24")])
5print(s)
6print(type(s.index))
2021-12-22    A
2021-12-23    B
2021-12-24    C
dtype: object

<class ‘pandas.core.indexes.datetimes.DatetimeIndex’>

1## PeriodIndex
2s = pd.Series(['A','B','C'],index=[pd.Period("2021-10"), 
3                                   pd.Period("2021-11"), 
4                                   pd.Period("2021-12")])
5print(s)
6print(type(s.index))
2021-10    A
2021-11    B
2021-12    C
Freq: M, dtype: object

<class ‘pandas.core.indexes.period.PeriodIndex’>

Automatic Date Conversion

Pandas can figure out dates based on various common formats

 1d1 = ['2 June 2013', 'Aug 29, 2014', '2015-06-26', '7/12/16']
 2
 3s = pd.Series(["A","B","C","D","E"], 
 4              index=["5th of January, 2021",
 5                     "Feb 27, 2021",
 6                     "2021-01-02",
 7                     "02/01/21", # 2nd of January!
 8                     "03 Jan 21"])   
 9
10# dayfirst=true is a must for non-US (i.e., European) date formats
11s.index = pd.to_datetime(s.index,dayfirst=True)
12s
2021-01-05    A
2021-02-27    B
2021-01-02    C
2021-01-02    D
2021-01-03    E
dtype: object

Generating Date Ranges

Date ranges can be generated based on a starting date, a given number of periods, and a frequency (weekly, quarterly, etc.).

1# 12 months
2pd.date_range('2021-01-31', periods=12, freq='M')
DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30',
               '2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31',
               '2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31'],
              dtype='datetime64[ns]', freq='M')
1# First 10 weeks, with the week starting on Monday rather than Sunday
2pd.date_range('2021-12-06', periods=10, freq='W-MON')
DatetimeIndex(['2021-12-06', '2021-12-13', '2021-12-20', '2021-12-27',
               '2022-01-03', '2022-01-10', '2022-01-17', '2022-01-24',
               '2022-01-31', '2022-02-07'],
              dtype='datetime64[ns]', freq='W-MON')

Data Ranges in DataFrames

1# Sample DataFrame
2dates_index = pd.date_range('2021-01-01', periods=12, freq='MS') # Month Start
3dates_df    = pd.DataFrame({"number":list(range(1, 13))}, 
4                            index=dates_index)
5dates_df

number
2021-01-01 1
2021-02-01 2
2021-03-01 3
2021-04-01 4
2021-05-01 5
2021-06-01 6
2021-07-01 7
2021-08-01 8
2021-09-01 9
2021-10-01 10
2021-11-01 11
2021-12-01 12
1# Here we add an extra column with the weekday
2dates_df["weekday"] = pd.Series(list(map(lambda x: 
3                                {0 : "MON", 1 : "TUE", 2 : "WED", 3 : "THR", 4 : "FRI", 5 : "SAT", 6 : "SUN"}[x], 
4                                dates_df.index.weekday)),index=dates_index)
5dates_df

number weekday
2021-01-01 1 FRI
2021-02-01 2 MON
2021-03-01 3 MON
2021-04-01 4 THR
2021-05-01 5 SAT
2021-06-01 6 TUE
2021-07-01 7 THR
2021-08-01 8 SUN
2021-09-01 9 WED
2021-10-01 10 FRI
2021-11-01 11 MON
2021-12-01 12 WED

Downsampling

This is the process of reducing the number of dates (for example, from daily to weekly), aggregating the values for the intermediate dates (which will be lost) in the process.

1# Sample DataFrame
2dates_df

number weekday
2021-01-01 1 FRI
2021-02-01 2 MON
2021-03-01 3 MON
2021-04-01 4 THR
2021-05-01 5 SAT
2021-06-01 6 TUE
2021-07-01 7 THR
2021-08-01 8 SUN
2021-09-01 9 WED
2021-10-01 10 FRI
2021-11-01 11 MON
2021-12-01 12 WED
1# Here we downsample the index from montlhy to quarterly (every three months), and sum the values
2# for all the months within a given quarter:
3#
4# 1  +  2 +  3 = 6
5# 4  +  5 +  6 = 15
6# 7  +  8 +  9 = 24
7# 10 + 11 + 12 = 33
8#
9dates_df.resample("Q").sum()

number
2021-03-31 6
2021-06-30 15
2021-09-30 24
2021-12-31 33

Indexing and Slicing

1# Sample DataFrame
2dates_df

number weekday
2021-01-01 1 FRI
2021-02-01 2 MON
2021-03-01 3 MON
2021-04-01 4 THR
2021-05-01 5 SAT
2021-06-01 6 TUE
2021-07-01 7 THR
2021-08-01 8 SUN
2021-09-01 9 WED
2021-10-01 10 FRI
2021-11-01 11 MON
2021-12-01 12 WED
1# All dates 2021-02-*
2dates_df.loc["2021-02"]

number weekday
2021-02-01 2 MON
1## All dates between 2021-02-* and 2021-06-*
2dates_df.loc["2021-02":"2021-06"]

number weekday
2021-02-01 2 MON
2021-03-01 3 MON
2021-04-01 4 THR
2021-05-01 5 SAT
2021-06-01 6 TUE

Before You Leave

🤘 Subscribe to my 100% spam-free newsletter!

website counters