I find most Pandas examples to be either too abstract (e.g., DataFrames are devised using single letters, random numbers, etc.) or the other way around (they load humongous data sets from the likes of Kaggle). Furthermore, the data sets tend to change depending on the method or property at hand. I, instead, use the same small data set for nearly all of my examples.
This is not a ‘my tutorial is better than your tutorial’ affair. It is, rather, an alternative set of examples which may work out for some people.
Essentials
1# Module Import23importnumpyasnp4importpandasaspd5importreasre# only used in a few examples
Series Introduction
Series are similar to Pandas arrays except that the index may be user-defined.
Declaration
1# Declaration without labels2pd.Series(['A+','A','C'])
0 A+
1 A
2 C
dtype: object
1# Declaration with labels by passing an index list2pd.Series(['A+','A','C'],index=['Daisy','Taffy','Nina'])
Daisy A+
Taffy A
Nina C
dtype: object
1# Declaration with labels using a dictionary2pd.Series({'Daisy':'A+',3'Taffy':'A',4'Nina':'C'})
Daisy A+
Taffy A
Nina C
dtype: object
Querying (Keys and Values)
1# Keys and values2s=pd.Series({'Daisy':'A+',3'Taffy':'A',4'Nina':'C'})5print(" keys: {}".format(s.keys()))6print("values: {}".format(s.to_list()))
1# Elements by Index and Key2s=pd.Series({'Daisy':'A+',3'Taffy':'A',4'Nina':'C'})56last_element=len(s)-1# 27print("By index: {}".format(s.iloc[last_element]))8print("By key: {}".format(s.loc['Nina']))
By index: C
By key: C
Querying (Head and Tail)
1# Head and Tail 2s=pd.Series(np.random.randint(1,91,500))# 500 random numbers 1 =< n < 9134# Bingo numbers5# We use to_list() since the result is another Series object6print("head: {}".format(s.head().to_list()))7print("tail: {}".format(s.tail().to_list()))
1# Appending series2s1=pd.Series([1,2,3],index=['One','Two','Three'])3s2=pd.Series([4,'III'],index=['Four','Three'])4s3=s1.append(s2)# s1 is not modified in place unlike Python's arrays!5s3
One 1
Two 2
Three 3
Four 4
Three III
dtype: object
1# Broadcasting (direct arithmetic) 2s=pd.Series([1,2,3]) 3 4# New Series 5print("s * 2 = {}".format((s*2).to_list())) 6print("s - 1 = {}".format((s-1).to_list())) 7 8# Modify in Place 9s+=1010print("After in place modification: {}".format(s.to_list()))
s * 2 = [2, 4, 6]
s - 1 = [0, 1, 2]
After in place modification: [11, 12, 13]
Series Aggregations and Computations
Unique values
1pd.Series([10,20,20,30]).unique()
array([10, 20, 30])
Sum
1pd.Series([1,2,3]).sum()
6
Series Gotchas
Indexing Using Square Brackets
1# Indexing without iloc using square brackets2s=pd.Series({1:'First',32:'Second'})4try:5# This fails because numbers are treated as keys6print(s[0])7except:8# This works because iloc will not cast the number incorrectly9print("This works: "+s.iloc[0])
This works: First
None vs NaN
1# None vs Not a Number (NaN)2grades=pd.Series(['A+','A',None])3numbers=pd.Series([10,9,None])4print("In grades, the third element is: {}".format(grades[2]))5print("In numbers, the third element is: {}".format(numbers[2]))67# Testing for None doesn't work with equals, but requires a specific function8print("numbers[3] == None: {}".format(numbers[2]==None))9print("np.isnan(numbers[3]): {}".format(np.isnan(numbers[2])))
In grades, the third element is: None
In numbers, the third element is: nan
numbers[3] == None: False
np.isnan(numbers[3]): True
Repeated Keys Are Possible
1# Repeated Keys2# Series are not dictionaries nor arrays3# - Types may be mixed4# - Element labels/keys may be repeated5pd.Series(["Austria","Germany","Italy"],index=["German","German","Italian"])
German Austria
German Germany
Italian Italy
dtype: object
1# Repeated Keys #2 2# A location may return a single value or a Series! 3s=pd.Series(["Austria","Germany","Italy"],index=["German","German","Italian"]) 4 5# For Italian 6italian=s.loc["Italian"] 7print("* ",end='') 8print(italian,end='') 9print(type(italian))1011# For German12print("* ",end='')13german=s.loc["German"]14print(german.to_list(),end='')15print(type(german))
DataFrames aggregate Series objects by treating Series as labeled rows.
1# Declaration using Series2c1=pd.Series(["French","EUR",65],index=["Language","Currency","Population"])3c2=pd.Series(["English","GBP",65],index=["Language","Currency","Population"])4c3=pd.Series(["Italy","EUR",60],index=["Language","Currency","Population"])56# The columns may be specified by adding 'name=Column' to each Series instead7pd.DataFrame([c1,c2,c3],index=["France","UK","Italy"])
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Declaration using Tuples2countries=[("French","EUR",65),3("English","GBP",65),4("Italy","EUR",60)]56pd.DataFrame(countries,7columns=["Language","Currency","Population"],8index=["France","UK","Italy"])
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Declaration using a list Dictionary values 2countries=[{"Language":"French", 3"Currency":"EUR", 4"Population":65}, 5{"Language":"English", 6"Currency":"GBP", 7"Population":65}, 8{"Language":"Italy", 9"Currency":"EUR",10"Population":60}]1112df=pd.DataFrame(countries,index=["France","UK","Italy"])13df
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
Querying (Simple)
Rows and Columns
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Rows and columns2print(" rows: {}".format(df.index))3print("columns: {}".format(df.columns))
Language French
Currency EUR
Population 65
Name: France, dtype: object
1# Select row using label2df.loc['France']# Produces a Series object
Language French
Currency EUR
Population 65
Name: France, dtype: object
1# Repeated labels will produce a DataFrame rather than a Series object!2df2=df.append(pd.Series(["Sicilian","EUR",60],3index=["Language","Currency","Population"],name="Italy"))4df2.loc['Italy']
Language
Currency
Population
Italy
Italy
EUR
60
Italy
Sicilian
EUR
60
Select Column By Label
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Select column using name2df['Language']
France French
UK English
Italy Italy
Name: Language, dtype: object
1# Select specific columns2df[["Language","Currency"]]
Language
Currency
France
French
EUR
UK
English
GBP
Italy
Italy
EUR
Select Element By Row and Column
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1df.loc["France","Currency"]
'EUR'
1# This can also be accomplished by first selecting the row and then the column2df.loc["France"]["Currency"]
'EUR'
Querying (Advanced)
Selecting Rows Using a Boolean Mask
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# France we keep, UK we leave out, Italy we keep2mask=pd.Series([True,False,True],index=["France","UK","Italy"])3df[mask]
Language
Currency
Population
France
French
EUR
65
Italy
Italy
EUR
60
1# Let's invert the mask so it becomes False, True, False2df[mask.map(lambdax:notx)]
Language
Currency
Population
UK
English
GBP
65
Creating Boolean Masks using Operators
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1print(df['Population']>=65)
France True
UK True
Italy False
Name: Population, dtype: bool
1print(df['Currency']=="EUR")
France True
UK False
Italy True
Name: Currency, dtype: bool
1# At the Series level2pd.Series([1,2,3,None]).isnull()
0 False
1 False
2 False
3 True
dtype: bool
1# At the DataFrame level2df2=df.copy(deep=True)3df2.loc["UK","Currency"]=None4df2.isnull()
Language
Currency
Population
France
False
False
False
UK
False
True
False
Italy
False
False
False
Preserving Non-Matches using Where()
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Passing the Boolean Mask to where() does not drop the rows,2# instead, they appear with their values as 'NaN'3df.where((df['Population']>=65)&(df['Currency']=="EUR"))
Language
Currency
Population
France
French
EUR
65.0
UK
NaN
NaN
NaN
Italy
NaN
NaN
NaN
1# The NaN rows can be dropped using .dropna()2# This is the behaviour implemented by the direct df[mask] method3df.where((df['Population']>=65)&(df['Currency']=="EUR")).dropna()
Using Apply() to Modify and/or Create Extra Columns
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Create extra column with the population 2defpop_to_million(row):3row['Pop_total']=row['Population']*10000004returnrow5df.apply(pop_to_million,axis=1)
Language
Currency
Population
Pop_total
France
French
EUR
65
65000000
UK
English
GBP
65
65000000
Italy
Italy
EUR
60
60000000
Using Extract() to Generate New Columns with a Regexp
1df2=df.copy(deep=True)2df2
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# More groups will result in more columns!2iso=df2["Language"].str.extract("(?P<ISO>^\w{2})").applymap(str.upper)3iso
ISO
France
FR
UK
EN
Italy
IT
1# Adding new column
1df2["LANG_ISO"]=iso2df2
Language
Currency
Population
LANG_ISO
France
French
EUR
65
FR
UK
English
GBP
65
EN
Italy
Italy
EUR
60
IT
Sorting
Sort by Index
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# ascending=True2df.sort_index()
Language
Currency
Population
France
French
EUR
65
Italy
Italy
EUR
60
UK
English
GBP
65
Sort by Column
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# ascending=True2df.sort_values("Language")
Language
Currency
Population
UK
English
GBP
65
France
French
EUR
65
Italy
Italy
EUR
60
Transposing
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1df.T# After transposing
France
UK
Italy
Language
French
English
Italy
Currency
EUR
GBP
EUR
Population
65
65
60
Updating Data
Deleting Rows
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Produces a new DataFrame, not in-place, unless inplace=True is added2df.drop(["UK","France"])
Language
Currency
Population
Italy
Italy
EUR
60
Deleting Columns
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Produces a new DataFrame, not in-place, unless inplace=True is added2df.drop(["Currency","Population"],axis=1)
Language
France
French
UK
English
Italy
Italy
1# Imperative alternative
1df2=df.copy(deep=True)2deldf2["Population"]3df2
Language
Currency
France
French
EUR
UK
English
GBP
Italy
Italy
EUR
Updating Axes
Renaming Axes By Label
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Both columns and rows may be renamed at once2df.rename(columns={"Language":"LAN","Currency":"CUR","Population":"POP"},3index={"France":"FR","Italy":"IT"})
1# Using mapper 2fromfunctoolsimportpartial34f=lambdalimit,label:label.upper()[:limit]5df2=df.rename(mapper=partial(f,3),axis=1)# columns6df3=df2.rename(mapper=partial(f,2),axis=0)# rows/index7df3
LAN
CUR
POP
FR
French
EUR
65
UK
English
GBP
65
IT
Italy
EUR
60
1# Replacing columns and index outright2df2=df.copy(deep=True)3df2.columns=map(partial(f,3),df.columns)4df2.index=map(partial(f,3),df.index)5df2
LAN
CUR
POP
FRA
French
EUR
65
UK
English
GBP
65
ITA
Italy
EUR
60
Advanced Indexing
Setting an Existing Column as the Index
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# Here we set 'Language' as the index but we lose the country name!2df.set_index('Language')
Currency
Population
Language
French
EUR
65
English
GBP
65
Italy
EUR
60
1# A way to solve this is to save the index as a Series and then add it as a column2countries=df.index3df2=df.set_index('Language')4df2['Country']=countries5df2
Currency
Population
Country
Language
French
EUR
65
France
English
GBP
65
UK
Italy
EUR
60
Italy
Promoting the Index to a Column
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1# An automatic numerical index is created as a result2df.reset_index()
index
Language
Currency
Population
0
France
French
EUR
65
1
UK
English
GBP
65
2
Italy
Italy
EUR
60
Compound Index
1# We'll first start with a DataFrame as follows2df2=df.copy(deep=True)3df2['Group']=["EU","Common Wealth","EU"]4df2=df2.reset_index().rename(columns={"index":"Country"})5df2
Country
Language
Currency
Population
Group
0
France
French
EUR
65
EU
1
UK
English
GBP
65
Common Wealth
2
Italy
Italy
EUR
60
EU
1# Here we select two columns as the index2df3=df2.set_index(['Group','Country'])3df3
Language
Currency
Population
Group
Country
EU
France
French
EUR
65
Common Wealth
UK
English
GBP
65
EU
Italy
Italy
EUR
60
1# Equivalent to df3.loc[("EU")]2df3.loc["EU"]
Language
Currency
Population
Country
France
French
EUR
65
Italy
Italy
EUR
60
1# Equivalent to df3.loc[("EU","France")]2df3.loc["EU","France"]
Language French
Currency EUR
Population 65
Name: (EU, France), dtype: object
1# No arguments2pd.read_csv("resources/countries_small.csv")
country
population
gdp_in_trillions
0
China
1439323776
12.238
1
India
1380004385
2.651
2
USA
331002651
19.485
1# Specifying index column2pd.read_csv("resources/countries_small.csv",index_col=0)
population
gdp_in_trillions
country
China
1439323776
12.238
India
1380004385
2.651
USA
331002651
19.485
1df2=df.copy(deep=True)2df23df2.drop('UK',axis=0)
Language
Currency
Population
France
French
EUR
65
Italy
Italy
EUR
60
1df
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
12df2=df.reset_index()3df24df2['index']
0 France
1 UK
2 Italy
Name: index, dtype: object
Joining DataFrames
The following two DataFrames will be used throughout this section
1bands_df=pd.DataFrame([{'name':'Iron Maiden','country':'UK'}, 2{'name':'Metallica','country':'USA'}, 3{'name':'AC/DC','country':'Australia'}]) 4albums_df=pd.DataFrame([{'name':'Iron Maiden','best_album':'Powerslave'}, 5{'name':'Metallica','best_album':'Ride the Lightning'}, 6{'name':'Helloween','best_album':'Walls of Jericho'}]) 7# The index may be specified on demand using the merge(on='') argument. 8bands_df.set_index('name') 9albums_df.set_index('name')10pass
Outer Join
The outer join is a union; non-matching items are preserved
1# Reference Left and Right Tables2display(bands_df)3display(albums_df)4# Outer Join5pd.merge(bands_df,albums_df,how='outer')
name
country
0
Iron Maiden
UK
1
Metallica
USA
2
AC/DC
Australia
name
best_album
0
Iron Maiden
Powerslave
1
Metallica
Ride the Lightning
2
Helloween
Walls of Jericho
name
country
best_album
0
Iron Maiden
UK
Powerslave
1
Metallica
USA
Ride the Lightning
2
AC/DC
Australia
NaN
3
Helloween
NaN
Walls of Jericho
Inner Join
The outer join is an intersection; only matching attributes are preserved
1# Reference Left and Right Tables2display(bands_df)3display(albums_df)4# Inner Join5pd.merge(bands_df,albums_df,how='inner')
name
country
0
Iron Maiden
UK
1
Metallica
USA
2
AC/DC
Australia
name
best_album
0
Iron Maiden
Powerslave
1
Metallica
Ride the Lightning
2
Helloween
Walls of Jericho
name
country
best_album
0
Iron Maiden
UK
Powerslave
1
Metallica
USA
Ride the Lightning
Left Join
Include all items on the left table regardless of whether they have a match on the right
1# Reference Left and Right Tables2display(bands_df)3display(albums_df)4# Left Join5pd.merge(bands_df,albums_df,how='left')
name
country
0
Iron Maiden
UK
1
Metallica
USA
2
AC/DC
Australia
name
best_album
0
Iron Maiden
Powerslave
1
Metallica
Ride the Lightning
2
Helloween
Walls of Jericho
name
country
best_album
0
Iron Maiden
UK
Powerslave
1
Metallica
USA
Ride the Lightning
2
AC/DC
Australia
NaN
Right Join
Include all items on the right table regardless of whether they have a match on the left
1# Reference Left and Right Tables2display(bands_df)3display(albums_df)4# Right Join5pd.merge(bands_df,albums_df,how='right')
name
country
0
Iron Maiden
UK
1
Metallica
USA
2
AC/DC
Australia
name
best_album
0
Iron Maiden
Powerslave
1
Metallica
Ride the Lightning
2
Helloween
Walls of Jericho
name
country
best_album
0
Iron Maiden
UK
Powerslave
1
Metallica
USA
Ride the Lightning
2
Helloween
NaN
Walls of Jericho
DataFrame Concatenation
The following two DataFrames are used to illustrate concatenation:
In its simplest form, DataFrame.groupby(COLUMN) provides as an iterable object in which the first component is the group name, and the second is the DataFrame which corresponds to the respective group. COLUMN may also be passed as by=COLUMN.
1df# Sample DataFrame
Language
Currency
Population
France
French
EUR
65
UK
English
GBP
65
Italy
Italy
EUR
60
1forgroup,frameindf.groupby("Currency"):2print("--- Group {} ----------".format(group))3print(frame)
--- Group EUR ----------
Language Currency Population
France French EUR 65
Italy Italy EUR 60
--- Group GBP ----------
Language Currency Population
UK English GBP 65
Grouping Using a Custom Function
A function in the form DataFrame.groupby(lambda index_value: index_value) may be provided as an argument, as an alternative to column name(s). This helps creating a custom grouping criteria rather than using existing column name(s). The function may also be passed using the by argument.
1# Let's use the numeric index. Here the function is neutral2forgroup,frameindf.reset_index().groupby(lambdaitem:item):3print("--- Group {} ----------".format(group))4print(frame)
--- Group 0 ----------
index Language Currency Population
0 France French EUR 65
--- Group 1 ----------
index Language Currency Population
1 UK English GBP 65
--- Group 2 ----------
index Language Currency Population
2 Italy Italy EUR 60
1# Let's group items by even and odd index numers2forgroup,frameindf.reset_index().groupby(lambdaitem:'Even'ifitem%2==0else'Odd'):3print("--- Group {} ----------".format(group))4print(frame)
--- Group Even ----------
index Language Currency Population
0 France French EUR 65
2 Italy Italy EUR 60
--- Group Odd ----------
index Language Currency Population
1 UK English GBP 65
Multi-level Index
If the index is multi-level, we may want to specify the level we want to group by.
1# Use first index component2forgroup,_indf2.groupby(level=0):3print(group)
France
Italy
UK
1# Use second index component2forgroup,_indf2.groupby(level=1):3print(group)
EUR
GBP
1# Use both components2forgroup,_indf2.groupby(level=[0,1]):3print(group)
('France', 'EUR')
('Italy', 'EUR')
('UK', 'GBP')
1# Note that when passing functions, item becomes a tuple2forgroup,_indf2.groupby(lambdaitem:item[0]+"-"+item[1]):3print(group)
France-EUR
Italy-EUR
UK-GBP
Direct Aggregation
Rather than iterating through a Group object, aggregations may be performed directly upon grouped column name(s) using built-in aggregation functions such as min, max, mean, etc.
1# Using built-in functions2df2.groupby("Currency").transform("sum")
Population
France
125
UK
65
Italy
125
1# Using user-defined functions2df2.groupby("Currency").transform(lambdaseries:sum(series))
Population
France
125
UK
65
Italy
125
Filtering
Applying a filtering function to a Group object helps remove undesired items. Please note that in this case the function receives a DataFrame rather than a Series object.
1# Remove countries whose combined population grouped by Currency is greater than 1002df2.groupby("Currency").filter(lambdadataFrame:dataFrame['Population'].sum()>100)
Currency
Population
France
EUR
65
Italy
EUR
60
Creating New Columns With Apply
With DataFrame.groupby(COLUMN).apply(lambda dataFrame: dataFrame) it is possible to include the aggregated results in a new column.
1# Sort doesn't work because Pandas doesn't know about the ordinal nature of t-shirt sizes2tshirts_df.sort_values("size")
size
type
unisex
L
unisex
M
unisex
S
big
XL
kids
XS
big
XXL
1# The solution is to create a custom category2tshirt_cat=pd.CategoricalDtype(categories=['XS','S','M','L','XL','XXL'],ordered=True)3# We now apply the category4sizes=tshirts_df["size"].astype(tshirt_cat)5# Let us also create a new DataFrame which includes size using the correct column data type6tshirts_df2=tshirts_df.copy(deep=True)7tshirts_df2["size"]=sizes8tshirts_df2
size
type
unisex
S
unisex
M
unisex
L
kids
XS
big
XL
big
XXL
1# Now this works!2tshirts_df2.sort_values("size")
size
type
kids
XS
unisex
S
unisex
M
unisex
L
big
XL
big
XXL
1# Comparisons also work2tshirts_df2[tshirts_df2["size"]>='L']
size
type
unisex
L
big
XL
big
XXL
Data Binning
Data Binning is about ‘bucketing’ data. In Pandas, this is equivalent to converting data typically in a ratio/interval scale to a category.
The pd.cut(DataFrame,number_of_bins) function is used for this purpose.
1# Let's split a number of temperature readings into three buckets2temperatures=pd.Series([0,5,10,15,20,25,30,35,40])3pd.cut(temperatures,3)
1# Let's use labels2pd.cut(temperatures,3,labels=["cold","medium","hot"])
0 cold
1 cold
2 cold
3 medium
4 medium
5 medium
6 hot
7 hot
8 hot
dtype: category
Categories (3, object): ['cold' < 'medium' < 'hot']
Pivot Tables
Simple Pivot Table
A pivot table is similar to the chaining of df.groupby().agg() in that it allows running an aggregate function against a groupby criteria. The difference is that the grouped values appear as columns.
1# Subtotals can be enabled by setting margins to True2df.pivot_table(values='Population',columns='Currency',aggfunc=['mean','sum'],margins=True)
mean
sum
Currency
EUR
All
GBP
All
EUR
All
GBP
All
Population
62.5
62.5
65.0
65.0
125
125
65
65
Querying Labels with Idmax
While the likes of min() and max() provide the maximum numerical value in an array or Series, sometimes we need to find out the label associated with a maximum value. We use Series.idxmin() and Series.idxmax(), respectively, for this purpose.
2021-10 A
2021-11 B
2021-12 C
Freq: M, dtype: object
<class ‘pandas.core.indexes.period.PeriodIndex’>
Automatic Date Conversion
Pandas can figure out dates based on various common formats
1d1=['2 June 2013','Aug 29, 2014','2015-06-26','7/12/16'] 2 3s=pd.Series(["A","B","C","D","E"], 4index=["5th of January, 2021", 5"Feb 27, 2021", 6"2021-01-02", 7"02/01/21",# 2nd of January! 8"03 Jan 21"]) 910# dayfirst=true is a must for non-US (i.e., European) date formats11s.index=pd.to_datetime(s.index,dayfirst=True)12s
2021-01-05 A
2021-02-27 B
2021-01-02 C
2021-01-02 D
2021-01-03 E
dtype: object
Generating Date Ranges
Date ranges can be generated based on a starting date, a given number of periods, and a frequency (weekly, quarterly, etc.).
1# Here we add an extra column with the weekday2dates_df["weekday"]=pd.Series(list(map(lambdax:3{0:"MON",1:"TUE",2:"WED",3:"THR",4:"FRI",5:"SAT",6:"SUN"}[x],4dates_df.index.weekday)),index=dates_index)5dates_df
number
weekday
2021-01-01
1
FRI
2021-02-01
2
MON
2021-03-01
3
MON
2021-04-01
4
THR
2021-05-01
5
SAT
2021-06-01
6
TUE
2021-07-01
7
THR
2021-08-01
8
SUN
2021-09-01
9
WED
2021-10-01
10
FRI
2021-11-01
11
MON
2021-12-01
12
WED
Downsampling
This is the process of reducing the number of dates (for example, from daily to weekly), aggregating the values for the intermediate dates (which will be lost) in the process.
1# Sample DataFrame2dates_df
number
weekday
2021-01-01
1
FRI
2021-02-01
2
MON
2021-03-01
3
MON
2021-04-01
4
THR
2021-05-01
5
SAT
2021-06-01
6
TUE
2021-07-01
7
THR
2021-08-01
8
SUN
2021-09-01
9
WED
2021-10-01
10
FRI
2021-11-01
11
MON
2021-12-01
12
WED
1# Here we downsample the index from montlhy to quarterly (every three months), and sum the values2# for all the months within a given quarter:3#4# 1 + 2 + 3 = 65# 4 + 5 + 6 = 156# 7 + 8 + 9 = 247# 10 + 11 + 12 = 338#9dates_df.resample("Q").sum()
number
2021-03-31
6
2021-06-30
15
2021-09-30
24
2021-12-31
33
Indexing and Slicing
1# Sample DataFrame2dates_df
number
weekday
2021-01-01
1
FRI
2021-02-01
2
MON
2021-03-01
3
MON
2021-04-01
4
THR
2021-05-01
5
SAT
2021-06-01
6
TUE
2021-07-01
7
THR
2021-08-01
8
SUN
2021-09-01
9
WED
2021-10-01
10
FRI
2021-11-01
11
MON
2021-12-01
12
WED
1# All dates 2021-02-*2dates_df.loc["2021-02"]
number
weekday
2021-02-01
2
MON
1## All dates between 2021-02-* and 2021-06-*2dates_df.loc["2021-02":"2021-06"]