Introduction
Pandas is an extremely popular data manipulation and analysis library. It's the go-to tool for loading in and analyzing datasets for many.
Correctly sorting data is a crucial element of many tasks regarding data analysis. In this tutorial, we'll take a look at how to sort a Pandas DataFrame by date.
Let's start off with making a simple DataFrame
with a few dates:
import pandas as pd
data = {'Name':["John", "Paul", "Dhilan", "Bob", "Henry"],
'Date of Birth': ["01/06/86", "05/10/77", "11/12/88", "25/12/82", "01/06/86"]}
df = pd.DataFrame(data)
print(df)
By default our output is sorted by the DataFrame
s index:
Name Date of Birth
0 John 01/06/86
1 Paul 05/10/77
2 Dhilan 11/12/88
3 Bob 25/12/82
4 Henry 01/06/86
The eagle-eyed may notice that John and Paul have the same date of birth - this is on-purpose as we'll see in a moment.
Convert Strings to Datetime in Pandas DataFrame
We have input Date of Birth
in date format and it appears to be formatted as such. However, the first thing we need to do is ensure Pandas recognizes and understands that this date is in fact a date.
The way Pandas stores and manipulates data in a DataFrame
is determined by its data type.
The data type of each value is assigned automatically, based on what it looks like. 60
will be assigned an integer type, while John
will be assigned a string type. Let's check the current data type of each column:
print(df.dtypes)
This gives us our list of data types:
Name object
Date of Birth object
dtype: object
We can see our Date of Birth
column has been assigned a basic string object type by default. However, in order to correctly sort, analyze or manipulate our dates correctly we need Pandas to recognize this column contains dates.
Let's explicitly change the data type in our Date of Birth
column from an object
type to a datetime
type.
The easiest way to do this is to use the to_datetime()
function:
df["Date of Birth"] = pd.to_datetime(df["Date of Birth"])
print(df.dtypes)
Now, if we check our output:
Name object
Date of Birth datetime64[ns]
dtype: object
So, we can see we have successfully changed our data type to datetime
.
Alternatively, we can manually specify the data type of our column, provided of course we know what data type we want it to be:
df["Date of Birth"] = df["Date of Birth"].astype('datetime64[ns]')
Output:
Name object
Date of Birth datetime64[ns]
dtype: object
Whilst both of these methods produce the same result, the to_datetime()
method is preferred as it was explicitly designed for this purpose.
Sorting a DataFrame by Date in Pandas
Now that Pandas correctly recognizes our data types, let's sort the DataFrame
.
Note: All of the methods we'll use don't sort in-place, so you'll have to either reassign the changed DataFrame
to a new (or the same) reference variable to persist the change - or you can use the inplace
argument to change the default behavior.
Sort by Single Date Column in Ascending Order
The sort_values()
method will, by default, sort data in ascending order. For dates this would mean the first or earliest in order will appear at the top of the list:
df.sort_values(by='Date of Birth', inplace=True)
print(df)
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
Running this code results in:
Name Date of Birth
1 Paul 1977-05-10
3 Bob 1982-12-25
0 John 1986-01-06
4 Henry 1986-01-06
2 Dhilan 1988-11-12
Alternatively, if you don't want to use the inplace
argument, you can simply re-assign the returned DataFrame
from the sort_values()
method to df
(or any other reference variable:
df = df.sort_values(by='Date of Birth')
As we gave John and Henry the same birthday, the order is based on their corresponding index number.
Sort by Single Date Column in Descending Order
Changing our order of sort to descending can be done by setting the ascending
argument to False
when calling the sort_values()
function:
df.sort_values(by='Date of Birth', ascending = False, inplace=True)
This time we get our data sorted in descending order, meaning the last or most recent will appear at the top of our list. Again as John and Henry have the same birthday their order is based on their index number:
Name Date of Birth
2 Dhilan 1988-11-12
0 John 1986-01-06
4 Henry 1986-01-06
3 Bob 1982-12-25
1 Paul 1977-05-10
Sort by Multiple Date Columns
So, what happens if we have multiple date columns that we want to sort by?
Let’s add another date-related column to our DataFrame
and make sure both our data types are correctly assigned:
# Values for the new column
employment_start = ["22/05/16", "17/08/10", "22/05/16", "11/06/19", "16/06/05"]
# Adding columns to DataFrame
df['Employment Start'] = employment_start
# Applying to_datetime() function to multiple columns at once
df[['Date of Birth', 'Employment Start']] = df[['Date of Birth', 'Employment Start']].apply(pd.to_datetime)
print(df.dtypes)
print(df)
Now, let's check if things look good:
Name object
Date of Birth datetime64[ns]
Employment Start datetime64[ns]
dtype: object
Name Date of Birth Employment Start
0 John 1986-01-06 2016-05-22
1 Paul 1977-05-10 2010-08-17
2 Dhilan 1988-11-12 2016-05-22
3 Bob 1982-12-25 2019-11-06
4 Henry 1986-01-06 2005-06-16
Sort by Multiple Date Columns in Ascending Order
To sort the DataFrame
by both Date of Birth
and Employment Start
in ascending order, we simply need to add both column names to our sort_values()
method. Just bear in mind the priority of the sort is determined by which column is entered first:
df.sort_values(by=['Date of Birth', 'Employment Start'], inplace=True)
As this method defaults to ascending order, our output will be:
Name Date of Birth Employment Start
1 Paul 1977-05-10 2010-08-17
3 Bob 1982-12-25 2019-11-06
4 Henry 1986-01-06 2005-06-16
0 John 1986-01-06 2016-05-22
2 Dhilan 1988-11-12 2016-05-22
As Date of Birth
is the first column entered in our method, Pandas is prioritizing it. Since John and Henry have the same Date of Birth
, they're sorted by the Employment Start
column instead.
Sort by Multiple Date Columns in Descending Order
As with the single column sort, we can change the order to descending order by changing the ascending
parameter to False
:
df.sort_values(by=['Date of Birth', 'Employment Start'], ascending = False, inplace=True)
Now, our output in descending order is:
Name Date of Birth Employment Start
2 Dhilan 1988-11-12 2016-05-22
0 John 1986-01-06 2016-05-22
4 Henry 1986-01-06 2005-06-16
3 Bob 1982-12-25 2019-11-06
1 Paul 1977-05-10 2010-08-17
As we can see John and Henry both appear higher in the list as the birthdays are displayed in descending order. This time though, John takes priority over Henry due to his more recent Employment Start
date.
Sort by Multiple Date Columns and Variable Order Sorts
Now, what if we not only want to sort using multiple columns but also have these columns sorted using different ascending criteria? With Pandas, this can be implemented within the same sort_values()
method we've used so far. We just have to pass the correct and corresponding list of values in the ascending
parameter.
In this example let’s assume we want to sort our Employment Start
in ascending order, i.e. longest serving first, but then their Date of Birth
in descending order i.e. youngest first:
df.sort_values(by=['Employment Start', 'Date of Birth'], ascending = [True, False], inplace=True)
The data is first sorted by Employment Start
in ascending order, this takes priority as this was the first column passed in our method. We then sort Date of Birth
in descending order. As Dhilan and John share the same Employment Start
date, Dhilan now takes priority as he is younger than John:
Name Date of Birth Employment Start
4 Henry 1986-01-06 2005-06-16
1 Paul 1977-05-10 2010-08-17
2 Dhilan 1988-11-12 2016-05-22
0 John 1986-01-06 2016-05-22
3 Bob 1982-12-25 2019-11-06
Conclusion
Given the popularity of the Pandas library, it is hardly surprising that sorting data based on columns is a straightforward process. We took a look at the flexibility of using the sort_values()
method across single and multiple columns, in ascending, descending and even a variable order. Whilst we have focused on sorting by date, this method can be used across multiple data types.
When looking to sort by date in particular, the first, and arguably most important step, is making sure we have correctly assigned the datetime
type to our data. Without correctly defining our data type we risk Pandas not recognizing our dates at all.