Introduction
Pandas provides a huge range of methods and functions to manipulate data, including merging DataFrames. Merging DataFrames allows you to both create a new DataFrame without modifying the original data source or alter the original data source.
If you are familiar with the SQL or a similar type of tabular data, you probably are familiar with the term join
, which means combining DataFrames to form a new DataFrame. If you are a beginner it can be hard to fully grasp the join types (inner, outer, left, right). In this tutorial we'll go over by join types with examples.
Our main focus would be on using the merge()
and concat()
functions. However, we will discuss other merging methods to give you as many practical alternatives as possible.
For this tutorial, we are using Pandas version 1.1.4 and NumPy version 1.19.4.
Merge DataFrames Using merge()
Let's start by setting up our DataFrames, which we'll use for the rest of the tutorial.
df1
will include our imaginary user list with names, emails, and IDs.
import pandas as pd
df1 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005', 'id006', 'id007'],
'first_name': ['Rivi', 'Wynnie', 'Kristos', 'Madalyn', 'Tobe', 'Regan', 'Kristin'],
'last_name': ['Valti', 'McMurty', 'Ivanets', 'Max', 'Riddich', 'Huyghe', 'Illis'],
'email': ['[email protected]', '[email protected]', '[email protected]',
'[email protected]', '[email protected]', '[email protected]', '[email protected]']
})
When designing databases, it's considered good practice to keep profile settings (like background color, avatar image link, font size etc.) in a separate table from the user data (email, date added, etc). These tables can then have a one-to-one relationship.
To simulate this scenario we will do the same by creating df2
with image URLs and user IDs:
df2 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005'],
'image_url': ['http://example.com/img/id001.png', 'http://example.com/img/id002.jpg',
'http://example.com/img/id003.bmp', 'http://example.com/img/id004.jpg',
'http://example.com/img/id005.png']
})
Here's how our DataFrames look like:
# df1
user_id first_name last_name email
0 id001 Rivi Valti [email protected]
1 id002 Wynnie McMurty [email protected]
2 id003 Kristos Ivanets [email protected]
3 id004 Madalyn Max [email protected]
4 id005 Tobe Riddich [email protected]
5 id006 Regan Huyghe [email protected]
6 id007 Kristin Illis [email protected]
#df2
user_id image_url
0 id001 http://example.com/img/id001.png
1 id002 http://example.com/img/id002.jpg
2 id003 http://example.com/img/id003.bmp
3 id004 http://example.com/img/id004.jpg
4 id005 http://example.com/img/id005.png
Let's combine these DataFrames with the merge()
function. First, have a look at all the of options this function can accept at a glance:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=('_x', '_y'), copy=True, indicator=False,
validate=None)
Most of these options have a default value except for the left and right. These two parameters are the names of the DataFrames that we will merge. The function itself will return a new DataFrame, which we will store in df3_merged
variable.
Enter the following code in your Python shell:
df3_merged = pd.merge(df1, df2)
Since both of our DataFrames have the column user_id
with the same name, the merge()
function automatically joins two tables matching on that key. If we had two columns with different names, we could use left_on='left_column_name'
and right_on='right_column_name'
to specify keys on both DataFrames explicitly.
Let's print the df3_merged
variable to see its contents:
user_id first_name last_name email image_url
0 id001 Rivi Valti [email protected] http://example.com/img/id001.png
1 id002 Wynnie McMurty [email protected] http://example.com/img/id002.jpg
2 id003 Kristos Ivanets [email protected] http://example.com/img/id003.bmp
3 id004 Madalyn Max [email protected] http://example.com/img/id004.jpg
4 id005 Tobe Riddich [email protected] http://example.com/img/id005.png
You'll notice that df3_merged
has only 5 rows while the original df1
had 7. Why is that?
When the default value of the how
parameter is set to inner
, a new DataFrame is generated from the intersection of the left and right DataFrames. Therefore, if a user_id
is missing in one of the tables, it would not be in the merged DataFrame.
This would stay true even if swapped places of the left and right rows:
df3_merged = pd.merge(df2, df1)
The results are still:
user_id image_url first_name last_name email
0 id001 http://example.com/img/id001.png Rivi Valti [email protected]
1 id002 http://example.com/img/id002.jpg Wynnie McMurty [email protected]
2 id003 http://example.com/img/id003.bmp Kristos Ivanets [email protected]
3 id004 http://example.com/img/id004.jpg Madalyn Max [email protected]
4 id005 http://example.com/img/id005.png Tobe Riddich [email protected]
Users with IDs 'id006'
and 'id007'
are not part of the merged DataFrames since they do not intersect on both tables.
However, there are times we want to use one of the DataFrames as the main DataFrame and include all the rows from it even if they don't all intersect with each other. That is to say, to have all of our users, while the image_url
is optional.
How? By using merge()
, we can pass the 'left'
argument to the how
parameter:
df_left_merge = pd.merge(df1, df2, how='left')
print(df_left_merge)
With a left join, we've included all elements of the left DataFrame (df1
) and every element of the right DataFrame (df2
). Running the above code would display this:
user_id first_name last_name email image_url
0 id001 Rivi Valti [email protected] http://example.com/img/id001.png
1 id002 Wynnie McMurty [email protected] http://example.com/img/id002.jpg
2 id003 Kristos Ivanets [email protected] http://example.com/img/id003.bmp
3 id004 Madalyn Max [email protected] http://example.com/img/id004.jpg
4 id005 Tobe Riddich [email protected] http://example.com/img/id005.png
5 id006 Regan Huyghe [email protected] NaN
6 id007 Kristin Illis [email protected] NaN
Cells that don't have any matching values with the left DataFrame are filled with NaN
.
Why don't we try a right join? Create the following merged DataFrame:
df_right_merge = pd.merge(df1, df2, how='right')
print(df_right_merge)
As you may have expected, the right join would return every value from the left DataFrame that matches the right DataFrame:
user_id first_name last_name email image_url
0 id001 Rivi Valti [email protected] http://example.com/img/id001.png
1 id002 Wynnie McMurty [email protected] http://example.com/img/id002.jpg
2 id003 Kristos Ivanets [email protected] http://example.com/img/id003.bmp
3 id004 Madalyn Max [email protected] http://example.com/img/id004.jpg
4 id005 Tobe Riddich [email protected] http://example.com/img/id005.png
As every row in df2
has a value in df1
, this right
join is similar to the inner
join, in this case.
Let's have a look at outer
joins. To best illustrate how they work, let's swap places of our DataFrames and create 2 new variables for both left and outer joins:
df_left = pd.merge(df2, df1, how='left', indicator=True)
df_outer = pd.merge(df2, df1, how='outer', indicator=True)
print(df_left)
print(df_outer)
Keep in mind that our left DataFrame is df2
and the right DataFrame is df1
. Using how='outer'
merges DataFrames matching on the key but also includes the values that are missing or don't match.
We also added the indicator
flag and set it to True
so that Pandas adds an additional column _merge
to the end of our DataFrame. This column tells us if a row was found in the left, right or both DataFrames.
The df_left
variable looks like this:
user_id image_url first_name last_name email _merge
0 id001 http://example.com/img/id001.png Rivi Valti [email protected] both
1 id002 http://example.com/img/id002.jpg Wynnie McMurty [email protected] both
2 id003 http://example.com/img/id003.bmp Kristos Ivanets [email protected] both
3 id004 http://example.com/img/id004.jpg Madalyn Max [email protected] both
4 id005 http://example.com/img/id005.png Tobe Riddich [email protected] both
However, df_outer
has this data:
user_id image_url first_name last_name email _merge
0 id001 http://example.com/img/id001.png Rivi Valti [email protected] both
1 id002 http://example.com/img/id002.jpg Wynnie McMurty [email protected] both
2 id003 http://example.com/img/id003.bmp Kristos Ivanets [email protected] both
3 id004 http://example.com/img/id004.jpg Madalyn Max [email protected] both
4 id005 http://example.com/img/id005.png Tobe Riddich [email protected] both
5 id006 NaN Regan Huyghe [email protected] right_only
6 id007 NaN Kristin Illis [email protected] right_only
Notice that in the df_outer
DataFrame id006
and id007
only exists in right DataFrame (in this case it's df1
). If we would try to compare the left and outer joins without swapping the places, we would end up with the same results for both of them.
Merge DataFrames Using join()
Unlike merge()
which is a method of the Pandas instance, join()
is a method of the DataFrame itself. This means that we can use it like a static method on the DataFrame: DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
.
The DataFrame we call join()
from will be our left DataFrame. The DataFrame in the other
argument would be our right DataFrame.
The on
parameter can take one or more (['key1', 'key2' ...]
) arguments to define the matching key, while how
parameter takes one of the handle arguments (left, right, outer, inner), and it's set to left
by default.
Let's try to join df2
to df1
:
df_join = df1.join(df2, rsuffix='_right')
print(df_join)
Like the merge()
function, the join()
function automatically tries to match the keys (columns) with the same name. In our case, it's the user_id
key.
The above code prints out this:
user_id first_name last_name email user_id_right image_url
0 id001 Rivi Valti [email protected] id001 http://example.com/img/id001.png
1 id002 Wynnie McMurty [email protected] id002 http://example.com/img/id002.jpg
2 id003 Kristos Ivanets [email protected] id003 http://example.com/img/id003.bmp
3 id004 Madalyn Max [email protected] id004 http://example.com/img/id004.jpg
4 id005 Tobe Riddich [email protected] id005 http://example.com/img/id005.png
5 id006 Regan Huyghe [email protected] NaN NaN
6 id007 Kristin Illis [email protected] NaN NaN
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
You probably noticed a "duplicate column" called user_id_right
. If you don't want to display that column, you can set the user_id
columns as an index on both columns so it would join without a suffix:
df_join_no_duplicates = df1.set_index('user_id').join(df2.set_index('user_id'))
print(df_join_no_duplicates)
By doing so, we are getting rid of the user_id
column and setting it as the index column instead. This provides us with a cleaner resulting DataFrame:
first_name last_name email image_url
user_id
id001 Rivi Valti [email protected] http://example.com/img/id001.png
id002 Wynnie McMurty [email protected] http://example.com/img/id002.jpg
id003 Kristos Ivanets [email protected] http://example.com/img/id003.bmp
id004 Madalyn Max [email protected] http://example.com/img/id004.jpg
id005 Tobe Riddich [email protected] http://example.com/img/id005.png
id006 Regan Huyghe [email protected] NaN
id007 Kristin Illis [email protected] NaN
Merge DataFrames Using append()
As the official Pandas documentation points, since concat()
and append()
methods return new copies of DataFrames, overusing these methods can affect the performance of your program.
Append is very useful when you want to merge two DataFrames in row axis only. This means that instead of matching data on their columns, we want a new DataFrame that contains all the rows of 2 DataFrames.
Let's append df2
to df1
and print the results:
df_append = df1.append(df2, ignore_index=True)
print(df_append)
Using append()
will not match DataFrames on any keys. It will just add the other DataFrame to the first and return a copy of it. If the shapes of DataFrames do not match, Pandas will replace any unmatched cells with a NaN.
The output for appending the two DataFrames looks like this:
user_id first_name last_name email image_url
0 id001 Rivi Valti [email protected] NaN
1 id002 Wynnie McMurty [email protected] NaN
2 id003 Kristos Ivanets [email protected] NaN
3 id004 Madalyn Max [email protected] NaN
4 id005 Tobe Riddich [email protected] NaN
5 id006 Regan Huyghe [email protected] NaN
6 id007 Kristin Illis [email protected] NaN
7 id001 NaN NaN NaN http://example.com/img/id001.png
8 id002 NaN NaN NaN http://example.com/img/id002.jpg
9 id003 NaN NaN NaN http://example.com/img/id003.bmp
10 id004 NaN NaN NaN http://example.com/img/id004.jpg
11 id005 NaN NaN NaN http://example.com/img/id005.png
Most users choose concat()
over the append()
since it also provides the key matching and axis option.
Merge DataFrames Using concat()
Concatenation is a bit more flexible when compared to merge()
and join()
as it allows us to combine DataFrames either vertically (row-wise) or horizontally (column-wise).
The trade-off is that any data that doesn't match will be discarded. Here's the full function with the parameters:
pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None,
levels=None, names=None, verify_integrity=False, sort=False, copy=True)
Here are the most commonly used parameters for the concat()
function:
objs
is the list of DataFrame objects ([df1, df2, ...]) to be concatenatedaxis
defines the direction of the concatenation,0
for row-wise and1
for column-wisejoin
can either beinner
(intersection) orouter
(union)ignore_index
by default set toFalse
which allows the index values to remain as they were in the original DataFrames, can cause duplicate index values. If set toTrue
, it will ignore the original values and re-assign index values in sequential orderkeys
allows us to construct a hierarchical index. Think of it as another level of the index that appended on the outer left of the DataFrame that helps us to distinguish indices when values are not unique
Let's create a new DataFrame with the same column types with the df2
, but this one includes the image_url
for id006
and id007
:
df2_addition = pd.DataFrame({'user_id': ['id006', 'id007'],
'image_url': ['http://example.com/img/id006.png',
'http://example.com/img/id007.jpg']
})
In order to join df2
and df2_addition
row-wise, we can pass them in a list as the objs
parameter and assign the resulting DataFrame to a new variable:
df_row_concat = pd.concat([df2, df2_addition])
print(df_row_concat)
We successfully filled in the missing values:
user_id image_url
0 id001 http://example.com/img/id001.png
1 id002 http://example.com/img/id002.jpg
2 id003 http://example.com/img/id003.bmp
3 id004 http://example.com/img/id004.jpg
4 id005 http://example.com/img/id005.png
0 id006 http://example.com/img/id006.png
1 id007 http://example.com/img/id007.jpg
However, have a look at the indices in the left-most column. The indices 0
and 1
are repeating. To get entirely new and unique index values, we pass True
to the ignore_index
parameter:
df_row_concat = pd.concat([df2, df2_addition], ignore_index=True)
Now our df_row_concat
has unique index values:
user_id image_url
0 id001 http://example.com/img/id001.png
1 id002 http://example.com/img/id002.jpg
2 id003 http://example.com/img/id003.bmp
3 id004 http://example.com/img/id004.jpg
4 id005 http://example.com/img/id005.png
5 id006 http://example.com/img/id006.png
6 id007 http://example.com/img/id007.jpg
As we mentioned earlier, concatenation can work both horizontally and vertically. To join two DataFrames together column-wise, we will need to change the axis
value from the default 0
to 1
:
df_column_concat = pd.concat([df1, df_row_concat], axis=1)
print(df_column_concat)
You will notice that it doesn't work like merge, matching two tables on a key:
user_id first_name last_name email user_id image_url
0 id001 Rivi Valti [email protected] id001 http://example.com/img/id001.png
1 id002 Wynnie McMurty [email protected] id002 http://example.com/img/id002.jpg
2 id003 Kristos Ivanets [email protected] id003 http://example.com/img/id003.bmp
3 id004 Madalyn Max [email protected] id004 http://example.com/img/id004.jpg
4 id005 Tobe Riddich [email protected] id005 http://example.com/img/id005.png
5 id006 Regan Huyghe [email protected] id006 http://example.com/img/id006.png
6 id007 Kristin Illis [email protected] id007 http://example.com/img/id007.jpg
If our right DataFrame didn't even have a user_id
column, this concatenation still would return the same result. The concat()
function glues two DataFrames together, taking the DataFrames indices values and table shape into consideration
It doesn't do key matching like merge()
or join()
. Try different concatenation combinations by changing the join
parameter to see the differences!
Merge DataFrames Using combine_first() and update()
In some cases, you might want to fill the missing data in your DataFrame by merging it with another DataFrame. By doing so, you will keep all the non-missing values in the first DataFrame while replacing all NaN
values with available non-missing values from the second DataFrame (if there are any).
For this example, we will import NumPy to use NaN
values. If you installed Pandas with pip
, NumPy should already be installed.
Type the following code in your Python shell or script file:
import numpy as np
df_first = pd.DataFrame({'COL 1': ['X', 'X', np.nan],
'COL 2': ['X', np.nan, 'X'],
'COL 3': [np.nan, 'X', 'X']},
index=range(0, 3))
df_second = pd.DataFrame({'COL 1': [np.nan, 'O', 'O'],
'COL 2': ['O', 'O', 'O']},
index=range(0, 3))
print(df_first)
print(df_second)
The df_first
DataFrame has 3 columns and 1 missing value in each of them:
COL 1 COL 2 COL 3
0 X X NaN
1 X NaN X
2 NaN X X
While df_second
has only 2 columns and one missing value in the first column:
COL 1 COL 2
0 NaN O
1 O O
2 O O
We can use df_second
to patch missing values in df_first
with all corresponding values:
df_tictactoe = df_first.combine_first(df_second)
print(df_tictactoe)
As mentioned earlier, using the combine_first()
method will only replace NaN
values in index wise order, and it will leave every non-missing values in the first DataFrame as they are:
COL 1 COL 2 COL 3
0 X X NaN
1 X O X
2 O X X
On the other hand, if we wanted to overwrite the values in df_first
with the corresponding values from df_second
(regardless they are NaN or not), we would use the update()
method.
Let's first add a another DataFrame to our code:
df_third = pd.DataFrame({'COL 1': ['O'], 'COL 2': ['O'], 'COL 3': ['O']})
print(df_third)
The shape is (1, 3) - 1 row and three columns, excluding the index:
COL 1 COL 2 COL 3
0 O O O
Now let's update the df_first
with the values from df_third
:
df_first.update(df_third)
print(df_first)
Keep in mind that unlike combine_first()
, update()
does not return a new DataFrame. It modifies the df_first
in-place, altering the corresponding values:
COL 1 COL 2 COL 3
0 O O O
1 X NaN X
2 NaN X X
The overwrite
parameter of the update()
function is set to True
by default. This is why it changes all corresponding values, instead of only NaN
values. We can change it to False
to replace only NaN
values:
df_tictactoe.update(df_first, overwrite=False)
print(df_tictactoe)
Here's the final state of our df_tictactoe
DataFrame:
COL 1 COL 2 COL 3
0 X X O
1 X O X
2 O X X
Not only did we successfully update the values, but we also won the Tic-Tac-Toe game!
Conclusion
Pandas provides powerful tools for merging DataFrames. But it can be hard to decide when to use what. While most of the times merge()
function is sufficient, for some cases you might want to use concat()
to merge row-wise, or use join()
with suffixes, or get rid of missing values with combine_first()
and update()
. You can even add rows of data with append()
.
Use the function that you're most comfortable with, and is best for the task at hand. How would these functions help you manipulate data in Pandas?