How to Merge DataFrames in Pandas - merge(), join(), append(), concat() and update()

Introduction

Pandas provides a huge range of methods and functions to manipulate data, including merging DataFrames. Merging DataFrames allows you to both create a new DataFrame without modifying the original data source or alter the original data source.

If you are familiar with the SQL or a similar type of tabular data, you probably are familiar with the term join, which means combining DataFrames to form a new DataFrame. If you are a beginner it can be hard to fully grasp the join types (inner, outer, left, right). In this tutorial we'll go over by join types with examples.

Our main focus would be on using the merge() and concat() functions. However, we will discuss other merging methods to give you as many practical alternatives as possible.

For this tutorial, we are using Pandas version 1.1.4 and NumPy version 1.19.4.

For your convenience, here's the table of contents:

Merge DataFrames Using merge()

Let's start by setting up our DataFrames, which we'll use for the rest of the tutorial.

df1 will include our imaginary user list with names, emails, and IDs.

import pandas as pd

df1 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005', 'id006', 'id007'],
                    'first_name': ['Rivi', 'Wynnie', 'Kristos', 'Madalyn', 'Tobe', 'Regan', 'Kristin'],
                    'last_name': ['Valti', 'McMurty', 'Ivanets', 'Max', 'Riddich', 'Huyghe', 'Illis'],
                    'email': ['[email protected]', '[email protected]', '[email protected]',
                              '[email protected]', '[email protected]', '[email protected]', '[email protected]']
                    })

When designing databases, it's considered good practice to keep profile settings (like background color, avatar image link, font size etc.) in a separate table from the user data (email, date added, etc). These tables can then have a one-to-one relationship.

To simulate this scenario we will do the same by creating df2 with image URLs and user IDs:


df2 = pd.DataFrame({'user_id': ['id001', 'id002', 'id003', 'id004', 'id005'],
                    'image_url': ['http://example.com/img/id001.png', 'http://example.com/img/id002.jpg',
                                  'http://example.com/img/id003.bmp', 'http://example.com/img/id004.jpg',
                                  'http://example.com/img/id005.png']
                    })

Here's how our DataFrames look like:

# df1
  user_id first_name last_name                  email
0   id001       Rivi     Valti    [email protected]
1   id002     Wynnie   McMurty  [email protected]
2   id003    Kristos   Ivanets  [email protected]
3   id004    Madalyn       Max      [email protected]
4   id005       Tobe   Riddich  [email protected]
5   id006      Regan    Huyghe    [email protected]
6   id007    Kristin     Illis    [email protected]

#df2
  user_id                         image_url
0   id001  http://example.com/img/id001.png
1   id002  http://example.com/img/id002.jpg
2   id003  http://example.com/img/id003.bmp
3   id004  http://example.com/img/id004.jpg
4   id005  http://example.com/img/id005.png

Let's combine these DataFrames with the merge() function. First, have a look at all the of options this function can accept at a glance:

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=('_x', '_y'), copy=True, indicator=False,
         validate=None)

Most of these options have a default value except for the left and right. These two parameters are the names of the DataFrames that we will merge. The function itself will return a new DataFrame, which we will store in df3_merged variable.

Enter the following code in your Python shell:

df3_merged = pd.merge(df1, df2)

Since both of our DataFrames have the column user_id with the same name, the merge() function automatically joins two tables matching on that key. If we had two columns with different names, we could use left_on='left_column_name' and right_on='right_column_name' to specify keys on both DataFrames explicitly.

Let's print the df3_merged variable to see its contents:

  user_id first_name last_name                  email                         image_url
0   id001       Rivi     Valti    [email protected]  http://example.com/img/id001.png
1   id002     Wynnie   McMurty  [email protected]  http://example.com/img/id002.jpg
2   id003    Kristos   Ivanets  [email protected]  http://example.com/img/id003.bmp
3   id004    Madalyn       Max      [email protected]  http://example.com/img/id004.jpg
4   id005       Tobe   Riddich  [email protected]  http://example.com/img/id005.png

You'll notice that df3_merged has only 5 rows while the original df1 had 7. Why is that?

When the default value of the how parameter is set to inner, a new DataFrame is generated from the intersection of the left and right DataFrames. Therefore, if a user_id is missing in one of the tables, it would not be in the merged DataFrame.

This would stay true even if swapped places of the left and right rows:

df3_merged = pd.merge(df2, df1)

The results are still:

  user_id                         image_url first_name last_name                  email
0   id001  http://example.com/img/id001.png       Rivi     Valti    [email protected]
1   id002  http://example.com/img/id002.jpg     Wynnie   McMurty  [email protected]
2   id003  http://example.com/img/id003.bmp    Kristos   Ivanets  [email protected]
3   id004  http://example.com/img/id004.jpg    Madalyn       Max      [email protected]
4   id005  http://example.com/img/id005.png       Tobe   Riddich  [email protected]

Users with IDs 'id006' and 'id007' are not part of the merged DataFrames since they do not intersect on both tables.

However, there are times we want to use one of the DataFrame as the main DataFrame and include all the rows from even if they don't all intersect with each other. That is to say, to have all of our users, while the image_url is optional.

How? By using merge(), we can pass the 'left' argument to the how parameter:

df_left_merge = pd.merge(df1, df2, how='left')

print(df_left_merge)

With a left join, we've included all elements of the left DataFrame (df1) and every element of the right DataFrame (df2). Running the above code would display this:

  user_id first_name last_name                  email                         image_url
0   id001       Rivi     Valti    [email protected]  http://example.com/img/id001.png
1   id002     Wynnie   McMurty  [email protected]  http://example.com/img/id002.jpg
2   id003    Kristos   Ivanets  [email protected]  http://example.com/img/id003.bmp
3   id004    Madalyn       Max      [email protected]  http://example.com/img/id004.jpg
4   id005       Tobe   Riddich  [email protected]  http://example.com/img/id005.png
5   id006      Regan    Huyghe    [email protected]                               NaN
6   id007    Kristin     Illis    [email protected]                               NaN

Cells that don't have any matching values with the left DataFrame are filled with NaN.

Why don't we try a right join? Create the following merged DataFrame:

df_right_merge = pd.merge(df1, df2, how='right')

print(df_right_merge)

As you may have expected, the right join would return every value from the left DataFrame that matches the right DataFrame:

  user_id first_name last_name                  email                         image_url
0   id001       Rivi     Valti    [email protected]  http://example.com/img/id001.png
1   id002     Wynnie   McMurty  [email protected]  http://example.com/img/id002.jpg
2   id003    Kristos   Ivanets  [email protected]  http://example.com/img/id003.bmp
3   id004    Madalyn       Max      [email protected]  http://example.com/img/id004.jpg
4   id005       Tobe   Riddich  [email protected]  http://example.com/img/id005.png

As every row in df2 has a value in df1, this right join is similar to the inner join, in this case.

Let's have a look at outer joins. To best illustrate how they work, let's swap places of our DataFrames and create 2 new variables for both left and outer joins:

df_left = pd.merge(df2, df1, how='left', indicator=True)
df_outer = pd.merge(df2, df1, how='outer', indicator=True)

print(df_left)
print(df_outer)

Keep in mind that our left DataFrame is df2 and the right DataFrame is df1. Using how='outer' merges DataFrames matching on the key but also includes the values that are missing or don't match.

We also added the indicator flag and set it to True so that Pandas adds an additional column _merge to the end of our DataFrame. This column tells us if a row was found in the left, right or both DataFrames.

The df_left variable looks like this:

  user_id                         image_url first_name last_name                  email _merge
0   id001  http://example.com/img/id001.png       Rivi     Valti    [email protected]   both
1   id002  http://example.com/img/id002.jpg     Wynnie   McMurty  [email protected]   both
2   id003  http://example.com/img/id003.bmp    Kristos   Ivanets  [email protected]   both
3   id004  http://example.com/img/id004.jpg    Madalyn       Max      [email protected]   both
4   id005  http://example.com/img/id005.png       Tobe   Riddich  [email protected]   both

However, df_outer has this data:

  user_id                         image_url first_name last_name                  email      _merge
0   id001  http://example.com/img/id001.png       Rivi     Valti    [email protected]        both
1   id002  http://example.com/img/id002.jpg     Wynnie   McMurty  [email protected]        both
2   id003  http://example.com/img/id003.bmp    Kristos   Ivanets  [email protected]        both
3   id004  http://example.com/img/id004.jpg    Madalyn       Max      [email protected]        both
4   id005  http://example.com/img/id005.png       Tobe   Riddich  [email protected]        both
5   id006                               NaN      Regan    Huyghe    [email protected]  right_only
6   id007                               NaN    Kristin     Illis    [email protected]  right_only

Notice that in the df_outer DataFrame id006 and id007 only exists in right DataFrame (in this case it's df1). If we would try to compare the left and outer joins without swapping the places, we would end up with the same results for both of them.

Merge DataFrames Using join()

Unlike merge() which is a method of the Pandas instance, join() is a method of the DataFrame itself. This means that we can use it like a static method on the DataFrame: DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False).

The DataFrame we call join() from will be our left DataFrame. The DataFrame in the other argument would be our right DataFrame.

The on parameter can take one or more (['key1', 'key2' ...]) arguments to define the matching key, while how parameter takes one of the handle arguments (left, right, outer, inner), and it's set to left by default.

Let's try to join df2 to df1:

df_join = df1.join(df2, rsuffix='_right')

print(df_join)

Like the merge() function, the join() function automatically tries to match the keys (columns) with the same name. In our case, it's the user_id key.

The above code prints out this:

  user_id first_name last_name                  email user_id_right                         image_url
0   id001       Rivi     Valti    [email protected]         id001  http://example.com/img/id001.png
1   id002     Wynnie   McMurty  [email protected]         id002  http://example.com/img/id002.jpg
2   id003    Kristos   Ivanets  [email protected]         id003  http://example.com/img/id003.bmp
3   id004    Madalyn       Max      [email protected]         id004  http://example.com/img/id004.jpg
4   id005       Tobe   Riddich  [email protected]         id005  http://example.com/img/id005.png
5   id006      Regan    Huyghe    [email protected]           NaN                               NaN
6   id007    Kristin     Illis    [email protected]           NaN                               NaN

You probably noticed a "duplicate column" called user_id_right. If you don't want to display that column, you can set the user_id columns as an index on both columns so it would join without a suffix:

df_join_no_duplicates = df1.set_index('user_id').join(df2.set_index('user_id'))

print(df_join_no_duplicates)

By doing so, we are getting rid of the user_id column and setting it as the index column instead. This provides us with a cleaner resulting DataFrame:

        first_name last_name                  email                         image_url
user_id                                                                              
id001         Rivi     Valti    [email protected]  http://example.com/img/id001.png
id002       Wynnie   McMurty  [email protected]  http://example.com/img/id002.jpg
id003      Kristos   Ivanets  [email protected]  http://example.com/img/id003.bmp
id004      Madalyn       Max      [email protected]  http://example.com/img/id004.jpg
id005         Tobe   Riddich  [email protected]  http://example.com/img/id005.png
id006        Regan    Huyghe    [email protected]                               NaN
id007      Kristin     Illis    [email protected]                               NaN

Merge DataFrames Using append()

As the official Pandas documentation points, since concat() and append() methods return new copies of DataFrames, overusing these methods can affect the performance of your program.

Append is very useful when you want to merge two DataFrames in row axis only. This means that instead of matching data on their columns, we want a new DataFrame that contains all the rows of 2 DataFrames.

Let's append df2 to df1 and print the results:

df_append = df1.append(df2, ignore_index=True)

print(df_append)

Using append() will not match DataFrames on any keys. It will just add the other DataFrame to the first and return a copy of it. If the shapes of DataFrames do not match, Pandas will replace any unmatched cells with a NaN.

The output for appending the two DataFrames looks like this:

   user_id first_name last_name                  email                         image_url
0    id001       Rivi     Valti    [email protected]                               NaN
1    id002     Wynnie   McMurty  [email protected]                               NaN
2    id003    Kristos   Ivanets  [email protected]                               NaN
3    id004    Madalyn       Max      [email protected]                               NaN
4    id005       Tobe   Riddich  [email protected]                               NaN
5    id006      Regan    Huyghe    [email protected]                               NaN
6    id007    Kristin     Illis    [email protected]                               NaN
7    id001        NaN       NaN                    NaN  http://example.com/img/id001.png
8    id002        NaN       NaN                    NaN  http://example.com/img/id002.jpg
9    id003        NaN       NaN                    NaN  http://example.com/img/id003.bmp
10   id004        NaN       NaN                    NaN  http://example.com/img/id004.jpg
11   id005        NaN       NaN                    NaN  http://example.com/img/id005.png

Most users choose concat() over the append() since it also provides the key matching and axis option.

Merge DataFrames Using concat()

Concatenation is a bit more flexible when compared to merge() and join() as it allows us to combine DataFrames either vertically (row-wise) or horizontally (column-wise).

The trade-off is that any data that doesn't match will be discarded. Here's the full function with the parameters:

pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None,
              levels=None, names=None, verify_integrity=False, sort=False, copy=True)

Here are the most commonly used parameters for the concat() function:

  • objs is the list of DataFrame objects ([df1, df2, ...]) to be concatenated
  • axis defines the direction of the concatenation, 0 for row-wise and 1 for column-wise
  • join can either be inner (intersection) or outer (union)
  • ignore_index by default set to False which allows the index values to remain as they were in the original DataFrames, can cause duplicate index values. If set to True, it will ignore the original values and re-assign index values in sequential order
  • keys allows us to construct a hierarchical index. Think of it as another level of the index that appended on the outer left of the DataFrame that helps us to distinguish indices when values are not unique

Let's create a new DataFrame with the same column types with the df2, but this one includes the image_url for id006 and id007:

df2_addition = pd.DataFrame({'user_id': ['id006', 'id007'],
                             'image_url': ['http://example.com/img/id006.png',
                                           'http://example.com/img/id007.jpg']
                             })

In order to join df2 and df2_addition row-wise, we can pass them in a list as the objs parameter and assign the resulting DataFrame to a new variable:

df_row_concat = pd.concat([df2, df2_addition])

print(df_row_concat)

We successfully filled in the missing values:

  user_id                         image_url
0   id001  http://example.com/img/id001.png
1   id002  http://example.com/img/id002.jpg
2   id003  http://example.com/img/id003.bmp
3   id004  http://example.com/img/id004.jpg
4   id005  http://example.com/img/id005.png
0   id006  http://example.com/img/id006.png
1   id007  http://example.com/img/id007.jpg

However, have a look at the indices in the left-most column. The indices 0 and 1 are repeating. To get entirely new and unique index values, we pass True to the ignore_index parameter:

df_row_concat = pd.concat([df2, df2_addition], ignore_index=True)

Now our df_row_concat has unique index values:

  user_id                         image_url
0   id001  http://example.com/img/id001.png
1   id002  http://example.com/img/id002.jpg
2   id003  http://example.com/img/id003.bmp
3   id004  http://example.com/img/id004.jpg
4   id005  http://example.com/img/id005.png
5   id006  http://example.com/img/id006.png
6   id007  http://example.com/img/id007.jpg

As we mentioned earlier, concatenation can work both horizontally and vertically. To join two DataFrames together column-wise, we will need to change the axis value from the default 0 to 1:

df_column_concat = pd.concat([df1, df_row_concat], axis=1)

print(df_column_concat)

You will notice that it doesn't work like merge, matching two tables on a key:

  user_id first_name last_name                  email user_id                         image_url
0   id001       Rivi     Valti    [email protected]   id001  http://example.com/img/id001.png
1   id002     Wynnie   McMurty  [email protected]   id002  http://example.com/img/id002.jpg
2   id003    Kristos   Ivanets  [email protected]   id003  http://example.com/img/id003.bmp
3   id004    Madalyn       Max      [email protected]   id004  http://example.com/img/id004.jpg
4   id005       Tobe   Riddich  [email protected]   id005  http://example.com/img/id005.png
5   id006      Regan    Huyghe    [email protected]   id006  http://example.com/img/id006.png
6   id007    Kristin     Illis    [email protected]   id007  http://example.com/img/id007.jpg

If our right DataFrame didn't even have a user_id column, this concatenation still would return the same result. The concat() function glues two DataFrames together, taking the DataFrames indices values and table shape into consideration

It doesn't do key matching like merge() of join(). Try different concatenation combinations by changing the join parameter to see the differences!

Merge DataFrames Using combine_first() and update()

In some cases, you might want to fill the missing data in your DataFrame by merging it with another DataFrame. By doing so, you will keep all the non-missing values in the first DataFrame while replacing all NaN values with available non-missing values from the second DataFrame (if there are any).

For this example, we will import NumPy to use NaN values. If you installed Pandas with pip, NumPy should already be installed.

Type the following code in your Python shell or script file:

import numpy as np

df_first = pd.DataFrame({'COL 1': ['X', 'X', np.nan],
                         'COL 2': ['X', np.nan, 'X'],
                         'COL 3': [np.nan, 'X', 'X']},
                        index=range(0, 3))

df_second = pd.DataFrame({'COL 1': [np.nan, 'O', 'O'],
                          'COL 2': ['O', 'O', 'O']},
                         index=range(0, 3))

print(df_first)
print(df_second)

The df_first DataFrame has 3 columns and 3 missing values in each of them:

  COL 1 COL 2 COL 3
0     X     X   NaN
1     X   NaN     X
2   NaN     X     X

While df_second has only 2 columns and one missing value in the first column:

  COL 1 COL 2
0   NaN     O
1     O     O
2     O     O

We can use df_second to patch missing values in df_first with all corresponding values:

df_tictactoe = df_first.combine_first(df_second)

print(df_tictactoe)

As mentioned earlier, using the combine_first() method will only replace NaN values in index wise order, and it will leave every non-missing values in the first DataFrame as they are:

  COL 1 COL 2 COL 3
0     X     X   NaN
1     X     O     X
2     O     X     X

On the other hand, if we wanted to overwrite the values in df_first with the corresponding values from df_second (regardless they are NaN or not), we would use the update() method.

Let's first add a another DataFrame to our code:

df_third = pd.DataFrame({'COL 1': ['O'], 'COL 2': ['O'], 'COL 3': ['O']})

print(df_third)

The shape is (1, 3) - 1 row and three columns, excluding the index:

  COL 1 COL 2 COL 3
0     O     O     O

Now let's update the df_first with the values from df_third:

df_first.update(df_third)

print(df_first)

Keep in mind that unlike combine_first(), update() does not return a new DataFrame. It modifies the df_first in-place, altering the corresponding values:

  COL 1 COL 2 COL 3
0     O     O     O
1     X   NaN     X
2   NaN     X     X

The overwrite parameter of the update() function is set to True by default. This is why it changes all corresponding values, instead of only NaN values. We can change it to False to replace only NaN values:

df_tictactoe.update(df_first, overwrite=False)

print(df_tictactoe)

Here's the final state of our df_tictactoe DataFrame:

  COL 1 COL 2 COL 3
0     X     X     O
1     X     O     X
2     O     X     X

Not only we successfully updated the values, but we also won the Tic-Tac-Toe game!

Conclusion

Pandas provides powerful tools for merging DataFrames. But it can be hard to decide when to use what. While most of the times merge() function is sufficient, for some cases you might want to use concat() to merge row-wise, or use join() with suffixes, or get rid of missing values with combine_first() and update(). You can even add rows of data with append().

Use the function that you're most comfortable with, and is best for the task at hand. How would these functions help you manipulate data in Pandas?

Author image
Full-stack software developer. Python & C#. Linux user. Excel Ninja