Handling Duplicate Values in a Pandas DataFrame

Introduction

As a data analyst, it is our responsibility to ensure data integrity to obtain accurate and trustworthy insights. Data cleansing plays a vital role in this process, and duplicate values are among the most common issues data analysts encounter. Duplicate values can potentially misrepresent insights. Therefore, it is crucial to have efficient methods for dealing with duplicate values. In this article, we will learn how to identify and handle duplicate values, as well as best practices for managing duplicates.

Identifying Duplicate Values

The first step in handling duplicate values is to identify them. Identifying duplicate values is an important step in data cleaning. Pandas offers multiple methods for identifying duplicate values within a dataframe. In this section, we will discuss the duplicated() function and value_counts() function for identifying duplicate values.

Usin duplicated()

The duplicated() function is a Pandas library function that checks for duplicate rows in a DataFrame. The output of the duplicated() function is a boolean series with the same length as the input DataFrame, where each element indicates whether or not the corresponding row is a duplicate.

Let's consider a simple example of the duplicated() function:

import pandas as pd

# Create a DataFrame
data = {
      'StudentName': ['Mark', 'Ali', 'Bob', 'John', 'Johny', 'Mark'],
      'Score': [45, 65, 76, 44, 39, 45]
}
df = pd.DataFrame(data)

# Check for duplicate rows
df_duplicates = df.duplicated()
print(df_duplicates)

Output:

0    False
1    False
2    False
3    False
4    False
5    True
dtype: bool

In the example above, we created a DataFrame containing the names of students and their total scores. We invoked duplicated() on the DataFrame, which generated a boolean series with False representing unique values and True representing duplicate values.

In this example, the first occurrence of the value is considered unique. However, what if we want the last value to be considered unique, and we don't want to consider all columns when identifying duplicate values? Here, we can modify the duplicated() function by altering the parameter values.

Parameters: Subset and Keep

The duplicated() function offers customization options through its optional parameters. It has two parameters, as described below:

  • subset: This parameter enables us to specify the subset of columns to consider during duplicate detection. The subset is set to None by default, meaning that each column in the DataFrame is considered. To specify column names, we can provide the subset with a list of column names.

    Here is an example of using the subset parameter:

    # Check for duplicate values in StudentName
    df_duplicates = df.duplicated(subset=['StudentName'])
    

    Output:

    0    False
    1    False
    2    False
    3    False
    4    False
    5    True
    dtype: bool
    
  • keep: This option allows us to choose which instance of the duplicate row should be marked as a duplicate. The possible values for keep are:

    • "first": This is the default value for the keep option. It identifies all duplicates except for the first occurrence, considering the first value to be unique.
    • "last": This option identifies the last occurrence as a unique value. All other occurrences will be considered duplicates.
    • False: This option labels each instance as a duplicate value.

Here is an example of using the keep parameter:

# Check for duplicate rows
df_duplicates = df.duplicated(keep='last')
print(df_duplicates)

Output:

0    True
1    False
2    False
3    False
4    False
5    False
dtype: bool

Visualize Duplicate Values

The value_counts() function is the second approach for identifying duplicates. The value_counts() function counts the number of times each unique value appears in a column. By applying the value_counts() function to a specific column, the frequency of each value can be visualized.

Here is an example of using the value_counts() function:

import matplotlib.pyplot as plt
import pandas as pd

# Create a DataFrame
data = {
      'StudentName': ['Mark', 'Ali', 'Bob', 'John', 'Johny', 'Mark'],
      'Score': [45, 65, 76, 44, 39, 45]
}
df = pd.DataFrame(data)

# Count occurrences of each unique value in the 'StudentName' column
name_counts = df['StudentName'].value_counts()
print(name_counts)

Output:

Mark     2
Ali      1
Bob      1
John     1
Johny    1
Name: StudentName, dtype: int64

Let's now visualize duplicate values with a bar graph. We can effectively visualize the frequency of duplicate values using a bar chart.

# Create a bar plot of Student name frequencies
name_counts.plot(kind='bar')
plt.xlabel('Student Name')
plt.ylabel('Frequency')
plt.title('Duplicate Name Frequencies')
plt.show()

Handling Duplicate Values

After identifying duplicate values, it's time to address them. In this section, we will explore various strategies for removing and updating duplicate values using the pandas drop_duplicates() and replace() functions. Additionally, we will discuss aggregating data with duplicate values using the groupby() function.

Removing Duplicate Values

The most common approach for handling duplicates is to remove them from the DataFrame. To eliminate duplicate records from the DataFrame, we will use the drop_duplicates() function. By default, this function keeps the first instance of each duplicate row and removes the subsequent occurrences. It identifies duplicate values based on all column values; however, we can specify the column to be considered using subset parameters.

Syntax of drop_duplicates() with default values in parameters is as follows:

dataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

The subset and keep parameters have the same explanation as in duplicates(). If we set the third parameter inplace to True, all modifications will be performed directly on the original DataFrame, resulting in the method returning None and the original DataFrame being modified. By default, inplace is False.

Here is an example of the drop_duplicates() function:

# Remove duplicate rows
df.drop_duplicates(keep='last', inplace=True)
print(df)

Output:

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

      StudentName  Score
1            Ali     65
2            Bob     76
3           John     44
4          Johny     39
5           Mark     45

In the above example, the first entry was deleted since it was a duplicate.

Replace or Update Duplicate Values

The second method for handling duplicates involves replacing the value using the Pandas replace() function. The replace() function allows us to replace specific values or patterns in a DataFrame with new values. By default, it replaces all instances of the value. However, by using the limit parameter, we can restrict the number of replacements.

Here's an example of using the replace() function:

# Replace duplicate values with a specific value
df['StudentName'].replace('Mark', 'Max', limit=1, inplace=True)
print(df)

Output:

   StudentName  Score
0         Max     45
1         Ali     65
2         Bob     76
3        John     44
4       Johny     39
5        Mark     45

Here, the limit was used to replace the first value. What if we want to replace the last occurrence? In this case, we will combine the duplicated() and replace() functions. Using duplicated(), we will indicate the last instance of each duplicate value, obtain the row number using the loc function, and then replace it using the replace() function. Here's an example of using duplicated() and replace() functions together.

# Identify the last duplicate occurrences
last_occurrences = df.duplicated(subset='StudentName', keep='first')

# Filter DataFrame to select only last duplicate occurrences from DataFrame
last_occurrences_rows = df[last_occurrences]

# Replace the last occurrence of each duplicate value in the 'StudentName' column with 'Max'
df.loc[last_occurrences, 'StudentName'] = df.loc[last_occurrences, 'StudentName'].replace('Mark', 'Max')

# Assign the modified DataFrame back to df
print(df)

Output:

  StudentName  Score
0        Mark     45
1         Ali     65
2         Bob     76
3        John     44
4       Johny     39
5         Max     45

Custom Functions for Complex Replacements

In some cases, handling duplicate values requires more intricate replacements than simply removing or updating them. Custom functions enable us to create specific replacement rules tailored to our needs. By using the pandas apply() function, we can apply the custom function to our data.

For example, let's assume the "StudentName" column contains duplicate names. Our goal is to replace duplicates using a custom function that appends a number at the end of duplicate values, making them unique.

# Custom function to add a number at the end of the duplicate value
def add_number(name, counts):
    if name in counts:
        counts[name] += 1
        return f'{name}_{counts[name]}'
    else:
        counts[name] = 0
        return name

# Initialize an empty dictionary to store counts for each name
name_counts = {}

# Identify duplicates and apply the custom function
df['is_duplicate'] = df.duplicated('StudentName', keep=False)
df['StudentName'] = df.apply(lambda x: add_number(x['StudentName'], name_counts) if x['is_duplicate'] else x['StudentName'], axis=1)
df.drop('is_duplicate', axis=1, inplace=True)
print(df)

Output:

    StudentName  Score
0          Mark     45
1           Ali     65
2           Bob     76
3          John     44
4         Johny     39
5        Mark_1     45

Aggregate Data with Duplicate Values

Data containing duplicate values can be aggregated to summarize and gain insights from the data. The Pandas groupby() function allows you to aggregate data with duplicate values. By using the groupby() function, you can group one or more columns and calculate the mean, median, or sum of another column for each group.

Here's an example of using the groupby() method:

# Use the groupby() function to group the DataFrame by the StudentName
grouped = df.groupby(['StudentName'])

# Apply an aggregation function to handle the duplicates and get the total sum
df_aggregated = grouped.sum()
print(df_aggregated)

Output:

              Score
StudentName       
Ali             65
Bob             76
John            44
Johny           39
Mark            90

Advanced Techniques

To handle more complex scenarios and ensure accurate analysis, there are some advanced techniques that we can use. This section will discuss dealing with fuzzy duplicates, duplication in time series data, and duplicate index values.

Fuzzy Duplicates

Fuzzy duplicates are records that are not exact matches but are similar, and they may occur for various reasons, including data input mistakes, misspellings, and variations in formatting. We will use the fuzzywuzzy Python library to identify duplicates using string similarity matching.

Here is an example of handling fuzzy values:

import pandas as pd
from fuzzywuzzy import fuzz

# Custom function for finding the fuzzy duplicates
def find_fuzzy_duplicates(dataframe, column, threshold):
    duplicates = []
    # Iterate over each row in the DataFrame
    for i in range(len(dataframe)):
        # Compare the current row with subsequent rows
        for j in range(i+1, len(dataframe)):
            # Calculate the similarity score using fuzz.ratio
            similarity = fuzz.ratio(dataframe[column][i], dataframe[column][j])
            # Check if the similarity score meets the threshold
            if similarity >= threshold:
                # Add the duplicate rows to the duplicates list
                duplicates.append(dataframe.iloc[[i, j]])
    # Concatenate the list of DataFrames into a single DataFrame
    if duplicates:
      duplicates_df = pd.concat(duplicates)
      return duplicates_df
    else:
      return pd.DataFrame()

# Create a DataFrame
data = {
      'StudentName': ['Mark', 'Ali', 'Bob', 'John', 'Johny', 'Mark'],
      'Score': [45, 65, 76, 44, 39, 45]
}
df = pd.DataFrame(data)

# Define the threshold value
threshold = 70

# Find fuzzy duplicates in the 'StudentName' column
fuzzy_duplicates = find_fuzzy_duplicates(df, 'StudentName', threshold)
print("Fuzzy duplicates:")
print(fuzzy_duplicates.to_string(index=False))

In this example, we create a custom function find_fuzzy_duplicates that takes a DataFrame, a column name, and a similarity threshold as input. The function iterates through each row in the DataFrame and compares it with subsequent rows using the fuzz.ratio method from the fuzzywuzzy library. If the similarity score is greater than or equal to the threshold, the duplicate rows are added to a list. Finally, the function returns a DataFrame containing the fuzzy duplicates.

Output:

Fuzzy duplicates:
StudentName  Score
       Mark     45
       Mark     45
       John     44
      Johny     39

In the above example, fuzzy duplicates are identified in the "StudentName" column. The 'find_fuzzy_duplicates' function compares each pair of strings using the fuzzywuzzy library's fuzz.ratio function, which calculates a similarity score based on the Levenshtein distance. We've set the threshold at 70, meaning that any name with a match ratio greater than 70 will be considered a fuzzy value. After identifying fuzzy values, we can manage them using the method outlined in the section titled "Handling Duplicates."

Handling Time Series Data Duplicates

Duplicates can occur when multiple observations are recorded at the same timestamp. These values can lead to biased results if not properly handled. Here are a few ways to handle duplicate values in time series data.

  • Dropping Exact Duplicates: In this method, we remove identical rows using the drop_duplicates function in Pandas.
  • Duplicate Timestamps with Different Values: If we have the same timestamp but different values, we can aggregate the data and gain more insight using groupby(), or we can select the most recent value and remove the others using drop_duplicates() with the keep parameter set to 'last'.

Handling Duplicate Index Values

Before addressing duplicate index values, let's first define what an index is in Pandas. An index is a unique identifier assigned to each row of the DataFrame. Pandas assigns a numeric index starting at zero by default. However, an index can be assigned to any column or column combination. To identify duplicates in the Index column, we can use the duplicated() and drop_duplicates() functions, respectively. In this section, we will explore how to handle duplicates in the Index column using reset_index().

As its name implies, the reset_index() function in Pandas is used to reset a DataFrame's index. When applying the reset_index() function, the current index is automatically discarded, which means the initial index values are lost. By specifying the drop parameter as False in the reset_index() function, we can retain the original index value while resetting the index.

Here is an example of using reset_index():

import pandas as pd

# Create a dataframe with an index
data = {
    'Score': [45, 65, 76, 44, 39, 45]
}
df = pd.DataFrame(data, index=['Mark', 'Ali', 'Bob', 'John', 'Johny', 'Mark'])

# Reset the index
df.reset_index(inplace=True)
print(df)

Output:

   index  Score
0   Mark     45
1    Ali     65
2    Bob     76
3   John     44
4  Johny     39
5   Mark     45

Best Practices

  • Understand Duplicate Data's Nature: Before taking any action, it is crucial to comprehend why duplicate values exist and what they represent. Identify the root cause and then determine the appropriate steps to handle them.

  • Select an Appropriate Method for Handling Duplicates: As discussed in previous sections, there are multiple ways to handle duplicates. The method you choose depends on the nature of the data and the analysis you aim to perform.

  • Document the Approach: It is vital to document the process for detecting duplicate values and addressing them, allowing others to understand the thought process.

  • Exercise Caution: Whenever we remove or modify data, we must ensure that eliminating duplicates does not introduce errors or bias into the analysis. Conduct sanity tests and validate the results of each action.

  • Preserve the Original Data: Before performing any operation on data, create a backup copy of the original data.

  • Prevent Future Duplicates: Implement measures to prevent duplicates from occurring in the future. This can include data validation during data entry, data cleansing routines, or database constraints to enforce uniqueness.

Final Thoughts

In data analysis, addressing duplicate values is a crucial step. Duplicate values can lead to inaccurate results. By identifying and managing duplicate values efficiently, data analysts can derive precise and significant information. Implementing the mentioned techniques and following best practices will enable analysts to preserve the integrity of their data and extract valuable insights from it.

Last Updated: June 16th, 2023
Was this article helpful?

Improve your dev skills!

Get tutorials, guides, and dev jobs in your inbox.

No spam ever. Unsubscribe at any time. Read our Privacy Policy.

© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms