Deleting DataFrame Rows Based on Column Value in Pandas

Introduction

Data processing is a common task in any data analysis codebase. And in Python, the Pandas library is one of the most popular tools for data analysis, which also provides high-performance, easy-to-use data structures and data analysis tools, one of which is the DataFrame. In this Byte, we're going to explore how to delete rows in a DataFrame based on the value in a specific column.

DataFrames in Pandas

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Let's create a simple DataFrame:

import pandas as pd

data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 24, 35, 32],
    'Country': ['USA', 'USA', 'Canada', 'USA']
}

df = pd.DataFrame(data)

print(df)

Output:

    Name  Age Country
0   John   28     USA
1   Anna   24     USA
2  Peter   35  Canada
3  Linda   32     USA

Why Delete Rows Based on Column Value?

Data comes in all shapes and sizes, and not all of it is useful or relevant. Sometimes, you might want to remove rows based on a specific column value to clean your data or focus on a subset of it. For instance, you might want to remove all rows related to a particular country in a dataset of global statistics.

How to Delete Rows Based on Column Value

There are several ways to delete rows in a DataFrame based on column value, which we'll explore here.

Method 1: Using Boolean Indexing

Boolean indexing is a powerful feature in Pandas that allows us to select data based on the actual values in the DataFrame. It's a method that should be easy for most to understand, making it a great starting point for our discussion.

Let's say we have a DataFrame of student grades and we want to delete all rows where the grade is below 60. Here's how we can do it with Boolean indexing:

import pandas as pd

# Create a simple dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'grade': [85, 55, 65, 70, 50]
})

# Create a boolean mask for grades >= 60
mask = df['grade'] >= 60

# Apply the mask to the dataframe
df = df[mask]

print(df)

Output:

      name  grade
0    Alice     85
2  Charlie     65
3    David     70

As you can see, the rows for Bob and Eve, who scored below 60, have been removed.

Method 2: Using the drop() Function

Pandas' drop() function has another way to remove rows from a DataFrame. This method requires a bit more setup than Boolean indexing, but it can be more intuitive for some users.

First, we need to identify the index values of the rows we want to drop. In our case, we want to drop rows where the grade is below 60. Here's how we can do it:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

import pandas as pd

# Create a simple dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'grade': [85, 55, 65, 70, 50]
})

# Identify indices of rows with grade < 60
drop_indices = df[df['grade'] < 60].index

# Drop these indices from the dataframe
df = df.drop(drop_indices)

print(df)

Output:

      name  grade
0    Alice     85
2  Charlie     65
3    David     70

Again, Bob and Eve's rows have been removed, just like in our first method.

Method 3: Using the query() Function

The query() function in Pandas helps us to filter data using a query string syntax, much like SQL. This can be a popular choice, especially for those with experience in SQL.

Let's use query() to delete rows where the grade is below 60:

import pandas as pd

# Create a simple dataframe
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'grade': [85, 55, 65, 70, 50]
})

# Filter dataframe using query() function
df = df.query('grade >= 60')

print(df)

Output:

      name  grade
0    Alice     85
2  Charlie     65
3    David     70

Once again, we have successfully removed Bob and Eve's rows from our DataFrame.

Conclusion

And there you have it - three different methods for deleting rows in a Pandas DataFrame based on column values. Each method has its own strengths and use cases, and which one you choose to use may depend on your specific needs and personal preference.

Last Updated: September 4th, 2023
Was this helpful?
Project

Building Your First Convolutional Neural Network With Keras

# python# artificial intelligence# machine learning# tensorflow

Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...

David Landup
David Landup
Details

Ā© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms