Deleting DataFrame Rows Based on Column Value in Pandas
Introduction
Data processing is a common task in any data analysis codebase. And in Python, the Pandas library is one of the most popular tools for data analysis, which also provides high-performance, easy-to-use data structures and data analysis tools, one of which is the DataFrame. In this Byte, we're going to explore how to delete rows in a DataFrame based on the value in a specific column.
DataFrames in Pandas
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dictionary of Series objects. It is generally the most commonly used Pandas object. Let's create a simple DataFrame:
import pandas as pd
data = {
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, 24, 35, 32],
'Country': ['USA', 'USA', 'Canada', 'USA']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Country
0 John 28 USA
1 Anna 24 USA
2 Peter 35 Canada
3 Linda 32 USA
Why Delete Rows Based on Column Value?
Data comes in all shapes and sizes, and not all of it is useful or relevant. Sometimes, you might want to remove rows based on a specific column value to clean your data or focus on a subset of it. For instance, you might want to remove all rows related to a particular country in a dataset of global statistics.
How to Delete Rows Based on Column Value
There are several ways to delete rows in a DataFrame based on column value, which we'll explore here.
Method 1: Using Boolean Indexing
Boolean indexing is a powerful feature in Pandas that allows us to select data based on the actual values in the DataFrame. It's a method that should be easy for most to understand, making it a great starting point for our discussion.
Let's say we have a DataFrame of student grades and we want to delete all rows where the grade is below 60. Here's how we can do it with Boolean indexing:
import pandas as pd
# Create a simple dataframe
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'grade': [85, 55, 65, 70, 50]
})
# Create a boolean mask for grades >= 60
mask = df['grade'] >= 60
# Apply the mask to the dataframe
df = df[mask]
print(df)
Output:
name grade
0 Alice 85
2 Charlie 65
3 David 70
As you can see, the rows for Bob and Eve, who scored below 60, have been removed.
Method 2: Using the drop() Function
Pandas' drop()
function has another way to remove rows from a DataFrame. This method requires a bit more setup than Boolean indexing, but it can be more intuitive for some users.
First, we need to identify the index values of the rows we want to drop. In our case, we want to drop rows where the grade is below 60. Here's how we can do it:
import pandas as pd
# Create a simple dataframe
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'grade': [85, 55, 65, 70, 50]
})
# Identify indices of rows with grade < 60
drop_indices = df[df['grade'] < 60].index
# Drop these indices from the dataframe
df = df.drop(drop_indices)
print(df)
Output:
name grade
0 Alice 85
2 Charlie 65
3 David 70
Again, Bob and Eve's rows have been removed, just like in our first method.
Method 3: Using the query() Function
The query()
function in Pandas helps us to filter data using a query string syntax, much like SQL. This can be a popular choice, especially for those with experience in SQL.
Let's use query()
to delete rows where the grade is below 60:
import pandas as pd
# Create a simple dataframe
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'grade': [85, 55, 65, 70, 50]
})
# Filter dataframe using query() function
df = df.query('grade >= 60')
print(df)
Output:
name grade
0 Alice 85
2 Charlie 65
3 David 70
Once again, we have successfully removed Bob and Eve's rows from our DataFrame.
Conclusion
And there you have it - three different methods for deleting rows in a Pandas DataFrame based on column values. Each method has its own strengths and use cases, and which one you choose to use may depend on your specific needs and personal preference.