Dropping NaN Values in Pandas DataFrame

Introduction

When working with data in Python, it's not uncommon to encounter missing or null values, often represented as NaN. In this Byte, we'll see how to handle these NaN values within the context of a Pandas DataFrame, particularly focusing on how to identify and drop rows with NaN values in a specific column.

NaN Values in Python

In Python, NaN stands for "Not a Number" and it is a special floating-point value that cannot be converted to any other type than float. It is defined under the NumPy library, and it's used to represent missing or undefined data.

It's important to note that NaN is not equivalent to zero or any other number. In fact, NaN is not even equal to itself. For instance, if you compare NaN with NaN, the result will be False.

import numpy as np

# Comparing NaN with NaN
print(np.nan == np.nan)  # Output: False

What is a DataFrame?

A DataFrame is a two-dimensional labeled data structure with columns, which can be potentially different types, much like a spreadsheet or SQL table, or a dictionary of Series objects. It's one of the primary data structures in Pandas, and therefore often used for data manipulation and analysis in Python. You can create DataFrame from various data types like dict, list, set, and from series as well.

import pandas as pd

# Creating a DataFrame
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 24, 35, np.nan]}
df = pd.DataFrame(data)

print(df)

This will output:

    Name   Age
0   John  28.0
1   Anna  24.0
2   Peter 35.0
3   Linda NaN

Why Drop NaN Values from a DataFrame?

NaN values can be a problem when doing data analysis or building machine learning models since they can lead to skewed or incorrect results. While there are methods to fill in NaN values with a specific value or an interpolated value, sometimes the simplest and most effective way to handle them is to drop the rows or columns that contain them. This is particularly true when the proportion of NaN values is small, and their absence won't significantly impact your analysis.

How to Identify NaN Values in a DataFrame

Before we start dropping NaN values, let's first see how we can find them in your DataFrame. To do this, you can use the isnull() function in Pandas, which returns a DataFrame of True/False values. True, in this case, indicates the presence of a NaN value.

# Identifying NaN values
print(df.isnull())

This will output:

    Name    Age
0  False  False
1  False  False
2  False  False
3  False   True

Note: The isnull() function can also be used with the sum() function to get a total count of NaN values in each column.

# Count of NaN values in each column
print(df.isnull().sum())

This will output:

Name    0
Age     1
dtype: int64

Dropping Rows with NaN Values

Now that we have an understanding of the core components of this problem, let's see how we can actually remove the NaN values. Pandas provides the dropna() function to do just that.

Let's say we have a DataFrame like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, 12]
})

print(df)

Output:

     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  7.0  11
3  4.0  8.0  12

To drop rows with NaN values, we can use:

Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

df = df.dropna()
print(df)

Output:

     A    B   C
0  1.0  5.0   9
3  4.0  8.0  12

This works well as you call it on the actual DataFrame object, making it easy to use and less error prone. However, what if we don't want to get rid of each row containing a NaN, but instead we'd rather get rid of the column that contains it. We'll show that in the next section.

Dropping Columns with NaN Values

Similarly, you might want to drop columns with NaN values instead of rows. Again, the dropna() function can be used for this purpose, but with a different parameter. By default, dropna() drops rows. To drop columns, you need to provide axis=1.

Let's use the same DataFrame as above:

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, 12]
})

To drop columns with NaN values, we can use:

df = df.dropna(axis=1)
print(df)

Output:

    C
0   9
1  10
2  11
3  12

As you can see, this drops the columns A and B since they both contained at least one NaN value.

Replacing NaN Values Instead of Dropping

Sometimes, dropping NaN values might not be the best solution, especially when you don't want to lose data. In such cases, you can replace NaN values with a specific value using the fillna() function.

For instance, let's replace NaN values in our DataFrame with 0:

df = pd.DataFrame({
    'A': [1, 2, np.nan, 4],
    'B': [5, np.nan, 7, 8],
    'C': [9, 10, 11, 12]
})

df = df.fillna(0)
print(df)

Output:

     A    B   C
0  1.0  5.0   9
1  2.0  0.0  10
2  0.0  7.0  11
3  4.0  8.0  12

Note: The fillna() function also accepts a method argument which can be set to 'ffill' or 'bfill' to forward fill or backward fill the NaN values in the DataFrame.

For certain datasets, replacing the value with something like 0 is more valuable than dropping the entire row, but all depends on your use-case.

Conclusion

Dealing with NaN values is a common task when working with data in Python. In this Byte, we've covered how to identify and drop rows or columns with NaN values in a DataFrame using the dropna() function. We've also seen how to replace NaN values with a specific value using the fillna() function. Remember, the choice between dropping and replacing NaN values depends on the specific requirements of your data analysis task.

Last Updated: August 24th, 2023
Was this helpful?
Project

Building Your First Convolutional Neural Network With Keras

# python# artificial intelligence# machine learning# tensorflow

Most resources start with pristine datasets, start at importing and finish at validation. There's much more to know. Why was a class predicted? Where was...

David Landup
David Landup
Details

Ā© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms