Counting Non-NaN Values in DataFrame Columns

Introduction

Data cleaning is an important step in any data science project. In Python, Pandas DataFrame is a commonly used data structure for data manipulation and analysis.

In this Byte, we will focus on handling non-NaN (Not a Number) values in DataFrame columns. We will learn how to count and calculate total non-NaN values, and also treat empty strings as NA values.

Counting Non-NaN Values in DataFrame Columns

Pandas provides the count() function to count the non-NaN values in DataFrame columns. Let's start by importing the pandas library and creating a simple DataFrame.

import pandas as pd
import numpy as np

data = {'Name': ['Tom', 'Nick', 'John', np.nan],
        'Age': [20, 21, 19, np.nan]}

df = pd.DataFrame(data)

print(df)

Output:

   Name   Age
0   Tom  20.0
1  Nick  21.0
2  John  19.0
3   NaN   NaN

Now, we can count the non-NaN values in each column using the count() method:

print(df.count())

Output:

Name    3
Age     3
dtype: int64

Calculating Total Non-NaN Values in DataFrame

If you want to get the total number of non-NaN values in the DataFrame, you can use the count() function combined with sum().

print(df.count().sum())

Output:

6

This indicates that there are a total of 6 non-NaN values in the DataFrame.

Treating Empty Strings as NA Values

In some cases, you might want to treat empty strings as NA values. You can use the replace() function to replace empty strings with np.nan.

data = {'Name': ['Tom', 'Nick', '', 'John'],
        'Age': [20, 21, '', 19]}

df = pd.DataFrame(data)

print(df)

Output:

   Name Age
0   Tom  20
1  Nick  21
2        
3  John  19
Get free courses, guided projects, and more

No spam ever. Unsubscribe anytime. Read our Privacy Policy.

Now, replace the empty strings with np.nan:

df.replace('', np.nan, inplace=True)

print(df)

Output:

   Name  Age
0   Tom  20.0
1  Nick  21.0
2   NaN   NaN
3  John  19.0

Note: This operation changes the DataFrame in-place. If you want to keep the original DataFrame intact, don't use the inplace=True argument.

Using notna() to Count Non-Missing Values

A slightly more direct way to filter and count non-NaN values is with the notna() method.

Let's start with a simple DataFrame:

import pandas as pd

data = {'Name': ['John', 'Anna', None, 'Mike', 'Sarah'],
        'Age': [28, None, None, 32, 29],
        'City': ['New York', 'Los Angeles', None, 'Chicago', 'Boston']}

df = pd.DataFrame(data)

print(df)

This will output:

   Name   Age         City
0  John  28.0     New York
1  Anna   NaN  Los Angeles
2  None   NaN         None
3  Mike  32.0      Chicago
4 Sarah  29.0       Boston

You can see that our DataFrame has some missing values (NaN or None).

Now, if you want to count the non-missing values in the 'Name' column, you can use notna():

print(df['Name'].notna().sum())

This will output:

4

The notna() function returns a Boolean Series where True represents a non-missing value and False represents a missing value. The sum() function is then used to count the number of True values, which represent the non-missing values.

Conclusion

In this Byte, we've learned how to count non-NaN values in DataFrame columns. Handling missing data is an important step in data preprocessing. The notna() function, among other functions in Pandas, provides a straightforward way to count non-missing values in DataFrame columns.

Last Updated: August 15th, 2023
Was this helpful?

Ā© 2013-2024 Stack Abuse. All rights reserved.

AboutDisclosurePrivacyTerms