Counting Non-NaN Values in DataFrame Columns
Introduction
Data cleaning is an important step in any data science project. In Python, Pandas DataFrame is a commonly used data structure for data manipulation and analysis.
In this Byte, we will focus on handling non-NaN
(Not a Number) values in DataFrame columns. We will learn how to count and calculate total non-NaN
values, and also treat empty strings as NA values.
Counting Non-NaN Values in DataFrame Columns
Pandas provides the count()
function to count the non-NaN
values in DataFrame columns. Let's start by importing the pandas library and creating a simple DataFrame.
import pandas as pd
import numpy as np
data = {'Name': ['Tom', 'Nick', 'John', np.nan],
'Age': [20, 21, 19, np.nan]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Tom 20.0
1 Nick 21.0
2 John 19.0
3 NaN NaN
Now, we can count the non-NaN
values in each column using the count()
method:
print(df.count())
Output:
Name 3
Age 3
dtype: int64
Calculating Total Non-NaN Values in DataFrame
If you want to get the total number of non-NaN
values in the DataFrame, you can use the count()
function combined with sum()
.
print(df.count().sum())
Output:
6
This indicates that there are a total of 6 non-NaN
values in the DataFrame.
Treating Empty Strings as NA Values
In some cases, you might want to treat empty strings as NA values. You can use the replace()
function to replace empty strings with np.nan
.
data = {'Name': ['Tom', 'Nick', '', 'John'],
'Age': [20, 21, '', 19]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Tom 20
1 Nick 21
2
3 John 19
Now, replace the empty strings with np.nan
:
df.replace('', np.nan, inplace=True)
print(df)
Output:
Name Age
0 Tom 20.0
1 Nick 21.0
2 NaN NaN
3 John 19.0
Note: This operation changes the DataFrame in-place. If you want to keep the original DataFrame intact, don't use the inplace=True
argument.
Using notna() to Count Non-Missing Values
A slightly more direct way to filter and count non-NaN
values is with the notna()
method.
Let's start with a simple DataFrame:
import pandas as pd
data = {'Name': ['John', 'Anna', None, 'Mike', 'Sarah'],
'Age': [28, None, None, 32, 29],
'City': ['New York', 'Los Angeles', None, 'Chicago', 'Boston']}
df = pd.DataFrame(data)
print(df)
This will output:
Name Age City
0 John 28.0 New York
1 Anna NaN Los Angeles
2 None NaN None
3 Mike 32.0 Chicago
4 Sarah 29.0 Boston
You can see that our DataFrame has some missing values (NaN
or None
).
Now, if you want to count the non-missing values in the 'Name' column, you can use notna()
:
print(df['Name'].notna().sum())
This will output:
4
The notna()
function returns a Boolean Series where True
represents a non-missing value and False
represents a missing value. The sum()
function is then used to count the number of True
values, which represent the non-missing values.
Conclusion
In this Byte, we've learned how to count non-NaN
values in DataFrame columns. Handling missing data is an important step in data preprocessing. The notna()
function, among other functions in Pandas, provides a straightforward way to count non-missing values in DataFrame columns.