Introduction
Pandas is a Python library for data analysis and manipulation. Almost all operations in pandas
revolve around DataFrame
s, an abstract data structure tailor-made for handling a metric ton of data.
In the aforementioned metric ton of data, some of it is bound to be missing for various reasons. Resulting in a missing (null
/None
/Nan
) value in our DataFrame
.
Which is why, in this article, we'll be discussing how to handle missing data in a Pandas DataFrame
.
Data Inspection
Real-world datasets are rarely perfect. They may contain missing values, wrong data types, unreadable characters, erroneous lines, etc.
The first step to to any proper data analysis is cleaning and organizing the data we'll later be using. We will discuss a few common problems related to data that might occur in a dataset.
We will be working with small employees dataset for this. The .csv
file looks like this:
First Name,Gender,Salary,Bonus %,Senior Management,Team
Douglas,Male,97308,6.945,TRUE,Marketing
Thomas,Male,61933,NaN,TRUE
Jerry,Male,NA,9.34,TRUE,Finance
Dennis,n.a.,115163,10.125,FALSE,Legal
,Female,0,11.598,,Finance
Angela,,,18.523,TRUE,Engineering
Shawn,Male,111737,6.414,FALSE,na
Rachel,Female,142032,12.599,FALSE,Business Development
Linda,Female,57427,9.557,TRUE,Client Services
Stephanie,Female,36844,5.574,TRUE,Business Development
,,,,,
Let's import it into a DataFrame
:
df = pd.read_csv('out.csv')
df
This results in:
Taking a closer look at the dataset, we note that Pandas automatically assigns NaN
if the value for a particular column is an empty string ''
NA
or NaN
. However, there are cases where missing values are represented by a custom value, for example, the string 'na'
or 0
for a numeric column.
For example, the 6th row has a value of na
for the Team
column, while the 5th row has a value of 0
for the Salary
column.
Customizing Missing Data Values
In our dataset, we want to consider these as missing values:
- A
0
value in theSalary
column - An
na
value in theTeam
column
This can be achieved by using the na_values
argument to set custom missing values. This argument represents a dictionary where the keys represent a column name and the value represents the data values that are to be considered as missing:
# This means that in Salary column, 0 is also considered a missing value.
# And 'na' is considered a missing value in the Team column.
df = pd.read_csv('out.csv', na_values={"Salary" : [0], "Team" : ['na']})
df
This results in:
On the other hand if we want to map a list of values which must be treated as missing values in all columns, we can pass a list of candidates which we want to globally consider as missing values to the na_values
parameter:
missing_values = ["n.a.","NA","n/a", "na", 0]
df = pd.read_csv('out.csv', na_values = missing_values)
df
This results in:
Note that here, the value of Gender
in the 4th row is also NaN
now since we defined n.a.
as a missing value above.
Which implementation to chose depends upon the nature of the dataset.
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
For example, some of the numeric columns in the dataset might need to treat 0
as a missing value while other columns may not. Therefore, you can use the first approach where you customize missing values based on columns.
Likewise, if we want to treat 0
for example as a missing value globally, we can utilize the second method and just pass an array of such values to the na_values
argument.
Once we have identified all the missing values in the DataFrame
and annotated them correctly, there are several ways we can handle missing data.
Removing Rows With Missing Values
One approach would be removing all the rows which contain missing values. This can easily be done with the dropna()
function, specifically dedicated to this:
# Drops all rows with NaN values
df.dropna(axis=0,inplace=True)
This results in:
inplace = True
makes all the changes in the existing DataFrame
without returning a new one. Without it, you'd have to re-assign the DataFrame
to itself.
The axis
argument specifies if you're working with rows or columns - 0
being rows, and 1
being columns.
You can control whether you want to remove the rows containing at least 1 NaN
or all NaN
values by setting the how
parameter in the dropna
method.
how : {‘any’, ‘all’}
any
: if any NA values are present, drop that labelall
: if all values are NA, drop that label
df.dropna(axis=0,inplace=True, how='all')
This would only remove the last row from the dataset since how=all
would only drop a row if all of the values are missing from the row.
Similarly, to drop columns containing missing values, just set axis=1
in the dropna
method.
Filling out Missing Values
It might not be the best approach to remove the rows containing missing values if such rows are abundant. They might contain valuable data in other columns and we don't want to skew the data towards an inaccurate state.
In this case, we have several options to assign appropriate values. The most common ones are listed below:
- Fill NA with Mean, Median or Mode of the data
- Fill NA with a constant value
- Forward Fill or Backward Fill NA
- Interpolate Data and Fill NA
Let's go through these one by one.
Fill Missing DataFrame Values with Column Mean, Median and Mode
Let's start out with the fillna()
method. It fills the NA-marked values with values you supply the method with.
For example, you can use the .median()
, .mode()
and .mean()
functions on a column, and supply those as the fill value:
# Using median
df['Salary'].fillna(df['Salary'].median(), inplace=True)
# Using mean
df['Salary'].fillna(int(df['Salary'].mean()), inplace=True)
# Using mode
df['Salary'].fillna(int(df['Salary'].mode()), inplace=True)
Now, if a salary is missing from a person's rows, a mean, mode or median are used to fill that value. This way, you're not dropping these people from the dataset, and you're also not skewing the salary values.
While not perfect, this method allows you to introduce values that don't impact the overall dataset, since no matter how many averages you add, the average stays the same.
Fill Missing DataFrame Values with a Constant
You could also decide to fill the NA-marked values with a constant value. For example, you can put in a special string or numerical value:
df['Salary'].fillna(0, inplace=True)
At the very least, these values are now actual values, instead of na
or NaN
.
Forward Fill Missing DataFrame Values
This method would fill the missing values with first non-missing value that occurs before it:
df['Salary'].fillna(method='ffill', inplace=True)
Backward Fill Missing DataFrame Values
This method would fill the missing values with first non-missing value that occurs after it:
df['Salary'].fillna(method='bfill', inplace=True)
Fill Missing DataFrame Values with Interpolation
Finally, this method uses mathematical interpolation to determine what value would have been in the place of a missing value:
df['Salary'].interpolate(method='polynomial')
Conclusion
Data cleaning and preprocessing is a very important part of every data analysis and each data science project.
In this article we went over several techniques to handle missing data which included customizing the missing data values and imputing the missing data values using different methods including mean, median, mode, a constant value, forward fill, backward fill and interpolation.