Pandas Library for Data Visualization in Python

In my previous article, I explained how the Seaborn Library can be used for advanced data visualization in Python. Seaborn is an excellent library and I always prefer to work with it, however, it is a bit of an advanced library and needs a bit of time and practice to get used to.

In this article, we will see how Pandas, which is another very useful Python library, can be used for data visualization in Python. Pandas is primarily used for importing and managing dataset in a variety of formats as explained in the article Beginner's Tutorial on the Pandas Python Library. The data visualization capabilities of Pandas are lesser known. In this article, you will focus on the data visualization capabilities of Pandas.

It is pertinent to mention that like Seaborn, the Pandas data visualization capabilities are also based on the Matplotlib Library. But with Pandas, you can directly plot different types of visualizations directly from the Pandas dataframe which we will see in this article.

Basic Plots

In this section, we will see how Pandas dataframes can be used to plot simple plots such as histograms, count plot, scatter plots, etc.

The Dataset

The dataset that we are going to use to plot these graphs is the famous Titanic dataset. The dataset can be downloaded from Kaggle. In this article, we will be using the train.csv file.

Before we import the dataset into our application, we need to import the required libraries. Execute the following script

import numpy as np  
import matplotlib.pyplot as plt  
import pandas as pd  

The following script imports the dataset;

titanic_data = pd.read_csv(r"E:\Datasets\train.csv")  

Let's see how our dataset actually looks like. Run the following script:

titanic_data.head()  

The output looks like this:

You can see that the dataset contains the information about the passengers of the unfortunate titanic ship that sank in the North Atlantic Ocean in 1912. The dataset includes information such as the name, age, the passenger class, whether the passenger survived or not etc.

Let's plot some basic graphs using this information.

Histogram

To draw a histogram for any column, you have to specify the column name followed by the method hist()method shown below:

titanic_data['Age'].hist()  

You can see how easy it is to plot a histogram for the age column using Pandas dataframe. The output of the script above looks like this:

You can pass Matplotlib-based parameters to the hist() method since behind the scene Pandas uses Matplotlib library. So for instance, you can increase the number of bins for your histogram using bin attribute, as follows:

titanic_data['Age'].hist(bins=20)  

In the above script, we set the number of bins for our histogram to 20. The output looks like this:

You can improve the styling of the plots by importing the Seaborn library and setting a value for its set_style attribute. For instance, let's set the style of the grid to dark gray. Execute the following script:

import seaborn as sns  
sns.set_style('darkgrid')  

Now again plot the histogram using the following script:

titanic_data['Age'].hist(bins=20)  

In the output, you will see dark gray grids in the background of our plot:

There are two ways you can use dataframe to plot graphs. One of the ways is to pass the value for the kind parameter of the plot function as shown below:

titanic_data['Age'].plot(kind='hist', bins=20)  

The output looks like this:

The other way is to directly call the method name for the plot using the plot function without passing the function name to the kind attribute. We will use the second (calling the method name for the plot using the plot function) method from here on.

Line Plots

To plot line plots with Pandas dataframe, you have to call the line() method using the plot function and pass the value for x-index and y-axis, as shown below:

titanic_data.plot.line(x='Age', y='Fare', figsize=(8,6))  

The script above plots a line plot where the x-axis contains passengers' age and the y-axix contains the fares paid by the passengers. You can see that we can use figsize attribute to the change the size of the plot. The output looks like this:

Scatter Plots

To plot line plots with Pandas dataframe, you have to call the scatter() method using the plot function and pass the value for x-index and y-axis as shown below:

titanic_data.plot.scatter(x='Age', y='Fare', figsize=(8,6))  

The output of the sript above looks like this:

Box Plot

Pandas dataframes can also be used to plot the box plot. All you have to do is call the box() method using the plot function of the pandas dataframe:

titanic_data.plot.box(figsize=(10,8))  

In the output, you will see box plots for all the numeric columns in the Titanic dataset:

Hexagonal Plots

Hexagonal plots plot the hexagons for intersecting data points on x and y-axis. The more points intersect, the darker is the hexagon. To plot hexagonal plots with Pandas dataframe, you have to call the hexbin() method using the plot function and pass the value for x-index and y-axis as shown below:

titanic_data.plot.hexbin(x='Age', y='Fare', gridsize=30, figsize=(8,6))  

In the output, you will see the hexagonal plot with age on x-axis and fare on y-axis.

Kernel Density Plots

Like Seaborn and Matplotlib, we can also draw kernel density plots with the Pandas library. To plot kernel density plots with Pandas dataframe, you have to call the kde() method using the plot function:

titanic_data['Age'].plot.kde()  

The output of the script above looks like this:

In this section, we saw how Pandas library can be used to draw some of the most basic plots. However, the application of Pandas library for data visualization is not limited to such basic plots. Rather, Pandas can also be used to visualize time series data which we will see in the next section.

Pandas for Visualizing Time Series

Time series data is the type of data where attributes or features are dependent upon time index which is also a feature of the dataset. Some of the most common examples of time series data include the number of items sold per hour, the daily temperature, and the daily stock prices. In all these examples, the data is dependent on some time unit and varies according to that time unit. The time unit can be an hour, day, week, year and so on and so forth.

The Pandas library can be used to visualize time series day. The Pandas library comes with built-in functions that can be used to perform a variety of tasks on time series data such as time shifting and time sampling. In this section, we will see, with the help of examples how the Pandas library is used for time series visualization. But first, we need time series data.

The Dataset

As said earlier, one of the examples of time series data is the stock prices that vary with respect to time. In this section, we will use AAPL stock prices for the 5 years (from 12-11-2013 to 12-11-2018) to visualize time series data. The dataset can be downloaded from this Yahoo Finance link. For other company ticker data, just go to their website, type the company name and the time period that you want your data to be downloaded for. The dataset will be downloaded in the CSV format.

Let's import the libraries that we are going to use for time series data visualization in Pandas. Execute the following script:

import numpy as np  
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt  

Next, to import the dataset, we will use read_csv() method of the Pandas library as follows:

apple_data = pd.read_csv(r'F:/AAPL.csv')  

To see how our dataset looks, we can use the head() function. Execute the following script:

apple_data.head()  

In the output, you will see the first five rows of the dataset.

You can see that the dataset contains the Date, the opening and closing price of the stock for the day, the highest and lowest price of the stock for the day, the adjusted close price and the volume of the stock. You can see that all the columns are depended on the Date. The change in Date column causes the change in all the other columns. Therefore, the Date is the index column in this case. However, in our dataset, by default date is being treated as a string. First, we need to change the type of the Date column from string to DateTime and then we need to set the Date column as index column.

Execute the following script to change the type of the DateTime column to string.

apple_data['Date'] = apple_data['Date'].apply(pd.to_datetime)  

In the script above we applied the to_datetime method to the Date column of our dataset in order to change its type.

Next, we need to set the Date column as the index column. The following script does that:

apple_data.set_index('Date', inplace=True)  

In the script above, we use the set_index method of the Pandas dataframe and pass it the 'Date' column as parameter. The attribute inplace=True means that the conversion will take in place and you do not need to store the result in a another variable.

Now, let's print the first five rows of our dataset again using the head() function:

apple_data.head()  

The output of the script above looks like this:

From the output, you can see that now the values in the Date column are bold, which highlights the fact that the Date column is now being used as an index column.

Before we move on to the time shifting section, let's just plot the closing price of the Apple stock. Execute the following script:

plt.rcParams['figure.figsize'] = (8,6)   # Change the plot size  
apple_data["Close"].plot(grid=True)  

Notice in the above script we simply call the plot method on 'Close' column. We did not specify any information regarding the date, but since the Date column is an index column, the x-axis will contain the values from the Date column while the y-axis will show closing stock price. The output of the script above looks like this:

Pandas can perform a variety of visualization tasks on time series data such as time shifting, time sampling, rolling expanding, time series predictions. In this article, we will see two applications of Pandas time series visualization: Time Shifting and Time sampling.

Time Shifting

Time shifting refers to moving the data a certain number of steps forward or backward. Time series shifting is one of the most important tasks in time series analysis.

We plotted the head of the dataset earlier, now we will first plot the tail of our dataset. Later we will use these head and tail dataframes to see the effects of time shifting.

To plot the tail of the dataset, we can use the tail() function as follows:

apple_data.tail()  

In the output, you will see the last five rows of the dataset as shown below:

Let's first move the data forward to see how time-shifting works in a positive direction. To move data a specific number of time steps forward, you simply need to call the shift() method on the dataset and pass it a positive integer. For instance, the following script shifts the data two steps forward and then prints the head of the data:

apple_data.shift(2).head()  

In the output, you will see that no data will be displayed for the first two rows of the head since the data for these rows will be moved two steps forward. In the output you will see that the data that previously belonged to the first index i.e. 2013-12-10, after moving two steps forward, belongs to the third index i.e. 2013-12-12 as shown below:

On the other hand, to shift the data backward, you can again use the shift() function but you have to specify a negative value. For instance, to shift the data 2 steps backward you can use the following script:

apple_data.shift(-2).tail()  

In the above script, the data is shifted 2 steps backward and then the tail of the data is displayed. In the output, you will see that the last two rows have no records since the data is moved two steps back as shown below:

Time Sampling

Time sampling refers to grouping data features or attributes based on the aggregated value of the index column. For instance, if you want to see the overall maximum opening stock price per year for all the years in the dataset, you can use time sampling.

Implementing time sampling with Pandas is pretty straight-forward. You need to call the resample() method using the Pandas dataframe. You also have to pass the value for the rule attribute. The value is basically the time-offset which specifies the time frame for which we want to group our data.

Finally, you need to call the aggregation function such as mean, max, min, etc. The following script displays the maximum value for all the attribute for each month in the dataset:

apple_data.resample(rule='M').max()  

The output of the script above looks like this:

The detailed list of the offset values for the rule attribute is as follows:

B       business day frequency  
C       custom business day frequency (experimental)  
D       calendar day frequency  
W       weekly frequency  
M       month end frequency  
SM      semi-month end frequency (15th and end of month)  
BM      business month end frequency  
CBM     custom business month end frequency  
MS      month start frequency  
SMS     semi-month start frequency (1st and 15th)  
BMS     business month start frequency  
CBMS    custom business month start frequency  
Q       quarter end frequency  
BQ      business quarter end frequency  
QS      quarter start frequency  
BQS     business quarter start frequency  
A       year end frequency  
BA      business year end frequency  
AS      year start frequency  
BAS     business year start frequency  
BH      business hour frequency  
H       hourly frequency  
T       minutely frequency  
S       secondly frequency  
L       milliseonds  
U       microseconds  
N       nanoseconds  

The above list has been taken from the Official Pandas Documentation.

Let's now try to print the average quarterly (every three months) values for the dataset. You can see from the offset list that Q is used for quarterly frequency. Execute the following script:

apple_data.resample(rule='Q').mean()  

The output of the script above looks like this:

In addition to finding the aggregated values for all the columns in the dataset. You can also resample data for a particular column. Let's plot the bar plot that displays the yearly mean value for the 'Close' attribute of our dataset. Execute the following script:

plt.rcParams['figure.figsize'] = (7, 5)  
apple_data['Close'].resample('A').mean().plot(kind='bar')  

You can see that to plot the bar plot you simply have to call the plot function after the aggregate function and pass it the type of the plot you want to plot. The output of the script above looks like this:

Similarly, to draw a line plot that displays the monthly maximum stock price value for the 'Close' attribute, you can use the following script:

plt.rcParams['figure.figsize'] = (7, 5)  
apple_data['Close'].resample('M').max().plot(kind='line')  

The output of the script above looks like this:

Conclusion

Pandas is one of the most useful Python libraries for data science. Usually, Pandas is used for importing, manipulating, and cleaning the dataset. However, Pandas can also be used for data visualization, as we showed in this article.

In this article, we saw with the help of different examples that how Pandas can be used to plot basic plots. We also studied how Pandas functionalities can be used for time series data visualization. As a rule of thumb, if you really have to plot a simple bar, line or count plots, you should use Pandas.

Author image
About Usman Malik
Paris (France) Twitter
Programmer | Blogger | Data Science Enthusiast | PhD To Be | Arsenal FC for Life