## Introduction

*Matplotlib* is one of the most widely used data visualization libraries in Python. From simple to complex visualizations, it's the go-to library for most.

In this tutorial, we'll take a look at how to *plot a histogram plot in Matplotlib*. Histogram plots are a great way to visualize distributions of data - In a histogram, each bar groups numbers into ranges. Taller bars show that more data falls in that range.

A histogram displays the shape and spread of continuous sample data.

## Import Data

We'll be using the Netflix Shows dataset and visualizing the distributions from there.

Let's import Pandas and load in the dataset:

```
import pandas as pd
df = pd.read_csv('netflix_titles.csv')
```

## Plot a Histogram Plot in Matplotlib

Now, with the dataset loaded in, let's import Matplotlib's PyPlot module and visualize the distribution of `release_year`

s of the shows that are live on Netflix:

```
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('netflix_titles.csv')
plt.hist(df['release_year'])
plt.show()
```

Here, we've got a minimum-setup scenario. We load the data into a DataFrame (`df`

), then, we use the PyPlot instance and call the `hist()`

function to plot a histogram for the `release_year`

feature. By default, this'll count the number of occurrences of these years, populate bars in ranges and plot the histogram.

Running this code results in:

Here, the movie *bins* (ranges) are set to 10 years. Each bar here includes all shows/movies in batches of 10 years. For example, we can see that around ~750 shows were released between 2000. and 2010. At the same time, ~5000 were released between 2010. and 2020.

These are pretty big ranges for the movie industry, it makes more sense to visualize this for ranges smaller than 10 years.

## Change Histogram Bin Size in Matplotlib

Say, let's visualize a histogram (distribution) plot in batches of 1 year, since this is a much more realistic time-frame for movie and show releases.

We'll import `numpy`

, as it'll help us calculate the size of the bins:

```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
plt.hist(data, bins=np.arange(min(data), max(data) + 1, 1))
plt.show()
```

This time around, we've extracted the DataFrame column into a `data`

variable, just to make it a bit easier to work with.

We've passed the `data`

to the `hist()`

function, and set the `bins`

argument. It accepts a list, which you can set manually, if you'd like, especially if you want a non-uniform bin distribution.

Since we'd like to pool these entries each in the same time-span (1 year), we'll create a NumPy array, that starts with the lowest value (`min(data)`

), ends at the highest value (`max(data)`

) and goes in increments of `1`

.

This time around, running this code results in:

Instead of a list, you can give a single `bins`

value. This will be the total number of `bins`

in the plot. Using `1`

will result in 1 bar for the entire plot.

Say, we want to have 20 bins, we'd use:

```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
plt.hist(data, bins=20)
plt.show()
```

This results in 20 equal bins, with data within those bins pooled and visualized in their respective bars:

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually *learn* it!

This results in 5-year intervals, considering we've got ~100 years worth of data. Splitting it up in 20 bins means that each will include 5 years worth of data.

## Plot Histogram with Density

Sometimes, instead of the count of the features, we'd want to check what the density of each bar/bin is. That is, how common it is to see a range within a given dataset. Since we're working with 1-year intervals, this'll result in the probability that a movie/show was released in that year.

To do this, we can simply set the `density`

argument to `True`

:

```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins=bins, density=True)
plt.ylabel('Density')
plt.xlabel('Year')
plt.show()
```

Now, instead of the count we've seen before, we'll be presented with the density of entries:

We can see that ~18% of the entries were released in 2018, followed by ~14% in 2019.

## Customizing Histogram Plots in Matplotlib

Other than these settings, there's a plethora of various arguments you can set to customize and change the way your plot looks like. Let's change a few of the common options people like to fiddle around with to change plots to their tastes:

```
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df = pd.read_csv('netflix_titles.csv')
data = df['release_year']
bins = np.arange(min(data), max(data) + 1, 1)
plt.hist(data, bins=bins, density=True, histtype='step', alpha=0.5, align='right', orientation='horizontal', log=True)
plt.show()
```

Here, we've set various arguments:

`bins`

- Number of bins in the plot`density`

- Whether PyPlot uses count or density to populate the plot`histtype`

- The type of histogram plot (default is`bar`

, though other values such as`step`

or`stepfilled`

are available)`alpha`

- The alpha/transparency of the lines`align`

- To which side of the bins the bars are aligned, default is`mid`

`orientation`

- Horizontal/Vertical orientation, default is`vertical`

`log`

- Whether the plot should be put on a logarithmic scale or not

This now results in:

Since we've put the `align`

to `right`

, we can see that the bar is offset a bit, to the vertical right of the *2020* bin.

## Conclusion

In this tutorial, we've gone over several ways to plot a histogram using Matplotlib and Python.

If you're interested in Data Visualization and don't know where to start, make sure to check out our *bundle of books* on *Data Visualization in Python*:

* Data Visualization in Python with Matplotlib and Pandas* is a book designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and allow them to build a strong foundation for advanced work with these libraries - from simple plots to animated 3D plots with interactive buttons.

It serves as an in-depth guide that'll teach you everything you need to know about Pandas and Matplotlib, including how to construct plot types that aren't built into the library itself.

* Data Visualization in Python*, a book for beginner to intermediate Python developers, guides you through simple data manipulation with Pandas, covers core plotting libraries like Matplotlib and Seaborn, and shows you how to take advantage of declarative and experimental libraries like Altair. More specifically, over the span of 11 chapters this book covers 9 Python libraries: Pandas, Matplotlib, Seaborn, Bokeh, Altair, Plotly, GGPlot, GeoPandas, and VisPy.

It serves as a unique, practical guide to Data Visualization, in a plethora of tools you might use in your career.