Course Ebook (PDF)

Course Ebook (EPUB)

Data can be represented in many different ways. We quantify observable phenomena to generate data which can then be represented through mathematical formulas, music, text, visualizations, etc.
Python has become one of the preferred languages in the world of Data Science over the years, given its simplicity and ease of use, which lowered the barrier to entry from other professions and opened up collaboration options between many research groups. Even without deep knowledge of Python - many scientists have become able to visualize and share their work with their colleagues and peers, and the new wave of users necessitated a new wave of tools. The easier it is to get started - the larger the impact on the community.
Many libraries have been created to make working with data easier, and as of writing this book, there's no shortage of powerful libraries that allow even the unpracticed folk to step in and try their hand at extracting knowledge from data. Data Visualization is one of the techniques used in all aspects of research and science. It's an interdisciplinary field that represents data through various graphical elements, such as lines and markers. Though - Data Visualization is much more than graphs and charts you've learned in elementary school. You can plot and explore relationships between data, their distributions, summaries, and put things into perspective. Data Visualization has become much more - it's become a storytelling substrate. Each plot you make can tell a story, and you're the artist shaping it.
An artist can also choose to put things out of context - and it's easy to lie with data. As a matter of fact, many people do - however, the statement is not as cynical as it may sound. Many people lie to themselves, not others. Without proper knowledge of how to approach problems, you can &quot;hallucinate&quot; relationships where they aren't present, and infer causation from correlation, which more often than not doesn't hold water.
The applicability of Data Visualization is a long list and it's it's a technique used within Data Science, Data Analysis, Descriptive statistics as well as at the core of Exploratory Data Analysis which is present at the heart of practically all research. Whether you're a biologist, molecular physicist, machine learning engineer, software engineer, psychologist or philosopher - your findings are backed by data, and data in the form of numbers is hard to interpret for humans. We use this crucial step of interpreting our own results as input for our Bayesian best guess of the world and reality around us, so it's crucial that we produce clear, concise, impactful and interpretable results, lest we end up fooling ourselves.
Welcome to Data Visualization in Python with Matplotlib and Pandas. This course is designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and allow them to build a strong foundation for advanced work with these libraries - from simple plots to 3D plots and interactive buttons.
We'll start out with the installation and setup of the environment you need to start working on Data Visualization projects, followed by a crash-course on Pandas - from the very foundations and upwards, covering everything you'll need to know to follow this course, and to start working on your own visualization projects.
Then, we'll dive into the basics of Matplotlib, the anatomy of plots, the APIs it offers, and basic plot customization, including common tasks like changing font size, setting axis ranges, changing tick frequency, adding legends and plotting basic plots. We'll also cover some fundamental classes like the <code>Text</code> class and how annotations work.
Once we've gotten the hang of Matplotlib's APIs, plot anatomy and how to customize them - we'll jump into the meat of the course - Data Visualization with Matplotlib. This is where we'll cover the quintessential plots that should be in the arsenal of any Data Scientist, like Scatter Plots, Box Plots, and Histograms. This is also the lesson in which we'll dive into custom plot types, such as Joint Plots and Ridge Plots that aren't part of the standard library, followed by 3D plots and an exploration section, where we'll explore an EEG dataset.
Note: For each plot type, we'll aim to use use a different dataset, which necessitates different pre-processing. In some cases, the data will be very fit for the plot type we're using already - since we'll be choosing the right plot type for the job. In other cases, though, we'll have to perform pre-processing with Pandas. All of the datasets will be available publicly, and downloadable for your convenience and the links to each will be provided in the footnotes. In many cases, we'll be able to pose hypothesis on the correlations between certain features, as well as test these hypothesis through Exploratory Data Analysis.
At this point, we'll jump into more advanced customization, overcoming potential issues with more advanced and custom plot types. Before wrapping up with interactive buttons, we'll take a brief look at Pandas' plotting abilities as well.
Finally, Matplotlib isn't only for static plots. While GUIs are typically created with GUI libraries and frameworks such as <a href="https://riverbankcomputing.com/software/pyqt/">PyQt</a>, <a href="https://docs.python.org/3/library/tkinter.html">Tkinter</a>, <a href="https://kivy.org/#home">Kivy</a> and <a href="https://www.wxpython.org/">wxPython</a>, and while Python does have excellent integration with PyQt, Tkinter and wxPython - there's no need to use any of these for some basic GUI functionality, through Matplotlib Widgets:
<ul>
<li>Lesson 1. - Introduction</li>
<li>Lesson 2. - Installation and Setup</li>
<li>Lesson 3. - Getting Started with Pandas</li>
<li>Lesson 4. - Getting Started with Matplotlib</li>
<li>Lesson 5. - Basic Matplotlib Customization</li>
<li>Lesson 6. - Data Visualization with Matplotlib</li>
<li>Lesson 7. - Advanced Matplotlib Customization</li>
<li>Lesson 8 - Data Visualization with Pandas</li>
<li>Lesson 9. - Matplotlib Widgets</li>
</ul>

David Landup

Introduction

We'll be working with several tools throughout the course. Dabbling in Python almost warrants that you've already used some of these before, if not most, since these are fairly popular libraries present in a large amount of projects. Namely, we'll be using:
<ul>
<li><a href="https://www.python.org/">Python</a></li>
<li><a href="https://matplotlib.org/">Matplotlib</a></li>
<li><a href="https://pandas.pydata.org/">Pandas</a></li>
<li><a href="https://numpy.org/">Numpy</a></li>
</ul>
We'll rely on Numpy sparingly, so if you haven't worked with it before - there's no need to worry. Even though Matplotlib uses Numpy Arrays under the hood, even without prior experience with it, you'll be able to follow the course without a problem given the intuitive API and little need to use it manually.
Assuming no prior knowledge of Pandas, we'll first be jumping into Lesson 3 - Getting Started with Pandas. It's extensively used with Matplotlib, and we'll be using it throughout the course to pre-process and wrangle data into the formats most fit for our visualization needs. We'll start at the foundations and building blocks of Pandas to common tasks and operations you'll be performing as well as data reshaping, giving you an solid introduction to the library and how we'll be using it in the course.
In the case you don't already have these tools installed on your local machine, let's quickly set them up.

Installation and Setup

Pandas is an open-source Python package that provides numerous tools for data analysis. The package comes with several data structures that can be used for many different data manipulation tasks. It also has a variety of methods that can be invoked for data analysis, which come in handy when working on Data Science and Machine Learning problems.
It can present data in a way that is very intuitive and suitable for data analysis, via its <code>Series</code> and <code>DataFrame</code> data structures. The <code>DataFrame</code> is a fundamental and key data structure in the framework, and you'll spend a lot of time working with them.
Additionally, Pandas has a variety of ways to work with different types of I/O operations very seamlessly. It can read data from a variety of formats, such as CSV, XSLX, JSON, etc.
<h3 id="pandasdatastructures">Pandas Data Structures</h3>
Pandas has two main data structures for data storage:
<ol>
<li>Series</li>
<li>DataFrame</li>
</ol>
Let's go over those two first.
<h2 id="series">Series</h2>
A series is similar to a one-dimensional array. It can store data of any type. The values of a Pandas <code>Series</code> are mutable but the size of a <code>Series</code> is immutable and cannot be changed.
The first element in the series is assigned the index of <code>0</code>, while the last element is at index <code>N-1</code>, where <code>N</code> is the total number of elements in the series.

Getting Started with Pandas

Now that we've covered everything you need to know about Pandas, its data structures, how to manipulate create and export them from and to various data types, as well as gotten a good view of Pandas' own data visualization capabilities, it's finally time to jump into the famed Matplotlib library.
<h2 id="whatismatplotlib">What is Matplotlib?</h2>
Matplotlib is the de-facto most popular visualization engine. Note the usage of &quot;visualization engine&quot; here.
Matplotlib isn't just a standalone library for itself - it carries much more on its shoulders. Other libraries, such as Pandas and Seaborn rely on Matplotlib to perform the actual visualizations. Seaborn can construct and create beautiful plots, but ultimately relies on Matplotlib to actually visualize it.
GeoPandas is another library, specialized for creating, manipulating and visualizing geospatial data, based on Pandas, and thus, Matplotlib.
Originally, Matplotlib was released back in 2003, and has seen worldwide adoption since, with regular updates to this day. Year-over-year, it's cemented itself as one of the key and core libraries for visualization, and isn't likely to be dethroned soon, given how deeply engrained it is with other extremely popular libraries, alongside its own popularity.
During this time, the team behind Matplotlib, including the community, has expanded it to include a plethora of visualization tools and plot types - from simple static 2D plots to more advanced, animated 3D plots, widgets and event-handling. It even offers support for integration with the popular PyQt and TKinter frameworks, used to create GUI applications in Python, allowing developers to integrate powerful visualizations in their applications.

Getting Started with Matplotlib

A good portion of Matplotlib's popularity comes from its customizability. Naturally, everyone working with Matplotlib will benefit significantly from being aware of these options, even if they're not really being used all the time.
In this lesson, we'll use the knowledge we've gained so far, alternating between the MATLAB/PyPlot-style approach to plotting and the Object-Oriented-style, and customize plots in a panoply of ways.
We'll explore common operations, such as changing the figure and font size, rotating text to make it fit better, saving images, setting axis ranges, adding multiple different-sized subplots, changing tick frequency, changing plot backgrounds, changing scales, but also dive into understanding Matplotlib Text.
There's an abundance of options to tweak and change with Matplotlib. In this lesson, we'll focus on the ones you'll frequently be using. In Lesson 7. - Advanced Matplotlib Customization, we'll focus on some of the less frequently used options, which are very useful and important nonetheless, such as understanding Matplotlib Stylesheets, Matplotlib Colors and Colormaps, and the GridSpec.
Since we've had an issue of fitting the four <code>Axes</code> objects in the last lesson, let's start off with changing the figure size.
<h2 id="changingthefiguresize">Changing the Figure Size</h2>
Let's re-create the plot from the previous lesson, that couldn't fit very well into our <code>Figure</code>:

Basic Matplotlib Customization

In Lesson 4. - Getting Started with Matplotlib we got acquainted with the anatomy of Matplotlib plots, how to utilize the generic <code>plot()</code> function with simple data and got familiar with the APIs we can use to work with Matplotlib.
In Lesson 5. - Basic Matplotlib Customization, we explored some customization operations which are commonly used, such as changing the figure and font size, setting axis ranges, changing tick frequency, adding legends, and plotting vertical lines. We've also jumped into Matplotlib text, and how it can be used, including how to add and style annotations to point out certain parts of your plots.
These operations are a primer on what we'll be generally using, but aren't the only customization options you'll ever be using. While we'll explore those in the next lesson - these should be more than enough to get you through most of your plotting needs.
In this lesson, armed with knowledge of how Matplotlib works and how we can tweak plots - let's jump into Data Visualization with Matplotlib.
We'll be covering some of the most commonly used plot types, such as Scatter Plots, Bar Plots and Box Plots, but we'll also be utilizing some more rarely used plot types such as Ridge Plots and the story of how they were conceived. Additionally, we'll be creating a custom plot such as a Joint Plot which isn't built into the Matplotlib library, but is a popularized plot type from another Data Visualization library - Seaborn.

Data Visualization with Matplotlib

In Lesson 5, we've taken a look at some of the basic customization we'll be doing fairly commonly when doing Data Visualization. These included the basic operations of changing figure sizes, tick frequency and scales. These features have gotten us to a point, though, even in the last lesson - these weren't enough for some of the plots we've been producing.
When recreating the Joy Division album cover, using a Ridge Plot in 3D - we couldn't just set the background to black. When creating a Joint Plot, simply adding multiple <code>Axes</code> instances looked bad so we opted to use a GridSpec instead. We've also used Colormaps in several Surface Plot examples.
We haven't fully delved into these or explained how they work - they were covered on a need-to-know basis. Now, this is the lesson in which we'll be exploring these functionalities in detail.
<h2 id="understandingmatplotlibstylesheets">Understanding Matplotlib Stylesheets</h2>
First off, a relatively easy, but still really nice feature - Matplotlib Stylesheets. A stylesheet contains a set of parameters that change the look of Matplotlib's elements. These were hand-crafted by the Matplotlib team and are designed to have colors and palettes that work together. The default stylesheet doesn't look bad - but it could most certainly look better, slicker.

Advanced Matplotlib Customization

Pandas has been aiding us so far in the phase of Data Preprocessing. Though, in one instance, while creating Histograms, we've also utilized another module from Pandas - <code>plotting</code>.
We've purposefully avoided is so far, because introducing it earlier would raise more questions than it answered. Namely, Pandas and Matplotlib were such a common an ubiquitous duo, that Pandas has started integrating Matplotlib's functionality. It heavily relies on Matplotlib to do any actual plotting, and you'll find many Matplotlib functions wrapped in the source code. Alternatively, you can use other backends for plotting, such as Plotly and Bokeh.
However, Pandas also introduces us to a couple of plots that aren't a part of Matplotlib's standard plot types, such as KDEs, Andrews Curves, Bootstrap Plots and Scatter Matrices.
The <code>plot()</code> function of a Pandas <code>DataFrame</code> uses the backend specified by <code>plotting.backend</code>, and depending on the <code>kind</code> argument - generates a plot using the given library. Since a lot of these overlap - there's no point in covering plot types such as <code>line</code>, <code>bar</code>, <code>hist</code> and <code>scatter</code>. They'll produce much the same plots with the same code as we've been doing so far with Matplotlib.
We'll only briefly take a look at the <code>plot()</code> function since the underlying mechanism has been explored so far. Instead, let's focus on some of the plots that we can't already readily do with Matplotlib.

Data Visualization with Pandas

Matplotlib isn't only for static plots. While GUIs are typically created with GUI libraries and frameworks such as <a href="https://riverbankcomputing.com/software/pyqt/">PyQt</a>, <a href="https://docs.python.org/3/library/tkinter.html">Tkinter</a>, <a href="https://kivy.org/#home">Kivy</a> and <a href="https://www.wxpython.org/">wxPython</a>, and while Python does have excellent integration with PyQt, Tkinter and wxPython - there's no need to use any of these for some basic GUI functionality, through Matplotlib Widgets.
The <code>matplotlib.widgets</code> module has several classes, including the <code>AxesWidget</code>, out of which <code>Button</code>s, <code>CheckButton</code>s, <code>Slider</code>s, <code>TextBox</code>es, etc are derived. These all accept the <code>Axes</code> they're being added to as the one and only mandatory constructor argument, and their positioning has to be manually set. A thing to note is that the widget is the axes, so you'll create an <code>Axes</code> instance for each widget.
Another thing to note is that you have to keep references to the widgets otherwise, they might get garbage collected.
Each of them can also be disabled by setting <code>active</code> to <code>False</code>, in which case, they won't respond to any events, such as being clicked on. That being said, we can introduce a new type of interactivity to our plots, through various GUI elements and components.
Note: Matplotlib isn't meant to be used for high-quality GUI creation, nor user-friendly systems. These widgets are rudimentary, don't really look great and have limited functionality. They're meant as a way to prototype and test things out, rather than actually ship them.

Matplotlib Widgets

That concludes this course - &quot;Data Visualization in Python with Matplotlib and Pandas&quot;. Thank you for taking a ride with us!
<blockquote>
Online education is spreading through the world, and is becoming an increasingly important part of many lives. We believe that accessible, high-quality resources can help empower people that build tomorrow, and remain guided by that goal.
</blockquote>
At StackAbuse, we believe that learning is not a one-stop time investment. It's life-long. Especially in the volatile and rapidly changing world of Computer Science and Software Engineering. So, we've pledged to update our courses, guides, and other upcoming material to keep the pace of progress in the field. Software is updating - it's only fitting that learning resources are updating as well.
Thank you for purchasing &quot;Data Visualization in Python with Pandas and Matplotlib&quot;! We hope that it has brought a ton of value to you so far, and know that it will continue to do so as you dive further in to this topic.
<blockquote>
Now, we'd like to ask you to get involved in improving the next version of the book and our courses.
</blockquote>
We believe that high-quality resources and education is community-driven and that minor (or major) contributions from each member results in a wonderful learning oasis. For this, feedback is crucial.

Thank You for Supporting Online Education

Data Visualization in Python with Matplotlib and Pandas is a course designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and allow them to build a strong foundation for advanced work with these libraries - from simple plots to 3D plots and interactive buttons.
Through practical, hands-on and straightforward examples, the course guides you through Data Visualization and Exploration using Python, Pandas and Matplotlib. You'll learn how to use the constituent elements of Pandas to load and manipulate datasets, as well as visualize them, the different styles of Matplotlib plotting and the anatomy of Matplotlib plots, before learning how to customize the elements to your liking. Furthermore, the book covers a great deal of different plot types, from simple Pie Charts and Bar Plots to 3D Surface Plots and Joint Plots. Each different plot type features a new dataset, containing different types of data you might want to visualize, guiding you through many unique aspects of Data Visualization.
This book aims to be your one-stop-shop for Matplotlib-related questions, including common tasks, frequently searched questions regarding customization and a wide variety of plot types - including the ones not built-in to the library itself.
<h3 id="whatourreadersthink">What Our Readers Think</h3>
<blockquote>
&quot;This is the best all-in-one collection of data plotting techniques for Python, and a good Pandas reference as well. Great desk-side reference. I highly recommend.&quot; 
— Amazon Customer
</blockquote>
<blockquote>
&quot;Easy to follow and well written. Helped me build a solid foundation of Matplotlib and start doing my own data visualization projects. It was nice that the book had several datasets to learn from since most projects require their own unique preprocessing and prep for data visualization. Now I just need to find a resource for animating plots.&quot; 
— Matt
</blockquote>
<h3 id="whothiscourseisfor">Who This Course is For:</h3>
<ul>
<li>Beginner to Advanced Python enthusiasts</li>
<li>Aspiring Data Scientists, who are ready to jump into the world of Data Visualization</li>
<li>Analysts, Product Managers and Marketing Consultants who'd like to explore trends and adapt their strategies with empirical evidence</li>
<li>The average Joe who's interested in Data Visualization 😉</li>
</ul>