An Introduction To Data Visualization In Python
Data can be represented in many different ways. We quantify observable phenomena to generate data which can then be represented through mathematical formulas, music, text, visualizations, etc.
Python has become one of the preferred languages in the world of Data Science over the years, given its simplicity and ease of use, which lowered the barrier to entry from other professions and opened up collaboration options between many research groups. Even without deep knowledge of Python - many scientists have become able to visualize and share their work with their colleagues and peers, and the new wave of users necessitated a new wave of tools. The easier it is to get started - the larger the impact on the community.
Many libraries have been created to make working with data easier, and as of writing this course, there's no shortage of powerful libraries that allow even the unpracticed folk to step in and try their hand at extracting knowledge from data. Data Visualization is one of the techniques used in all aspects of research and science. It's an interdisciplinary field that represents data through various graphical elements, such as lines and markers. Though - Data Visualization is much more than graphs and charts you've learned in elementary school. You can plot and explore relationships between data, their distributions, summaries, and put things into perspective. Data Visualization has become much more - it's become a storytelling substrate. Each plot you make can tell a story, and you're the artist shaping it.
An artist can also choose to put things out of context - and it's easy to lie with data. As a matter of fact, many people do - however, the statement is not as cynical as it may sound. Many people lie to themselves, not others. Without proper knowledge of how to approach problems, you can "hallucinate" relationships where they aren't present, and infer causation from correlation, which more often than not doesn't hold water.
The applicability of Data Visualization is a long list and it's it's a technique used within Data Science, Data Analysis, Descriptive statistics as well as at the core of Exploratory Data Analysis which is present at the heart of practically all research. Whether you're a biologist, molecular physicist, machine learning engineer, software engineer, psychologist or philosopher - your findings are backed by data, and data in the form of numbers is hard to interpret for humans. We use this crucial step of interpreting our own results as input for our Bayesian best guess of the world and reality around us, so it's crucial that we produce clear, concise, impactful and interpretable results, lest we end up fooling ourselves.
How the Course Is Formatted
This course will cover 9 different libraries used in the Python ecosystem for Data Visualization, going over the most relevant and unique attributes and features of each. This course will also cover the different types of data you can visualize in Python, in addition to common visualization techniques, tools, and plot types.
Note: Note that the course assumes prior knowledge of Python's syntax and basic handling in the language. We won't assume prior knowledge of the additional libraries used in the course such as Pandas and NumPy. Chapter 3 is dedicated to a brief introduction to Pandas and how it can be used to load, manipulate and visualize data. NumPy will be used extensively though the course for its helper methods, to generate ranges and dummy values, as well as calculate aggregate statistics. In most cases, these methods are fairly self-explanatory and a dedicated section to NumPy isn't required to get the hang of it. Whenever new methods are used - a short description in an additional paragraph will be added to explain its usage.
Each lesson in the course will start with an introduction to the library covered in the lesson, followed by its internal representation and terminology. The strengths and weaknesses of each library as well as some of the baseline (common) and unique plots will be covered, but we won't be diving into every plot of every library, as that would be extremely repetitive and tiresome. Instead, we'll be building a holistic intuition for each library.
For most libraries in the course, we'll end it off with a hands-on, end-to-end project, featuring a new dataset or domain, and this is the focus of the course. Diversity is one of the most important aspects of practice in the field, and each dataset needs to be preprocessed in a different manner. Additionally, someone who does Data Visualization has to have the ability to understand a wide variety of topics, at least on a surface level. You can't infer from data if you don't understand what it means. Additionally, you're much more liable to interpreting something wrong and arriving at false conclusions if you're not familiar with the domain you're dealing with.
Throughout the course - we'll be using a baseline, simple dataset or a couple of simple datasets to start out with the libraries, followed by a new dataset or multiple datasets in the hands-on section. We'll be exploring mathematical structures and generator functions, EEG (electroencephalogram) brainwave data, spatial data, etc.
Some of these domains use completely different data formats - for instance, GeoJSON is oftentimes used for spatial data, EEG uses several data formats (we'll be working with CSV), genomic data can be represented through various formats such as FASTA, bedGraph, bowtie, etc.
Many formats can be boiled down to the good old CSV format most people are acquainted with, and when possible - we'll be falling back to it for simplicity's sake, as this is the most common one you'll be using in your day-to-day work.
Many of these datasets belong to completely different families of data, and you need some exposition and context to understand it properly. The course will have an introductory section on each domain right before we dive into the project. This should be enough for you to grasp and understand the visualizations and analysis done in the practical sections. Don't forget, doing your own research of the fields you're visualizing data from is part of the Data Visualization process.
The course is written by two authors - David Landup, who authored the hands-on, end-to-end sections and edited the rest of the course and Daniel Nelson, who authored the introductions to the libraries.
Landscape of Python's Libraries
Before delving too deeply into the libraries themselves, it would be helpful to gain an intuition of how the landscape of Python’s visualization libraries breaks down. To put that another way, it’s helpful to understand how the different Python libraries are designed and related to one another. Understanding how the different libraries operate will help you choose the best library for your visualization project.
There are a number of different data visualization libraries and modules compatible with Python. Most of the Python data visualization libraries can be placed into one of four groups, separated based on their origins and focus.
The groups are:
- Matplotlib-based libraries
- JavaScript libraries
- JSON libraries
- WebGL libraries
Matplotlib-based Libraries
The first major group of libraries is those based on Matplotlib. Matplotlib is one of the oldest Python data visualization libraries, and thanks to its wealth of features and ease of use it is still one of the most widely used one. Matplotlib was first released back in 2003 and has been continuously updated since.
Matplotlib contains a large number of visualization tools, plot types, and output types. It produces mainly static visualizations. While the library does have some 3D visualization options, these options are far more limited than those possessed by other libraries like Plotly and VisPy. It is also limited in the field of interactive plots, unlike Bokeh, which we'll cover in a later lesson.
Because of Matplotlib’s success as a visualization library, various other libraries have expanded on its core features over the years. These libraries are Matplotlib-based, using Matplotlib as an engine for their own visualization functions.
The libraries based upon Matplotlib add new functionality to the library by specializing in the rendering of certain data types or domains, adding new types of plots, or creating new high-level APIs for Matplotlib’s functions.
They're used alongside Matplotlib, not instead, to expand its styling and plotting capabilities.
JavaScript-based Libraries
There are a number of JavaScript-based libraries for Python that specialize in data visualization. The adoption of HTML5 by web browsers enabled interactivity for graphs and visualizations, instead of only static 2D plots. Styling HTML pages with CSS can net beautiful visualizations.
These libraries wrap JavaScript/HTML5 functions and tools in Python, allowing the user to create new interactive plots. The libraries provide high-level APIs for the JavaScript functions, and the JavaScript primitives can often be edited to create new types of plots, all from within Python.
JSON-based Libraries
JavaScript Object Notation (JSON) is a data interchange format, containing data in a simple structured format that can be interpreted not only by JavaScript libraries but by almost any language. It's also human-readable.
There are various Python libraries designed to interpret and display JSON data. With JSON-based libraries, the data is fully contained in a JSON data file. This makes it possible to integrate plots with various visualization tools and techniques.
WebGL-based Libraries
The WebGL standard is a graphics standard that enables interactivity for 3D plots. Much like how HTML5 made interactivity for 2D plots possible (and plotting libraries were developed as a result), the WebGL standard gave rise to 3D interactive plotting libraries.
Python has several plotting libraries that are focused on the development of WebGL plots. Most of these 3D plotting libraries allow for easy integration and sharing via Jupyter notebooks and remote manipulation through the web.
Other Libraries
There are also a variety of other Python plotting libraries, many of which create Python wrappers for other languages and visualization platforms.
Popular Python Data Visualization Libraries
This course will cover the most popular data visualization libraries for Python, which fall into the five different categories defined above. The libraries covered in this course are: Matplotlib, Pandas, Seaborn, Bokeh, Plotly, Altair, GGPlot, GeoPandas, and VisPy.
You’ll need to know what these different libraries are capable of, in order to choose the proper library for your project’s needs. Let’s take a quick look at these different libraries, some of their unique distinctive features, and what they're used for.
Matplotlib-based Python Libraries
Matplotlib
As already stated above, Matplotlib is one of the most common and widely used visualization libraries, used to create static 2D plots, although it does have some support for 3D visualizations. Matplotlib is structured in a fashion that allows the user to create and customize multiple plots for a single image, achieved through the creation of subplots. It's intended to make producing both simple and advanced plots straightforward and intuitive and has support for both static and interactive visualization modes. Though, it's relatively limited when it comes to interactive visualization.
Matplotlib is able to generate numerous different plot types and styles, and it can work along with general-purpose Python GUI libraries like Qt and Tkinter.
Pandas
Pandas is a data analysis and manipulation library. While Pandas does come with some visualization and plotting functions, the main reason Pandas is so popular and widely used is that the library makes manipulating data simple and straightforward. Pandas can read data in many different formats, and it creates a Python data object filled with rows and columns, called a DataFrame
.
These rows and columns are easy to manipulate through built-in functions that let the user merge, split, view, filter, sort, and otherwise alter the data within them, all done with relatively simple commands.
For these reasons, Pandas is frequently used alongside the other data visualization libraries - to prepare the data in question for analysis.
Seaborn
Seaborn is a visualization library that adds onto Matplotlib’s basic functions. Seaborn is intended to enable the easy creation of informative and attractive visualizations. Seaborn gives the user more control over their plots, letting them do things that aren’t possible with normal Matplotlib.
This includes the ability to easily produce less common types of visualizations such as heatmaps, violin plots, and joint plots, amongst other plots. Seaborn's goal is to abstract away many of Matplotlib's low-level functions and methods, letting the user create visually impressive plots with less code compared to Matplotlib.
Seaborn gives you more customization options for your plots as well, allowing you to use preset themes or customize the plots to your liking. It also enables efficient handling of dataframes and time-series data.
GeoPandas
GeoPandas is an extension to the Pandas plotting library designed to make it easier to work with geospatial/geographical data. GeoPandas enables the types of data manipulation possible in Pandas on geometric data, letting you easily carry out visualization tasks that would typically require a spatial database.
GeoPandas allows you to specify the shape of graph regions using special shapefiles, and to clip points and lines to the boundary mask.
JavaScript-based Libraries
Bokeh
Bokeh is a visualization library that allows the user to create interactive visualizations that can be displayed in Jupyter notebooks and web browsers. Bokeh is focused on the production of highly interactive visualizations, unlike Matplotlib which has just a handful of interactive options. Visualizations in Bokeh are based around objects called "glyphs", which you can render in numerous different shapes and styles.
Bokeh lets you choose different tools to include alongside your visualization. These tools let you select groups of data points, hover over points to see more information about them, zoom in on multiple graphs at once, and more.
It also allows you to construct numerous different plots with various styles, all the while maintaining high performance across large datasets. Bokeh supports HTML formatting and exporting and has native Pandas integration, allowing you to edit dataframes and the resulting visualizations easily.
With Bokeh, it's easy to create a well-styled interactive HTML file which you can then embed into a page or presentation.
Plotly
Like Bokeh, Plotly is designed specifically with the purpose of creating interactive plots. Plotly supports numerous use cases like statistical, geographic, scientific, and even 3D datasets. Similar to Bokeh's use of glyphs, the fundamental unit of a Plotly plot is the "trace". You can combine multiple traces and display them all on a single figure.
Plotly for Python is based on JavaScript's Plotly library and it can be used to create more than 40 different types of plots and charts, each of which can be displayed in a Jupyter notebook or saved in an HTML file. Plotly allows the user to save their plots in the cloud or as a file on their device.
Plotly plots are interactive by default, and they can be created with JSON charts as well as easily embedded in web pages. You can also export Plotly graphs in a variety of different formats, such as PNG, SVG, PDF, and HTML to your local machine.
JSON-based Libraries
Altair
Altair is a Python library designed explicitly for the visualization of statistical data. Altair is based on the Vega and Vega-Lite standards, meaning that you use visualization grammar (specific phrases) that allow you to specify the level of interactivity and style you want your graph to have. Vega specifications are used to define how interactive visualizations are created in JavaScript Object Notation (JSON). Altair is a declarative library, and all you need to do is declare which kind of graph you'd like to create along with some desired features for it.
With Altair, you can produce effective visualizations with minimal code. You can often create complex plots with just a single line of code. However, Altair does lack some of the more advanced customization features of the other libraries.
Altair is designed to quickly create interactive statistical visualizations that can be integrated with IPython notebooks. Altair also lets you create compound charts comprised of different layers.
WebGL-Based
VisPy
VisPy is a 2D and 3D visualization library, created primarily to assist in the visualization of big data. Unlike the other libraries mentioned here, VisPy makes use of Graphics Processing Units (GPUs) to display the visualization of large datasets.
VisPy supports visualizations of scientific and statistical plots featuring millions of data points. It's intended to be scalable, easy to use, and fast. With having both low-level and high-level interfaces, VisPy makes it possible to create visualizations with relatively few lines of code and then edit those visualizations to your needed specifications.
It has OpenGL support, on which it currently bases some of its functionality, though it does require knowledge of the OpenGL Shaders Language (GLSL) to use.
Other
GGplot
GGplot is intended to make producing plots simple and efficient, rendering them with minimal code. It uses the “Grammar of Graphics” standard, borrowed from R. GGplot graphs contain consistent basic elements, which makes graphs uniform and easy to read.
GGplot lets you perform aesthetics mapping, meaning that you can control how variables within your dataset are mapped onto visual properties, defining mappings for different variables and layers of your graph.