Introduction
Data can be represented in many different ways. We quantify observable phenomena to generate data which can then be represented through mathematical formulas, music, text, visualizations, etc.
Python has become one of the preferred languages in the world of Data Science over the years, given its simplicity and ease of use, which lowered the barrier to entry from other professions and opened up collaboration options between many research groups. Even without deep knowledge of Python - many scientists have become able to visualize and share their work with their colleagues and peers, and the new wave of users necessitated a new wave of tools. The easier it is to get started - the larger the impact on the community.
Many libraries have been created to make working with data easier, and as of writing this book, there's no shortage of powerful libraries that allow even the unpracticed folk to step in and try their hand at extracting knowledge from data. Data Visualization is one of the techniques used in all aspects of research and science. It's an interdisciplinary field that represents data through various graphical elements, such as lines and markers. Though - Data Visualization is much more than graphs and charts you've learned in elementary school. You can plot and explore relationships between data, their distributions, summaries, and put things into perspective. Data Visualization has become much more - it's become a storytelling substrate. Each plot you make can tell a story, and you're the artist shaping it.
An artist can also choose to put things out of context - and it's easy to lie with data. As a matter of fact, many people do - however, the statement is not as cynical as it may sound. Many people lie to themselves, not others. Without proper knowledge of how to approach problems, you can "hallucinate" relationships where they aren't present, and infer causation from correlation, which more often than not doesn't hold water.
The applicability of Data Visualization is a long list and it's it's a technique used within Data Science, Data Analysis, Descriptive statistics as well as at the core of Exploratory Data Analysis which is present at the heart of practically all research. Whether you're a biologist, molecular physicist, machine learning engineer, software engineer, psychologist or philosopher - your findings are backed by data, and data in the form of numbers is hard to interpret for humans. We use this crucial step of interpreting our own results as input for our Bayesian best guess of the world and reality around us, so it's crucial that we produce clear, concise, impactful and interpretable results, lest we end up fooling ourselves.
Welcome to Data Visualization in Python with Matplotlib and Pandas. This course is designed to take absolute beginners to Pandas and Matplotlib, with basic Python knowledge, and allow them to build a strong foundation for advanced work with these libraries - from simple plots to 3D plots and interactive buttons.
We'll start out with the installation and setup of the environment you need to start working on Data Visualization projects, followed by a crash-course on Pandas - from the very foundations and upwards, covering everything you'll need to know to follow this course, and to start working on your own visualization projects.
Then, we'll dive into the basics of Matplotlib, the anatomy of plots, the APIs it offers, and basic plot customization, including common tasks like changing font size, setting axis ranges, changing tick frequency, adding legends and plotting basic plots. We'll also cover some fundamental classes like the Text
class and how annotations work.
Once we've gotten the hang of Matplotlib's APIs, plot anatomy and how to customize them - we'll jump into the meat of the course - Data Visualization with Matplotlib. This is where we'll cover the quintessential plots that should be in the arsenal of any Data Scientist, like Scatter Plots, Box Plots, and Histograms. This is also the lesson in which we'll dive into custom plot types, such as Joint Plots and Ridge Plots that aren't part of the standard library, followed by 3D plots and an exploration section, where we'll explore an EEG dataset.
Note: For each plot type, we'll aim to use use a different dataset, which necessitates different pre-processing. In some cases, the data will be very fit for the plot type we're using already - since we'll be choosing the right plot type for the job. In other cases, though, we'll have to perform pre-processing with Pandas. All of the datasets will be available publicly, and downloadable for your convenience and the links to each will be provided in the footnotes. In many cases, we'll be able to pose hypothesis on the correlations between certain features, as well as test these hypothesis through Exploratory Data Analysis.
At this point, we'll jump into more advanced customization, overcoming potential issues with more advanced and custom plot types. Before wrapping up with interactive buttons, we'll take a brief look at Pandas' plotting abilities as well.
Finally, Matplotlib isn't only for static plots. While GUIs are typically created with GUI libraries and frameworks such as PyQt, Tkinter, Kivy and wxPython, and while Python does have excellent integration with PyQt, Tkinter and wxPython - there's no need to use any of these for some basic GUI functionality, through Matplotlib Widgets:
- Lesson 1. - Introduction
- Lesson 2. - Installation and Setup
- Lesson 3. - Getting Started with Pandas
- Lesson 4. - Getting Started with Matplotlib
- Lesson 5. - Basic Matplotlib Customization
- Lesson 6. - Data Visualization with Matplotlib
- Lesson 7. - Advanced Matplotlib Customization
- Lesson 8 - Data Visualization with Pandas
- Lesson 9. - Matplotlib Widgets