Before we get started it would be helpful to know what data science and machine learning actually are. So in case you don't know, here are some basic definitions:
Data science is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured
Machine learning is a field of computer science that often uses statistical techniques to give computers the ability to "learn" with data, without being explicitly programmed.
Glassdoor has ranked data scientist as the number one job in America with an average salary of $120,000 and over 4,500 job openings (as of the time of this writing). With these kind of numbers there's definitely a good amount of people who want to try out careers in data science, which creates a demand for courses on data science to help them level up their skills.
With demand comes supply, which is the reason why there are so many data science and machine learning courses available online and at different institutions. Which presents another challenge, getting to choose the right course to help you in starting out you journey data science and machine learning.
The past few weeks I have been taking one of those courses, Python for Data Science and Machine Learning Bootcamp, which is available only on Udemy. Throughout this article I present my take on this online course.
This course is the work of Jose Portilla, an experienced Data Scientist with several years in the field and founder of Pierian Data. Jose Portilla is among the top instructors on Udemy with over half a million students and 15 courses. Most of his courses are focused on Python, Deep Learning, Data Science and Machine Learning, covering the latter 2 topics in both Python and R.
Jose Portilla is a holder BS and MS in Mechanical Engineering, with several publications and patents to his name. For more information you can check out his profile on Udemy.
This is probably the first question you have on any course so as to know of it's a fit for you.
Machine learning and data science are advanced topics in math and programming. Therefore, there is a fairly steep learning curve that goes into understanding this concepts, which is why it is even more important to have a good resource to learn from.
You can't jump from Novice to Expert. You have to go through the different stages of learning Novice, Intermediate, Advanced then Expert.
For this course you have to possess some programming experience. In any language, a basic grasp of the core programming concepts, like data structures, conditional statements, etc. is important to have. It would be preferable to have this experience in Python, which is the programming language used throughout this course. However, knowledge of Python is not a necessity as the course does start out with a Python Crash Course, which will help you understand Python and follow along in the course.
This is one of the most immersive courses I have come across. With almost 150 videos, clocking in at just over 21 hours in video length. This course takes the learner through an in-depth training of a number of topics, ranging from a Python crash course, an overview of data analysis libraries, an overview of data visualization libraries, and machine learning algorithms, amongst many others.
This course also uses Jupyter NoteBooks which helps in sharing the code and providing a playground for all the code written and executed.
Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
In the following sections we'll take a closer look at the actual content in this course.
Python Crash Course
From the name of the course you probably figured that the material would be using Python to explore data science and machine learning, so no surprise there.
The Python Crash Course section takes you from the basics and through a few beginner concepts in the Python programming language. The mini crash course takes you through a few Python concepts including data types, conditional operators and statements, loops, lambdas, and many more.
Most of the Python knowledge you will need is contained in this section, so you don't need to worry about being a Python expert before taking this course. However, the importance of taking time to get a better grasp of the language before proceeding to other stages can't be over-emphasized, as you'll then be able to focus on the machine learning concepts and not the small details of the programming language.
A very simple way to describe data science is that it involves extracting knowledge and insights from a data set. To be able to process the data and extract insights and information from it you have to be able to analyse it.
This begs the question: What exactly is data analysis?
Data Analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Seeing as how critical data analysis is, this course takes time to guide you through several data analysis libraries in Python, which I'll touch on below.
- NumPy: A Python library, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas: A Python library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
Data Visualization is critical because it helps with communicating information clearly and efficiently to users by use of statistical graphics, plots, information graphics and other tools.
Data visualization refers to the techniques used to communicate data or information by encoding it as visual objects.
This course takes the learner through several data visualization libraries in Python, demonstrating to the learner how to create a variety of visualization for a wide range of data sets using the different libraries. Some of the visualization libraries taught in this course include:
- Matplotlib: A Python 2D plotting library which produces publication quality figures in a variety of hardcopy formats and interactive environments across platforms.
- Seaborn: A Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- Pandas: A data library that has both analysis and visualization capabilities.
- Plotly: An interactive visualization library.
- Cufflinks: A library that helps connect Plotly with Pandas.
- Geographical Plotting: Creating choropleth maps for geographic data visualization.
This is the second part of the course, which takes the learner through several machine learning algorithms. The course takes several steps to help students' understanding of the algorithm by offering instructions on theory, supplemental reading, a Python implementation of the algorithm, exercises on the algorithm, and solutions to the exercises.
The course covers the different types of machine learning algorithms, namely supervised learning, unsupervised learning, and reinforcement learning extensively.
Some of the machine learning algorithms covered in this course include:
- Linear Regression: It is used to estimate real values based on continuous variables.
- Logistic Regression: It is used to estimate discrete values based on given set of independent variables.
- K Nearest Neighbour: kNN is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure.
- Natural Language Processing: The application of computational techniques to the analysis and synthesis of natural language and speech.
- Neural Nets and Deep Learning: Neural networks are computer system modelled on the human brain and nervous system. Deep learning, a powerful set of techniques for learning in neural networks.
- Support Vector Machines: SVM is supervised machine learning algorithm which can be used for both classification or regression challenges.
- K-Means Clustering: K-Means Clustering aims to partition observations into clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.
Other Algorithms covered in the course include big data and Spark with Python, principal component analysis, and recommender systems.
The course also takes the learner through the Scikit-Learn library, which is a Python library with implementation of quite a few machine learning algorithms. This is basically Python's "Swiss Army Knife" for machine learning.
Hands down, this is an amazing course. With a very large amount of course content, it took me a while to review it, the course takes time to go into detail due to the number of concepts covered in this course.
Python Crash Course
One of the major drawbacks for most courses is assuming the students can level up on the required stack on their own. This course doesn't take that chance, taking the student through a Python Crash Course so the user is able to comfortably go through the course and not get bogged down on details unrelated to the core material.
Going into Detail
This course doesn't shy away from diving deep into concepts. The course takes time to dive deep on the important concepts to ensure that the student gets a complete grasp of the topic. Sometimes one concept is even split into multiple different sections just to ensure that all of the concept is delivered fully.
Also, learners are provided with (optional) extra reading material to expand their knowledge in algorithms covered. For example, the course uses Introduction to Statistical Learning by Gareth James as a companion book.
This course has meticulously written notes, both on screen as the instructor goes through the content to help with following the content and before or after videos to explain a few concepts. This notes are critical in helping learners follow along, especially on the more complex concepts.
Sharing Code (Jupyter Notebook)
Throughout the duration of this course, due to its hands-on approach, there is a lot of code being written down. The instructor uses Jupyter Notebooks to share all the code that is covered in the course. The course has "Resources folder" which contains well-arranged Jupyter Notebooks for each section.
These notebooks help learners to have access to the code so that they can follow the lectures more easily and also have access to the code to do more practice later.
Exercises and Solutions
The best way to learn and understand something is to actually do it. This course understands that important step in learning new concepts and has a custom exercise for almost every section in the course. It goes on further to provide solutions for the exercises in each section.
These exercises are meant to help the student internalise the concepts taught in the section. For the different machine learning algorithms a real world data set is provided to the student with questions requiring them to use the concepts they've learned to solve it. The student is also provided with means to get more data sets to sharpen their skills via resources like Kaggle.
One of the hardest things to get when going through an online course is running into blockers. Without any help you are left stuck at some point in the course, or even worse, not understanding some concepts.
Jose has worked on creating a community around his course to help learners help each other out with problems they face along the way. Most of the problems a student might come across in the course are actually already in the FAQ for the course, making it even easier for learners to find solutions.
Too Much Information
This is just my opinion, but when someone gets to the level of learning complex topics like data science and machine learning you probably already have an understanding of basic concepts in programming, and as such a course of this level should not spend so much time explaining the basic concepts.
However, due to the hands-on approach that this course takes, it ends up explaining a lot of basic programming concepts that ends up taking a lot of time, making the course even longer.
Python for Data Science and Machine Learning Bootcamp is truly is an amazing course. It is very well-detailed, with a lot of support to ensure you come out of it well-equipped to start working on machine learning and data science problems.
But as you all know, practice makes perfect, so going through just this course won't make you the kick ass data scientist or machine learning engineer the industry needs. You will have to put in the work to go through the exercises in the course and more practice on the different libraries and algorithms to get to the top.