data science

Articles: 64

Recently published

Article

K-Means Elbow Method and Silhouette Analysis with Yellowbrick and Scikit Learn

K-Means is one of the most popular clustering algorithms. By having central points to a cluster, it groups other points based on their distance to that central point. A downside of K-Means is having to choose the number of clusters, K, prior to running the algorithm that groups points. If...

Cássia Sampaio

Jul 04, 2022·7 min read

Article

Definitive Guide to Hierarchical Clustering with Python and Scikit-Learn

In this guide, we will focus on implementing the Hierarchical Clustering Algorithm with Scikit-Learn to solve a marketing problem. After reading the guide, you will understand: When to apply Hierarchical Clustering How to visualize the dataset to understand if it is fit for clustering How to pre-process features and engineer...

Cássia Sampaio

Jul 01, 2022·52 min read

Article

Split Train, Test and Validation Sets with TensorFlow Datasets - tfds

TensorFlow Datasets, also known as tfds is is a library that serves as a wrapper to a wide selection of datasets, with proprietary functions to load, split and prepare datasets for Machine and Deep Learning, primarily with TensorFlow. Note: While the TensorFlow Datasets library is used to get data, it's...

David Landup

Jan 28, 2022·12 min read

Article

Keras Callbacks: Save and Visualize Prediction on Each Training Epoch

Keras is a high-level API, typically used with the TensorFlow library, and has lowered the barrier to entry for many and democratized the creation of Deep Learning models and systems. When just starting out, a high-level API that abstracts most of the inner-workings helps people get the hang of the...

David Landup

Nov 17, 2021·16 min read

Article

Scikit-Learn's train_test_split() - Training, Testing and Validation Sets

Scikit-Learn is one of the most widely-used Machine Learning library in Python. It's optimized and efficient - and its high-level API is simple and easy to use. Scikit-Learn has a plethora of convenience tools and methods that make preprocessing, evaluating and other painstaking processes as easy as calling a single...

David Landup

Oct 12, 2021·17 min read

Article

Random Projection: Theory and Implementation in Python with Scikit-Learn

This guide is an in-depth introduction to an unsupervised dimensionality reduction technique called Random Projections. A Random Projection can be used to reduce the complexity and size of data, making the data easier to process and visualize. It is also a preprocessing technique for input preparation to a classifier or...

Mehreen Saeed

Aug 31, 2021·28 min read

Article

Self-Organizing Maps: Theory and Implementation in Python with NumPy

In this guide, we'll be taking a look at an unsupervised learning model, known as a Self-Organizing Map (SOM), as well as its implementation in Python. We'll be using an RGB Color example to train the SOM and demonstrate its performance and typical usage. Self-Organizing Maps: A General Introduction A...

Mehreen Saeed

Aug 25, 2021·20 min read

Article

Guide to Multidimensional Scaling in Python with Scikit-Learn

In this guide, we'll dive into a dimensionality reduction, data embedding and data visualization technique known as Multidimensional Scaling (MDS). We'll be utilizing Scikit-Learn to perform Multidimensional Scaling, as it has a wonderfully simple and powerful API. Throughout the guide, we'll be using the Olivetti faces dataset from AT&...

Mehreen Saeed

Aug 24, 2021·16 min read

Article

Calculating Spearman's Rank Correlation Coefficient in Python with Pandas

This guide is an introduction to Spearman's rank correlation coefficient, its mathematical calculation, and its computation via Python's pandas library. We'll construct various examples to gain a basic understanding of this coefficient and demonstrate how to visualize the correlation matrix via heatmaps. What Is the Spearman Rank Correlation Coefficient? Spearman...

Mehreen Saeed

Aug 23, 2021·20 min read