Breast Cancer Classification Notebook

<h3 id="introduction">Introduction</h3>
Welcome to our guided project on Breast Cancer Classification with Keras and TensorFlow. We'll be diving into a hands-on project, from start to finish, contemplating what the challenge is, what the reward would be for solving it. Specifically, we'll be classifying benign and malignant Invasive Ductal Carcinoma from histopathology images. If you're unfamiliar with this terminology - no need to worry, it's covered in the guided project.
We'll start out by performing Domain Research, and getting familiar with the domain we're trying to solve a problem in. We'll then proceed with Exploratory Data Analysis, and begin the standard Machine Learning Workflow. For this guide, we'll both be building a CNN from scratch, as well as use pre-defined architectures (such as the EfficientNet family, or ResNet family). Once we benchmark the most promising baseline model - we'll perform hyperparameter tuning, and evaluate the model.
As Data Scientists and Machine Learning Engineers - we're exploring prospects of applying Machine Learning algorithms to various domains, and extracting knowledge from data.
<h3 id="machinelearninginmedicine">Machine Learning in Medicine</h3>
Machine Learning has been increasingly employed in medicine, and is helping save lives from a wide variety of medical conditions. The application of Machine Learning in Medicine is vast, and an extremely complex topic in and of itself, but some of the major areas include:
<ul>
<li>Precision Medicine (Tailoring medicine to individuals)</li>
<li>Medical Imaging Diagnosis (Diagnosing conditions based on images, etc.)</li>
<li>Drug Discovery (Generating structures such as proteins or drug-like molecules, bioactivity prediction, etc.)</li>
</ul>
<blockquote>
Precision Medicine is a movement that focuses on personalized, precise medicine, which naturally builds upon robust datasets, unique to the individuals receiving treatment.
</blockquote>
Instead of a one-size-fits-all approach, which has been employed so far, Precision Medicine aims to tailor treatment to an individual based on their lifestyle, environment and genetics. Naturally, Precision Medicine is built on top of robust datasets, generated from the ever-increasing list of gadgets and devices we can use to monitor health.
<blockquote>
Medical Imaging Diagnosis is a field which is helping automate and even improve the accuracy of diagnosis based on medical imagery.
</blockquote>
Figuring out whether someone's afflicted with a given ailment is difficult. It takes years of practice, intuition and experience to diagnose with a relative level of certainty whether someone's afflicted with a condition or not based on medical imagery. Automating this process has significant implications for the speed of diagnosis - and the faster someone can get diagnosed, the faster they can receive treatment. In some cases, this time can be of the essence.
<blockquote>
Drug Discovery is a field in which we utilize Machine Learning methods, as well as computational aids to search the landscape of chemical compounds, and predict their properties in an environment as complex as the human body.
</blockquote>
Drug Discovery with Machine Learning is a new, up-and-coming field, with major financial and temporal implications. Designing a drug can take years if not decades, and in-vitro/in-vivo studies take place in real time, under varying conditions. Delegating any of these tasks can create an environment in which we can perform rapid drug design, and provide remedies to new conditions faster than ever.
<blockquote>
As a Machine Learning practitioner, you can help make a difference.
</blockquote>
In this guided project, we'll be working within the field of Medical Imaging Diagnosis, tackling the classification of one of the major groups of cancer - breast cancer.
<h3 id="challengeproblemstatement">Challenge/Problem Statement</h3>
Let's take a moment to define the problem we're trying to solve:
<blockquote>
Cancer is oftentimes physically noticable in tissue and can be more easily treatable when detected early. Histology studies tissues, and Pathology studies diseases. Histopathology studies diseases in tissues! Pathologists examine images of tissue (histology images), and come up with a verdict. Cancer kills up to <a rel="nofollow noopener noreferrer" target="_blank" href="https://ourworldindata.org/cancer#deaths-from-cancer">10M people every year</a> and is one of the leading causes of death globally. Alongside lung, colon and stomach cancer - breast cancer kills 700.000 people every year. Certain areas might not have the equipment or schooling necessary to make diagnosis a swift procedure, so patients might have to travel to get diagnosed, prolonging the period in which they can't receive treatment.
</blockquote>
In a sense, pathologists are performing classification (positive, or negative) based on patterns and occurrences in the images (visual features). This is a long and laborious process, and requires experience.
<img src="https://s3.stackabuse.com/media/guided+projects/breast-cancer-classification-with-keras-and-tensorflow-custom-cnns-transfer-learning-2.png" alt="">
<h3 id="rewardforsolvingtheproblem">Reward for Solving the Problem</h3>
<blockquote>
Fast, accurate and early diagnosis improves the probability of survival. Machine Learning models can be deployed globally or locally, and can process large sums of data in a fraction of the time it takes humans. On various occassions - it's been proven that Machine Learning models, when trained right, can distinguish features better than humans, and can perform image classification to a higher degree of accuracy, even without much context, and/or low image resolution.
</blockquote>
According to <a rel="nofollow noopener noreferrer" target="_blank" href="https://my.clevelandclinic.org/">Cleveland Clinic</a>:
<blockquote>
Invasive ductal carcinoma is quite curable, especially when detected and treated early. The five-year survival rate for localized invasive ductal carcinoma is high — nearly 100% when treated early on. If the cancer has spread to other tissues in the region, the five-year survival rate is 86%. If the cancer has metastasized to distant areas of your body, the five-year survival rate is 28%.
</blockquote>
A quick calculation shows that early detection can save up to 400.000 lives annually. There's a large incentive to provide global, accessible, accurate and swift diagnosis tools, especially in regions where expertise is hard to obtain and come by. With automation-prone tasks out of the way, doctors can focus on what they're the best at - administering medicine, observing the effects and steering the procedure to help patients.
<img src="https://s3.stackabuse.com/media/guided+projects/breast-cancer-classification-with-keras-and-tensorflow-custom-cnns-transfer-learning-1.png" alt="">
<h3 id="domainresearchandexposition">Domain Research and Exposition</h3>
Let's take a moment to get familiar with the domain. When trying to solve a problem in any domain, you have to have at least rudimentary knowledge of what you're trying to solve, why you're trying to solve it, and what the data means in the context of the domain. Without knowing anything about the domain - it's hard to tell whether a model is really working or not. As a rule of thumb - it's best to consult with someone in the field, and get their input, especially in the later stages of model development.
Though - to get started, you'll typically be on your own, so being able to quickly get a grasp of some of the basic concepts is crucial!
Invasive Ductal Carcinoma (IDC) is by far the most common breast cancer subtype, accounting for 80% of the cases. By tackling this one subtype alone, we can address 80% of the cases.
Tumors are bundles of cells that aren't supposed to bundle, and grow into solid lumps. Tumors can be benign (non-cancerous) and confied to a particular region and might not cause any problems. They could grow and cause problems through sheer size, though. If a tumor starts growing outside of the confounds of the group of cells - it becomes malignant (cancerous). Cancer can invade local tissue or metastasize and attack further tissue. There's much more to be said about tumors and cancer, including subtypes and its degrees, but the dataset we're working with simply classifies images as non-cancerous (benign) and cancerous (malignant).
For this specific dataset, there's surprisingly little medical knowledge you need to have in order to build a capable classifier. This is in large part due to the fact that it took hundreds of years of cummulative scientific knowledge on behalf of doctors of medicine to label and prepare datasets from which we can infer knowledge. It is on their experience and expertise that we're able to build models for feature extraction and classification, with high degrees of accuracy and integrity.
To a Machine Learning engineer - this task almost boils down to regular image classification! However, there are certain implications that come with this dataset, that are scarcely present in other datasets you might've worked with before. We'll specifically focus on making educated guesses in a later section, while contemplating class imbalance, augmentation, cost-sensitive learning, etc.
That being said - let's jump into the data!

David Landup

Jovana Ninkovic

Machine Learning in Medicine

<h3 id="loadingthedata">Loading the Data</h3>
We'll start out by downloading the dataset and loading it in. We'll be working with the <a target="_blank" rel="nofollow noopener noreferrer" href="https://www.kaggle.com/paultimothymooney/breast-histopathology-images">Breast Histopathology Images</a> dataset. It contains 198738 IDC(-) image patches and 78786 IDC(+) image patches.
<ul>
<li>IDC(-) refers to benign cases</li>
<li>IDC(+) refers to malignant cases</li>
</ul>

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: IDC(-) in this dataset implies that the patient doesn't have Invasive Ductal Carcinoma. It implies that they have a benign case or normal tissue, rather than a malignant case. Besides IDC, another condition exists - Non-Invasive Ductal Carcinoma also known as Ductal carcinoma in situ (DCIS).

 </div>
 </div>
 The dataset comes from a 2016 study - <a target="_blank" rel="nofollow noopener noreferrer" href="https://pubmed.ncbi.nlm.nih.gov/27563488/">&quot;Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases&quot;</a> by Andrew Janowczyk and Anant Madabhushi. Their study focused on several tasks, one of which was IDC clasification, for which they had an F-score of 0.7648 on 50k testing patches.
The dataset we're working with is derived from 279 patients, each of which has a unique ID. Each patient has a dedicated folder, named by their ID, with two subfolders - <code>0</code> and <code>1</code>. The folder named <code>0</code> consists of images of benign tissue samples (those without IDC markers). The folder named <code>1</code> consists of images of malignant tissue samples (those containing IDC markers).

Exploratory Data Analysis

We've done Exploratory Data Analysis and got familiar with the dataset we're working with. Now - it's time to hop into the standard Machine Learning Workflow, starting with preprocessing data.
<h4 id="datapreprocessing">Data Preprocessing</h4>
We've worked with <code>DataFrame</code>s so far, though, this was all without images - we only stored their paths in case we want to retrieve and plot them. One way to load images is to simply iterate through the data and load them in:
<pre><code class="hljs">import cv2

x = []
y = []

# Loading in 1000 images
for i in data[:1000]:
 if i.endswith(&#x27;.png&#x27;):
 label=i[-5]
 img = cv2.imread(i)
 # Transformation steps, such as resizing
 img = cv2.resize(img,(200,200))
 x.append(img)
 y.append(label)
</code></pre>
<code>x</code> and <code>y</code> are Python lists - which are very efficient at appending data at the cost of higher memory usage. Let's convert them to NumPy arrays, split them into a training and testing set, and call the garbage collection module to clear <code>x</code> and <code>y</code> from memory since we won't be using them anymore:
<pre><code class="hljs"># Reduce from float32 for memory footprint
x = np.array(x, dtype=&#x27;float16&#x27;)
y = np.array(y, dtype=&#x27;float16&#x27;)

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y, shuffle=True, test_size=0.3)

import gc
x = None
y = None
gc.collect()
</code></pre>

Machine Learning Workflow

<h3 id="conclusion">Conclusion</h3>
That would wrap up this Guided Project. Thank you for coming along on the ride, and I hope that you've learned how you can put your Machine Learning skills to very good use in the field of Medical Diagnosis.
In this Guided Project - we've started out with exploring some of the applications of Machine Learning in medicine, followed by an introduction to the topic of the project. With an engineering mindset - we've considered the problem at hand, and why performing the task of classifying breast cancer is expensive and difficult, as well as how we could remedy this, and what the rewards and benefits of solving the problem are.
Then, we've loaded in the dataset and performed Exploratory Data Analysis, and got familiar with the data in the domain. Only then, we've delved into the standard Machine Learning Workflow, starting with Data Preprocessing. We've explored what Class Imbalance is, and whether it poses an issue for us, as well as cases in which it would. We've explored the possibilities of removing class imbalance for projects that would be negativelly affected by it, and considered the implications of what the imbalance would bring us, noting them down for making educated guesses down the line.

Ending Note

<h4 id="wanttowritearesearchgradedeeplearningclassifier">Want to write a research-grade Deep Learning classifier?</h4>
As Data Scientists and Machine Learning Engineers - we're exploring prospects of applying Machine Learning algorithms to various domains and extracting knowledge from data. Fast, accurate and early diagnosis of cancer improves the probability of survival, and early Breast Cancer diagnosis can save up to 400,000 lives every year. Machine Learning models can be deployed globally or locally, and can process large sums of data in a fraction of the time it takes humans.
Invasive Ductal Carcinoma (IDC) is the most common subtype of all breast cancers. Breast cancer is the most common form of cancer in women. Accurately identifying and categorizing breast cancer subtypes is an important clinical task, and automated methods can be used to save time and reduce error.
<blockquote>
As a Machine Learning practitioner, you can help make a difference.
</blockquote>
In this guided project, we'll be working within the field of Medical Imaging Diagnosis, tackling the classification of one of the major groups of cancer - breast cancer.
In our Guided Project, Breast Cancer Classification with Keras and TensorFlow, we'll be diving into a hands-on project, from start to finish, contemplating what the challenge is, what the reward would be for solving it. Specifically, we'll be classifying benign and malignant Invasive Ductal Carcinoma from histopathology images. If you're unfamiliar with this terminology - no need to worry, it's covered in the guided project.
We'll start out by performing Domain Research, and getting familiar with the domain we're trying to solve a problem in. We'll then proceed with Exploratory Data Analysis, and begin the standard Machine Learning Workflow. For this guide, we'll both be building a CNN from scratch, as well as use pre-defined architectures (such as the EfficientNet family, or ResNet family). Once we benchmark the most promising baseline model - we'll perform hyperparameter tuning, and evaluate the model.

 <div class="alert alert-note">
 <div class="flex">
 
 <div class="flex-shrink-0 mr-3">
 <img src="/assets/images/icon-information-circle-solid.svg" class="icon" aria-hidden="true" />
 </div>
 
 Note: This Guided Project is part of our in-depth course on <a target="_blank" href="https://stackabuse.com/courses/practical-deep-learning-for-computer-vision-with-python/">Practical Deep Learning for Computer Vision</a> and assumes that you've read the previous lessons or have that prerequisite knowledge from before.

 </div>
 </div>