Machine Learning in Medicine
Welcome to our guided project on Breast Cancer Classification with Keras and TensorFlow. We'll be diving into a hands-on project, from start to finish, contemplating what the challenge is, what the reward would be for solving it. Specifically, we'll be classifying benign and malignant Invasive Ductal Carcinoma from histopathology images. If you're unfamiliar with this terminology - no need to worry, it's covered in the guided project.
We'll start out by performing Domain Research, and getting familiar with the domain we're trying to solve a problem in. We'll then proceed with Exploratory Data Analysis, and begin the standard Machine Learning Workflow. For this guide, we'll both be building a CNN from scratch, as well as use pre-defined architectures (such as the EfficientNet family, or ResNet family). Once we benchmark the most promising baseline model - we'll perform hyperparameter tuning, and evaluate the model.
As Data Scientists and Machine Learning Engineers - we're exploring prospects of applying Machine Learning algorithms to various domains, and extracting knowledge from data.
Machine Learning in Medicine
Machine Learning has been increasingly employed in medicine, and is helping save lives from a wide variety of medical conditions. The application of Machine Learning in Medicine is vast, and an extremely complex topic in and of itself, but some of the major areas include:
- Precision Medicine (Tailoring medicine to individuals)
- Medical Imaging Diagnosis (Diagnosing conditions based on images, etc.)
- Drug Discovery (Generating structures such as proteins or drug-like molecules, bioactivity prediction, etc.)
Precision Medicine is a movement that focuses on personalized, precise medicine, which naturally builds upon robust datasets, unique to the individuals receiving treatment.
Instead of a one-size-fits-all approach, which has been employed so far, Precision Medicine aims to tailor treatment to an individual based on their lifestyle, environment and genetics. Naturally, Precision Medicine is built on top of robust datasets, generated from the ever-increasing list of gadgets and devices we can use to monitor health.
Medical Imaging Diagnosis is a field which is helping automate and even improve the accuracy of diagnosis based on medical imagery.
Figuring out whether someone's afflicted with a given ailment is difficult. It takes years of practice, intuition and experience to diagnose with a relative level of certainty whether someone's afflicted with a condition or not based on medical imagery. Automating this process has significant implications for the speed of diagnosis - and the faster someone can get diagnosed, the faster they can receive treatment. In some cases, this time can be of the essence.
Drug Discovery is a field in which we utilize Machine Learning methods, as well as computational aids to search the landscape of chemical compounds, and predict their properties in an environment as complex as the human body.
Drug Discovery with Machine Learning is a new, up-and-coming field, with major financial and temporal implications. Designing a drug can take years if not decades, and in-vitro/in-vivo studies take place in real time, under varying conditions. Delegating any of these tasks can create an environment in which we can perform rapid drug design, and provide remedies to new conditions faster than ever.
As a Machine Learning practitioner, you can help make a difference.
In this guided project, we'll be working within the field of Medical Imaging Diagnosis, tackling the classification of one of the major groups of cancer - breast cancer.
Let's take a moment to define the problem we're trying to solve:
Cancer is oftentimes physically noticable in tissue and can be more easily treatable when detected early. Histology studies tissues, and Pathology studies diseases. Histopathology studies diseases in tissues! Pathologists examine images of tissue (histology images), and come up with a verdict. Cancer kills up to 10M people every year and is one of the leading causes of death globally. Alongside lung, colon and stomach cancer - breast cancer kills 700.000 people every year. Certain areas might not have the equipment or schooling necessary to make diagnosis a swift procedure, so patients might have to travel to get diagnosed, prolonging the period in which they can't receive treatment.
In a sense, pathologists are performing classification (positive, or negative) based on patterns and occurrences in the images (visual features). This is a long and laborious process, and requires experience.
Reward for Solving the Problem
Fast, accurate and early diagnosis improves the probability of survival. Machine Learning models can be deployed globally or locally, and can process large sums of data in a fraction of the time it takes humans. On various occassions - it's been proven that Machine Learning models, when trained right, can distinguish features better than humans, and can perform image classification to a higher degree of accuracy, even without much context, and/or low image resolution.
According to Cleveland Clinic:
Invasive ductal carcinoma is quite curable, especially when detected and treated early. The five-year survival rate for localized invasive ductal carcinoma is high — nearly 100% when treated early on. If the cancer has spread to other tissues in the region, the five-year survival rate is 86%. If the cancer has metastasized to distant areas of your body, the five-year survival rate is 28%.
A quick calculation shows that early detection can save up to 400.000 lives annually. There's a large incentive to provide global, accessible, accurate and swift diagnosis tools, especially in regions where expertise is hard to obtain and come by. With automation-prone tasks out of the way, doctors can focus on what they're the best at - administering medicine, observing the effects and steering the procedure to help patients.
Domain Research and Exposition
Let's take a moment to get familiar with the domain. When trying to solve a problem in any domain, you have to have at least rudimentary knowledge of what you're trying to solve, why you're trying to solve it, and what the data means in the context of the domain. Without knowing anything about the domain - it's hard to tell whether a model is really working or not. As a rule of thumb - it's best to consult with someone in the field, and get their input, especially in the later stages of model development.
Though - to get started, you'll typically be on your own, so being able to quickly get a grasp of some of the basic concepts is crucial!
Invasive Ductal Carcinoma (IDC) is by far the most common breast cancer subtype, accounting for 80% of the cases. By tackling this one subtype alone, we can address 80% of the cases.
Tumors are bundles of cells that aren't supposed to bundle, and grow into solid lumps. Tumors can be benign (non-cancerous) and confied to a particular region and might not cause any problems. They could grow and cause problems through sheer size, though. If a tumor starts growing outside of the confounds of the group of cells - it becomes malignant (cancerous). Cancer can invade local tissue or metastasize and attack further tissue. There's much more to be said about tumors and cancer, including subtypes and its degrees, but the dataset we're working with simply classifies images as non-cancerous (benign) and cancerous (malignant).
For this specific dataset, there's surprisingly little medical knowledge you need to have in order to build a capable classifier. This is in large part due to the fact that it took hundreds of years of cummulative scientific knowledge on behalf of doctors of medicine to label and prepare datasets from which we can infer knowledge. It is on their experience and expertise that we're able to build models for feature extraction and classification, with high degrees of accuracy and integrity.
To a Machine Learning engineer - this task almost boils down to regular image classification! However, there are certain implications that come with this dataset, that are scarcely present in other datasets you might've worked with before. We'll specifically focus on making educated guesses in a later section, while contemplating class imbalance, augmentation, cost-sensitive learning, etc.
That being said - let's jump into the data!