Exploratory Data Analysis
Loading the Data
We'll start out by downloading the dataset and loading it in. We'll be working with the Breast Histopathology Images dataset. It contains 198738 IDC(-) image patches and 78786 IDC(+) image patches.
- IDC(-) refers to benign cases
- IDC(+) refers to malignant cases
Note: IDC(-) in this dataset implies that the patient doesn't have Invasive Ductal Carcinoma. It implies that they have a benign case or normal tissue, rather than a malignant case. Besides IDC, another condition exists - Non-Invasive Ductal Carcinoma also known as Ductal carcinoma in situ (DCIS).
The dataset comes from a 2016 study - "Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases" by Andrew Janowczyk and Anant Madabhushi. Their study focused on several tasks, one of which was IDC clasification, for which they had an F-score of 0.7648 on 50k testing patches.
The dataset we're working with is derived from 279 patients, each of which has a unique ID. Each patient has a dedicated folder, named by their ID, with two subfolders - 0
and 1
. The folder named 0
consists of images of benign tissue samples (those without IDC markers). The folder named 1
consists of images of malignant tissue samples (those containing IDC markers).