Importing and Data Exploration
The Intel Image Classification Dataset - Importing and Exploration
Let's try working with the Intel Image Classification dataset. It's a great dataset to go further from, since it's not super easy to get a high accuracy from the get-go. Additionally, there are some features that are easy to mix up for a network, which will serve as a great introduction into model evaluation and how you can learn about what makes it trip up and misclassify an image.
The dataset consists of 14k images for training and 3k for testing, sized at 150x150, with 6 classes: "buildings", "forest", "glacier", "mountain", "sea" and "street". As you could assume from the classes, it mainly consists of images of natural scenes (as much as buildings can be natural).
It may sound like this should be a walk in the park, like the Dogs vs. Cats classification is. However, consider the classes again. A mountain oftentimes contains a forest on it. A glacier is in the sea and there are buildings all around on streets! These classes, although, seemingly totally different, are fairly intertwined.
Note: This phenomena is known as co-occurrence. Some things are generally present with other things in images. A network might learn a co-occurring feature as part of a class, whereas, it might not be. Learning a co-occurring feature and assuming a class is similar to mixing up correlation with causation, and it's a major fallacy that results in a bias that can easily go unnoticed because of its very nature. Correctly identifying and evaluating wrong predictions as well as visualizing what networks learn is a great way to spot and remove this bias.