Machine Learning Workflow
We've done Exploratory Data Analysis and got familiar with the dataset we're working with. Now - it's time to hop into the standard Machine Learning Workflow, starting with preprocessing data.
Data Preprocessing
We've worked with DataFrame
s so far, though, this was all without images - we only stored their paths in case we want to retrieve and plot them. One way to load images is to simply iterate through the data and load them in:
import cv2
x = []
y = []
# Loading in 1000 images
for i in data[:1000]:
if i.endswith('.png'):
label=i[-5]
img = cv2.imread(i)
# Transformation steps, such as resizing
img = cv2.resize(img,(200,200))
x.append(img)
y.append(label)
x
and y
are Python lists - which are very efficient at appending data at the cost of higher memory usage. Let's convert them to NumPy arrays, split them into a training and testing set, and call the garbage collection module to clear x
and y
from memory since we won't be using them anymore:
# Reduce from float32 for memory footprint
x = np.array(x, dtype='float16')
y = np.array(y, dtype='float16')
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y, shuffle=True, test_size=0.3)
import gc
x = None
y = None
gc.collect()