The Role of GPUs in Deep Learning
GPUs, or Graphics Processing Units, are important pieces of hardware originally designed for rendering computer graphics, primarily for games and movies. However, in recent years, GPUs have gained recognition for significantly enhancing the speed of computational processes involving neural networks.
GPUs now play a pivotal role in the artificial intelligence revolution, predominantly driving rapid advancements in deep learning, computer vision, and large language models, among others.
In this article, we will delve into the utilization of GPUs to expedite neural network training using PyTorch, one of the most widely used deep learning libraries.
Note: An NVIDIA GPU-equipped machine is required to follow the instructions in this article.
Inroduction to GPUs with PyTorch
PyTorch is an open-source, simple, and powerful machine-learning framework based on Python. It is used to develop and train neural networks by performing tensor computations like automatic differentiation using the Graphics Processing Units.
PyTorch employs the CUDA library to configure and leverage NVIDIA GPUs.
CUDA is a GPU computing toolkit developed by Nvidia, designed to expedite compute-intensive operations by parallelizing them across multiple GPUs. PyTorch offers support for CUDA through the
Utilising GPUs in Torch via the CUDA Package
The CUDA library in PyTorch is instrumental in detecting, activating, and harnessing the power of GPUs. Let's delve into some functionalities using PyTorch.
Verifying GPU Availability
Before using the GPUs, we can check if they are configured and ready to use. The following code returns a boolean indicating whether GPU is configured and available for use on the machine.
import torch print(torch.cuda.is_available())
The number of GPUs present on the machine and the device in use can be identified as follows:
This output indicates that there is a single GPU available, and it is identified by the device number 0.
Initialize the Device
The active device can be initialized and stored in a variable for future use, such as loading models and tensors into it. This step is necessary if GPUs are available because CPUs are automatically detected and configured by PyTorch.
torch.device function can be used to select the device.
"cuda" if torch.cuda.is_available() else "cpu") devicedevice = torch.device(
With the device variable, we can now create and move tensors into it.
Creating and Moving tensors to the GPU
The models and datasets are represented as PyTorch tensors, which must be initialized on, or transferred to, the GPU prior to training the model. This can be accomplished in several ways, as outlined below:
- Creating Tensors Directly on the GPU
Tensors can be directly created on the desired device, such as the GPU, by specifying the
device parameter. By default, tensors are created on the CPU. You can determine the device where the tensor is stored by accessing the
device parameter of the tensor.
x = torch.tensor([1, 2, 3]) print(x) print("Device: ", x.device)
tensor([1, 2, 3]) Device: cpu
Now, let's generate the tensors directly on the device.
y = torch.tensor([4, 5, 6], device=device) print(y) print("Device: ", y.device)
tensor([4, 5, 6], device='cuda:0') Device: cuda:0
Lastly, the device number where the tensors are stored can be retrieved using the
In the output above,
-1 represents the CPU, while
0 represents GPU number 0.
- Transferring Tensors Using the
Tensors can be transferred from the CPU to the device using the
to() method, which is supported by PyTorch tensors.
Free eBook: Git Essentials
Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!
x = torch.tensor([1, 2, 3]) x = x.to(device) print("Device: ", x.device) print(x.get_device())
Device: cuda:0 0
When multiple GPUs are available, tensors can be transferred to specific GPUs by passing the device number as a parameter.
cuda:0 is for the first GPU,
cuda:1 for the second GPU, and so on.
# Transfer to the first GPU x = torch.tensor([8, 9, 10]) x = x.to("cuda:0") print(x.device)
Attempting to transfer to a GPU that is not available or to an incorrect GPU number will result in a
- Transferring Tensors Using the
Below is an example of creating a sample tensor and transferring it to the GPU using the
cuda() method, which is supported by PyTorch tensors.
# Create a random tensor of shape (100, 30) tensor = torch.rand((100, 30)) tensor = tensor.cuda() print(tensor.device)
Now let's explore techniques to load the tensors into multiple GPUs through parallelisation i.e. one of the most important features responsible for high computational speeds in GPUs.
Multi-GPU Distributed Training
Distributed training involves deploying both the model and the dataset across multiple GPUs, thereby dramatically accelerating the training process via the capability of parallelization. We will cover some of the distributed training classes offered by PyTorch in the following sections.
DataParallel is an effective way for conducting multi-GPU training of models on a single machine. It achieves data parallelization at the module level by dividing the input across the designated devices via chunking, and then propagating it through the model by replicating the inputs on all devices.
Let's create and initialise a basic
LinearRegression model class prior to wrapping it within the
import torch.nn as nn class LinearRegression(nn.Module): def __init__(self, input_size, output_size): super(LinearRegression, self).__init__() self.linear = nn.Linear(input_size, output_size) def forward(self, x): return self.linear(x) # Initialize the model model = LinearRegression(2, 5) print(model)
LinearRegression( (linear): Linear(in_features=2, out_features=5, bias=True) )
Now, let's wrap the model to execute data parallelization across multiple GPUs. This can be achieved by utilizing the
nn.DataParallel class and passing the model along with the device list as parameters.
model = nn.DataParallel(model, device_ids=) print(model)
DataParallel( (module): LinearRegression( (linear): Linear(in_features=2, out_features=5, bias=True) ) )
In the above code, we have passed the model along with the list of device ids as parameters. Now we can proceed by directly loading the model on to device and perform model training as required.
# Move the model and inputs to GPU model = model.to(device) input_data = input_data.to(device) # Continue with Training Loop # ...
DistributedDataParallel class from PyTorch supports training across multiple GPU training on multiple machines. The
DistributedDataParallel class is recommended over the
DataParallel class, as it manages single machine scenarios by default and exhibits superior speed compared to the
DistributedDataParallel module operates on the principle of data parallelism. Here, the model and data are duplicated across multiple processes, and each process conducts training on a data subset.
Setting up the
DistributedDataParallel class entails initializing the distributed environment and subsequently wrapping the model with the DDP object.
# Imports import torch import torch.nn as nn
# Define the Linear Regression model class LinearRegression(nn.Module): def __init__(self, input_size, output_size): super(LinearRegression, self).__init__() self.linear = nn.Linear(input_size, output_size) def forward(self, x): return self.linear(x) # Initialize the model model = LinearRegression(2, 5)
# Initialize the distributed environment. torch.distributed.init_process_group(backend='nccl') # Wrap the model with DDP model = nn.parallel.DistributedDataParallelCPU(model) # Proceed to load and train the model # ...
This provides a basic wrapper to load the model for multi-GPU training across multiple nodes.
In this article, we've explored various methods to leverage NVIDIA GPUs using the
CUDA library in the PyTorch ML library. These strategies help us harness the power of robust GPUs, accelerating the model training process by a factor of ten compared to traditional CPUs in deep learning applications. This significant reduction in training time expedites a broad array of compute-intensive tasks.