Guide to Fine-Tuning Open Source LLM Models on Custom Data


I'm sure most of you would have heard of ChatGPT and tried it out to answer your questions! Ever wondered what happens under the hood? It's powered by a Large Language Model GPT-3 developed by Open AI. These large language models, often referred to as LLMs have unlocked many possibilities in Natural Language Processing.

What are Large Language Models?

The LLM models are trained on massive amounts of text data, enabling them to understand human language with meaning and context. Previously, most models were trained using the supervised approach, where we feed input features and corresponding labels. Unlike this, LLMs are trained through unsupervised learning, where they are fed humongous amounts of text data without any labels and instructions. Hence, LLMs learn the meaning and relationships between words of a language efficiently. They can be used for a wide variety of tasks like text generation, question answering, translation from one language to another, and much more.

As a cherry on top, these large language models can be fine-tuned on your custom dataset for domain-specific tasks. In this article, I'll talk about the need for fine-tuning, the different LLMs available, and also show an example.

Understanding LLM Fine-Tuning

Let's say you run a diabetes support community and want to set up an online helpline to answer questions. A pre-trained LLM is trained more generally and wouldn't be able to provide the best answers for domain specific questions and understand the medical terms and acronyms. This can be solved by fine-tuning.

What do we mean by fine-tuning? To say in brief, Transfer
! The large language models are trained on huge datasets using heavy resources and have millions of parameters. The representations and language patterns learned by LLM during pre-training are transferred to your current task at hand. In technical terms, we initialize a model with the pre-trained weights, and then train it on our task-specific data to reach more task-optimized weights for parameters. You can also make changes in the architecture of the model, and modify the layers as per your need.

Why Should you Fine-Tune Models?

  • Save time and resources: Fine-tuning can help you reduce the training time and resources needed than training from scratch.
  • Reduced Data Requirements: If you want to train a model from scratch, you would need huge amounts of labeled data which is often unavailable for individuals and small businesses. Fine-tuning can help you achieve good performance even with a smaller amount of data.
  • Customize to your needs: The pre-trained LLM may not be catch your domain-specific terminology and abbreviations. For example, a normal LLM wouldn't recognize that "Type 1" and "Type 2" signify the types of diabetes, whereas a fine-tuned one can.
  • Enable continual learning: Let's say we fine-tuned our model on diabetes information data and deployed it. What if there's a new diet plan or treatment available that you want to include? You can use the weights of your previously fine-tuned model and adjust it to include your new data. This can help organizations keep their models up-to-date in an efficient manner.

Choosing an Open-Source LLM Model

The next step would be to choose a large language model for your task. What are your options? The state-of-the-art large language models available currently include GPT-3, Bloom, BERT, T5, and XLNet. Among these, GPT-3 (Generative Pretrained Transformers) has shown the best performance, as it's trained on 175 billion parameters and can handle diverse NLU tasks. But, GPT-3 fine-tuning can be accessed only through a paid subscription and is relatively more expensive than other options.

On the other hand, BERT is an open-source large language model and can be fine-tuned for free. BERT stands for Bi-directional Encoder Decoder Transformers. BERT does an excellent job of understanding contextual word representations.

How do you choose?

If your task is more oriented towards text generation, GPT-3 (paid) or GPT-2 (open source) models would be a better choice. If your task falls under text classification, question answering, or Entity Recognition, you can go with BERT. For my case of Question answering on Diabetes, I would be proceeding with the BERT model.

Preparing and Pre-processing your Dataset

This is the most crucial step of fine-tuning, as the format of data varies based on the model and task. For this case, I have created a sample text document with information on diabetes that I have procured from the National Institue of Health website. You can use your own data.

To fine-tune BERT the task of Question-Answering, converting your data into SQuAD format is recommended. SQuAD is Stanford Question Answering Dataset and this format is widely adopted for training NLP models for Question answering tasks. The data needs to be in JSON format, where each field consists of:

  • context: The sentence or paragraph with text based on which the model will search for the answer to the question
  • question: The query we want the BERT to answer. You would need to frame these questions based on how the end user would interact with the QA model.
  • answers: You need to provide the desired answer under this field. There are two sub-components under this, text and answer_start. The text will have the answer string. Whereas, answer_startdenotes the index, from where the answer begins in the context paragraph.

As you can imagine, it would take a lot of time to create this data for your document if you were to do it manually. Don't worry, I'll show you how to do it easily with the Haystack annotation tool.

How to Create Data in SQuAD Format with Haystack?

Using the Haystack annotation tool, you can quickly create a labeled dataset for question-answering tasks. You can access the tool by creating an account on their site. Create a new project and upload your document. You can view it under the "Documents" tab, go to "Actions" and you can see option to create your questions. You can write your question and highlight the answer in the document, Haystack would automatically find the starting index of it. I have shown how I did it on my document in the below image.

Fig. 1: Creating labeled dataset for Question-Answering with Haystack

When you are done creating enough Question-answer pairs for fine-tuning, you should be able to see a summary of them as shown below. Under the "Export labels" tab, you can find multiple options for the format you want to export in. We choose the squad format for our case. If you need more help in using the tool, you can check their documentation. We now have our JSON file containing the QA pairs for fine-tuning.

How to Fine-Tune?

Python offers many open-source packages you can use for fine-tuning. I used the Pytorch and Transformers package for my case. Start by importing the package modules using pip, the package manager. The transformers library provides a BERTTokenizer, which is specifically for tokenizing inputs to the BERT model.

# Install and import the modules
!pip install torch
!pip install transformers

import json
import torch
from transformers import BertTokenizer, BertForQuestionAnswering
from import DataLoader, Dataset

Defining Custom Dataset for Loading and Pre-processing

The next step is to load and pre-process the data. You can use the Dataset class from pytorch's module to define a custom class for your dataset. I have created a custom dataset class diabetes as you can see in the below code snippet. The init is responsible for initializing the variables. The file_path is an argument that will input the path of your JSON training file and will be used to initialize data. We initialize the BertTokenizer also here.

Next, we define a load_data() function. This function will read the JSON file into a JSON data object and extract the context, question, answers, and their index from it. It appends the extracted fields into a list and returns it.

The getitem uses the BERT tokenizer to encode the question and context into input tensors which are input_ids and attention_mask. The encode_plus will tokenize the text, and adds special tokens (such as [CLS] and [SEP]). Note that we use the squeeze() method to remove any singleton dimensions before inputting to BERT. Finally, it returns the processed input tensors.

class diabetes(Dataset):
    def __init__(self, file_path): = self.load_data(file_path)
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    def load_data(self, file_path):
        with open(file_path, 'r') as f:
            data = json.load(f)
        paragraphs = data['data'][0]['paragraphs']
        extracted_data = []
        for paragraph in paragraphs:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']
                answer = qa['answers'][0]['text']
                start_pos = qa['answers'][0]['answer_start']
                    'context': context,
                    'question': question,
                    'answer': answer,
                    'start_pos': start_pos,
        return extracted_data

    def __len__(self):
        return len(

    def __getitem__(self, index):
        example =[index]
        question = example['question']
        context = example['context']
        answer = example['answer']
        inputs = self.tokenizer.encode_plus(question, context, add_special_tokens=True, padding='max_length', max_length=512, truncation=True, return_tensors='pt')
        input_ids = inputs['input_ids'].squeeze()
        attention_mask = inputs['attention_mask'].squeeze()
        start_pos = torch.tensor(example['start_pos'])
        return input_ids, attention_mask, start_pos, end_pos

Once you define it, you can go ahead and create an instance of this class by passing the file_path argument to it.

# Create an instance of the custom dataset
file_path = 'diabetes.json'
dataset = diabetes(file_path)

Training the Model

I'll be using the BertForQuestionAnswering model as it is best suited for QA tasks. You can initialize the pre-trained weights of the bert-base-uncased model by calling the from_pretrained function on the model. You should also choose the evaluation loss function and optimizer you would be using for training.

Free eBook: Git Essentials

Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it!

I am using an Adam optimizer and cross entropy loss function. You can use the Pytorch class DataLoader to load data in different batches and also shuffle them to avoid any bias.

# Set device (CPU or GPU)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize the BERT model for question answering
model = BertForQuestionAnswering.from_pretrained('bert-base-uncased')

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()
batch_size = 8
num_epochs = 50

# Create data loader
data_loader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

Once, the data loader is defined you can go ahead and write the final training loop. During each iteration, each batch obtained from the data_loader contains batch_size number of examples, on which forward and backward propagation is performed. The code attempts to find the best set of weights for parameters, at which the loss would be minimal.

for epoch in range(num_epochs):
    total_loss = 0

    for batch in data_loader:
        # Move batch tensors to the device
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        start_positions = batch[2].to(device)

        # Zero the gradients

        # Forward pass
        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions)
        loss = outputs.loss

        # Backward pass and optimization

        total_loss += loss.item()

    avg_loss = total_loss / len(data_loader)
    print(f"Epoch {epoch+1}/{num_epochs} - Average Loss: {avg_loss:.4f}")

This completes your fine-tuning! You can test the model by setting it to model.eval(). You can also use fine-tune the learning rate, and no of epochs parameters to obtain the best results on your data.

Best Tips and Practices

Here's some points to note while fine-tuning any large language models on custom data:

  • Your dataset needs to represent the target domain or task you want the language model to excel at. Clean and well-structured data is essential.
  • Ensure that you have enough training examples in your data for the model to learn patterns. Else, the model might memorize the examples and overfit, without the capacity to generalize to unseen examples.
  • Choose a pre-trained model that has been trained on a corpus that is relevant to your task at hand. For question answering, we choose a pre-trained model that's trained on the Stanford Question Answering dataset. Similar to this, there are different models available for tasks like sentiment analysis, text generation, summarization, text classification, and more.
  • Try Gradient accumulation if you have limited GPU memory. In this method, rather than updating the model's weights after each batch, gradients are accumulated over multiple mini-batches before performing an update.
  • If you face the problem of overfitting while fine-tuning, use regularization technqiues. Some commonly used methods include adding dropout layers to the model architecture, implementing weight decay and layer normalization.


Large language models can help you automate many tasks in quick and efficient manner. Fine-tuning LLMs help you leverage the power of transfer learning and customize it to your particular domain. Fine-tuning can be essential if your dataset is in domains like medical, a technical niche, financial datasets and more.

In this article we used BERT as it is open source and works well for personal use. If you are working on a large-scale the project, you can opt for more powerful LLMs, like GPT3, or other open source alternatives. Remember, fine-tuning large language models can be computationally expensive and time-consuming. Ensure you have sufficient computational resources, including GPUs or TPUs based on the scale.

Last Updated: July 6th, 2023
Was this article helpful?

© 2013-2024 Stack Abuse. All rights reserved.