Fine-Tuning Best Practices

Oct 23, 20245 min read

Every large language model runs on a specific set of parameters or weights that determine how the model behaves.

Cutting-edge LLMs end up having billions of parameters which all work together to form a base model. While these base models are great at understanding human language in a variety of contexts, performance often breaks down when these models encounter novel scenarios not covered in the dataset.

Fine-tuning is an approach where you take this pre-trained model and adjust the model weights slightly using a smaller but more domain-specific dataset. For example, exposing the model to a dataset of a specific writer's work will allow the model to learn how to generate text in that writer's voice.

In this guide, we’ll look into four best practices to follow when fine-tuning LLMs:

Choosing the right base model to fine-tune
Curating high-quality datasets
Adjusting fine-tuning hyper-parameters
Continuous evaluation and iteration

4 Best Practices When Fine-Tuning LLMs

Let's look into some best practices to keep in mind once you're set on fine-tuning your LLM.

Choosing the Right Base Model

Before investing in fine-tuning, it's essential that you decide on an LLM that is capable of performing your task effectively.

Here are some factors to consider:

Performance on relevant tasks
Response time
Cost and scalability
Privacy

Choosing a large language model can be broken down into two main decisions: size of the model (smaller vs. larger) and type of model (open-source vs. proprietary). Each of these dimensions has distinct implications for performance, cost, and flexibility.

Smaller vs Larger Models

Not every scenario requires the most cutting-edge LLMs to achieve great results.

Sometimes, simpler, smaller models can perform just as well, if not better, depending on the task's complexity and requirements.

For example, if your use case requires quicker response times or only needs relatively simple output, consider using a smaller model.

Open-source vs Proprietary

Proprietary models are still dominating the top of the LLM leaderboards in a variety of metrics. However, these models provide limited control over fine-tuning and always comes with external risks for your business (such as server downtimes, price increases, and so on)

Open-source models, as the name suggests, are LLMs that are built from publicly accessible code and architecture. These models are preferred for use cases which require more fine-tuning control or scenarios where privacy is a concern. Open-source LLMs can also be cheaper to run to achieve similar results with their proprietary counterparts.

To help with this step, you can try out our free Model Selection tool to see how different LLMs perform against your own custom prompts.

You can use this tool to compare up to five different SOTA LLMs at once using your own custom prompts.

High Quality Datasets

After deciding on the right base model, your next focus should be on curating high-quality datasets to use for fine-tuning.

A fine-tuning dataset typically consists of input-output pairs that demonstrate how you want your model to behave. For example, consider the following input-output pair we can use to train our customer support chatbot:

{"prompt": "Why is my computer monitor defective?","completion": "I'm sorry to hear about this product issue! Could you describe the issue you’re experiencing with your monitor in more detail? This will help me suggest the right steps to resolve it."}

To create high-quality training data, we recommend focusing on three main areas: quality, quantity, and diversity.

Curating Quality Data

When fine-tuning, the quality of your examples are often more important than the quantity of examples. Data preparation is an important step in ensuring your model is optimized.

High-quality examples should have consistent formatting, accurate input/output, and should always be relevant to the task being optimized for. We recommend pre-processing the dataset to handle missing values, misspelled words, and any other inconsistencies.

At Aligned, our domain experts help clients curate high-quality datasets for fine-tuning. Leveraging on expert knowledge ensures your model is well-prepared to handle industry-specific challenges and perform effectively in real-world applications.

Quantity and Diversity

Your dataset must be large enough to provide a wide array of examples for the LLM to learn from. Consider setting up a dataset that is representative of the variety of possible scenarios your model might need to face.

Some other tasks may also benefit from a larger dataset due to the task's complexity. For instance, training an LLM to rewrite text in a certain voice may require more examples than training the same model to classify text.

Your data should also be diverse enough to ensure that the fine-tuned model can respond appropriately to a variety of input. Research has also shown that reducing the amount of duplicated examples could improve model performance after fine-tuning.

Hyper-parameter Tuning

When fine-tuning your model, there are a few hyper-parameters you can adjust to influence how your model is trained. It's considered best practice to start with the model's default hyper-parameters and adjust based on performance.

Suppose you want to fine-tune an LLM using a dataset containing 10,000 examples. By adjusting the different hyperparameters, we can control exactly how our model learns from the training data.

Let’s take a look at OpenAI's API, which allows you to adjust three hyperparameters for fine-tuning:

batch_size
learning_rate_multiplier
n_epochs

The n_epochs value controls how many times we pass through the entire dataset during fine-tuning (4 times by default). Increasing this value could help if the fine-tuned model fails to replicate training data. However, the model could be in danger of overfitting.

The batch_size hyper-parameter controls the number of training examples to use during a single step of training. Using our example earlier, training our dataset with a batch size of 100 means we'll finish training a single epoch after exactly 100 batches.

Ideally you'd want to select a batch size that leads to faster model convergence. However, the batch size is also limited to the amount of memory available to you.

Finally, the learning_rate_multiplier allows you to influence how quickly the model converges. OpenAI recommends increasing the learning rate multiplier if the model fails to converge.

Continuous Evaluation & Iteration

Fine-tuning your model is rarely a one-time process. To achieve optimal results, you’ll need to perform continuous evaluation and iteration of the fine-tuning data, the hyper-parameters used for fine-tuning, and even the evaluation process itself.

By focusing on areas where your model underperforms, you’ll be guided on how to iteratively adjust the fine-tuning dataset. For example, if the model struggles with understanding certain jargon in the input, consider using more examples with these terms in the training data.

Aligned’s growing network of domain experts supports iterative model improvement through expert ratings and side-by-side comparisons. Their feedback ensures that fine-tuned models meet client expectations and guidelines.

Generative AI models are still innovating at a rapid pace, where each week new research and models are introduced. While LLMs (and some clever prompt engineering) are already powerful at general tasks, they can only take you so far. To truly excel in domain-specific applications, fine-tuning is essential.

By refining models with curated data and expert feedback, you can achieve more accurate and relevant performance out of your model.

Are you looking into exploring LLMs or need to improve an existing model? Our team of experts can help you create and curate high-quality data to train the perfect AI models for your specific needs.

Aligned