A Comprehensive Guide to Fine-Tuning Techniques for LLMs
Large Language Models (LLMs) have revolutionized the field of natural language processing. However, to achieve optimal performance for your specific task, fine-tuning is often necessary. Fine-tuning adapts a pre-trained LLM to a particular dataset or behavior, improving its accuracy, style, and overall suitability.
The landscape of fine-tuning techniques has expanded significantly, with various methods offering different strengths and weaknesses. This guide clarifies several prominent fine-tuning techniques available within the Hugging Face TRL (Transformer Reinforcement Learning) library, assisting you in selecting the appropriate approach for your requirements.
Table of Contents
- Understanding the Landscape: Key Concepts
- Supervised Fine-Tuning (SFT)
- Reinforcement Learning from Human Feedback (RLHF)
- Preference Data vs. Prompt-Only Data
- Online vs. Offline Methods
- Reward Models vs. Judges
- Differentiable vs. Non-Differentiable Reward Functions
- Mixture of Experts Models (MOEs)
- Fine-Tuning Techniques and When to Use Them
- SFTTrainer: The Foundation
- DPOTrainer: Direct Preference Optimization
- OnlineDPOTrainer: Dynamic DPO
- CPOTrainer: Contrastive Preference Optimization
- SimPO: Simple Preference Optimization.
- CPO-SimPO: Combining CPO and SimPO.
- BCOTrainer: Binary Classifier Optimization
- AlignProp Trainer: Reward backpropagation.
- NashMDTrainer: Nash Learning from Human Feedback
- RLOOTrainer: REINFORCE Leave-One-Out
- XPOTrainer: Exploratory Preference Optimization
- PRMTrainer: Process-supervised Reward Models
- IterativeSFTTrainer: For Custom, Iterative Fine-Tuning
- **GRPO Trainer:**Group Relative Policy Optimization.
- Choosing the Right Technique: A Decision Tree
- Quick Reference Table
1. Understanding the Landscape: Key Concepts
Before exploring specific techniques, let's define some fundamental concepts:
- Supervised Fine-Tuning (SFT): The most basic fine-tuning method. You provide a dataset of input-output pairs (e.g., prompts and desired responses), and the LLM is trained to minimize the discrepancy between its output and the target output. This is standard supervised learning applied to LLMs.
- Reinforcement Learning from Human Feedback (RLHF): A more advanced approach where the LLM learns from human preferences. Instead of being given the correct answer directly, the model receives feedback from humans indicating which of several model outputs is better. This is especially useful for tasks where "correctness" is subjective (e.g., style, helpfulness).
- Preference Data vs. Prompt-Only Data:
- Preference Data: Datasets containing comparisons between different model outputs. Typically, this includes a prompt, a "chosen" (preferred) response, and a "rejected" (less preferred) response. Some methods, like DPO, require this type of data.
- Prompt-Only Data: Datasets containing only prompts. These are used by techniques such as Online DPO, where "chosen" and "rejected" responses are generated dynamically during training.
- Unpaired Preference Data: Datasets that include prompts, and separate lists of desirable (or 'chosen') responses, and undesirable (or 'rejected') responses.
- Online vs. Offline Methods:
- Offline Methods: Utilize a static dataset of preferences or prompts collected before the training process begins. DPO is a classic example.
- Online Methods: Generate data dynamically during training. The model itself might generate responses, which are then evaluated (either by a reward model, a judge, or human feedback) to generate training signals. Online DPO and RLOO are examples.
- Reward Models vs. Judges:
- Reward Model: A separate, trained model that accepts a prompt and a response as input and outputs a scalar reward indicating the response's quality. This is a common approach in RLHF.
- Judge: A system (often another LLM) that directly compares two responses to a prompt and indicates which is preferred. Online DPO can use a judge. A judge provides relative preference rather than a scalar reward.
- Differentiable vs. Non-Differentiable Reward Function.
* Differentiable Reward Function like a reward model is used to directly backpropagate gradients from the reward models to the diffusion model.- Non-Differentiable Reward Function like a judge, does not allow for backpropagation, and a policy gradient algorithm will need to be used.
- Mixture of Experts Models (MOEs):
- To ensure MOEs are trained similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.This option is enabled by setting
output_router_logits=True
in the model config
- To ensure MOEs are trained similarly during preference-tuning, it is beneficial to add the auxiliary loss from the load balancer to the final loss.This option is enabled by setting
2. Fine-Tuning Techniques and When to Use Them
Now, let's explore the specific fine-tuning methods provided by the TRL library:
SFTTrainer: The Foundation
- What it is: Supervised Fine-Tuning. The simplest approach, where you train the LLM on a dataset of input-output pairs.
- When to use it:
- You have a dataset of high-quality input-output examples.
- Your task has a clear definition of "correctness" (e.g., code generation, translation, question answering with factual answers).
- You want a simple, straightforward fine-tuning approach.
- You want to adapt an LLM to a specific style or format.
- Data Requirements: Input-output pairs. Can be in standard (text) or conversational format.
- Key Features:
- Simple and efficient.
- Can be combined with techniques like LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
- Supports packing (combining multiple short examples into a single sequence) for increased training efficiency.
formatting_func
allows for flexible input formatting.
- Reference: SFT Trainer
- Limitations: The maximum level of capability of the model will be at the level of quality of the training data.
Example Dataset (Standard Format):
[
{"prompt": "Translate to English: Je t'aime.", "completion": "I love you."},
{"prompt": "Summarize: The quick brown fox jumps over the lazy dog.", "completion": "A quick fox jumps over a lazy dog."}
]
DPOTrainer: Direct Preference Optimization
- What it is: An offline RLHF method that directly optimizes the LLM's policy based on preference data, without needing to explicitly train a separate reward model. It cleverly reframes the RLHF problem as a classification task.
- When to use it:
- You have a dataset of preference pairs (prompt, chosen response, rejected response).
- You want a simpler and more stable alternative to traditional RLHF methods like PPO.
- You want to avoid the complexities of training and managing a separate reward model.
- Data Requirements: Preference data (prompt, chosen, rejected).
- Key Features:
- Stable and computationally efficient.
- Directly optimizes the policy, avoiding the two-stage process of RLHF (reward modeling + RL).
beta
parameter controls the deviation from the reference (original) model.- Supports different loss types (
sigmoid
,hinge
,ipo
,kto
).
- Reference: DPO Trainer
- Limitations: Since its an offline method, the performance is capped by the quality of the preference data. The model cannot go beyond this limitation.
Example Dataset (Standard Format):
[
{
"prompt": "Write a short poem about the ocean.",
"chosen": "Vast blue expanse,\nWaves crash, a rhythmic dance,\nMysteries untold.",
"rejected": "The ocean is big and blue. It has water."
},
{
"prompt": "Explain the theory of relativity.",
"chosen": "Einstein's theory of relativity, encompassing both special and general relativity, revolutionized our understanding of gravity, space, and time...",
"rejected": "Relativity is a thing about space. Einstein made it."
}
]
OnlineDPOTrainer: Dynamic DPO
- What it is: An online version of DPO. Instead of using a static preference dataset, it generates new responses during training and uses a "judge" (either an LLM or a reward model) to provide preferences on-the-fly.
- When to use it:
* Situations where your data distribution changes.
* You don't have a large, pre-existing preference dataset.
* You want the model to actively explore and potentially generate better responses than those seen in any initial data.
* the training model is evolving.
* You have access to a good judge (another LLM or a pre-trained reward model). - Data Requirements: Prompt-only dataset. The trainer handles generating completions and getting preferences.
- Key Features:
- Dynamic data generation – the model learns from its own evolving outputs.
- Can potentially achieve higher performance than offline DPO by exploring beyond the initial dataset.
- Controllable feedback via instruction prompts to the judge (if using an LLM judge).
- Can use either a reward model or a pairwise judge.
- Optional penalty for not generating an EOS token (encourages shorter completions).
- Reference: Online DPO Trainer
- Limitations: Requires a well-designed judge. If it does not have a good judge, the training progress will stall.
Example Dataset (Prompt-Only):
[
{"prompt": "Write a short poem about the ocean."},
{"prompt": "Explain the theory of relativity."}
]
CPOTrainer: Contrastive Preference Optimization
- What it is: A method designed to mitigate issues with SFT, such as the model's performance being capped by the training data quality and the lack of a mechanism to reject mistakes. CPO is derived from the DPO objective.
- When to use it:
- You have preference data.
- You want to avoid the issues of SFT (performance cap, inability to reject mistakes).
- You want a method that can be applied to domains beyond machine translation (e.g., chat).
- Data Requirements: Preference dataset (prompt, chosen, rejected). Supports conversational and standard formats.
- Example Dataset (Standard format):
[
{
"prompt": "Translate 'good morning' to French.",
"chosen": "Bonjour.",
"rejected": "Au revoir."
},
{
"prompt": "Write a haiku about winter.",
"chosen": "White blanket of snow,\nSilent trees stand tall and still,\nWinter's cold embrace.",
"rejected": "Winter cold, snow fall, very pretty."
}
]
- Key Features:
- Aims to make the model avoid generating adequate but imperfect outputs.
- Supports various loss functions (sigmoid, hinge, ipo, simpo).
- Can incorporate an auxiliary loss for Mixture of Experts (MOE) models.
- Limitations: Requires a high quality preference dataset.
- Reference: CPO Trainer
- SimPO: Simple Preference Optimization.
SimPO is implemented within theCPOTrainer
and can be used for cases where BC regularization is not needed. It is an alternative loss that adds a reward margin, and allows for length normalization. - CPO-SimPO: A combination of CPO and SimPO, allowing for more stable training.
BCOTrainer: Binary Classifier Optimization
- What it is: Uses an unpaired preference dataset. The authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
- When to use it:
- You have an unpaired preference dataset, it must contain a
prompt
, a list ofrejected
responses and a list ofchosen
responses, not necessarily matching.
- You have an unpaired preference dataset, it must contain a
- Key Features
- Supports Underlying Distribution matching (UDM) if the prompts in your desired and undesired datasets differ a lot
- Supports Mixture of Experts Models
- Example Dataset (Standard format):
[
{
"prompt": "Write a short story about a cat.",
"chosen": [
"Whiskers twitched, emerald eyes gleaming, as Mittens stalked the sunbeam...",
"Once upon a time in a cozy little house, lived a cat named Luna..."
],
"rejected":[
"The cat sat on the mat, it was fat.",
"Cats are animals, they meow"
]
},
{
"prompt": "Explain the concept of recursion in programming.",
"chosen": [
"Recursion is a technique where a function calls itself within its own definition...",
"Think of recursion like a set of Russian dolls nested within each other..."
],
"rejected":[
"Recursion is when a function is complicated.",
"Recursion is looping but different."
]
}
]
- Reference: BCO Trainer
AlignProp Trainer
- What it is: If the reward function is differentiable. Directly backpropagates gradients from the reward models to the diffusion model.
- When to use it:
- When the reward function is differentiable, such as when using a reward model instead of a judge.
- Data Requirements:
- prompt : The prompt is the text that is used to generate the image
- prompt_metadata : The prompt metadata is the metadata associated with the prompt.
- rewards: The rewards/score is a numerical associated with the generated image
* image: The image generated by the Stable Diffusion model
- Key Features:
- Significantly more sample and compute efficient (25x) than doing policy gradient algorithm like DDPO.
- Does full backpropagation through time, which allows updating the earlier steps of denoising via reward backpropagation.
- Reference:
Example Dataset:
[
{
"prompt": "A majestic lion sitting on a rock.",
"prompt_metadata": {"style": "photorealistic"},
"rewards": [8.5],
"image": "lion_image_1.png"
},
{
"prompt": "A fluffy cat curled up in a basket.",
"prompt_metadata": {"style": "cartoon", "lighting": "soft"},
"rewards": [7.2],
"image": "cat_image_1.png"
}
]
NashMDTrainer: Nash Learning from Human Feedback
- What it is: A method that learns a preference model and then aims to find a policy (the LLM's behavior) that is a Nash equilibrium – meaning no single policy can generate responses that are consistently preferred over it.
- When to use it:
- You want a method grounded in game theory (Nash equilibrium).
- You have a preference model or a way to obtain pairwise comparisons (e.g., a judge).
- Data Requirements: Prompt-only dataset. Completions are generated during training.
- Key Features:
- Based on the Nash-MD algorithm (Mirror Descent).
- Can use a reward model or a judge for comparisons.
- Similar to Online DPO, it can include a penalty for missing EOS tokens.
- Reference: Nash-MD Trainer
- Limitations: Newer technique, less widely adopted than DPO/PPO.
Example Dataset (Prompt-Only):
[
{"prompt": "Compose a short email inviting colleagues to a team meeting."},
{"prompt": "Write a Python function to reverse a string."}
]
RLOOTrainer: REINFORCE Leave-One-Out
- What it is: An online RL algorithm that uses a clever trick to estimate the advantage function. Instead of a separate value function (like in PPO), it generates multiple completions for each prompt and uses the average reward of the other completions as a baseline.
- When to use it:
- You want an online RL method (dynamic data generation).
- You don't want to train a separate value function.
- You are comfortable with the computational cost of generating multiple completions per prompt.
- Data Requirements: Prompt-only dataset.
- Key Features:
- Online learning – adapts as it generates data.
- Avoids the need for a separate value function, simplifying the training process.
- Can use a reward model or a judge.
- Reference: RLOO Trainer
- Limitations: Can be less stable than methods like PPO. The performance relies heavily on the quality of the reward signal.
- Example Dataset (Prompt-only):
[
{"prompt": "Explain the concept of 'opportunity cost' in economics."},
{"prompt": "Write a short fictional story about a time-traveling detective."}
]
XPOTrainer: Exploratory Preference Optimization
- What is is: An online preference tuning method, that combines the loss function from Direct Preference Optimization (DPO) and and exploration bonus.
- When to use it:
- You have a prompt-only dataset.
- You don't have a dataset of preference pairs.
- You have access to a good judge (another LLM or a pre-trained reward model).
- Key Features:
- Can be used with both a reward model or a judge.
- Combines DPO loss with an exploration bonus to help models move out of distribution
- Reference: XPO Trainer
- Limitations: Newer method, less widely adopted than DPO/PPO.
- Example Dataset (Prompt-only):
[
{"prompt": "Write a short poem about autumn."},
{"prompt": "Explain the concept of blockchain technology."}
]
PRMTrainer: Process-supervised Reward Models
- What it is: Trains a reward model based on stepwise feedback, rather than just overall feedback on a completed response. This is especially useful for tasks that involve reasoning or multi-step processes.
- When to use it:
- You have data where you can provide feedback on individual steps in a reasoning process, not just the final answer.
- You want to train a reward model that can guide the LLM at a more granular level.
- Data Requirements: Stepwise supervision data (prompt, a series of reasoning steps, and labels/rewards for each step).
- Example Dataset:
[
{
"prompt": "Solve the math problem: (2 + 3) * 4",
"completions": [
"Step 1: Calculate 2 + 3 = 5",
"Step 2: Multiply 5 by 4 = 20"
],
"labels": [1.0, 1.0] // Both steps are correct (reward = 1.0)
},
{
"prompt": "Solve the math problem: (2 + 3) * 4",
"completions": [
"Step 1: Calculate 2 * 4 = 8",
"Step 2: Add 3 to 8 = 11"
],
"labels": [0.0, 0.0] // Both steps are incorrect (reward = 0.0)
},
{
"prompt": "Solve the math problem: (2 + 3) * 4",
"completions": [
"Step 1: Calculate 2 + 3 = 5",
"Step 2: Multiply 4 by 4 = 16"
],
"labels": [1.0, 0.0] // Only first step is correct (reward = 1.0,0.0)
}
]
- Reference: PRM Trainer
- Limitations: Requires detailed, step-by-step feedback, which can be more expensive to collect than simple preference data.
IterativeSFTTrainer: For Custom, Iterative Fine-Tuning
- What it is: A flexible trainer that allows for custom actions (like generation and filtering) between optimization steps.
- When to use it:
- You want fine-grained control over the training loop.
- You have a custom training procedure that doesn't fit neatly into standard SFT or RLHF.
- You might want to iteratively generate data, filter it, and then train on the filtered data.
- Data Requirements: Flexible, depends on your custom logic.
- Key Features:
- Provides a
step()
function that you can call repeatedly, allowing you to interleave data generation, filtering, and training. - Gives you more control than the standard
SFTTrainer
.
- Provides a
- Reference: Iterative Trainer
Example Dataset:
Since this is iterative, you could start with no data, generate some, filter it, and then start your loop. Or you could start with a small seed dataset:
```json
[
{"text": "Initial example 1."},
{"text": "Initial example 2."}
]
```
In each `step()`, you'd generate more data, potentially filter it, and then add it to your training set.
GRPO Trainer:Group Relative Policy Optimization
- What it is:: GRPO is an online learning algorithm, meaning it improves iteratively by using the data generated by the trained model itself during training
- When to use it::
- You want an online RL method (dynamic data generation).
- You only have a reward model.
- You don't want to train a separate value function.
- Key Features:
- Calculates advantage using all samples within a group.
- No value model or reference model needed.
Example Dataset (Prompt-Only):
[
{"prompt": "Describe the process of photosynthesis."},
{"prompt": "Write a short story about a time-traveling historian."}
]
3. Choosing the Right Fucking Technique: A Decision Tree
This decision tree helps you navigate the options:
Do you have preference data (chosen/rejected pairs)?
│
├── YES:
│ ├── Do you want an online method (dynamic data generation)?
│ │ ├── YES: Use OnlineDPOTrainer (with a judge or reward model)
│ │ └── NO: Use DPOTrainer
│ └── Do you want a method with a focus on avoiding generating mistakes?
│ └── YES: Use CPOTrainer
│
└── NO:
├── Do you have prompt-only data?
│ ├── YES:
│ │ ├── Do you have access to a good judge or reward model?
│ │ │ ├── YES: Use OnlineDPOTrainer, NashMDTrainer, or RLOOTrainer
│ │ │ └── NO: Consider collecting preference data or use SFTTrainer if you have a way to generate "good" outputs.
│ │ └── Do you want a method that leverages the power of exploration?
│ │ └── YES: Consider using XPOTrainer
│ └── NO:
│ └── You need to collect data, either prompt-only or preference data.
└── Do you have a reward function that is differentiable?
├── YES: Consider using AlignProp Trainer
└── No: Consider using NashMDTrainer
Important Considerations:
- Data Quality: All fine-tuning methods are sensitive to the quality of your data. Garbage in, garbage out!
- Computational Resources: RLHF methods (PPO, DPO, online variants) are generally more computationally expensive than SFT.
- Experimentation: The best method will depend on your specific task and dataset. Don't be afraid to experiment!
4. Quick Reference Table
Trainer | Method Type | Data Type | Requires Reward Model/Judge | Key Features |
---|---|---|---|---|
SFTTrainer | Supervised | Input-Output Pairs | No | Simple, efficient, good for tasks with clear correctness. |
DPOTrainer | Offline RLHF | Preference Data | No (implicit reward) | Simpler and more stable than PPO, directly optimizes policy. |
OnlineDPOTrainer | Online RLHF | Prompt-Only | Yes (or Judge) | Dynamic data generation, potentially higher performance than offline DPO. |
CPOTrainer | Offline RLHF | Preference Data | No (implicit reward) | Mitigates SFT issues by preventing mistakes, can combine with SimPO |
BCOTrainer | Offline | Unpaired Preference data | No | Trains a binary classifier from {prompt, completion} and outputs a logit as reward. |
AlignProp Trainer | Offline | Prompt, reward | Yes, needs to be differentiable | Efficient sample and compute use compared to policy gradient algorithms. |
NashMDTrainer | Online RLHF | Prompt-Only | Yes (or Judge) | Based on Nash equilibrium, theoretically sound, but newer. |
RLOOTrainer | Online RLHF | Prompt-Only | Yes | Avoids separate value function, uses REINFORCE with leave-one-out baseline. |
XPOTrainer | Online RLHF | Prompt-only | Yes (or Judge) | Online variant of DPO, with exploration bonus, allows to go out of distribution. |
PRMTrainer | Reward Modeling | Stepwise Supervision | No | Trains a reward model based on stepwise supervision. |
IterativeSFTTrainer | Supervised | Flexible (customizable) | No | Allows for custom steps (e.g., generation, filtering) between optimization steps. |
GRPO Trainer | online RL | prompt only | Yes | No need for a value model, all samples within a group contibute to the update |
This guide provides a starting point. The best fine-tuning technique will depend on your specific task, dataset, resources, and desired model behavior. Don't hesitate to experiment and iterate! The Hugging Face TRL library provides the tools; you provide the creativity and data.
Member discussion