GRPO Trainer
The GRPO Trainer plugin allows you to create and manage GRPO training jobs using Transformer Lab. Installation steps remain the same as other training plugins.
There are two variants available:
- GRPO trainer (Multi GPU): Designed for single or multi-GPU setups without any PEFT models.
- Unsloth GRPO Trainer: This variant adds a LoRA adapter at the end.
Note: These plugins work exclusively with CUDA environments.
Step 1: Installing the Plugin
- Open the
Plugins
tab. - Filter by trainer plugins.
- Install the
GRPO trainer (Multi GPU)
plugin.
If you wish to use the LoRA adapter feature, install theUnsloth GRPO Trainer
.

Step 2: Creating a Training Task
-
Navigate to the
Train
tab. -
Click on the
New
button. -
In the pop-up, complete the following sections:
-
Template/Task Name:
Set a unique name for your training template/task. -
Dataset Tab:
Select the Dataset to use for training. The commonly used dataset is:
openai/gsm8k
. -
Data Template Tab:
There are three fields to configure:-
Instruction Field:
Provide the instruction prompt. For example:Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer> -
Input Field:
Enter the question field from your dataset (foropenai/gsm8k
, use{{question}}
). -
Output Field:
Enter the answer field from your dataset (foropenai/gsm8k
, use{{answer}}
).
-
-
Plugin Config Tab:
Configure the training parameters. The fields vary based on the selected plugin:For GRPO trainer (Multi GPU):
- Training Device:
Set to eithercuda
,cpu
ortpu
. - GPU IDs to train:
Default isauto
. - Start thinking string:
<reasoning>
(Represents the start thinking tag). - End thinking string:
</reasoning>
(Represents the end thinking tag). - Start answer string:
<answer>
(Represents the start answer tag). - End answer string:
</answer>
(Represents the end answer tag). - Maximum Sequence Length:
Defines the maximum tokens allowed per input sequence. - Maximum Completion Length:
Sets the maximum tokens for the model's output. - Batch Size:
Number of samples processed together. - Learning Rate Schedule:
Options include: constant, linear, cosine, or constant with warmup. - Learning Rate:
Specifies the initial learning rate. - Number of Training Epochs:
Controls the number of full passes through the dataset. - Max Steps:
Total training steps (use-1
for no limit). - Max Grad Norm:
Maximum gradient norm for clipping. - Weight Decay:
Regularization parameter. - Adam Beta 1:
The beta1 hyperparameter for Adam. - Adam Beta 2:
The beta2 hyperparameter for Adam. - Adam Epsilon:
A small constant for numerical stability in Adam. - Adaptor Name:
Unique identifier for the training adaptor.
For Unsloth GRPO Trainer:
- All fields listed above are included except the "Training Device" and "GPU IDs to train".
- Additional Fields:
- LoRA R:
Indicates the rank for the LoRA adapter. - LoRA Alpha:
Scaling factor for the LoRA weights. - LoRA Dropout:
Dropout rate used in the LoRA layers.
- LoRA R:
- Training Device:
-
-
Save the training template by clicking on
Save Training Template
.

Step 3: Queueing the Training Job
After saving the training template, click on Queue
to start the training job.
While the training is running, you can review output logs and tensorboard outputs to monitor progress.

Step 4: Viewing Training Logs on WANDB (Optional)
You can monitor the training progress and metrics on Weights and Biases (WANDB) if you've provided a Weights and Biases API key in settings.
