Pre-Training

The Nanotron Pre-training Framework plugin allows you to pre-train models on a single or multi-GPU setup using Transformer Lab. After training, the model will be available in the Foundation tab for further preference training or chatting. It is uses Nanotron for pre-training.

Step 1: Installing the Plugin

Open the Plugins tab.
Filter by trainer plugins.
Install the Nanotron Pre-training Framework plugin.

Note: This plugin supports both single and multi-GPU setups.

Step 2: Creating a Pre-training Task

Navigate to the Train tab.
Click on the New button.
In the pop-up, complete the following sections:
- Name:
  Set a unique name for your pre-training task. This will be set as the name of your pre-trained model followed by the job id.
- Dataset Tab:
  Select the dataset to use for training. A simple and small dataset for pre-training tests is:
  stas/openwebtext-10k (contains 10M tokens).
- Data Template Tab:
  Specify the column representing the text data.
  For example, if the dataset has a text column, set the Formatting Template to:
```
{{text}}
```

Step 3: Configuring Plugin Parameters

In the Plugin Config Tab, configure the following parameters:

Training Device:
Set the device for training.
Example: "cuda"
(Only cuda is supported currently)
Random Seed:
Set the seed for reproducibility.
Default: 42
Checkpoint Interval (steps):
Determines how often a checkpoint is saved.
Default: 1000
Dataset Split:
Specify which part of the dataset to use.
Default: "train"
Text Column Name (in Dataset):
Name of the column with text data.
Default: "text"
Tokenizer Name or Path:
Set the tokenizer.
Default: "robot-test/dummy-tokenizer-wordlevel"
Maximum Sequence Length:
Maximum tokens per sequence.
Default: 256, (range: 128 - 8192)
Model Hidden Size:
Dimensionality of the model's hidden layers.
Default: 16, (range: 16 - 8192)
Number of Hidden Layers:
Total hidden layers in the model.
Default: 2, (minimum: 2)
Number of Attention Heads:
Total attention heads.
Default: 4, (minimum: 2)
Number of KV Heads (for GQA):
KV Heads for Grouped Query Attention.
Default: 4, (minimum: 2)
Intermediate Size:
Size of the feed-forward network.
Default: 64, (minimum: 16)
Micro Batch Size:
Number of samples per micro batch.
Default: 2,
Total Training Steps:
Total number of steps for training.
Default: 9500
Learning Rate:
Initial learning rate.
Default: 5e-4
Warmup Steps:
Steps for the warmup phase.
Default: 2
Annealing Phase Start Step:
Step to start the annealing phase.
Default: 10
Weight Decay:
Regularization parameter.
Default: 0.01
Data Parallel Size:
Number of GPUs for data parallelism.
Default: 2
Tensor Parallel Size:
Number of GPUs for tensor parallelism.
Default: 1
Pipeline Parallel Size:
Number of GPUs for pipeline parallelism.
Default: 1
Mixed Precision Type:
Floating point precision mode.
Options: bfloat16, float32, float64
Default: bfloat16

Note: The product of the configs Data Parallel Size, Tensor Parallel Size, and Pipeline Parallel Size should be equal to the total number of GPUs available.

Step 4: Queue and Run the Pre-training Task

After configuring your task:

Save the pre-training template by clicking on Save Training Template.
Click on Queue to start the pre-training job.

Step 5: Post-training

Once the training finishes, the pre-trained model is available in the Foundation tab. You can then use this model for further preference training or for interactive chatting.

Step 1: Installing the Plugin​

Step 2: Creating a Pre-training Task​

Step 3: Configuring Plugin Parameters​

Step 4: Queue and Run the Pre-training Task​

Step 5: Post-training​

Step 1: Installing the Plugin

Step 2: Creating a Pre-training Task

Step 3: Configuring Plugin Parameters

Step 4: Queue and Run the Pre-training Task

Step 5: Post-training