Skip to main content

Transformer Lab Can Talk Now: Introducing Text-to-Speech, Training & One-Shot Voice Cloning

· 5 min read

🎉 Transformer Lab just got a voice! We’re thrilled to announce audio modality support so you can generate, clone, and train voices directly in Transformer Lab.

What’s included in this release

  • 🎙️ Turn text into speech (TTS) with CUDA, AMD and MLX
  • 🛠️ Train your own TTS models on CUDA and AMD
  • 🧬 Clone a voice in one shot for lightning-fast replication on CUDA and AMD

🚀 Text-to-Speech on MLX

We’ve added TTS support to Transformer Lab’s MLX generation plugin, making it easier than ever to generate natural-sounding audio.

Here’s how you can try it today:

  1. Install the Apple Audio MLX Server plugin
  2. Pick a supported audio model in the Foundation tab
  3. Switch to the Audio tab
  4. Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

We currently support several powerful TTS model categories. Here are a few examples you can try right now:

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using Kokoro-82M-4bit:

mlx-tts-generation.gif

In just a few clicks, we went from plain text to lifelike audio. For this example, we used the sentence:

“Hello! Welcome to Transformer Lab, where we turn text into natural-sounding speech.”

🎛️ MLX Generation Parameters

When you generate audio with the MLX plugin, you’ll see a set of parameters you can adjust to customize the output. Here’s what each one does:

  • text → The input string you want to convert to speech.
  • Sample Rate → Number of audio samples per second; higher rates mean clearer, more detailed audio.
  • Temperature → Controls randomness in speech; lower = consistent, higher = more expressive and varied.
  • Speech Speed → Adjusts how quickly the text is spoken: slower for clarity, faster for natural pacing.

⚡ Text-to-Speech & One-Shot Cloning on CUDA and AMD

On CUDA and AMD, you can perform one-shot audio cloning replicating a voice instantly from just one reference sample

Here’s how you can try it today:

  1. Install the Unsloth Text-to-Speech Server plugin
  2. Pick a supported audio model in the Foundation tab
  3. Switch to the Audio tab
  4. Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using unsloth/orpheus-3b-0.1-ft:

cuda_tts_generation_one_shot_audio_cloning.gif

First, here’s the model generating speech directly from text:

Next, we provided a single sample of the target voice we wanted to clone:

Finally, here’s the result — the model speaking the same sentence, but now in the cloned voice:

🏗️Training Your Own TTS Model on CUDA and AMD

While one-shot cloning is powerful, you can take it even further by training a model directly on the target voice. This gives the model more examples to learn from, resulting in more consistent and natural-sounding speech.

For this demo, we used the bosonai/EmergentTTS-Eval dataset and trained a custom TTS model inside Transformer Lab.

training_tts.gif

🎛️ Training Parameters

Here are some of the key parameters you’ll see in the training configuration tab:

  • Sampling Rate → Audio sampling frequency
  • Audio Column Name → Dataset column containing audio files
  • Text Column Name → Dataset column containing transcriptions

For a complete list of training parameters and detailed explanations, see the Text-to-Speech Training Documentation.

👀 Watch It in Action

To compare, here are three samples: Before training — the model’s default voice generating our sentence:

Sample from dataset — a real voice clip the model trained on:

After training — the model reproducing the same sentence in the target voice:

We’re just getting started with audio support in Transformer Lab, and we want to make sure we’re adding the models that matter most to you. 🎙️

👉 Which text-to-speech or voice cloning models would you like to see supported next?

Drop your suggestions in our Discord community — we’re always listening and excited to hear your ideas.

Transformer Lab Now Works with AMD GPUs

· 17 min read

We're excited to announce that Transformer Lab now supports AMD GPUs! Whether you're on Linux or Windows, you can now harness the power of your AMD hardware to run and train models with Transformer Lab.
👉 Read the full installation guide here

TL;DR

If you have an AMD GPU and want to do ML work, just follow our guide above and skip a lot of stress.

The journey for us to figure out how to build a reliable PyTorch workspace on AMD was... messy. And we've documented everything below.

Generating Datasets and Training Models with Transformer Lab

· 3 min read

Introduction

In this tutorial, we'll explore how to bridge a knowledge gap in our model by generating custom dataset content and then fine-tuning the model using a LoRA adapter. The process begins with generating data from raw text using the Generate Data from Raw Text Plugin and concludes with fine-tuning via the MLX LoRA Plugin within Transformer Lab.

Fine Tuning a Python Code Completion Model

· 7 min read
Person

This post details our journey to fine-tune smolLM 135M, a compact language model, for Python code completion.

We chose smolLM 135M for its size, which allows for rapid iteration. Instead of full fine-tuning, we employed LoRA (Low-Rank Adaptation), a technique that introduces trainable "adapter" matrices into the transformer layers. This provides a good balance between parameter efficiency and achieving solid results on the downstream task (code completion).

Transformer Lab handled the training, evaluation, and inference, abstracting away much of the underlying complexity. We used the flytech/python codes-25k dataset, consisting of 25,000 Python code snippets, without any specific pre-processing. Our training setup involved a constant learning rate, a batch size of 4, and an NVIDIA RTX 4060 GPU.

The Iterative Fine-tuning Process: Nine Runs to Success

The core of this project was an iterative refinement of LoRA hyperparameters and training duration. We tracked both the training loss and conducted qualitative assessments of the generated code (our "vibe check") to judge its syntactic correctness and logical coherence. This combination of quantitative and qualitative feedback proved crucial in guiding our parameter adjustments.