One post tagged with "audio"

Transformer Lab Can Talk Now: Introducing Text-to-Speech, Training & One-Shot Voice Cloning

September 9, 2025 · 5 min read

Person

🎉 Transformer Lab just got a voice! We’re thrilled to announce audio modality support so you can generate, clone, and train voices directly in Transformer Lab.

What’s included in this release

🎙️ Turn text into speech (TTS) with CUDA, AMD and MLX
🛠️ Train your own TTS models on CUDA and AMD
🧬 Clone a voice in one shot for lightning-fast replication on CUDA and AMD

🚀 Text-to-Speech on MLX

We’ve added TTS support to Transformer Lab’s MLX generation plugin, making it easier than ever to generate natural-sounding audio.

Here’s how you can try it today:

Install the Apple Audio MLX Server plugin
Pick a supported audio model in the Foundation tab
Switch to the Audio tab
Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

We currently support several powerful TTS model categories. Here are a few examples you can try right now:

Kokoro → mlx-community/Kokoro-82M-4bit
Dia → mlx-community/Dia-1.6B
Spark → mlx-community/Spark-TTS-0.5B-bf16
Bark → mlx-community/bark-small
CSM → mlx-community/csm-1b

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using Kokoro-82M-4bit:

In just a few clicks, we went from plain text to lifelike audio. For this example, we used the sentence:

“Hello! Welcome to Transformer Lab, where we turn text into natural-sounding speech.”

🎛️ MLX Generation Parameters

When you generate audio with the MLX plugin, you’ll see a set of parameters you can adjust to customize the output. Here’s what each one does:

text → The input string you want to convert to speech.
Sample Rate → Number of audio samples per second; higher rates mean clearer, more detailed audio.
Temperature → Controls randomness in speech; lower = consistent, higher = more expressive and varied.
Speech Speed → Adjusts how quickly the text is spoken: slower for clarity, faster for natural pacing.

⚡ Text-to-Speech & One-Shot Cloning on CUDA and AMD

On CUDA and AMD, you can perform one-shot audio cloning replicating a voice instantly from just one reference sample

Here’s how you can try it today:

Install the Unsloth Text-to-Speech Server plugin
Pick a supported audio model in the Foundation tab
Switch to the Audio tab
Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

Orpheus → unsloth/orpheus-3b-0.1-ft
CSM → unsloth/csm-1b

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using unsloth/orpheus-3b-0.1-ft:

First, here’s the model generating speech directly from text:

Next, we provided a single sample of the target voice we wanted to clone:

Finally, here’s the result — the model speaking the same sentence, but now in the cloned voice:

🏗️Training Your Own TTS Model on CUDA and AMD

While one-shot cloning is powerful, you can take it even further by training a model directly on the target voice. This gives the model more examples to learn from, resulting in more consistent and natural-sounding speech.

For this demo, we used the bosonai/EmergentTTS-Eval dataset and trained a custom TTS model inside Transformer Lab.

🎛️ Training Parameters

Here are some of the key parameters you’ll see in the training configuration tab:

Sampling Rate → Audio sampling frequency
Audio Column Name → Dataset column containing audio files
Text Column Name → Dataset column containing transcriptions

For a complete list of training parameters and detailed explanations, see the Text-to-Speech Training Documentation.

👀 Watch It in Action

To compare, here are three samples: Before training — the model’s default voice generating our sentence:

Sample from dataset — a real voice clip the model trained on:

After training — the model reproducing the same sentence in the target voice:

We’re just getting started with audio support in Transformer Lab, and we want to make sure we’re adding the models that matter most to you. 🎙️

👉 Which text-to-speech or voice cloning models would you like to see supported next?

Drop your suggestions in our Discord community — we’re always listening and excited to hear your ideas.

What’s included in this release​

🚀 Text-to-Speech on MLX​

🎧 Supported Model Families​

👀 Watch It in Action​

🎛️ MLX Generation Parameters​

⚡ Text-to-Speech & One-Shot Cloning on CUDA and AMD​

🎧 Supported Model Families​

👀 Watch It in Action​

🏗️Training Your Own TTS Model on CUDA and AMD​

🎛️ Training Parameters​

👀 Watch It in Action​

What’s included in this release

🚀 Text-to-Speech on MLX

🎧 Supported Model Families

👀 Watch It in Action

🎛️ MLX Generation Parameters

⚡ Text-to-Speech & One-Shot Cloning on CUDA and AMD

🎧 Supported Model Families

👀 Watch It in Action

🏗️Training Your Own TTS Model on CUDA and AMD

🎛️ Training Parameters

👀 Watch It in Action