Skip to main content

One post tagged with "text-to-speech"

View All Tags

Transformer Lab Can Talk Now: Introducing Text-to-Speech, Training & One-Shot Voice Cloning

· 5 min read

🎉 Transformer Lab just got a voice! We’re thrilled to announce audio modality support so you can generate, clone, and train voices directly in Transformer Lab.

What’s included in this release

  • 🎙️ Turn text into speech (TTS) with CUDA, AMD and MLX
  • 🛠️ Train your own TTS models on CUDA and AMD
  • 🧬 Clone a voice in one shot for lightning-fast replication on CUDA and AMD

🚀 Text-to-Speech on MLX

We’ve added TTS support to Transformer Lab’s MLX generation plugin, making it easier than ever to generate natural-sounding audio.

Here’s how you can try it today:

  1. Install the Apple Audio MLX Server plugin
  2. Pick a supported audio model in the Foundation tab
  3. Switch to the Audio tab
  4. Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

We currently support several powerful TTS model categories. Here are a few examples you can try right now:

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using Kokoro-82M-4bit:

mlx-tts-generation.gif

In just a few clicks, we went from plain text to lifelike audio. For this example, we used the sentence:

“Hello! Welcome to Transformer Lab, where we turn text into natural-sounding speech.”

🎛️ MLX Generation Parameters

When you generate audio with the MLX plugin, you’ll see a set of parameters you can adjust to customize the output. Here’s what each one does:

  • text → The input string you want to convert to speech.
  • Sample Rate → Number of audio samples per second; higher rates mean clearer, more detailed audio.
  • Temperature → Controls randomness in speech; lower = consistent, higher = more expressive and varied.
  • Speech Speed → Adjusts how quickly the text is spoken: slower for clarity, faster for natural pacing.

⚡ Text-to-Speech & One-Shot Cloning on CUDA and AMD

On CUDA and AMD, you can perform one-shot audio cloning replicating a voice instantly from just one reference sample

Here’s how you can try it today:

  1. Install the Unsloth Text-to-Speech Server plugin
  2. Pick a supported audio model in the Foundation tab
  3. Switch to the Audio tab
  4. Adjust your generation settings and start creating speech instantly!

🎧 Supported Model Families

👀 Watch It in Action

Here’s a quick demo showing how simple it is to generate speech in Transformer Lab using unsloth/orpheus-3b-0.1-ft:

cuda_tts_generation_one_shot_audio_cloning.gif

First, here’s the model generating speech directly from text:

Next, we provided a single sample of the target voice we wanted to clone:

Finally, here’s the result — the model speaking the same sentence, but now in the cloned voice:

🏗️Training Your Own TTS Model on CUDA and AMD

While one-shot cloning is powerful, you can take it even further by training a model directly on the target voice. This gives the model more examples to learn from, resulting in more consistent and natural-sounding speech.

For this demo, we used the bosonai/EmergentTTS-Eval dataset and trained a custom TTS model inside Transformer Lab.

training_tts.gif

🎛️ Training Parameters

Here are some of the key parameters you’ll see in the training configuration tab:

  • Sampling Rate → Audio sampling frequency
  • Audio Column Name → Dataset column containing audio files
  • Text Column Name → Dataset column containing transcriptions

For a complete list of training parameters and detailed explanations, see the Text-to-Speech Training Documentation.

👀 Watch It in Action

To compare, here are three samples: Before training — the model’s default voice generating our sentence:

Sample from dataset — a real voice clip the model trained on:

After training — the model reproducing the same sentence in the target voice:

We’re just getting started with audio support in Transformer Lab, and we want to make sure we’re adding the models that matter most to you. 🎙️

👉 Which text-to-speech or voice cloning models would you like to see supported next?

Drop your suggestions in our Discord community — we’re always listening and excited to hear your ideas.