Skip to main content

Text-to-Speech (TTS) Generation

Transformer Lab supports Text-to-Speech (TTS) on MLX (Apple Silicon), CUDA (NVIDIA GPUs), and ROCm (AMD GPUs). This feature lets you convert plain text into natural-sounding speech directly inside Transformer Lab.

TTS Screenshot

How It Works​

  1. Install the appropriate plugin:

    • Apple Audio MLX Server (for MLX)
    • Unsloth Text-to-Speech Server (for CUDA and ROCm)
  2. Select a TTS model in the Foundation tab

  3. Switch to the Audio tab

  4. Enter text, adjust generation parameters, and generate audio

Supported Model Families​

You can start generating audio today with the following models:

MLX (Apple Silicon)​

CUDA and AMD​

TTS Generation Process​

Here's a visual guide to the TTS generation process in Transformer Lab:

MLX TTS Generation

This demonstrates the complete workflow from model selection to audio output generation.

Generation Parameters​

When generating speech, you’ll see the following parameters:

  • Text → The input string to convert into speech
  • Sample Rate → Number of audio samples per second (higher = clearer audio)
  • Temperature → Controls randomness; lower = consistent, higher = expressive
  • Speech Speed → Adjusts pacing of speech (slower = clarity, faster = natural flow)

Some models expose extra controls for more flexibility:

  • Audio Cloning → Provide a short reference sample to make the output mimic that voice
  • Language → Choose the language for generation (if multilingual support is available)
  • Voice → Select a specific voice style or speaker profile offered by the model

Next Steps​

Learn how to train your own TTS models: