Universal TTS Guide

A comprehensive guide to TTS dataset prep and training

View on GitHub

Guide 2: Training Environment Setup & Configuration

Navigation: Main README Previous Step: Data Preparation Next Step: Model Training

With your dataset prepared, the next stage involves setting up the necessary software environment and configuring the parameters for your specific training run.


3. Training Environment Setup

This section covers installing the required software and organizing your project files.

3.1. Choose and Clone a TTS Framework

TTS Architecture Comparison

When selecting a TTS architecture, consider these popular options and their characteristics:

Architecture Pros Cons Best For Hardware Requirements
VITS • End-to-end (no separate vocoder)
• High-quality audio
• Fast inference
• Good for fine-tuning
• Complex to understand
• Can be unstable during training
• Requires careful hyperparameter tuning
• Single-speaker voice cloning
• Projects needing high-quality output
• When you have 5+ hours of data
• Training: 8GB+ VRAM
• Inference: 4GB+ VRAM
StyleTTS2 • Excellent voice and style control
• State-of-the-art quality
• Good for emotion/prosody
• Newer, potentially less stable implementations
• More complex architecture
• Fewer community resources
• Projects requiring style control
• Expressive speech synthesis
• Multi-speaker with style transfer
• Training: 12GB+ VRAM
• Inference: 6GB+ VRAM
Tacotron2 + HiFi-GAN • Well-established, stable
• Easier to understand
• More tutorials available
• Separate components for easier debugging
• Two-stage pipeline (slower)
• Generally lower quality than newer models
• More prone to attention failures on long text
• Educational projects
• When stability is prioritized over quality
• Lower resource environments
• Training: 6GB+ VRAM
• Inference: 2GB+ VRAM
FastSpeech2 • Non-autoregressive (faster inference)
• More stable than Tacotron2
• Good documentation
• Requires phoneme alignments
• More complex preprocessing
• Quality not as high as VITS/StyleTTS2
• Real-time applications
• When inference speed is critical
• More controlled output
• Training: 8GB+ VRAM
• Inference: 2GB+ VRAM
YourTTS (VITS variant) • Multilingual support
• Zero-shot voice cloning
• Good for language transfer
• Complex training setup
• Requires careful data preparation
• May need larger datasets
• Multilingual projects
• Cross-lingual voice cloning
• When language flexibility is needed
• Training: 10GB+ VRAM
• Inference: 4GB+ VRAM
Diffusion-based TTS • Highest quality potential
• More natural prosody
• Better handling of rare words
• Very slow inference
• Extremely compute-intensive training
• Newer, less established
• Offline generation
• When quality trumps speed
• Research projects
• Training: 16GB+ VRAM
• Inference: 8GB+ VRAM

Note on Hardware Requirements:

1.3. Detailed Hardware Requirements

Choosing the right hardware is critical for successful TTS model training. Here’s a detailed breakdown of requirements for different scenarios:

GPU Requirements by Model Type and Dataset Size

Model Type Small Dataset (<10h) Medium Dataset (10-50h) Large Dataset (>50h) Recommended GPU Models
Tacotron2 + HiFi-GAN 8GB VRAM 12GB VRAM 16GB+ VRAM RTX 3060, RTX 2080, T4
FastSpeech2 8GB VRAM 12GB VRAM 16GB+ VRAM RTX 3060, RTX 2080, T4
VITS 12GB VRAM 16GB VRAM 24GB+ VRAM RTX 3080, RTX 3090, A5000
StyleTTS2 16GB VRAM 24GB VRAM 32GB+ VRAM RTX 3090, RTX 4090, A100
XTTS-v2 24GB VRAM 32GB VRAM 40GB+ VRAM RTX 4090, A100, A6000
Diffusion-based TTS 16GB VRAM 24GB VRAM 32GB+ VRAM RTX 3090, RTX 4090, A100

CPU and System Memory

Training Scale CPU Requirements System RAM Storage
Hobby/Personal 4+ cores, 2.5GHz+ 16GB 50GB SSD
Research 8+ cores, 3.0GHz+ 32GB 100GB+ SSD
Production 16+ cores, 3.5GHz+ 64GB+ 500GB+ NVMe SSD

Cloud GPU Options and Approximate Costs

Cloud Provider GPU Option VRAM Approx. Cost/Hour Best For
Google Colab T4/P100 (Free)
V100/A100 (Pro)
16GB
16-40GB
Free
$10-$15
Experimentation, small datasets
Kaggle P100/T4 16GB Free (limited hours) Small-medium datasets
AWS g4dn.xlarge (T4)
p3.2xlarge (V100)
p4d.24xlarge (A100)
16GB
16GB
40GB
$0.50-$0.75
$3.00-$3.50
$20.00-$32.00
Any scale, production
GCP n1-standard-8 + T4
a2-highgpu-1g (A100)
16GB
40GB
$0.35-$0.50
$3.80-$4.50
Any scale, production
Azure NC6s_v3 (V100)
NC24ads_A100_v4
16GB
80GB
$3.00-$3.50
$16.00-$24.00
Any scale, production
Lambda Labs 1x RTX 3090
1x A100
24GB
40GB
$1.10
$1.99
Research, medium datasets
Vast.ai Various consumer GPUs 8-24GB $0.20-$1.00 Budget-conscious training

Training Time Estimates

Model Dataset Size GPU Approximate Training Time Epochs to Convergence
Tacotron2 + HiFi-GAN 10 hours RTX 3080 2-3 days 50-100K steps
FastSpeech2 10 hours RTX 3080 2-3 days 150-200K steps
VITS 10 hours RTX 3090 3-5 days 300-500K steps
StyleTTS2 10 hours RTX 3090 4-7 days 500-800K steps
XTTS-v2 10 hours RTX 4090 5-10 days 1M+ steps

Optimization Tips to Reduce Hardware Requirements

  1. Gradient Accumulation: Simulate larger batch sizes by accumulating gradients over multiple forward/backward passes
  2. Mixed Precision Training: Use FP16 instead of FP32 to reduce VRAM usage by up to 50%
  3. Gradient Checkpointing: Trade computation for memory by recomputing activations during backward pass
  4. Model Parallelism: Split large models across multiple GPUs
  5. Progressive Training: Start with smaller models/configurations and gradually increase complexity

These requirements should help you plan your hardware needs based on your specific project goals and budget constraints.

3.2. Set Up Python Environment & Install Dependencies

3.3. Organize Your Project Folder


4. Configuring the Training Run

Before launching the training, you need to create a configuration file that tells the framework how to train the model, using your specific data.

4.1. Find and Copy a Base Configuration

4.2. Edit Your Custom Configuration File

4.3. Hardware and Dataset Considerations

4.4. Monitoring Tools (TensorBoard)


With your environment set up and configuration file tailored to your data and goals, you are now ready to start the actual model training process.

Next Step: Model Training | Back to Top