Universal TTS Guide

A comprehensive guide to TTS dataset prep and training

View on GitHub

Guide 3: Model Training & Fine-tuning

Navigation: Main README Previous Step: Training Setup Next Step: Inference

You’ve prepared your data and configured your training environment. Now it’s time to actually train (or fine-tune) your Text-to-Speech model. This phase involves running the training script, monitoring its progress, and understanding how to manage the process.


5. Running the Training

This section details how to launch, monitor, and manage the training process.

5.1. Launching the Training Script

5.2. Monitoring Training Progress

5.3. Understanding Checkpoints

5.4. Resuming Interrupted Training

5.5. When to Stop Training


6. Fine-tuning vs. Training from Scratch

6.1. Choosing Your Approach

When starting a TTS project, one of the most important decisions is whether to fine-tune an existing model or train a new one from scratch. This table helps you decide which approach is best for your specific situation:

Factor Fine-tuning Training from Scratch
Dataset Size Works well with smaller datasets (5-20 hours)
Can produce good results with as little as 1-2 hours for some voices
Typically requires larger datasets (30+ hours)
Less than 20 hours often leads to poor quality
Voice Similarity Best when your target voice is similar to voices in the pre-trained model’s training data Necessary when your target voice is very unique or significantly different from available pre-trained models
Language Works well if fine-tuning within the same language
Can work for cross-lingual with careful preparation
Required for languages with no available pre-trained models
Better for capturing language-specific phonetics
Training Time Much faster (days instead of weeks)
Requires fewer epochs to converge
Significantly longer training time
May require 2-5x more epochs
Hardware Requirements Similar GPU requirements but for less time
Can often use smaller batch sizes
Needs sustained GPU access for longer periods
May benefit more from multi-GPU setups
Quality Potential Can achieve excellent quality quickly
May inherit limitations of the base model
Maximum flexibility and potential quality
No constraints from previous training
Stability Generally more stable training process
Less prone to collapse or non-convergence
More sensitive to hyperparameters
Higher risk of training instability

When to Choose Fine-tuning

Fine-tuning is generally recommended when:

When to Choose Training from Scratch

Training from scratch is better when:

6.2. Fine-tuning Specifics

Fine-tuning leverages a powerful pre-trained model and adapts it to your specific dataset (speaker, language, style). It’s usually faster and requires less data than training from scratch.

The Goal

6.2. Key Configuration Differences (Recap from Setup)

6.3. Potential Strategies (Framework Dependent)

6.4. Monitoring Fine-tuning


7. Comprehensive Troubleshooting Guide

Training TTS models can be challenging, with many potential issues. This section provides solutions for common problems you might encounter.

7.1. Common Error Messages and Solutions

Error Message Possible Causes Solutions
CUDA out of memory • Batch size too large
• Model too large for GPU
• Memory leak
• Reduce batch size
• Enable gradient checkpointing
• Use mixed precision training
• Reduce sequence length
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long • Incorrect data type in dataset
• Incompatible tensor types
• Check data preprocessing
• Ensure all tensors have correct dtype
• Add explicit type conversion
ValueError: too many values to unpack • Mismatch between model outputs and loss function expectations
• Incorrect data format
• Check model output structure
• Verify loss function implementation
• Debug data loader outputs
FileNotFoundError: [Errno 2] No such file or directory • Incorrect paths in config
• Missing data files
• Verify all file paths
• Check manifest file integrity
• Ensure data is downloaded/extracted
KeyError: 'speaker_id' • Missing speaker information
• Incorrect dataset format
• Check dataset format
• Verify speaker mapping file
• Add speaker information to manifest
Loss is NaN • Learning rate too high
• Unstable initialization
• Gradient explosion
• Reduce learning rate
• Add gradient clipping
• Check for division by zero
• Normalize input data
ModuleNotFoundError: No module named 'X' • Missing dependency
• Environment issue
• Install missing package
• Check virtual environment
• Verify package versions
RuntimeError: expected scalar type Float but found Double • Inconsistent tensor types • Add .float() to tensors
• Check data preprocessing
• Standardize dtype across model

7.2. Training Quality Issues

Symptom Possible Causes Solutions
Robotic/Buzzy Audio • Vocoder issues
• Insufficient training
• Poor audio preprocessing
• Train vocoder longer
• Check audio normalization
• Verify sampling rate consistency
Word Skipping/Repetition • Attention problems
• Unstable training
• Insufficient data
• Use guided attention loss
• Add more data variety
• Reduce learning rate
• Check for long silences in data
Incorrect Pronunciation • Text normalization issues
• Phoneme errors
• Language mismatch
• Improve text preprocessing
• Use phoneme-based input
• Add pronunciation dictionary
Speaker Identity Loss • Overfitting to dominant speaker
• Weak speaker embeddings
• Insufficient speaker data
• Balance speaker data
• Increase speaker embedding dim
• Use speaker adversarial loss
Slow Convergence • Learning rate issues
• Poor initialization
• Complex dataset
• Try different LR schedules
• Use transfer learning
• Simplify dataset initially
Unstable Training • Batch variance
• Outliers in dataset
• Optimizer issues
• Use gradient accumulation
• Clean outlier samples
• Try different optimizers

7.3. Framework-Specific Issues

Coqui TTS

# Error: "RuntimeError: Error in applying gradient to param_name"
# Solution: Check for NaN values in your dataset or reduce learning rate
python -c "import torch; torch.autograd.set_detect_anomaly(True)"  # Run before training to debug

# Error: "ValueError: Tacotron training requires `r` > 1"
# Solution: Set reduction factor correctly in config
# Example fix in config.json:
"r": 2  # Try values between 2-5

ESPnet

# Error: "TypeError: forward() missing 1 required positional argument: 'feats'"
# Solution: Check data formatting and ensure feats are provided
# Debug data loading:
python -c "from espnet2.train.dataset import ESPnetDataset; dataset = ESPnetDataset(...); print(dataset[0])"

VITS/StyleTTS

# Error: "RuntimeError: expected scalar type Half but found Float"
# Solution: Ensure consistent precision throughout model
# Add to your training script:
model = model.half()  # If using mixed precision
# OR
model = model.float()  # If not using mixed precision

7.4. Hardware and Environment Issues

  1. GPU Memory Fragmentation
    • Symptom: OOM errors after training for several hours despite sufficient VRAM
    • Solution: Periodically restart training from checkpoint, use smaller batches
  2. CPU Bottlenecks
    • Symptom: GPU utilization fluctuates or stays low
    • Solution: Increase num_workers in DataLoader, use faster storage, pre-cache datasets
  3. Disk I/O Bottlenecks
    • Symptom: Training stalls periodically during data loading
    • Solution: Use SSD storage, increase prefetch factor, cache dataset in RAM
  4. Environment Conflicts
    • Symptom: Mysterious crashes or import errors
    • Solution: Use isolated environments (conda/venv), check CUDA/PyTorch compatibility

7.5. Debugging Strategies

  1. Isolate the Problem
    # Test data loading separately
    python -c "from your_framework import DataLoader; loader = DataLoader(...); next(iter(loader))"
       
    # Test forward pass with dummy data
    python -c "import torch; from your_model import Model; model = Model(); x = torch.randn(1, 100); model(x)"
    
  2. Simplify to Identify Issues
    • Train on a tiny subset (10-20 samples)
    • Disable data augmentation temporarily
    • Try with a single speaker first
  3. Visualize Intermediate Outputs
    • Plot attention alignments
    • Visualize mel spectrograms at different stages
    • Monitor gradient norms
  4. Enable Verbose Logging
    # Add to your training script
    import logging
    logging.basicConfig(level=logging.DEBUG)
    
  5. Use TensorBoard Profiling
    # Add to your training code
    from torch.profiler import profile, record_function
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
        with record_function("model_inference"):
            # Your forward pass
    print(prof.key_averages().table())
    

With training launched and monitored, the next step, after selecting a good checkpoint, is to use the model for generating speech on new text.

Next Step: Inference | Back to Top