Universal TTS Guide

A comprehensive guide to TTS dataset prep and training

View on GitHub

Guide 1: Data Preparation for TTS Training

Navigation: Main README Next Step: Training Setup

This guide covers the critical first phase of any TTS project: preparing high-quality, correctly formatted audio and text data. The quality of your dataset directly impacts the quality of your final TTS model.


1. Dataset Preparation Steps

Follow these steps systematically to transform raw audio into a training-ready dataset.

1.1. Audio Acquisition & Initial Processing

1.3. Audio Chunking (Splitting into Segments)

1.4. Volume Normalization

1.5. Transcription: Creating Text Pairs

1.6. Data Structuring & Manifest File Creation


2. Data Quality Checklist

Before moving to training setup, rigorously review your prepared dataset using this checklist. Fixing issues now saves significant time later.

Aspect Check Why Important? Action if Failed
Audio Completeness Do all listed .wav files in manifests actually exist? Training will crash if files are missing. Re-run manifest generation; check file paths; ensure no files were accidentally deleted.
Transcript Match Does each .wav have a corresponding, accurate .txt/transcript? Mismatched pairs teach the model incorrect associations. Verify filenames; review ASR output; manually correct transcripts.
Audio Length Are most segments within the desired range (e.g., 2-15s)? Few outliers? Very short/long segments can destabilize training. Re-run chunking with adjusted parameters; manually filter outliers from manifests.
Audio Quality Listen to random samples: Low background noise? No music/reverb/echo? Garbage In, Garbage Out. Model learns the noise. Improve source audio; apply noise reduction (carefully!); filter out bad segments.
Speaker Consistency For single-speaker: Is it always the target voice? No other speakers? Prevents voice dilution or instability. Manually review/filter segments; check chunking boundaries.
Format & Specs All WAV? Identical sampling rate? Mono channels? PCM 16-bit? Inconsistencies cause errors or poor performance. Re-run conversion/resampling steps (Section 1.1). Batch-verify specs using command-line tools like ffprobe or soxi (part of the SoX package). Example: soxi -r *.wav to check rates.
Volume Levels Listen to random samples: Are volumes relatively consistent? Drastic volume shifts can hinder learning. Re-run normalization (Section 1.3); check normalization parameters.
Transcription Cleanliness No timestamps, speaker labels? Fillers handled consistently? Punctuation standard? Numbers/symbols expanded? Ensures text maps cleanly to speech sounds/prosody. Re-run text cleaning scripts; perform manual review and correction.
Manifest Format Correct path|text|speaker_id structure? Paths valid? No extra lines? Parser errors will prevent data loading. Check delimiter (|); validate paths relative to training script location; check encoding (UTF-8 preferred).
Train/Val Split Are validation files truly unseen during training? No overlap? Overlapping data gives misleading validation scores. Ensure random shuffle before splitting; check splitting logic.

Tip: Use tools like soxi (from SoX) or ffprobe to batch-check audio properties (sampling rate, channels, duration). Write small scripts to verify file existence and basic manifest formatting.

2.1. Practical Verification Scripts

Here are some practical scripts to help verify your dataset quality:

Check Audio Properties (Sampling Rate, Channels, Duration)

#!/bin/bash
# verify_audio.sh - Check audio properties across all WAV files
# Usage: ./verify_audio.sh /path/to/audio/directory

AUDIO_DIR="$1"
echo "Checking audio files in $AUDIO_DIR..."

# Check if SoX is installed
if ! command -v soxi &> /dev/null; then
    echo "SoX not found. Please install it first (e.g., 'apt-get install sox' or 'brew install sox')."
    exit 1
fi

# Initialize counters and arrays
total_files=0
non_mono=0
wrong_rate=0
too_short=0
too_long=0
target_rate=22050  # Change this to your target sampling rate
min_duration=1.0   # Minimum duration in seconds
max_duration=15.0  # Maximum duration in seconds

# Process all WAV files
find "$AUDIO_DIR" -name "*.wav" | while read -r file; do
    total_files=$((total_files + 1))
    
    # Get audio properties
    channels=$(soxi -c "$file")
    rate=$(soxi -r "$file")
    duration=$(soxi -d "$file" | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')
    
    # Check properties
    if [ "$channels" -ne 1 ]; then
        echo "WARNING: Non-mono file: $file (channels: $channels)"
        non_mono=$((non_mono + 1))
    fi
    
    if [ "$rate" -ne "$target_rate" ]; then
        echo "WARNING: Wrong sampling rate: $file (rate: $rate Hz, expected: $target_rate Hz)"
        wrong_rate=$((wrong_rate + 1))
    fi
    
    if (( $(echo "$duration < $min_duration" | bc -l) )); then
        echo "WARNING: File too short: $file (duration: ${duration}s, minimum: ${min_duration}s)"
        too_short=$((too_short + 1))
    fi
    
    if (( $(echo "$duration > $max_duration" | bc -l) )); then
        echo "WARNING: File too long: $file (duration: ${duration}s, maximum: ${max_duration}s)"
        too_long=$((too_long + 1))
    fi
    
    # Print progress every 100 files
    if [ $((total_files % 100)) -eq 0 ]; then
        echo "Processed $total_files files..."
    fi
done

# Print summary
echo "===== SUMMARY ====="
echo "Total files checked: $total_files"
echo "Non-mono files: $non_mono"
echo "Files with wrong sampling rate: $wrong_rate"
echo "Files too short (<${min_duration}s): $too_short"
echo "Files too long (>${max_duration}s): $too_long"

if [ $((non_mono + wrong_rate + too_short + too_long)) -eq 0 ]; then
    echo "All files passed basic checks!"
else
    echo "Some issues were found. Please review the warnings above."
fi

Verify Manifest File Integrity

#!/usr/bin/env python3
# verify_manifest.py - Check that all files in manifest exist and have matching transcripts
# Usage: python verify_manifest.py path/to/manifest.txt

import os
import sys
from pathlib import Path

def verify_manifest(manifest_path):
    """Verify that all audio files and transcripts in the manifest exist and are valid."""
    if not os.path.exists(manifest_path):
        print(f"Error: Manifest file '{manifest_path}' not found.")
        return False
    
    print(f"Verifying manifest: {manifest_path}")
    base_dir = os.path.dirname(os.path.abspath(manifest_path))
    
    # Statistics
    total_entries = 0
    missing_audio = 0
    empty_transcripts = 0
    
    with open(manifest_path, 'r', encoding='utf-8') as f:
        for line_num, line in enumerate(f, 1):
            line = line.strip()
            if not line:
                continue
            
            total_entries += 1
            
            # Parse the line (assuming pipe-separated format: audio_path|transcript|speaker_id)
            parts = line.split('|')
            if len(parts) < 2:
                print(f"Line {line_num}: Invalid format. Expected at least 'audio_path|transcript'")
                continue
            
            audio_path = parts[0]
            transcript = parts[1]
            
            # Check if audio path is relative and resolve it
            if not os.path.isabs(audio_path):
                audio_path = os.path.join(base_dir, audio_path)
            
            # Check if audio file exists
            if not os.path.exists(audio_path):
                print(f"Line {line_num}: Audio file not found: {audio_path}")
                missing_audio += 1
            
            # Check if transcript is empty
            if not transcript or transcript.isspace():
                print(f"Line {line_num}: Empty transcript for {audio_path}")
                empty_transcripts += 1
    
    # Print summary
    print("\n===== SUMMARY =====")
    print(f"Total entries: {total_entries}")
    print(f"Missing audio files: {missing_audio}")
    print(f"Empty transcripts: {empty_transcripts}")
    
    if missing_audio == 0 and empty_transcripts == 0:
        print("All manifest entries are valid!")
        return True
    else:
        print("Issues found in manifest. Please fix them before proceeding.")
        return False

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python verify_manifest.py path/to/manifest.txt")
        sys.exit(1)
    
    success = verify_manifest(sys.argv[1])
    sys.exit(0 if success else 1)

Visualize Audio Spectrograms for Quality Assessment

This script helps you visually inspect the quality of your audio files by generating spectrograms:

#!/usr/bin/env python3
# generate_spectrograms.py - Create spectrograms for audio quality assessment
# Usage: python generate_spectrograms.py /path/to/audio/directory /path/to/output/directory [num_samples]

import os
import sys
import random
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from pathlib import Path

def generate_spectrograms(audio_dir, output_dir, num_samples=10):
    """Generate spectrograms for a random sample of audio files."""
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all WAV files
    wav_files = list(Path(audio_dir).glob('**/*.wav'))
    if not wav_files:
        print(f"No WAV files found in {audio_dir}")
        return False
    
    # Sample files if there are more than requested
    if len(wav_files) > num_samples:
        wav_files = random.sample(wav_files, num_samples)
    
    print(f"Generating spectrograms for {len(wav_files)} files...")
    
    for i, wav_path in enumerate(wav_files):
        try:
            # Load audio file
            y, sr = librosa.load(wav_path, sr=None)
            
            # Create figure with two subplots
            plt.figure(figsize=(12, 8))
            
            # Plot waveform
            plt.subplot(2, 1, 1)
            librosa.display.waveshow(y, sr=sr)
            plt.title(f'Waveform: {wav_path.name}')
            plt.xlabel('Time (s)')
            plt.ylabel('Amplitude')
            
            # Plot spectrogram
            plt.subplot(2, 1, 2)
            D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
            librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
            plt.colorbar(format='%+2.0f dB')
            plt.title('Log-frequency power spectrogram')
            
            # Save figure
            output_path = os.path.join(output_dir, f'spectrogram_{i+1}_{wav_path.stem}.png')
            plt.tight_layout()
            plt.savefig(output_path)
            plt.close()
            
            print(f"Generated: {output_path}")
            
        except Exception as e:
            print(f"Error processing {wav_path}: {e}")
    
    print(f"Spectrograms saved to {output_dir}")
    return True

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python generate_spectrograms.py /path/to/audio/directory /path/to/output/directory [num_samples]")
        sys.exit(1)
    
    audio_dir = sys.argv[1]
    output_dir = sys.argv[2]
    num_samples = int(sys.argv[3]) if len(sys.argv) > 3 else 10
    
    success = generate_spectrograms(audio_dir, output_dir, num_samples)
    sys.exit(0 if success else 1)

These scripts provide practical tools to verify your dataset’s quality before training, helping you identify and fix issues early in the process.


Once your dataset passes this quality check, you are ready to proceed to setting up the training environment.

Next Step: Training Setup | Back to Top