Guide 1: Data Preparation for TTS Training
Navigation: Main README | Next Step: Training Setup |
This guide covers the critical first phase of any TTS project: preparing high-quality, correctly formatted audio and text data. The quality of your dataset directly impacts the quality of your final TTS model.
1. Dataset Preparation Steps
Follow these steps systematically to transform raw audio into a training-ready dataset.
1.1. Audio Acquisition & Initial Processing
- Gather Audio: Collect your raw audio files (common formats include WAV, MP3, FLAC, OGG, M4A). Ensure you have the rights to use this audio.
- Convert to WAV: Most TTS frameworks expect WAV format. Use tools like
ffmpeg
or audio libraries (pydub
,soundfile
) to convert your audio. Aim for a standard WAV encoding like PCM 16-bit.# Example using ffmpeg to convert MP3 to WAV ffmpeg -i input_audio.mp3 output_audio.wav
- Standardize Channels (Mono): TTS models typically train on single-channel (mono) audio. Convert stereo tracks to mono.
# Example using ffmpeg to convert stereo WAV to mono WAV ffmpeg -i stereo_input.wav -ac 1 mono_output.wav
-ac 1
: Sets the number of audio channels to 1.
- Resample Audio: Ensure all audio files have the exact same sampling rate. Choose your target rate based on your project goals and framework compatibility (e.g., 16000 Hz, 22050 Hz, 48000 Hz). 22050 Hz is common for many models.
# Example using ffmpeg to resample to 22050 Hz ffmpeg -i input.wav -ar 22050 resampled_output.wav
-ar 22050
: Sets the audio sampling rate (samples per second).
1.2 Advanced Audio Cleaning (Noise/Music Removal) - Optional but Recommended
- Goal: To remove unwanted background sounds like noise (hum, hiss, fans), music, reverb, or other interfering voices from your source audio, isolating the target speaker’s voice as much as possible. This step is crucial if your source audio is not studio quality.
-
Why? TTS models learn from the audio they are given. If the audio contains background noise or music, the resulting TTS voice will likely inherit these characteristics, sounding noisy or “muddy”. Cleaner audio leads to a cleaner TTS voice.
- Tools & Techniques:
- AI Source Separation Tools (Recommended for Music/Voice): These tools use AI models to separate audio into different stems (vocals, music, drums, bass, other).
- Ultimate Vocal Remover (UVR): A popular, free, open-source GUI application that provides access to various state-of-the-art AI separation models. It’s excellent for removing background music or isolating dialogue.
- Models (like those mentioned): UVR allows you to use different AI models.
MDX-Inst-HQ3
is one such model often good at separating vocals from instruments (hence “Inst”). Other MDX models, Demucs models (likehtdemucs
), and potentially models like Mel-Roformer (if integrated or available standalone) are designed for similar tasks, each with slightly different strengths and weaknesses. Experimentation is key. Choose models focused on vocal isolation.
- Models (like those mentioned): UVR allows you to use different AI models.
- Other Tools: Online services (e.g., Lalal.ai) or other standalone software might use similar underlying models (often Demucs or Spleeter variants).
- Ultimate Vocal Remover (UVR): A popular, free, open-source GUI application that provides access to various state-of-the-art AI separation models. It’s excellent for removing background music or isolating dialogue.
- Traditional Noise Reduction Tools: Often found in Digital Audio Workstations (DAWs) or audio editors.
- Audacity: Contains built-in noise reduction effects (requires sampling a noise profile). Can be effective for constant background noise (like hiss or hum).
- Commercial Plugins (e.g., Izotope RX, Waves Clarity): Offer more sophisticated AI-powered noise, reverb, and voice isolation tools, but come at a cost.
- Spectral Editing: Manually removing unwanted sounds in a spectral editor (like Adobe Audition, Izotope RX, Acon Digital Acoustica). Powerful but very time-consuming.
- AI Source Separation Tools (Recommended for Music/Voice): These tools use AI models to separate audio into different stems (vocals, music, drums, bass, other).
- Workflow Considerations:
- When to Apply: It’s generally recommended to apply cleaning to your longer audio files before chunking (Step 1.3 below). This allows the AI models to work with more context and can be more efficient than processing thousands of small chunks. However, if cleaning introduces too many artifacts on long files, you might try cleaning individual problematic chunks later.
- Process:
- Load your standardized WAV file (from Step 1.1) into the chosen tool (e.g., UVR).
- Select an appropriate vocal isolation model (e.g., an MDX or Demucs vocal model).
- Process the audio to generate a “vocals only” track.
- Listen Carefully: Critically evaluate the separated vocal track. Check for:
- Artifacts: AI separation can sometimes introduce “watery” sounds, glitches, or parts of the voice being mistakenly removed.
- Remaining Noise/Music: How effectively was the unwanted sound removed?
- Iterate: You might need to try different models, adjust settings within the tool, or even apply a secondary noise reduction pass (e.g., using Audacity’s noise reduction on the AI-separated vocals) for best results.
- Save Output: Save the cleaned vocal track as a new WAV file (e.g.,
original_file_cleaned.wav
). Use these cleaned files as the input for the next step (Chunking).
- Caveats:
- Artifacts are Possible: Aggressive cleaning can degrade the naturalness of the target voice. Aim for a balance between removing noise and preserving voice quality.
- Computational Cost: AI separation models can be computationally intensive and may take significant time, especially on long audio files and without a powerful GPU.
1.3. Audio Chunking (Splitting into Segments)
- Goal: Break long audio files (like chapters of an audiobook or podcast episodes) into shorter, manageable segments. Ideal segment length is typically between 2 to 15 seconds.
- Why Chunk?
- Aligns audio duration with typical sentence lengths.
- Makes transcription feasible (transcribing hours-long files is difficult).
- Helps manage memory during training.
- Allows filtering out unsuitable segments (e.g., pure silence, noise, music).
-
Method: Use tools that detect silence to split the audio.
pydub
is a popular Python library for this.# Example using pydub for silence-based splitting from pydub import AudioSegment from pydub.silence import split_on_silence import os input_file = "resampled_mono_audio.wav" # Use the output from step 1.1 output_dir = "audio_chunks" # Create this directory os.makedirs(output_dir, exist_ok=True) print(f"Loading audio file: {input_file}") sound = AudioSegment.from_wav(input_file) print("Audio loaded. Splitting based on silence...") chunks = split_on_silence( sound, min_silence_len=500, # Minimum duration of silence in milliseconds to trigger a split. Adjust as needed. silence_thresh=-40, # Silence threshold in dBFS (decibels relative to full scale). Lower values (e.g., -50) detect quieter silences. Adjust based on your audio's noise floor. keep_silence=200 # Optional: Amount of silence (in ms) to leave at the beginning/end of each chunk. Helps avoid abrupt cuts. ) print(f"Found {len(chunks)} potential chunks before duration filtering.") # --- Filtering and Exporting --- min_duration_sec = 2.0 # Minimum chunk length in seconds max_duration_sec = 15.0 # Maximum chunk length in seconds target_sr = 22050 # Ensure chunks retain the correct sample rate (pydub usually handles this) exported_count = 0 for i, chunk in enumerate(chunks): duration_sec = len(chunk) / 1000.0 if min_duration_sec <= duration_sec <= max_duration_sec: # Ensure the chunk uses the target sample rate if necessary (pydub tries to preserve it) # chunk = chunk.set_frame_rate(target_sr) # Usually not needed if source was correctly sampled chunk_filename = f"segment_{exported_count:05d}.wav" # Use padding for easier sorting chunk_path = os.path.join(output_dir, chunk_filename) print(f"Exporting chunk {i} ({duration_sec:.2f}s) to {chunk_path}") chunk.export(chunk_path, format="wav") exported_count += 1 else: print(f"Skipping chunk {i} due to duration: {duration_sec:.2f}s") print(f"\nExported {exported_count} chunks meeting duration criteria ({min_duration_sec}-{max_duration_sec}s) to '{output_dir}'.")
- Review: Listen to a sample of the generated chunks. Are the splits logical? Is speech cut off? Adjust
min_silence_len
andsilence_thresh
and re-run if necessary. Manually splitting or refining splits in an audio editor (like Audacity) might be needed for tricky audio.
1.4. Volume Normalization
- Goal: Ensure all audio chunks have a consistent volume level. This prevents quiet or loud segments from disproportionately affecting training.
- Methods:
- Peak Normalization: Adjusts the audio so the loudest point reaches a specific level (e.g., -3.0 dBFS). Simple, but doesn’t guarantee consistent perceived loudness.
- Loudness Normalization (LUFS): Adjusts the audio to meet a target perceived loudness level (e.g., -23 LUFS is common for broadcast). Generally preferred as it better reflects human hearing. Requires libraries like
pyloudnorm
.
-
Apply Consistently: Apply the chosen normalization method to all chunks created in the previous step. Save the normalized files to a new directory (e.g.,
normalized_chunks
) to keep originals intact.# Example using pydub for PEAK normalization from pydub import AudioSegment import os import glob input_chunk_dir = "audio_chunks" output_norm_dir = "normalized_chunks" os.makedirs(output_norm_dir, exist_ok=True) target_dBFS = -3.0 # Target peak amplitude def match_target_amplitude(sound, target_dBFS): change_in_dBFS = target_dBFS - sound.dBFS return sound.apply_gain(change_in_dBFS) print(f"Normalizing chunks from '{input_chunk_dir}' to '{output_norm_dir}' with target peak {target_dBFS} dBFS.") wav_files = glob.glob(os.path.join(input_chunk_dir, "*.wav")) for i, wav_file in enumerate(wav_files): filename = os.path.basename(wav_file) output_path = os.path.join(output_norm_dir, filename) try: sound = AudioSegment.from_wav(wav_file) # Only apply gain if the sound is not silent (dBFS is not -inf) if sound.dBFS > -float('inf'): normalized_sound = match_target_amplitude(sound, target_dBFS) normalized_sound.export(output_path, format="wav") else: print(f"Skipping silent file: {filename}") # Optionally copy silent files or handle them as needed # shutil.copy(wav_file, output_path) if (i + 1) % 50 == 0: # Print progress print(f"Processed {i+1}/{len(wav_files)} files...") except Exception as e: print(f"Error processing {filename}: {e}") print(f"\nNormalization complete. Normalized files saved in '{output_norm_dir}'.")
- Note: For LUFS normalization, you’d use a library like
pyloudnorm
, iterating through files similarly.
- Note: For LUFS normalization, you’d use a library like
1.5. Transcription: Creating Text Pairs
- Goal: Obtain an accurate text transcript for every single normalized audio chunk. The text should represent exactly what is spoken in the audio.
- Methods:
- Automatic Speech Recognition (ASR): Best for large datasets. Use high-quality ASR models.
- OpenAI Whisper: Excellent multilingual, open-source option. Runs locally (GPU recommended) or via API. Note: While powerful for word accuracy, Whisper’s punctuation and capitalization may require careful review and correction during the cleaning step. Various community fine-tuned Whisper models (often found on Hugging Face) may offer improvements.
- Google Gemini Models (e.g., via API or AI Studio): Models like Gemini Pro or Flash can perform audio transcription. Often requires audio to be in specific formats and may perform best on shorter segments (aligning well with the pre-chunking step). Check current API offerings and potential free tiers.
- Cloud Services: Google Cloud Speech-to-Text, AWS Transcribe, Azure Speech Service offer robust APIs, often with pay-as-you-go pricing and potentially free tiers initially.
- Other Models: Explore Hugging Face Models for other open-source or fine-tuned ASR models specific to your language.
- Manual Transcription: Most accurate but very time-consuming. Suitable for small, high-value datasets or for correcting ASR outputs.
- Existing Transcripts: If your source audio comes with aligned transcripts (e.g., some audiobooks, broadcast archives), you may need scripts to parse and align them with your chunks.
- Output Format: Create one
.txt
file for each corresponding.wav
file in yournormalized_chunks
directory. The filenames must match exactly (e.g.,normalized_chunks/segment_00001.wav
needstranscripts/segment_00001.txt
). - Text Cleaning and Normalization: This is crucial!
- Remove Non-Speech: Delete timestamps (like
[00:01:05]
), speaker labels (“SPEAKER A:”, “John Doe:”), sound event tags ([laughter]
,[music]
), transcription comments. - Handle Filler Words: Decide whether to keep or remove common fillers (“uh,” “um,” “ah”). Keeping them might make the TTS sound more natural but can also introduce unwanted hesitations. Removing them leads to cleaner, more direct speech. Consistency is key.
- Punctuation: Ensure consistent and appropriate punctuation. Commas, periods, question marks help the model learn prosody. Avoid excessive or non-standard punctuation.
- Numbers, Acronyms, Symbols: Expand them into words (e.g., “101” -> “one hundred one”, “USA” -> “U S A” or “United States of America”, “%” -> “percent”). How you expand depends on how you want the TTS to pronounce them. Create a normalization dictionary/ruleset if needed.
- Case: Usually convert text to a consistent case (e.g., lowercase) unless your TTS framework/tokenizer handles casing appropriately. Check framework docs.
- Special Characters: Remove or replace characters that might confuse the tokenizer (e.g., emojis, control characters).
# Example structure: my_tts_dataset/ ├── normalized_chunks/ │ ├── segment_00001.wav │ ├── segment_00002.wav │ └── ... └── transcripts/ ├── segment_00001.txt # Contains "Hello world." ├── segment_00002.txt # Contains "This is a test sentence." └── ...
- Remove Non-Speech: Delete timestamps (like
1.6. Data Structuring & Manifest File Creation
- Goal: Create index files (manifests) that tell the TTS training script where to find the audio files and their corresponding transcriptions.
- Manifest Format: The most common format is a plain text file where each line represents one audio-text pair, separated by a delimiter (usually a pipe
|
).path/to/audio_chunk.wav|The corresponding transcription text|speaker_id
path/to/audio_chunk.wav
: Relative path to the normalized audio file from the directory where the training script will be run.The corresponding transcription text
: The cleaned, normalized text from the.txt
file.speaker_id
: An identifier for the speaker (e.g.,speaker0
,mary_smith
). For single-speaker datasets, use the same ID for all lines. For multi-speaker datasets, use unique IDs for each distinct speaker.
- Splitting Data (Train/Validation): Divide your data into a training set (used to update model weights) and a validation set (used to monitor performance on unseen data and prevent overfitting). A common split is 90-98% for training and 2-10% for validation. Crucially, ensure that segments from the same original long recording do not end up in both train and validation sets if possible, to avoid data leakage. If splitting randomly, shuffle first.
-
Generate Manifests Script:
import os import random # --- Configuration --- dataset_name = "my_tts_dataset" normalized_audio_dir = os.path.join(dataset_name, "normalized_chunks") transcripts_dir = os.path.join(dataset_name, "transcripts") output_dir = dataset_name # Where manifest files will be saved train_manifest_path = os.path.join(output_dir, "train_list.txt") val_manifest_path = os.path.join(output_dir, "val_list.txt") speaker_id = "main_speaker" # Use a consistent ID for single speaker datasets # For multi-speaker, determine ID based on filename or source val_split_ratio = 0.05 # 5% for validation set random_seed = 42 # For reproducible splits # --------------------- manifest_entries = [] print("Reading audio and transcript files...") # Iterate through normalized audio files wav_files = sorted([f for f in os.listdir(normalized_audio_dir) if f.endswith(".wav")]) for wav_filename in wav_files: base_filename = os.path.splitext(wav_filename)[0] txt_filename = base_filename + ".txt" audio_path = os.path.join(normalized_audio_dir, wav_filename) # Use os.path.relpath if your training script runs from a different root # relative_audio_path = os.path.relpath(audio_path, start=training_script_dir) relative_audio_path = audio_path # Assuming script runs from root containing 'my_tts_dataset' transcript_path = os.path.join(transcripts_dir, txt_filename) if os.path.exists(transcript_path): try: with open(transcript_path, "r", encoding="utf-8") as f: transcript = f.read().strip() # Basic cleaning: remove pipe chars, trim extra whitespace transcript = transcript.replace('|', ' ').strip() transcript = ' '.join(transcript.split()) # Normalize whitespace if transcript: # Ensure transcript is not empty after cleaning manifest_entries.append(f"{relative_audio_path}|{transcript}|{speaker_id}") else: print(f"Warning: Empty transcript for {wav_filename}. Skipping.") except Exception as e: print(f"Error reading or processing transcript {txt_filename}: {e}. Skipping.") else: print(f"Warning: Missing transcript file {txt_filename} for {wav_filename}. Skipping.") print(f"Found {len(manifest_entries)} valid audio-transcript pairs.") # Shuffle and split random.seed(random_seed) random.shuffle(manifest_entries) split_idx = int(len(manifest_entries) * (1 - val_split_ratio)) train_entries = manifest_entries[:split_idx] val_entries = manifest_entries[split_idx:] # Write manifest files try: with open(train_manifest_path, "w", encoding="utf-8") as f: f.write("\n".join(train_entries)) print(f"Successfully wrote {len(train_entries)} entries to {train_manifest_path}") with open(val_manifest_path, "w", encoding="utf-8") as f: f.write("\n".join(val_entries)) print(f"Successfully wrote {len(val_entries)} entries to {val_manifest_path}") except Exception as e: print(f"Error writing manifest files: {e}")
2. Data Quality Checklist
Before moving to training setup, rigorously review your prepared dataset using this checklist. Fixing issues now saves significant time later.
Aspect | Check | Why Important? | Action if Failed |
---|---|---|---|
Audio Completeness | Do all listed .wav files in manifests actually exist? |
Training will crash if files are missing. | Re-run manifest generation; check file paths; ensure no files were accidentally deleted. |
Transcript Match | Does each .wav have a corresponding, accurate .txt /transcript? |
Mismatched pairs teach the model incorrect associations. | Verify filenames; review ASR output; manually correct transcripts. |
Audio Length | Are most segments within the desired range (e.g., 2-15s)? Few outliers? | Very short/long segments can destabilize training. | Re-run chunking with adjusted parameters; manually filter outliers from manifests. |
Audio Quality | Listen to random samples: Low background noise? No music/reverb/echo? | Garbage In, Garbage Out. Model learns the noise. | Improve source audio; apply noise reduction (carefully!); filter out bad segments. |
Speaker Consistency | For single-speaker: Is it always the target voice? No other speakers? | Prevents voice dilution or instability. | Manually review/filter segments; check chunking boundaries. |
Format & Specs | All WAV? Identical sampling rate? Mono channels? PCM 16-bit? | Inconsistencies cause errors or poor performance. | Re-run conversion/resampling steps (Section 1.1). Batch-verify specs using command-line tools like ffprobe or soxi (part of the SoX package). Example: soxi -r *.wav to check rates. |
Volume Levels | Listen to random samples: Are volumes relatively consistent? | Drastic volume shifts can hinder learning. | Re-run normalization (Section 1.3); check normalization parameters. |
Transcription Cleanliness | No timestamps, speaker labels? Fillers handled consistently? Punctuation standard? Numbers/symbols expanded? | Ensures text maps cleanly to speech sounds/prosody. | Re-run text cleaning scripts; perform manual review and correction. |
Manifest Format | Correct path|text|speaker_id structure? Paths valid? No extra lines? |
Parser errors will prevent data loading. | Check delimiter (| ); validate paths relative to training script location; check encoding (UTF-8 preferred). |
Train/Val Split | Are validation files truly unseen during training? No overlap? | Overlapping data gives misleading validation scores. | Ensure random shuffle before splitting; check splitting logic. |
Tip: Use tools like soxi
(from SoX) or ffprobe
to batch-check audio properties (sampling rate, channels, duration). Write small scripts to verify file existence and basic manifest formatting.
2.1. Practical Verification Scripts
Here are some practical scripts to help verify your dataset quality:
Check Audio Properties (Sampling Rate, Channels, Duration)
#!/bin/bash
# verify_audio.sh - Check audio properties across all WAV files
# Usage: ./verify_audio.sh /path/to/audio/directory
AUDIO_DIR="$1"
echo "Checking audio files in $AUDIO_DIR..."
# Check if SoX is installed
if ! command -v soxi &> /dev/null; then
echo "SoX not found. Please install it first (e.g., 'apt-get install sox' or 'brew install sox')."
exit 1
fi
# Initialize counters and arrays
total_files=0
non_mono=0
wrong_rate=0
too_short=0
too_long=0
target_rate=22050 # Change this to your target sampling rate
min_duration=1.0 # Minimum duration in seconds
max_duration=15.0 # Maximum duration in seconds
# Process all WAV files
find "$AUDIO_DIR" -name "*.wav" | while read -r file; do
total_files=$((total_files + 1))
# Get audio properties
channels=$(soxi -c "$file")
rate=$(soxi -r "$file")
duration=$(soxi -d "$file" | awk -F: '{ print ($1 * 3600) + ($2 * 60) + $3 }')
# Check properties
if [ "$channels" -ne 1 ]; then
echo "WARNING: Non-mono file: $file (channels: $channels)"
non_mono=$((non_mono + 1))
fi
if [ "$rate" -ne "$target_rate" ]; then
echo "WARNING: Wrong sampling rate: $file (rate: $rate Hz, expected: $target_rate Hz)"
wrong_rate=$((wrong_rate + 1))
fi
if (( $(echo "$duration < $min_duration" | bc -l) )); then
echo "WARNING: File too short: $file (duration: ${duration}s, minimum: ${min_duration}s)"
too_short=$((too_short + 1))
fi
if (( $(echo "$duration > $max_duration" | bc -l) )); then
echo "WARNING: File too long: $file (duration: ${duration}s, maximum: ${max_duration}s)"
too_long=$((too_long + 1))
fi
# Print progress every 100 files
if [ $((total_files % 100)) -eq 0 ]; then
echo "Processed $total_files files..."
fi
done
# Print summary
echo "===== SUMMARY ====="
echo "Total files checked: $total_files"
echo "Non-mono files: $non_mono"
echo "Files with wrong sampling rate: $wrong_rate"
echo "Files too short (<${min_duration}s): $too_short"
echo "Files too long (>${max_duration}s): $too_long"
if [ $((non_mono + wrong_rate + too_short + too_long)) -eq 0 ]; then
echo "All files passed basic checks!"
else
echo "Some issues were found. Please review the warnings above."
fi
Verify Manifest File Integrity
#!/usr/bin/env python3
# verify_manifest.py - Check that all files in manifest exist and have matching transcripts
# Usage: python verify_manifest.py path/to/manifest.txt
import os
import sys
from pathlib import Path
def verify_manifest(manifest_path):
"""Verify that all audio files and transcripts in the manifest exist and are valid."""
if not os.path.exists(manifest_path):
print(f"Error: Manifest file '{manifest_path}' not found.")
return False
print(f"Verifying manifest: {manifest_path}")
base_dir = os.path.dirname(os.path.abspath(manifest_path))
# Statistics
total_entries = 0
missing_audio = 0
empty_transcripts = 0
with open(manifest_path, 'r', encoding='utf-8') as f:
for line_num, line in enumerate(f, 1):
line = line.strip()
if not line:
continue
total_entries += 1
# Parse the line (assuming pipe-separated format: audio_path|transcript|speaker_id)
parts = line.split('|')
if len(parts) < 2:
print(f"Line {line_num}: Invalid format. Expected at least 'audio_path|transcript'")
continue
audio_path = parts[0]
transcript = parts[1]
# Check if audio path is relative and resolve it
if not os.path.isabs(audio_path):
audio_path = os.path.join(base_dir, audio_path)
# Check if audio file exists
if not os.path.exists(audio_path):
print(f"Line {line_num}: Audio file not found: {audio_path}")
missing_audio += 1
# Check if transcript is empty
if not transcript or transcript.isspace():
print(f"Line {line_num}: Empty transcript for {audio_path}")
empty_transcripts += 1
# Print summary
print("\n===== SUMMARY =====")
print(f"Total entries: {total_entries}")
print(f"Missing audio files: {missing_audio}")
print(f"Empty transcripts: {empty_transcripts}")
if missing_audio == 0 and empty_transcripts == 0:
print("All manifest entries are valid!")
return True
else:
print("Issues found in manifest. Please fix them before proceeding.")
return False
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python verify_manifest.py path/to/manifest.txt")
sys.exit(1)
success = verify_manifest(sys.argv[1])
sys.exit(0 if success else 1)
Visualize Audio Spectrograms for Quality Assessment
This script helps you visually inspect the quality of your audio files by generating spectrograms:
#!/usr/bin/env python3
# generate_spectrograms.py - Create spectrograms for audio quality assessment
# Usage: python generate_spectrograms.py /path/to/audio/directory /path/to/output/directory [num_samples]
import os
import sys
import random
import numpy as np
import matplotlib.pyplot as plt
import librosa
import librosa.display
from pathlib import Path
def generate_spectrograms(audio_dir, output_dir, num_samples=10):
"""Generate spectrograms for a random sample of audio files."""
# Create output directory if it doesn't exist
os.makedirs(output_dir, exist_ok=True)
# Get all WAV files
wav_files = list(Path(audio_dir).glob('**/*.wav'))
if not wav_files:
print(f"No WAV files found in {audio_dir}")
return False
# Sample files if there are more than requested
if len(wav_files) > num_samples:
wav_files = random.sample(wav_files, num_samples)
print(f"Generating spectrograms for {len(wav_files)} files...")
for i, wav_path in enumerate(wav_files):
try:
# Load audio file
y, sr = librosa.load(wav_path, sr=None)
# Create figure with two subplots
plt.figure(figsize=(12, 8))
# Plot waveform
plt.subplot(2, 1, 1)
librosa.display.waveshow(y, sr=sr)
plt.title(f'Waveform: {wav_path.name}')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
# Plot spectrogram
plt.subplot(2, 1, 2)
D = librosa.amplitude_to_db(np.abs(librosa.stft(y)), ref=np.max)
librosa.display.specshow(D, sr=sr, x_axis='time', y_axis='log')
plt.colorbar(format='%+2.0f dB')
plt.title('Log-frequency power spectrogram')
# Save figure
output_path = os.path.join(output_dir, f'spectrogram_{i+1}_{wav_path.stem}.png')
plt.tight_layout()
plt.savefig(output_path)
plt.close()
print(f"Generated: {output_path}")
except Exception as e:
print(f"Error processing {wav_path}: {e}")
print(f"Spectrograms saved to {output_dir}")
return True
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: python generate_spectrograms.py /path/to/audio/directory /path/to/output/directory [num_samples]")
sys.exit(1)
audio_dir = sys.argv[1]
output_dir = sys.argv[2]
num_samples = int(sys.argv[3]) if len(sys.argv) > 3 else 10
success = generate_spectrograms(audio_dir, output_dir, num_samples)
sys.exit(0 if success else 1)
These scripts provide practical tools to verify your dataset’s quality before training, helping you identify and fix issues early in the process.
Once your dataset passes this quality check, you are ready to proceed to setting up the training environment.
Next Step: Training Setup | Back to Top