Universal TTS Guide

A comprehensive guide to TTS dataset prep and training

View on GitHub

Guide 5: Packaging and Sharing Your TTS Model

Navigation: Main README Previous Step: Inference   Next Step: Troubleshooting and Resources

You’ve trained a model and can generate speech with it. Congratulations! To ensure your model is usable in the future (by yourself or others) and to facilitate reproducibility, proper packaging and documentation are essential.


9. Packaging Your Trained Model

Think of your trained model not just as a single .pth file, but as a complete package containing everything needed to understand and use it.

9.1. Organize Your Model Files

Create a clean, self-contained directory structure for each distinct trained model or significant version. This makes it easy to find everything later.

Example Structure:

my_tts_model_packages/
└── yoruba_male_v1.0/         # Descriptive name for this model package
    ├── checkpoints/          # Directory for model weights
    │   ├── best_model.pth    # Checkpoint with lowest validation loss (or best perceived quality)
    │   └── last_model.pth    # Checkpoint from the very end of training (optional, but sometimes useful)
    │
    ├── config.yaml           # The EXACT configuration file used for training THIS checkpoint
    │
    ├── training_info.md      # Optional: A file with detailed training logs/notes
    │   ├── train_list.txt    # Copy of the training manifest file used
    │   └── val_list.txt      # Copy of the validation manifest file used
    │
    ├── samples/              # Directory with example audio generated by this model
    │   ├── sample_short_sentence.wav
    │   ├── sample_question.wav
    │   └── sample_longer_paragraph.wav
    │
    └── README.md             # Essential: Human-readable documentation for this specific model package

Key Components Explained:

9.2. Writing a Good Model README.md

This README is specific to this model package, not the overall project guide. It should tell anyone (including your future self) everything they need to know to use the model.

Minimal Template:

# TTS Model Package: Yoruba Male Voice v1.0

## Model Description
- **Voice:** Clear, adult male voice speaking Yoruba.
- **Source Data Quality:** Trained on ~25 hours of clean radio broadcast recordings.
- **Language(s):** Yoruba (primarily). May have limited handling of English loanwords based on training data.
- **Speaking Style:** Formal, narrative/broadcast style.
- **Model Architecture:** [Specify Framework/Architecture, e.g., StyleTTS2, VITS]
- **Version:** 1.0

## Training Details
- **Based On:** Fine-tuned from [Specify base model, e.g., pre-trained LibriTTS model] OR Trained from scratch.
- **Training Data:** See included `train_list.txt` and `val_list.txt`. Total hours: ~25h.
- **Key Training Config:** See included `config.yaml`.
- **Sampling Rate:** 22050 Hz (Input audio must match this rate for some frameworks).
- **Training Time:** Approx. 48 hours on 1x NVIDIA RTX 3090.
- **Checkpoint Info:** `best_model.pth` selected based on lowest validation loss at step [XXXXX].

## How to Use for Inference
1.  **Prerequisites:** Ensure you have the [Specify TTS Framework Name, e.g., StyleTTS2] framework installed, compatible with this model version.
2.  **Configuration:** Use the included `config.yaml`.
3.  **Checkpoint:** Load the `checkpoints/best_model.pth` file.
4.  **Input Text:** Provide plain text input. Text normalization matching the training data (e.g., number expansion) might improve results.
5.  **Speaker ID (if applicable):** This is a single-speaker model. Use speaker ID `[Specify ID used, e.g., main_speaker]` if required by the framework, otherwise it might not be needed.
6.  **Expected Output:** Audio will be generated at 22050 Hz sampling rate.

## Audio Samples
Listen to examples generated by this model:
- [Short Sentence](./samples/sample_short_sentence.wav)
- [Question](./samples/sample_question.wav)
- [Longer Paragraph](./samples/sample_longer_paragraph.wav)

## Known Limitations / Notes
- Performance may degrade on text significantly different from the radio broadcast domain.
- Does not explicitly model nuanced emotions.
- [Add any other relevant observations]

## Licensing
- **Model Weights:** [Specify License, e.g., CC BY-NC-SA 4.0, Research/Non-Commercial Use Only, MIT License - Be accurate!]
- **Source Data:** [Mention source data license restrictions if they impact model usage, e.g., "Trained on proprietary data, model for internal use only."] **Consult the license of your training data!**

9.3. Model Versioning Tips

Treat your trained models like software releases.

9.4. Sharing and Distribution Considerations

If you plan to share your model:


Properly packaging and documenting your models makes them significantly more valuable and usable, whether for your own future projects or for collaboration and sharing within the community.

Next Step: Troubleshooting and Resources | Back to Top