Guide 5: Packaging and Sharing Your TTS Model
Navigation: Main README | Previous Step: Inference | Next Step: Troubleshooting and Resources |
You’ve trained a model and can generate speech with it. Congratulations! To ensure your model is usable in the future (by yourself or others) and to facilitate reproducibility, proper packaging and documentation are essential.
9. Packaging Your Trained Model
Think of your trained model not just as a single .pth
file, but as a complete package containing everything needed to understand and use it.
9.1. Organize Your Model Files
Create a clean, self-contained directory structure for each distinct trained model or significant version. This makes it easy to find everything later.
Example Structure:
my_tts_model_packages/
└── yoruba_male_v1.0/ # Descriptive name for this model package
├── checkpoints/ # Directory for model weights
│ ├── best_model.pth # Checkpoint with lowest validation loss (or best perceived quality)
│ └── last_model.pth # Checkpoint from the very end of training (optional, but sometimes useful)
│
├── config.yaml # The EXACT configuration file used for training THIS checkpoint
│
├── training_info.md # Optional: A file with detailed training logs/notes
│ ├── train_list.txt # Copy of the training manifest file used
│ └── val_list.txt # Copy of the validation manifest file used
│
├── samples/ # Directory with example audio generated by this model
│ ├── sample_short_sentence.wav
│ ├── sample_question.wav
│ └── sample_longer_paragraph.wav
│
└── README.md # Essential: Human-readable documentation for this specific model package
Key Components Explained:
checkpoints/
: Contains the actual model weights. Always include the checkpoint deemed ‘best’ (whether by loss or listening tests). Including the final checkpoint is also good practice.config.yaml
(or.json
): Absolutely critical. This file defines the model architecture and parameters required to load and use the checkpoint correctly. Without it, the checkpoint is often unusable. Ensure it’s the exact config used for the included checkpoints.training_info.md
/ Manifests (Optional but Recommended): Storing the manifests helps track exactly what data the model was trained on. Atraining_info.md
can hold notes about the training run (duration, hardware used, final metrics, observations).samples/
: Include a few diverse audio examples generated by thebest_model.pth
. This quickly demonstrates the model’s voice identity, quality, and characteristics.README.md
: The user manual for this specific model package. See next section.
9.2. Writing a Good Model README.md
This README is specific to this model package, not the overall project guide. It should tell anyone (including your future self) everything they need to know to use the model.
Minimal Template:
# TTS Model Package: Yoruba Male Voice v1.0
## Model Description
- **Voice:** Clear, adult male voice speaking Yoruba.
- **Source Data Quality:** Trained on ~25 hours of clean radio broadcast recordings.
- **Language(s):** Yoruba (primarily). May have limited handling of English loanwords based on training data.
- **Speaking Style:** Formal, narrative/broadcast style.
- **Model Architecture:** [Specify Framework/Architecture, e.g., StyleTTS2, VITS]
- **Version:** 1.0
## Training Details
- **Based On:** Fine-tuned from [Specify base model, e.g., pre-trained LibriTTS model] OR Trained from scratch.
- **Training Data:** See included `train_list.txt` and `val_list.txt`. Total hours: ~25h.
- **Key Training Config:** See included `config.yaml`.
- **Sampling Rate:** 22050 Hz (Input audio must match this rate for some frameworks).
- **Training Time:** Approx. 48 hours on 1x NVIDIA RTX 3090.
- **Checkpoint Info:** `best_model.pth` selected based on lowest validation loss at step [XXXXX].
## How to Use for Inference
1. **Prerequisites:** Ensure you have the [Specify TTS Framework Name, e.g., StyleTTS2] framework installed, compatible with this model version.
2. **Configuration:** Use the included `config.yaml`.
3. **Checkpoint:** Load the `checkpoints/best_model.pth` file.
4. **Input Text:** Provide plain text input. Text normalization matching the training data (e.g., number expansion) might improve results.
5. **Speaker ID (if applicable):** This is a single-speaker model. Use speaker ID `[Specify ID used, e.g., main_speaker]` if required by the framework, otherwise it might not be needed.
6. **Expected Output:** Audio will be generated at 22050 Hz sampling rate.
## Audio Samples
Listen to examples generated by this model:
- [Short Sentence](./samples/sample_short_sentence.wav)
- [Question](./samples/sample_question.wav)
- [Longer Paragraph](./samples/sample_longer_paragraph.wav)
## Known Limitations / Notes
- Performance may degrade on text significantly different from the radio broadcast domain.
- Does not explicitly model nuanced emotions.
- [Add any other relevant observations]
## Licensing
- **Model Weights:** [Specify License, e.g., CC BY-NC-SA 4.0, Research/Non-Commercial Use Only, MIT License - Be accurate!]
- **Source Data:** [Mention source data license restrictions if they impact model usage, e.g., "Trained on proprietary data, model for internal use only."] **Consult the license of your training data!**
9.3. Model Versioning Tips
Treat your trained models like software releases.
- Use Semantic Versioning (Recommended): Use names like
model_v1.0
,model_v1.1
,model_v2.0
.- Increment PATCH version (v1.0 -> v1.0.1) for minor fixes/retrains with same data/config.
- Increment MINOR version (v1.0 -> v1.1) for improvements, retraining with more data, significant config tweaks.
- Increment MAJOR version (v1.0 -> v2.0) for major architecture changes or complete retraining with different core data/goals.
- Update READMEs: When creating a new version, update its README to reflect the changes from the previous version.
- Keep Old Versions: Don’t immediately discard older versions. Sometimes a previous model might perform better on specific types of text, or you might need to revert if a new version introduces regressions. Storage permitting, archive them.
9.4. Sharing and Distribution Considerations
If you plan to share your model:
- Packaging: Create a compressed archive (e.g.,
.zip
,.tar.gz
) of the entire model package directory (containing checkpoints, config, README, samples, etc.). - Hosting Platforms:
- Hugging Face Hub (Models): Excellent platform for sharing models, includes versioning, model cards (use your README content!), and potentially inference widgets. Easy for others to discover and use.
- GitHub Releases: Suitable for smaller models, attach the zip archive to a release tag in your project repository.
- Cloud Storage (Google Drive, Dropbox, S3): Simple for direct sharing, but less discoverable and lacks versioning features. Ensure link permissions are set correctly.
- Licensing (CRITICAL):
- Your Model: Choose a license for the model weights you are distributing (e.g., MIT, Apache 2.0 for permissive; CC BY-NC-SA for non-commercial sharing).
- Data Dependency: Crucially, the license of your training data often dictates how you can license your trained model. If you trained on data with a non-commercial license, you generally cannot release your model under a permissive commercial license. If trained on copyrighted data without permission, you likely cannot share the model publicly at all. Always check your data sources’ licenses.
- Framework License: The TTS framework code itself has its own license, which is separate from your model’s license.
- Clearly State Usage Terms: Use the
README.md
within your model package to clearly state the intended use (e.g., research only, non-commercial, free for any use) and the license terms.
Properly packaging and documenting your models makes them significantly more valuable and usable, whether for your own future projects or for collaboration and sharing within the community.
Next Step: Troubleshooting and Resources | Back to Top