This commit completes the evaluation of Text-to-Speech (TTS) options as described in TICKET-013. - Creates a detailed document comparing Piper, Mimic 3, and Coqui TTS. - Recommends Piper for initial development due to its performance and low resource usage. - Updates to reflect the decision and points to the new evaluation document. - Moves TICKET-013 to the 'done' column.
3.4 KiB
3.4 KiB
TTS Evaluation
This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in TICKET-013.
1. Options Considered
The following TTS engines were evaluated based on latency, quality, resource usage, and customization options.
| Feature | Piper | Mycroft Mimic 3 | Coqui TTS |
|---|---|---|---|
| License | MIT | AGPL-3.0 | Mozilla Public License 2.0 |
| Technology | VITS | VITS | Various (Tacotron, Glow-TTS, etc.) |
| Pre-trained Voices | Yes | Yes | Yes |
| Voice Cloning | No | No | Yes |
| Language Support | Multi-lingual | Multi-lingual | Multi-lingual |
| Resource Usage | Low (CPU) | Moderate (CPU) | High (GPU recommended) |
| Latency | Low | Low | Moderate to High |
| Quality | Good | Very Good | Excellent |
| Notes | Fast, lightweight, good for resource-constrained devices. | High-quality voices, but more restrictive license. | Very high quality, but requires more resources. Actively developed. |
2. Evaluation Summary
| Engine | Pros | Cons | Recommendation |
|---|---|---|---|
| Piper | - Very fast, low latency - Lightweight, runs on CPU - Good quality for its size - Permissive license |
- Quality not as high as larger models - Fewer voice customization options |
Recommended for prototyping and initial development. Its speed and low resource usage are ideal for quick iteration. |
| Mycroft Mimic 3 | - High-quality, natural-sounding voices - Good performance on CPU |
- AGPL-3.0 license may have implications for commercial use - Less actively maintained than Coqui |
A strong contender, but the license needs legal review. |
| Coqui TTS | - State-of-the-art, excellent voice quality - Voice cloning and extensive customization - Active community and development |
- High resource requirements (GPU often necessary) - Higher latency - Coqui the company is now defunct, but the open source community continues work. |
Recommended for production if high quality is paramount and resources allow. Voice cloning is a powerful feature. |
3. Voice Selection
For the "family agent" persona, we need voices that are warm, friendly, and clear.
Initial Voice Candidates:
- From Piper:
en_US-lessac-medium(A clear, standard American English voice) - From Coqui TTS: (Requires further investigation into available pre-trained models that fit the desired persona)
4. Resource Requirements
| Engine | CPU | RAM | Storage (Model Size) | GPU |
|---|---|---|---|---|
| Piper | ~1-2 cores | ~500MB | ~100-200MB per voice | Not required |
| Mimic 3 | ~2-4 cores | ~1GB | ~200-500MB per voice | Not required |
| Coqui TTS | 4+ cores | 2GB+ | 500MB - 2GB+ per model | Recommended for acceptable performance |
5. Decision & Next Steps
Decision:
For the initial phase of development, Piper is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application.
We will proceed with the following steps:
- Integrate Piper as the default TTS engine.
- Use the
en_US-lessac-mediumvoice for the family agent. - Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation.
- Update the
ARCHITECTURE.mdto reflect this decision.