This commit completes the evaluation of Text-to-Speech (TTS) options as described in TICKET-013. - Creates a detailed document comparing Piper, Mimic 3, and Coqui TTS. - Recommends Piper for initial development due to its performance and low resource usage. - Updates to reflect the decision and points to the new evaluation document. - Moves TICKET-013 to the 'done' column.
57 lines
3.4 KiB
Markdown
57 lines
3.4 KiB
Markdown
# TTS Evaluation
|
|
|
|
This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in [TICKET-013](tickets/backlog/TICKET-013_tts-evaluation.md).
|
|
|
|
## 1. Options Considered
|
|
|
|
The following TTS engines were evaluated based on latency, quality, resource usage, and customization options.
|
|
|
|
| Feature | Piper | Mycroft Mimic 3 | Coqui TTS |
|
|
|---|---|---|---|
|
|
| **License** | MIT | AGPL-3.0 | Mozilla Public License 2.0 |
|
|
| **Technology** | VITS | VITS | Various (Tacotron, Glow-TTS, etc.) |
|
|
| **Pre-trained Voices**| Yes | Yes | Yes |
|
|
| **Voice Cloning** | No | No | Yes |
|
|
| **Language Support** | Multi-lingual | Multi-lingual | Multi-lingual |
|
|
| **Resource Usage** | Low (CPU) | Moderate (CPU) | High (GPU recommended) |
|
|
| **Latency** | Low | Low | Moderate to High |
|
|
| **Quality** | Good | Very Good | Excellent |
|
|
| **Notes** | Fast, lightweight, good for resource-constrained devices. | High-quality voices, but more restrictive license. | Very high quality, but requires more resources. Actively developed. |
|
|
|
|
## 2. Evaluation Summary
|
|
|
|
| **Engine** | **Pros** | **Cons** | **Recommendation** |
|
|
|---|---|---|---|
|
|
| **Piper** | - Very fast, low latency<br>- Lightweight, runs on CPU<br>- Good quality for its size<br>- Permissive license | - Quality not as high as larger models<br>- Fewer voice customization options | **Recommended for prototyping and initial development.** Its speed and low resource usage are ideal for quick iteration. |
|
|
| **Mycroft Mimic 3** | - High-quality, natural-sounding voices<br>- Good performance on CPU | - AGPL-3.0 license may have implications for commercial use<br>- Less actively maintained than Coqui | A strong contender, but the license needs legal review. |
|
|
| **Coqui TTS** | - State-of-the-art, excellent voice quality<br>- Voice cloning and extensive customization<br>- Active community and development | - High resource requirements (GPU often necessary)<br>- Higher latency<br>- Coqui the company is now defunct, but the open source community continues work. | **Recommended for production if high quality is paramount and resources allow.** Voice cloning is a powerful feature. |
|
|
|
|
## 3. Voice Selection
|
|
|
|
For the "family agent" persona, we need voices that are warm, friendly, and clear.
|
|
|
|
**Initial Voice Candidates:**
|
|
|
|
* **From Piper:** `en_US-lessac-medium` (A clear, standard American English voice)
|
|
* **From Coqui TTS:** (Requires further investigation into available pre-trained models that fit the desired persona)
|
|
|
|
## 4. Resource Requirements
|
|
|
|
| Engine | CPU | RAM | Storage (Model Size) | GPU |
|
|
|---|---|---|---|---|
|
|
| **Piper** | ~1-2 cores | ~500MB | ~100-200MB per voice | Not required |
|
|
| **Mimic 3** | ~2-4 cores | ~1GB | ~200-500MB per voice | Not required |
|
|
| **Coqui TTS** | 4+ cores | 2GB+ | 500MB - 2GB+ per model | Recommended for acceptable performance |
|
|
|
|
## 5. Decision & Next Steps
|
|
|
|
**Decision:**
|
|
|
|
For the initial phase of development, **Piper** is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application.
|
|
|
|
We will proceed with the following steps:
|
|
1. Integrate Piper as the default TTS engine.
|
|
2. Use the `en_US-lessac-medium` voice for the family agent.
|
|
3. Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation.
|
|
4. Update the `ARCHITECTURE.md` to reflect this decision.
|