Merge pull request 'Evaluate TTS Options' (#2) from vk/45ad-evaluate-tts-opt into master

Reviewed-on: #2
2026-01-05 21:30:15 -05:00 · 2026-01-05 21:30:15 -05:00 · 4a0bfa773f
commit 4a0bfa773f
parent f7dce46ac9 53771e13cf
4 changed files with 65 additions and 3 deletions
--- a/ARCHITECTURE.md
+++ b/ARCHITECTURE.md
@ -78,12 +78,18 @@ The system consists of 5 parallel tracks:
 - **Languages**: Python (backend services), TypeScript/JavaScript (clients)
 - **LLM Servers**: Ollama, vLLM, or llama.cpp
 - **ASR**: faster-whisper or Whisper.cpp
- **TTS**: Piper, Mimic 3, or Coqui TTS
+- **TTS**: Piper (selected for initial development), Coqui TTS (for future high-quality option)
 - **Wake-Word**: openWakeWord or Porcupine
 - **Protocols**: MCP (Model Context Protocol), WebSocket, HTTP/gRPC
 - **Storage**: SQLite (memory, sessions), Markdown files (tasks, notes)
 - **Infrastructure**: Docker, systemd, Linux

+### TTS Selection
+
+For initial development, **Piper** has been selected as the primary Text-to-Speech (TTS) engine. This decision is based on its high performance, low resource requirements, and permissive license, which are ideal for prototyping and early-stage implementation. **Coqui TTS** is identified as a potential future upgrade for a high-quality voice when more resources can be allocated.
+
+For a detailed comparison of all evaluated options, see the [TTS Evaluation document](docs/TTS_EVALUATION.md).
+
 ## Design Patterns

 ### Core Patterns
--- a/docs/TTS_EVALUATION.md
+++ b/docs/TTS_EVALUATION.md
@ -0,0 +1,56 @@
+# TTS Evaluation
+
+This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in [TICKET-013](tickets/backlog/TICKET-013_tts-evaluation.md).
+
+## 1. Options Considered
+
+The following TTS engines were evaluated based on latency, quality, resource usage, and customization options.
+
+| Feature | Piper | Mycroft Mimic 3 | Coqui TTS |
+|---|---|---|---|
+| **License** | MIT | AGPL-3.0 | Mozilla Public License 2.0 |
+| **Technology** | VITS | VITS | Various (Tacotron, Glow-TTS, etc.) |
+| **Pre-trained Voices**| Yes | Yes | Yes |
+| **Voice Cloning** | No | No | Yes |
+| **Language Support** | Multi-lingual | Multi-lingual | Multi-lingual |
+| **Resource Usage** | Low (CPU) | Moderate (CPU) | High (GPU recommended) |
+| **Latency** | Low | Low | Moderate to High |
+| **Quality** | Good | Very Good | Excellent |
+| **Notes** | Fast, lightweight, good for resource-constrained devices. | High-quality voices, but more restrictive license. | Very high quality, but requires more resources. Actively developed. |
+
+## 2. Evaluation Summary
+
+| **Engine** | **Pros** | **Cons** | **Recommendation** |
+|---|---|---|---|
+| **Piper** | - Very fast, low latency<br>- Lightweight, runs on CPU<br>- Good quality for its size<br>- Permissive license | - Quality not as high as larger models<br>- Fewer voice customization options | **Recommended for prototyping and initial development.** Its speed and low resource usage are ideal for quick iteration. |
+| **Mycroft Mimic 3** | - High-quality, natural-sounding voices<br>- Good performance on CPU | - AGPL-3.0 license may have implications for commercial use<br>- Less actively maintained than Coqui | A strong contender, but the license needs legal review. |
+| **Coqui TTS** | - State-of-the-art, excellent voice quality<br>- Voice cloning and extensive customization<br>- Active community and development | - High resource requirements (GPU often necessary)<br>- Higher latency<br>- Coqui the company is now defunct, but the open source community continues work. | **Recommended for production if high quality is paramount and resources allow.** Voice cloning is a powerful feature. |
+
+## 3. Voice Selection
+
+For the "family agent" persona, we need voices that are warm, friendly, and clear.
+
+**Initial Voice Candidates:**
+
+*   **From Piper:** `en_US-lessac-medium` (A clear, standard American English voice)
+*   **From Coqui TTS:** (Requires further investigation into available pre-trained models that fit the desired persona)
+
+## 4. Resource Requirements
+
+| Engine | CPU | RAM | Storage (Model Size) | GPU |
+|---|---|---|---|---|
+| **Piper** | ~1-2 cores | ~500MB | ~100-200MB per voice | Not required |
+| **Mimic 3** | ~2-4 cores | ~1GB | ~200-500MB per voice | Not required |
+| **Coqui TTS** | 4+ cores | 2GB+ | 500MB - 2GB+ per model | Recommended for acceptable performance |
+
+## 5. Decision & Next Steps
+
+**Decision:**
+
+For the initial phase of development, **Piper** is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application.
+
+We will proceed with the following steps:
+1.  Integrate Piper as the default TTS engine.
+2.  Use the `en_US-lessac-medium` voice for the family agent.
+3.  Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation.
+4.  Update the `ARCHITECTURE.md` to reflect this decision.
--- a/tickets/TICKETS_SUMMARY.md
+++ b/tickets/TICKETS_SUMMARY.md
@ -28,7 +28,7 @@ Tickets are organized in the `tickets/` directory by status:
 - TICKET-010: Implement Streaming Audio Capture → ASR Service
 - TICKET-011: Define ASR API Contract
 - TICKET-012: Benchmark ASR Latency and Quality
- TICKET-013: Evaluate TTS Options
+- TICKET-013: Evaluate TTS Options (done)
 - TICKET-014: Build TTS Service
 - TICKET-015: Voice Consistency and Volume Leveling
 - TICKET-016: Integrate TTS with Clients
@ -79,7 +79,7 @@ Tickets are organized in the `tickets/` directory by status:
 - TICKET-004: High-Level Architecture Document
 - TICKET-005: Evaluate and Select Wake-Word Engine
 - TICKET-009: Select ASR Engine and Target Hardware
- TICKET-013: Evaluate TTS Options
+- TICKET-013: Evaluate TTS Options (done)
 - TICKET-017: Survey Candidate Open-Weight Models
 - TICKET-018: LLM Capacity Assessment
 - TICKET-019: Select Work Agent Model (4080)
--- a/tickets/backlog/TICKET-013_tts-evaluation.md
+++ b/tickets/backlog/TICKET-013_tts-evaluation.md