From f8ff2d3a55d1288a96d7b1824885011e2678e360 Mon Sep 17 00:00:00 2001 From: ilia Date: Mon, 5 Jan 2026 20:33:53 -0500 Subject: [PATCH 1/2] feat(tts): Evaluate TTS options and select Piper This commit completes the evaluation of Text-to-Speech (TTS) options as described in TICKET-013. - Creates a detailed document comparing Piper, Mimic 3, and Coqui TTS. - Recommends Piper for initial development due to its performance and low resource usage. - Updates to reflect the decision and points to the new evaluation document. - Moves TICKET-013 to the 'done' column. --- ARCHITECTURE.md | 8 ++- docs/TTS_EVALUATION.md | 56 +++++++++++++++++++ .../TICKET-013_tts-evaluation.md | 0 3 files changed, 63 insertions(+), 1 deletion(-) create mode 100644 docs/TTS_EVALUATION.md rename tickets/{backlog => done}/TICKET-013_tts-evaluation.md (100%) diff --git a/ARCHITECTURE.md b/ARCHITECTURE.md index b33cf96..0c9ac51 100644 --- a/ARCHITECTURE.md +++ b/ARCHITECTURE.md @@ -78,12 +78,18 @@ The system consists of 5 parallel tracks: - **Languages**: Python (backend services), TypeScript/JavaScript (clients) - **LLM Servers**: Ollama, vLLM, or llama.cpp - **ASR**: faster-whisper or Whisper.cpp -- **TTS**: Piper, Mimic 3, or Coqui TTS +- **TTS**: Piper (selected for initial development), Coqui TTS (for future high-quality option) - **Wake-Word**: openWakeWord or Porcupine - **Protocols**: MCP (Model Context Protocol), WebSocket, HTTP/gRPC - **Storage**: SQLite (memory, sessions), Markdown files (tasks, notes) - **Infrastructure**: Docker, systemd, Linux +### TTS Selection + +For initial development, **Piper** has been selected as the primary Text-to-Speech (TTS) engine. This decision is based on its high performance, low resource requirements, and permissive license, which are ideal for prototyping and early-stage implementation. **Coqui TTS** is identified as a potential future upgrade for a high-quality voice when more resources can be allocated. + +For a detailed comparison of all evaluated options, see the [TTS Evaluation document](docs/TTS_EVALUATION.md). + ## Design Patterns ### Core Patterns diff --git a/docs/TTS_EVALUATION.md b/docs/TTS_EVALUATION.md new file mode 100644 index 0000000..1649e30 --- /dev/null +++ b/docs/TTS_EVALUATION.md @@ -0,0 +1,56 @@ +# TTS Evaluation + +This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in [TICKET-013](tickets/backlog/TICKET-013_tts-evaluation.md). + +## 1. Options Considered + +The following TTS engines were evaluated based on latency, quality, resource usage, and customization options. + +| Feature | Piper | Mycroft Mimic 3 | Coqui TTS | +|---|---|---|---| +| **License** | MIT | AGPL-3.0 | Mozilla Public License 2.0 | +| **Technology** | VITS | VITS | Various (Tacotron, Glow-TTS, etc.) | +| **Pre-trained Voices**| Yes | Yes | Yes | +| **Voice Cloning** | No | No | Yes | +| **Language Support** | Multi-lingual | Multi-lingual | Multi-lingual | +| **Resource Usage** | Low (CPU) | Moderate (CPU) | High (GPU recommended) | +| **Latency** | Low | Low | Moderate to High | +| **Quality** | Good | Very Good | Excellent | +| **Notes** | Fast, lightweight, good for resource-constrained devices. | High-quality voices, but more restrictive license. | Very high quality, but requires more resources. Actively developed. | + +## 2. Evaluation Summary + +| **Engine** | **Pros** | **Cons** | **Recommendation** | +|---|---|---|---| +| **Piper** | - Very fast, low latency
- Lightweight, runs on CPU
- Good quality for its size
- Permissive license | - Quality not as high as larger models
- Fewer voice customization options | **Recommended for prototyping and initial development.** Its speed and low resource usage are ideal for quick iteration. | +| **Mycroft Mimic 3** | - High-quality, natural-sounding voices
- Good performance on CPU | - AGPL-3.0 license may have implications for commercial use
- Less actively maintained than Coqui | A strong contender, but the license needs legal review. | +| **Coqui TTS** | - State-of-the-art, excellent voice quality
- Voice cloning and extensive customization
- Active community and development | - High resource requirements (GPU often necessary)
- Higher latency
- Coqui the company is now defunct, but the open source community continues work. | **Recommended for production if high quality is paramount and resources allow.** Voice cloning is a powerful feature. | + +## 3. Voice Selection + +For the "family agent" persona, we need voices that are warm, friendly, and clear. + +**Initial Voice Candidates:** + +* **From Piper:** `en_US-lessac-medium` (A clear, standard American English voice) +* **From Coqui TTS:** (Requires further investigation into available pre-trained models that fit the desired persona) + +## 4. Resource Requirements + +| Engine | CPU | RAM | Storage (Model Size) | GPU | +|---|---|---|---|---| +| **Piper** | ~1-2 cores | ~500MB | ~100-200MB per voice | Not required | +| **Mimic 3** | ~2-4 cores | ~1GB | ~200-500MB per voice | Not required | +| **Coqui TTS** | 4+ cores | 2GB+ | 500MB - 2GB+ per model | Recommended for acceptable performance | + +## 5. Decision & Next Steps + +**Decision:** + +For the initial phase of development, **Piper** is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application. + +We will proceed with the following steps: +1. Integrate Piper as the default TTS engine. +2. Use the `en_US-lessac-medium` voice for the family agent. +3. Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation. +4. Update the `ARCHITECTURE.md` to reflect this decision. diff --git a/tickets/backlog/TICKET-013_tts-evaluation.md b/tickets/done/TICKET-013_tts-evaluation.md similarity index 100% rename from tickets/backlog/TICKET-013_tts-evaluation.md rename to tickets/done/TICKET-013_tts-evaluation.md From 53771e13cf211a3caed786fc0d010e99fb9dd299 Mon Sep 17 00:00:00 2001 From: ilia Date: Mon, 5 Jan 2026 20:34:05 -0500 Subject: [PATCH 2/2] docs(tickets): Mark TICKET-013 as done in summary --- tickets/TICKETS_SUMMARY.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tickets/TICKETS_SUMMARY.md b/tickets/TICKETS_SUMMARY.md index 7263267..2b271db 100644 --- a/tickets/TICKETS_SUMMARY.md +++ b/tickets/TICKETS_SUMMARY.md @@ -28,7 +28,7 @@ Tickets are organized in the `tickets/` directory by status: - TICKET-010: Implement Streaming Audio Capture → ASR Service - TICKET-011: Define ASR API Contract - TICKET-012: Benchmark ASR Latency and Quality -- TICKET-013: Evaluate TTS Options +- TICKET-013: Evaluate TTS Options (done) - TICKET-014: Build TTS Service - TICKET-015: Voice Consistency and Volume Leveling - TICKET-016: Integrate TTS with Clients @@ -79,7 +79,7 @@ Tickets are organized in the `tickets/` directory by status: - TICKET-004: High-Level Architecture Document - TICKET-005: Evaluate and Select Wake-Word Engine - TICKET-009: Select ASR Engine and Target Hardware -- TICKET-013: Evaluate TTS Options +- TICKET-013: Evaluate TTS Options (done) - TICKET-017: Survey Candidate Open-Weight Models - TICKET-018: LLM Capacity Assessment - TICKET-019: Select Work Agent Model (4080)