atlas/docs/TTS_EVALUATION.md
ilia f8ff2d3a55 feat(tts): Evaluate TTS options and select Piper
This commit completes the evaluation of Text-to-Speech (TTS) options
as described in TICKET-013.

- Creates a detailed  document comparing Piper,
  Mimic 3, and Coqui TTS.
- Recommends Piper for initial development due to its performance and
  low resource usage.
- Updates  to reflect the decision and points to the
  new evaluation document.
- Moves TICKET-013 to the 'done' column.
2026-01-05 20:33:53 -05:00

3.4 KiB

TTS Evaluation

This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in TICKET-013.

1. Options Considered

The following TTS engines were evaluated based on latency, quality, resource usage, and customization options.

Feature Piper Mycroft Mimic 3 Coqui TTS
License MIT AGPL-3.0 Mozilla Public License 2.0
Technology VITS VITS Various (Tacotron, Glow-TTS, etc.)
Pre-trained Voices Yes Yes Yes
Voice Cloning No No Yes
Language Support Multi-lingual Multi-lingual Multi-lingual
Resource Usage Low (CPU) Moderate (CPU) High (GPU recommended)
Latency Low Low Moderate to High
Quality Good Very Good Excellent
Notes Fast, lightweight, good for resource-constrained devices. High-quality voices, but more restrictive license. Very high quality, but requires more resources. Actively developed.

2. Evaluation Summary

Engine Pros Cons Recommendation
Piper - Very fast, low latency
- Lightweight, runs on CPU
- Good quality for its size
- Permissive license
- Quality not as high as larger models
- Fewer voice customization options
Recommended for prototyping and initial development. Its speed and low resource usage are ideal for quick iteration.
Mycroft Mimic 3 - High-quality, natural-sounding voices
- Good performance on CPU
- AGPL-3.0 license may have implications for commercial use
- Less actively maintained than Coqui
A strong contender, but the license needs legal review.
Coqui TTS - State-of-the-art, excellent voice quality
- Voice cloning and extensive customization
- Active community and development
- High resource requirements (GPU often necessary)
- Higher latency
- Coqui the company is now defunct, but the open source community continues work.
Recommended for production if high quality is paramount and resources allow. Voice cloning is a powerful feature.

3. Voice Selection

For the "family agent" persona, we need voices that are warm, friendly, and clear.

Initial Voice Candidates:

  • From Piper: en_US-lessac-medium (A clear, standard American English voice)
  • From Coqui TTS: (Requires further investigation into available pre-trained models that fit the desired persona)

4. Resource Requirements

Engine CPU RAM Storage (Model Size) GPU
Piper ~1-2 cores ~500MB ~100-200MB per voice Not required
Mimic 3 ~2-4 cores ~1GB ~200-500MB per voice Not required
Coqui TTS 4+ cores 2GB+ 500MB - 2GB+ per model Recommended for acceptable performance

5. Decision & Next Steps

Decision:

For the initial phase of development, Piper is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application.

We will proceed with the following steps:

  1. Integrate Piper as the default TTS engine.
  2. Use the en_US-lessac-medium voice for the family agent.
  3. Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation.
  4. Update the ARCHITECTURE.md to reflect this decision.