ilia/atlas

ilia f8ff2d3a55 feat(tts): Evaluate TTS options and select Piper

This commit completes the evaluation of Text-to-Speech (TTS) options
as described in TICKET-013.

- Creates a detailed  document comparing Piper,
  Mimic 3, and Coqui TTS.
- Recommends Piper for initial development due to its performance and
  low resource usage.
- Updates  to reflect the decision and points to the
  new evaluation document.
- Moves TICKET-013 to the 'done' column.

2026-01-05 20:33:53 -05:00

3.4 KiB

Raw Blame History

TTS Evaluation

This document outlines the evaluation of Text-to-Speech (TTS) options for the project, as detailed in TICKET-013.

1. Options Considered

The following TTS engines were evaluated based on latency, quality, resource usage, and customization options.

Feature	Piper	Mycroft Mimic 3	Coqui TTS
License	MIT	AGPL-3.0	Mozilla Public License 2.0
Technology	VITS	VITS	Various (Tacotron, Glow-TTS, etc.)
Pre-trained Voices	Yes	Yes	Yes
Voice Cloning	No	No	Yes
Language Support	Multi-lingual	Multi-lingual	Multi-lingual
Resource Usage	Low (CPU)	Moderate (CPU)	High (GPU recommended)
Latency	Low	Low	Moderate to High
Quality	Good	Very Good	Excellent
Notes	Fast, lightweight, good for resource-constrained devices.	High-quality voices, but more restrictive license.	Very high quality, but requires more resources. Actively developed.

2. Evaluation Summary

Engine	Pros	Cons	Recommendation
Piper	- Very fast, low latency - Lightweight, runs on CPU - Good quality for its size - Permissive license	- Quality not as high as larger models - Fewer voice customization options	Recommended for prototyping and initial development. Its speed and low resource usage are ideal for quick iteration.
Mycroft Mimic 3	- High-quality, natural-sounding voices - Good performance on CPU	- AGPL-3.0 license may have implications for commercial use - Less actively maintained than Coqui	A strong contender, but the license needs legal review.
Coqui TTS	- State-of-the-art, excellent voice quality - Voice cloning and extensive customization - Active community and development	- High resource requirements (GPU often necessary) - Higher latency - Coqui the company is now defunct, but the open source community continues work.	Recommended for production if high quality is paramount and resources allow. Voice cloning is a powerful feature.

3. Voice Selection

For the "family agent" persona, we need voices that are warm, friendly, and clear.

Initial Voice Candidates:

From Piper: en_US-lessac-medium (A clear, standard American English voice)
From Coqui TTS: (Requires further investigation into available pre-trained models that fit the desired persona)

4. Resource Requirements

Engine	CPU	RAM	Storage (Model Size)	GPU
Piper	~1-2 cores	~500MB	~100-200MB per voice	Not required
Mimic 3	~2-4 cores	~1GB	~200-500MB per voice	Not required
Coqui TTS	4+ cores	2GB+	500MB - 2GB+ per model	Recommended for acceptable performance

5. Decision & Next Steps

Decision:

For the initial phase of development, Piper is the recommended TTS engine. Its ease of use, low resource footprint, and good-enough quality make it perfect for building and testing the core application.

We will proceed with the following steps:

Integrate Piper as the default TTS engine.
Use the en_US-lessac-medium voice for the family agent.
Create a separate ticket to investigate integrating Coqui TTS as a "high-quality" option, pending resource availability and further voice evaluation.
Update the ARCHITECTURE.md to reflect this decision.

3.4 KiB Raw Blame History