ilia/atlas

ilia bdbf09a9ac feat: Implement voice I/O services (TICKET-006, TICKET-010, TICKET-014)

✅ TICKET-006: Wake-word Detection Service
- Implemented wake-word detection using openWakeWord
- HTTP/WebSocket server on port 8002
- Real-time detection with configurable threshold
- Event emission for ASR integration
- Location: home-voice-agent/wake-word/

✅ TICKET-010: ASR Service
- Implemented ASR using faster-whisper
- HTTP endpoint for file transcription
- WebSocket endpoint for streaming transcription
- Support for multiple audio formats
- Auto language detection
- GPU acceleration support
- Location: home-voice-agent/asr/

✅ TICKET-014: TTS Service
- Implemented TTS using Piper
- HTTP endpoint for text-to-speech synthesis
- Low-latency processing (< 500ms)
- Multiple voice support
- WAV audio output
- Location: home-voice-agent/tts/

✅ TICKET-047: Updated Hardware Purchases
- Marked Pi5 kit, SSD, microphone, and speakers as purchased
- Updated progress log with purchase status

📚 Documentation:
- Added VOICE_SERVICES_README.md with complete testing guide
- Each service includes README.md with usage instructions
- All services ready for Pi5 deployment

🧪 Testing:
- Created test files for each service
- All imports validated
- FastAPI apps created successfully
- Code passes syntax validation

🚀 Ready for:
- Pi5 deployment
- End-to-end voice flow testing
- Integration with MCP server

Files Added:
- wake-word/detector.py
- wake-word/server.py
- wake-word/requirements.txt
- wake-word/README.md
- wake-word/test_detector.py
- asr/service.py
- asr/server.py
- asr/requirements.txt
- asr/README.md
- asr/test_service.py
- tts/service.py
- tts/server.py
- tts/requirements.txt
- tts/README.md
- tts/test_service.py
- VOICE_SERVICES_README.md

Files Modified:
- tickets/done/TICKET-047_hardware-purchases.md

Files Moved:
- tickets/backlog/TICKET-006_prototype-wake-word-node.md → tickets/done/
- tickets/backlog/TICKET-010_streaming-asr-service.md → tickets/done/
- tickets/backlog/TICKET-014_tts-service.md → tickets/done/

2026-01-12 22:22:38 -05:00

3.5 KiB

Raw Permalink Blame History

ASR API Contract

API specification for the Automatic Speech Recognition (ASR) service.

Overview

The ASR service converts audio input to text. It supports streaming audio for real-time transcription.

Base URL

http://localhost:8001/api/asr

(Configurable port and host)

Endpoints

1. Health Check

GET /health

Response:

{
  "status": "healthy",
  "model": "faster-whisper",
  "model_size": "base",
  "language": "en"
}

2. Transcribe Audio (File Upload)

POST /transcribe
Content-Type: multipart/form-data

Request:

audio: Audio file (WAV, MP3, FLAC, etc.)
language (optional): Language code (default: "en")
format (optional): Response format ("text" or "json", default: "text")

Response (text format):

This is the transcribed text.

Response (json format):

{
  "text": "This is the transcribed text.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "This is the transcribed text."
    }
  ],
  "language": "en",
  "duration": 2.5
}

3. Streaming Transcription (WebSocket)

WS /stream

Client → Server:

Send audio chunks (binary)
Send {"action": "end"} to finish

Server → Client:

{
  "type": "partial",
  "text": "Partial transcription..."
}

{
  "type": "final",
  "text": "Final transcription.",
  "segments": [...]
}

4. Get Supported Languages

GET /languages

Response:

{
  "languages": [
    {"code": "en", "name": "English"},
    {"code": "es", "name": "Spanish"},
    ...
  ]
}

Error Responses

{
  "error": "Error message",
  "code": "ERROR_CODE"
}

Error Codes:

INVALID_AUDIO: Audio file is invalid or unsupported
TRANSCRIPTION_FAILED: Transcription process failed
LANGUAGE_NOT_SUPPORTED: Requested language not supported
SERVICE_UNAVAILABLE: ASR service is unavailable

Rate Limiting

File upload: 10 requests/minute
Streaming: 1 concurrent stream per client

Audio Format Requirements

Format: WAV, MP3, FLAC, OGG
Sample Rate: 16kHz recommended (auto-resampled)
Channels: Mono or stereo (converted to mono)
Bit Depth: 16-bit recommended

Performance

Latency: < 500ms for short utterances (< 5s)
Accuracy: > 95% WER for clear speech
Model: faster-whisper (base or small)

Integration

With Wake-Word Service

Wake-word detects activation
Sends "start" signal to ASR
ASR begins streaming transcription
Wake-word sends "stop" signal
ASR returns final transcription

With LLM

ASR returns transcribed text
Text sent to LLM for processing
LLM response sent to TTS

Example Usage

Python Client

import requests

# Transcribe file
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8001/api/asr/transcribe",
        files={"audio": f},
        data={"language": "en", "format": "json"}
    )
    result = response.json()
    print(result["text"])

JavaScript Client

// Streaming transcription
const ws = new WebSocket("ws://localhost:8001/api/asr/stream");

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "final") {
        console.log("Transcription:", data.text);
    }
};

// Send audio chunks
const audioChunk = ...; // Audio data
ws.send(audioChunk);

Future Enhancements

Speaker diarization
Punctuation and capitalization
Custom vocabulary
Confidence scores per word
Multiple language detection

3.5 KiB Raw Permalink Blame History