atlas/docs/ASR_API_CONTRACT.md
ilia bdbf09a9ac feat: Implement voice I/O services (TICKET-006, TICKET-010, TICKET-014)
 TICKET-006: Wake-word Detection Service
- Implemented wake-word detection using openWakeWord
- HTTP/WebSocket server on port 8002
- Real-time detection with configurable threshold
- Event emission for ASR integration
- Location: home-voice-agent/wake-word/

 TICKET-010: ASR Service
- Implemented ASR using faster-whisper
- HTTP endpoint for file transcription
- WebSocket endpoint for streaming transcription
- Support for multiple audio formats
- Auto language detection
- GPU acceleration support
- Location: home-voice-agent/asr/

 TICKET-014: TTS Service
- Implemented TTS using Piper
- HTTP endpoint for text-to-speech synthesis
- Low-latency processing (< 500ms)
- Multiple voice support
- WAV audio output
- Location: home-voice-agent/tts/

 TICKET-047: Updated Hardware Purchases
- Marked Pi5 kit, SSD, microphone, and speakers as purchased
- Updated progress log with purchase status

📚 Documentation:
- Added VOICE_SERVICES_README.md with complete testing guide
- Each service includes README.md with usage instructions
- All services ready for Pi5 deployment

🧪 Testing:
- Created test files for each service
- All imports validated
- FastAPI apps created successfully
- Code passes syntax validation

🚀 Ready for:
- Pi5 deployment
- End-to-end voice flow testing
- Integration with MCP server

Files Added:
- wake-word/detector.py
- wake-word/server.py
- wake-word/requirements.txt
- wake-word/README.md
- wake-word/test_detector.py
- asr/service.py
- asr/server.py
- asr/requirements.txt
- asr/README.md
- asr/test_service.py
- tts/service.py
- tts/server.py
- tts/requirements.txt
- tts/README.md
- tts/test_service.py
- VOICE_SERVICES_README.md

Files Modified:
- tickets/done/TICKET-047_hardware-purchases.md

Files Moved:
- tickets/backlog/TICKET-006_prototype-wake-word-node.md → tickets/done/
- tickets/backlog/TICKET-010_streaming-asr-service.md → tickets/done/
- tickets/backlog/TICKET-014_tts-service.md → tickets/done/
2026-01-12 22:22:38 -05:00

3.5 KiB

ASR API Contract

API specification for the Automatic Speech Recognition (ASR) service.

Overview

The ASR service converts audio input to text. It supports streaming audio for real-time transcription.

Base URL

http://localhost:8001/api/asr

(Configurable port and host)

Endpoints

1. Health Check

GET /health

Response:

{
  "status": "healthy",
  "model": "faster-whisper",
  "model_size": "base",
  "language": "en"
}

2. Transcribe Audio (File Upload)

POST /transcribe
Content-Type: multipart/form-data

Request:

  • audio: Audio file (WAV, MP3, FLAC, etc.)
  • language (optional): Language code (default: "en")
  • format (optional): Response format ("text" or "json", default: "text")

Response (text format):

This is the transcribed text.

Response (json format):

{
  "text": "This is the transcribed text.",
  "segments": [
    {
      "start": 0.0,
      "end": 2.5,
      "text": "This is the transcribed text."
    }
  ],
  "language": "en",
  "duration": 2.5
}

3. Streaming Transcription (WebSocket)

WS /stream

Client → Server:

  • Send audio chunks (binary)
  • Send {"action": "end"} to finish

Server → Client:

{
  "type": "partial",
  "text": "Partial transcription..."
}
{
  "type": "final",
  "text": "Final transcription.",
  "segments": [...]
}

4. Get Supported Languages

GET /languages

Response:

{
  "languages": [
    {"code": "en", "name": "English"},
    {"code": "es", "name": "Spanish"},
    ...
  ]
}

Error Responses

{
  "error": "Error message",
  "code": "ERROR_CODE"
}

Error Codes:

  • INVALID_AUDIO: Audio file is invalid or unsupported
  • TRANSCRIPTION_FAILED: Transcription process failed
  • LANGUAGE_NOT_SUPPORTED: Requested language not supported
  • SERVICE_UNAVAILABLE: ASR service is unavailable

Rate Limiting

  • File upload: 10 requests/minute
  • Streaming: 1 concurrent stream per client

Audio Format Requirements

  • Format: WAV, MP3, FLAC, OGG
  • Sample Rate: 16kHz recommended (auto-resampled)
  • Channels: Mono or stereo (converted to mono)
  • Bit Depth: 16-bit recommended

Performance

  • Latency: < 500ms for short utterances (< 5s)
  • Accuracy: > 95% WER for clear speech
  • Model: faster-whisper (base or small)

Integration

With Wake-Word Service

  1. Wake-word detects activation
  2. Sends "start" signal to ASR
  3. ASR begins streaming transcription
  4. Wake-word sends "stop" signal
  5. ASR returns final transcription

With LLM

  1. ASR returns transcribed text
  2. Text sent to LLM for processing
  3. LLM response sent to TTS

Example Usage

Python Client

import requests

# Transcribe file
with open("audio.wav", "rb") as f:
    response = requests.post(
        "http://localhost:8001/api/asr/transcribe",
        files={"audio": f},
        data={"language": "en", "format": "json"}
    )
    result = response.json()
    print(result["text"])

JavaScript Client

// Streaming transcription
const ws = new WebSocket("ws://localhost:8001/api/asr/stream");

ws.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.type === "final") {
        console.log("Transcription:", data.text);
    }
};

// Send audio chunks
const audioChunk = ...; // Audio data
ws.send(audioChunk);

Future Enhancements

  • Speaker diarization
  • Punctuation and capitalization
  • Custom vocabulary
  • Confidence scores per word
  • Multiple language detection