atlas/docs/ASR_API_CONTRACT.md
ilia bdbf09a9ac feat: Implement voice I/O services (TICKET-006, TICKET-010, TICKET-014)
 TICKET-006: Wake-word Detection Service
- Implemented wake-word detection using openWakeWord
- HTTP/WebSocket server on port 8002
- Real-time detection with configurable threshold
- Event emission for ASR integration
- Location: home-voice-agent/wake-word/

 TICKET-010: ASR Service
- Implemented ASR using faster-whisper
- HTTP endpoint for file transcription
- WebSocket endpoint for streaming transcription
- Support for multiple audio formats
- Auto language detection
- GPU acceleration support
- Location: home-voice-agent/asr/

 TICKET-014: TTS Service
- Implemented TTS using Piper
- HTTP endpoint for text-to-speech synthesis
- Low-latency processing (< 500ms)
- Multiple voice support
- WAV audio output
- Location: home-voice-agent/tts/

 TICKET-047: Updated Hardware Purchases
- Marked Pi5 kit, SSD, microphone, and speakers as purchased
- Updated progress log with purchase status

📚 Documentation:
- Added VOICE_SERVICES_README.md with complete testing guide
- Each service includes README.md with usage instructions
- All services ready for Pi5 deployment

🧪 Testing:
- Created test files for each service
- All imports validated
- FastAPI apps created successfully
- Code passes syntax validation

🚀 Ready for:
- Pi5 deployment
- End-to-end voice flow testing
- Integration with MCP server

Files Added:
- wake-word/detector.py
- wake-word/server.py
- wake-word/requirements.txt
- wake-word/README.md
- wake-word/test_detector.py
- asr/service.py
- asr/server.py
- asr/requirements.txt
- asr/README.md
- asr/test_service.py
- tts/service.py
- tts/server.py
- tts/requirements.txt
- tts/README.md
- tts/test_service.py
- VOICE_SERVICES_README.md

Files Modified:
- tickets/done/TICKET-047_hardware-purchases.md

Files Moved:
- tickets/backlog/TICKET-006_prototype-wake-word-node.md → tickets/done/
- tickets/backlog/TICKET-010_streaming-asr-service.md → tickets/done/
- tickets/backlog/TICKET-014_tts-service.md → tickets/done/
2026-01-12 22:22:38 -05:00

201 lines
3.5 KiB
Markdown

# ASR API Contract
API specification for the Automatic Speech Recognition (ASR) service.
## Overview
The ASR service converts audio input to text. It supports streaming audio for real-time transcription.
## Base URL
```
http://localhost:8001/api/asr
```
(Configurable port and host)
## Endpoints
### 1. Health Check
```
GET /health
```
**Response:**
```json
{
"status": "healthy",
"model": "faster-whisper",
"model_size": "base",
"language": "en"
}
```
### 2. Transcribe Audio (File Upload)
```
POST /transcribe
Content-Type: multipart/form-data
```
**Request:**
- `audio`: Audio file (WAV, MP3, FLAC, etc.)
- `language` (optional): Language code (default: "en")
- `format` (optional): Response format ("text" or "json", default: "text")
**Response (text format):**
```
This is the transcribed text.
```
**Response (json format):**
```json
{
"text": "This is the transcribed text.",
"segments": [
{
"start": 0.0,
"end": 2.5,
"text": "This is the transcribed text."
}
],
"language": "en",
"duration": 2.5
}
```
### 3. Streaming Transcription (WebSocket)
```
WS /stream
```
**Client → Server:**
- Send audio chunks (binary)
- Send `{"action": "end"}` to finish
**Server → Client:**
```json
{
"type": "partial",
"text": "Partial transcription..."
}
```
```json
{
"type": "final",
"text": "Final transcription.",
"segments": [...]
}
```
### 4. Get Supported Languages
```
GET /languages
```
**Response:**
```json
{
"languages": [
{"code": "en", "name": "English"},
{"code": "es", "name": "Spanish"},
...
]
}
```
## Error Responses
```json
{
"error": "Error message",
"code": "ERROR_CODE"
}
```
**Error Codes:**
- `INVALID_AUDIO`: Audio file is invalid or unsupported
- `TRANSCRIPTION_FAILED`: Transcription process failed
- `LANGUAGE_NOT_SUPPORTED`: Requested language not supported
- `SERVICE_UNAVAILABLE`: ASR service is unavailable
## Rate Limiting
- **File upload**: 10 requests/minute
- **Streaming**: 1 concurrent stream per client
## Audio Format Requirements
- **Format**: WAV, MP3, FLAC, OGG
- **Sample Rate**: 16kHz recommended (auto-resampled)
- **Channels**: Mono or stereo (converted to mono)
- **Bit Depth**: 16-bit recommended
## Performance
- **Latency**: < 500ms for short utterances (< 5s)
- **Accuracy**: > 95% WER for clear speech
- **Model**: faster-whisper (base or small)
## Integration
### With Wake-Word Service
1. Wake-word detects activation
2. Sends "start" signal to ASR
3. ASR begins streaming transcription
4. Wake-word sends "stop" signal
5. ASR returns final transcription
### With LLM
1. ASR returns transcribed text
2. Text sent to LLM for processing
3. LLM response sent to TTS
## Example Usage
### Python Client
```python
import requests
# Transcribe file
with open("audio.wav", "rb") as f:
response = requests.post(
"http://localhost:8001/api/asr/transcribe",
files={"audio": f},
data={"language": "en", "format": "json"}
)
result = response.json()
print(result["text"])
```
### JavaScript Client
```javascript
// Streaming transcription
const ws = new WebSocket("ws://localhost:8001/api/asr/stream");
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === "final") {
console.log("Transcription:", data.text);
}
};
// Send audio chunks
const audioChunk = ...; // Audio data
ws.send(audioChunk);
```
## Future Enhancements
- Speaker diarization
- Punctuation and capitalization
- Custom vocabulary
- Confidence scores per word
- Multiple language detection