✅ TICKET-006: Wake-word Detection Service - Implemented wake-word detection using openWakeWord - HTTP/WebSocket server on port 8002 - Real-time detection with configurable threshold - Event emission for ASR integration - Location: home-voice-agent/wake-word/ ✅ TICKET-010: ASR Service - Implemented ASR using faster-whisper - HTTP endpoint for file transcription - WebSocket endpoint for streaming transcription - Support for multiple audio formats - Auto language detection - GPU acceleration support - Location: home-voice-agent/asr/ ✅ TICKET-014: TTS Service - Implemented TTS using Piper - HTTP endpoint for text-to-speech synthesis - Low-latency processing (< 500ms) - Multiple voice support - WAV audio output - Location: home-voice-agent/tts/ ✅ TICKET-047: Updated Hardware Purchases - Marked Pi5 kit, SSD, microphone, and speakers as purchased - Updated progress log with purchase status 📚 Documentation: - Added VOICE_SERVICES_README.md with complete testing guide - Each service includes README.md with usage instructions - All services ready for Pi5 deployment 🧪 Testing: - Created test files for each service - All imports validated - FastAPI apps created successfully - Code passes syntax validation 🚀 Ready for: - Pi5 deployment - End-to-end voice flow testing - Integration with MCP server Files Added: - wake-word/detector.py - wake-word/server.py - wake-word/requirements.txt - wake-word/README.md - wake-word/test_detector.py - asr/service.py - asr/server.py - asr/requirements.txt - asr/README.md - asr/test_service.py - tts/service.py - tts/server.py - tts/requirements.txt - tts/README.md - tts/test_service.py - VOICE_SERVICES_README.md Files Modified: - tickets/done/TICKET-047_hardware-purchases.md Files Moved: - tickets/backlog/TICKET-006_prototype-wake-word-node.md → tickets/done/ - tickets/backlog/TICKET-010_streaming-asr-service.md → tickets/done/ - tickets/backlog/TICKET-014_tts-service.md → tickets/done/
201 lines
3.5 KiB
Markdown
201 lines
3.5 KiB
Markdown
# ASR API Contract
|
|
|
|
API specification for the Automatic Speech Recognition (ASR) service.
|
|
|
|
## Overview
|
|
|
|
The ASR service converts audio input to text. It supports streaming audio for real-time transcription.
|
|
|
|
## Base URL
|
|
|
|
```
|
|
http://localhost:8001/api/asr
|
|
```
|
|
|
|
(Configurable port and host)
|
|
|
|
## Endpoints
|
|
|
|
### 1. Health Check
|
|
|
|
```
|
|
GET /health
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"model": "faster-whisper",
|
|
"model_size": "base",
|
|
"language": "en"
|
|
}
|
|
```
|
|
|
|
### 2. Transcribe Audio (File Upload)
|
|
|
|
```
|
|
POST /transcribe
|
|
Content-Type: multipart/form-data
|
|
```
|
|
|
|
**Request:**
|
|
- `audio`: Audio file (WAV, MP3, FLAC, etc.)
|
|
- `language` (optional): Language code (default: "en")
|
|
- `format` (optional): Response format ("text" or "json", default: "text")
|
|
|
|
**Response (text format):**
|
|
```
|
|
This is the transcribed text.
|
|
```
|
|
|
|
**Response (json format):**
|
|
```json
|
|
{
|
|
"text": "This is the transcribed text.",
|
|
"segments": [
|
|
{
|
|
"start": 0.0,
|
|
"end": 2.5,
|
|
"text": "This is the transcribed text."
|
|
}
|
|
],
|
|
"language": "en",
|
|
"duration": 2.5
|
|
}
|
|
```
|
|
|
|
### 3. Streaming Transcription (WebSocket)
|
|
|
|
```
|
|
WS /stream
|
|
```
|
|
|
|
**Client → Server:**
|
|
- Send audio chunks (binary)
|
|
- Send `{"action": "end"}` to finish
|
|
|
|
**Server → Client:**
|
|
```json
|
|
{
|
|
"type": "partial",
|
|
"text": "Partial transcription..."
|
|
}
|
|
```
|
|
|
|
```json
|
|
{
|
|
"type": "final",
|
|
"text": "Final transcription.",
|
|
"segments": [...]
|
|
}
|
|
```
|
|
|
|
### 4. Get Supported Languages
|
|
|
|
```
|
|
GET /languages
|
|
```
|
|
|
|
**Response:**
|
|
```json
|
|
{
|
|
"languages": [
|
|
{"code": "en", "name": "English"},
|
|
{"code": "es", "name": "Spanish"},
|
|
...
|
|
]
|
|
}
|
|
```
|
|
|
|
## Error Responses
|
|
|
|
```json
|
|
{
|
|
"error": "Error message",
|
|
"code": "ERROR_CODE"
|
|
}
|
|
```
|
|
|
|
**Error Codes:**
|
|
- `INVALID_AUDIO`: Audio file is invalid or unsupported
|
|
- `TRANSCRIPTION_FAILED`: Transcription process failed
|
|
- `LANGUAGE_NOT_SUPPORTED`: Requested language not supported
|
|
- `SERVICE_UNAVAILABLE`: ASR service is unavailable
|
|
|
|
## Rate Limiting
|
|
|
|
- **File upload**: 10 requests/minute
|
|
- **Streaming**: 1 concurrent stream per client
|
|
|
|
## Audio Format Requirements
|
|
|
|
- **Format**: WAV, MP3, FLAC, OGG
|
|
- **Sample Rate**: 16kHz recommended (auto-resampled)
|
|
- **Channels**: Mono or stereo (converted to mono)
|
|
- **Bit Depth**: 16-bit recommended
|
|
|
|
## Performance
|
|
|
|
- **Latency**: < 500ms for short utterances (< 5s)
|
|
- **Accuracy**: > 95% WER for clear speech
|
|
- **Model**: faster-whisper (base or small)
|
|
|
|
## Integration
|
|
|
|
### With Wake-Word Service
|
|
1. Wake-word detects activation
|
|
2. Sends "start" signal to ASR
|
|
3. ASR begins streaming transcription
|
|
4. Wake-word sends "stop" signal
|
|
5. ASR returns final transcription
|
|
|
|
### With LLM
|
|
1. ASR returns transcribed text
|
|
2. Text sent to LLM for processing
|
|
3. LLM response sent to TTS
|
|
|
|
## Example Usage
|
|
|
|
### Python Client
|
|
|
|
```python
|
|
import requests
|
|
|
|
# Transcribe file
|
|
with open("audio.wav", "rb") as f:
|
|
response = requests.post(
|
|
"http://localhost:8001/api/asr/transcribe",
|
|
files={"audio": f},
|
|
data={"language": "en", "format": "json"}
|
|
)
|
|
result = response.json()
|
|
print(result["text"])
|
|
```
|
|
|
|
### JavaScript Client
|
|
|
|
```javascript
|
|
// Streaming transcription
|
|
const ws = new WebSocket("ws://localhost:8001/api/asr/stream");
|
|
|
|
ws.onmessage = (event) => {
|
|
const data = JSON.parse(event.data);
|
|
if (data.type === "final") {
|
|
console.log("Transcription:", data.text);
|
|
}
|
|
};
|
|
|
|
// Send audio chunks
|
|
const audioChunk = ...; // Audio data
|
|
ws.send(audioChunk);
|
|
```
|
|
|
|
## Future Enhancements
|
|
|
|
- Speaker diarization
|
|
- Punctuation and capitalization
|
|
- Custom vocabulary
|
|
- Confidence scores per word
|
|
- Multiple language detection
|