# ASR API Contract API specification for the Automatic Speech Recognition (ASR) service. ## Overview The ASR service converts audio input to text. It supports streaming audio for real-time transcription. ## Base URL ``` http://localhost:8001/api/asr ``` (Configurable port and host) ## Endpoints ### 1. Health Check ``` GET /health ``` **Response:** ```json { "status": "healthy", "model": "faster-whisper", "model_size": "base", "language": "en" } ``` ### 2. Transcribe Audio (File Upload) ``` POST /transcribe Content-Type: multipart/form-data ``` **Request:** - `audio`: Audio file (WAV, MP3, FLAC, etc.) - `language` (optional): Language code (default: "en") - `format` (optional): Response format ("text" or "json", default: "text") **Response (text format):** ``` This is the transcribed text. ``` **Response (json format):** ```json { "text": "This is the transcribed text.", "segments": [ { "start": 0.0, "end": 2.5, "text": "This is the transcribed text." } ], "language": "en", "duration": 2.5 } ``` ### 3. Streaming Transcription (WebSocket) ``` WS /stream ``` **Client → Server:** - Send audio chunks (binary) - Send `{"action": "end"}` to finish **Server → Client:** ```json { "type": "partial", "text": "Partial transcription..." } ``` ```json { "type": "final", "text": "Final transcription.", "segments": [...] } ``` ### 4. Get Supported Languages ``` GET /languages ``` **Response:** ```json { "languages": [ {"code": "en", "name": "English"}, {"code": "es", "name": "Spanish"}, ... ] } ``` ## Error Responses ```json { "error": "Error message", "code": "ERROR_CODE" } ``` **Error Codes:** - `INVALID_AUDIO`: Audio file is invalid or unsupported - `TRANSCRIPTION_FAILED`: Transcription process failed - `LANGUAGE_NOT_SUPPORTED`: Requested language not supported - `SERVICE_UNAVAILABLE`: ASR service is unavailable ## Rate Limiting - **File upload**: 10 requests/minute - **Streaming**: 1 concurrent stream per client ## Audio Format Requirements - **Format**: WAV, MP3, FLAC, OGG - **Sample Rate**: 16kHz recommended (auto-resampled) - **Channels**: Mono or stereo (converted to mono) - **Bit Depth**: 16-bit recommended ## Performance - **Latency**: < 500ms for short utterances (< 5s) - **Accuracy**: > 95% WER for clear speech - **Model**: faster-whisper (base or small) ## Integration ### With Wake-Word Service 1. Wake-word detects activation 2. Sends "start" signal to ASR 3. ASR begins streaming transcription 4. Wake-word sends "stop" signal 5. ASR returns final transcription ### With LLM 1. ASR returns transcribed text 2. Text sent to LLM for processing 3. LLM response sent to TTS ## Example Usage ### Python Client ```python import requests # Transcribe file with open("audio.wav", "rb") as f: response = requests.post( "http://localhost:8001/api/asr/transcribe", files={"audio": f}, data={"language": "en", "format": "json"} ) result = response.json() print(result["text"]) ``` ### JavaScript Client ```javascript // Streaming transcription const ws = new WebSocket("ws://localhost:8001/api/asr/stream"); ws.onmessage = (event) => { const data = JSON.parse(event.data); if (data.type === "final") { console.log("Transcription:", data.text); } }; // Send audio chunks const audioChunk = ...; // Audio data ws.send(audioChunk); ``` ## Future Enhancements - Speaker diarization - Punctuation and capitalization - Custom vocabulary - Confidence scores per word - Multiple language detection