# ASR (Automatic Speech Recognition) Service Speech-to-text service using faster-whisper for real-time transcription. ## Features - HTTP endpoint for file transcription - WebSocket endpoint for streaming transcription - Support for multiple audio formats (WAV, MP3, FLAC, etc.) - Auto language detection - Low-latency processing - GPU acceleration support (CUDA) ## Installation ```bash # Install Python dependencies pip install -r requirements.txt # For GPU support (optional) # CUDA toolkit must be installed # faster-whisper will use GPU automatically if available ``` ## Usage ### Standalone Service ```bash # Run as HTTP/WebSocket server python3 -m asr.server # Or use uvicorn directly uvicorn asr.server:app --host 0.0.0.0 --port 8001 ``` ### Python API ```python from asr.service import ASRService service = ASRService( model_size="small", device="cpu", # or "cuda" for GPU language="en" ) # Transcribe file with open("audio.wav", "rb") as f: result = service.transcribe_file(f.read()) print(result["text"]) ``` ## API Endpoints ### HTTP - `GET /health` - Health check - `POST /transcribe` - Transcribe audio file - `audio`: Audio file (multipart/form-data) - `language`: Language code (optional) - `format`: Response format ("text" or "json") - `GET /languages` - Get supported languages ### WebSocket - `WS /stream` - Streaming transcription - Send audio chunks (binary) - Send `{"action": "end"}` to finish - Receive partial and final results ## Configuration - **Model Size**: small (default), tiny, base, medium, large - **Device**: cpu (default), cuda (if GPU available) - **Compute Type**: int8 (default), int8_float16, float16, float32 - **Language**: en (default), or None for auto-detect ## Performance - **CPU (small model)**: ~2-4s latency - **GPU (small model)**: ~0.5-1s latency - **GPU (medium model)**: ~1-2s latency ## Integration The ASR service is triggered by: 1. Wake-word detection events 2. Direct HTTP/WebSocket requests 3. Audio file uploads Output is sent to: 1. LLM for processing 2. Conversation manager 3. Response generation ## Testing ```bash # Test health curl http://localhost:8001/health # Test transcription curl -X POST http://localhost:8001/transcribe \ -F "audio=@test.wav" \ -F "language=en" \ -F "format=json" ``` ## Notes - First run downloads the model (~500MB for small) - GPU acceleration requires CUDA - Streaming transcription needs proper audio format handling - Supports many languages (see /languages endpoint)