# ASR Engine Evaluation and Selection ## Overview This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware. ## Evaluation Criteria ### Requirements - **Latency**: < 2s end-to-end (audio in → text out) for interactive use - **Accuracy**: High word error rate (WER) for conversational speech - **Resource Usage**: Efficient GPU/CPU utilization - **Streaming**: Support for real-time audio streaming - **Model Size**: Balance between quality and resource usage - **Integration**: Easy integration with wake-word events ## ASR Engine Options ### 1. faster-whisper (Recommended) **Description**: Optimized Whisper implementation using CTranslate2 **Pros:** - ⭐ **Best performance** - 4x faster than original Whisper - ✅ GPU acceleration (CUDA) support - ✅ Streaming support available - ✅ Multiple model sizes (tiny, small, medium, large) - ✅ Good accuracy for conversational speech - ✅ Active development and maintenance - ✅ Python API, easy integration **Cons:** - Requires CUDA for GPU acceleration - Model files are large (small: 500MB, medium: 1.5GB) **Performance:** - **GPU (4080)**: ~0.5-1s latency (medium model) - **GPU (1050)**: ~1-2s latency (small model) - **CPU**: ~2-4s latency (small model) **Model Sizes:** - **tiny**: ~75MB, fastest, lower accuracy - **small**: ~500MB, good balance (recommended) - **medium**: ~1.5GB, higher accuracy - **large**: ~3GB, best accuracy, slower **Recommendation**: ⭐ **Primary choice** - Best balance of speed and accuracy ### 2. Whisper.cpp **Description**: C++ port of Whisper, optimized for CPU **Pros:** - ✅ Very efficient CPU implementation - ✅ Low memory footprint - ✅ Cross-platform (Linux, macOS, Windows) - ✅ Can run on small devices (Raspberry Pi) - ✅ Streaming support **Cons:** - ⚠️ No GPU acceleration (CPU-only) - ⚠️ Slower than faster-whisper on GPU - ⚠️ Less Python-friendly (C++ API) **Performance:** - **CPU**: ~2-3s latency (small model) - **Raspberry Pi**: ~5-8s latency (tiny model) **Recommendation**: Good for CPU-only deployment or small devices ### 3. OpenAI Whisper (Original) **Description**: Original PyTorch implementation **Pros:** - ✅ Reference implementation - ✅ Well-documented - ✅ Easy to use **Cons:** - ❌ Slowest option (4x slower than faster-whisper) - ❌ Higher memory usage - ❌ Not optimized for production **Recommendation**: ❌ Not recommended - Use faster-whisper instead ### 4. Other Options **Vosk**: - Pros: Very fast, lightweight - Cons: Lower accuracy, requires model training - Recommendation: Not suitable for general speech **DeepSpeech**: - Pros: Open source, lightweight - Cons: Lower accuracy, outdated - Recommendation: Not recommended ## Deployment Options ### Option A: faster-whisper on RTX 4080 (Recommended) **Configuration:** - **Engine**: faster-whisper - **Model**: medium (best accuracy) or small (faster) - **Hardware**: RTX 4080 (shared with work agent LLM) - **Latency**: ~0.5-1s (medium), ~0.3-0.7s (small) **Pros:** - ✅ Lowest latency - ✅ Best accuracy (with medium model) - ✅ No additional hardware needed - ✅ Can share GPU with LLM (time-multiplexed) **Cons:** - ⚠️ GPU resource contention with LLM - ⚠️ May need to pause LLM during ASR processing **Recommendation**: ⭐ **Best for quality** - Use if 4080 has headroom ### Option B: faster-whisper on RTX 1050 **Configuration:** - **Engine**: faster-whisper - **Model**: small (fits in 4GB VRAM) - **Hardware**: RTX 1050 (shared with family agent LLM) - **Latency**: ~1-2s **Pros:** - ✅ Good latency - ✅ No additional hardware - ✅ Can share with family agent LLM **Cons:** - ⚠️ VRAM constraints (4GB is tight) - ⚠️ May conflict with family agent LLM - ⚠️ Only small model fits **Recommendation**: ⚠️ **Possible but tight** - Consider CPU option ### Option C: faster-whisper on CPU (Small Box) **Configuration:** - **Engine**: faster-whisper - **Model**: small or tiny - **Hardware**: Always-on node (Pi/NUC/SFF PC) - **Latency**: ~2-4s (small), ~1-2s (tiny) **Pros:** - ✅ No GPU resource contention - ✅ Dedicated hardware for ASR - ✅ Can run 24/7 without affecting LLM servers - ✅ Lower power consumption **Cons:** - ⚠️ Higher latency (2-4s) - ⚠️ Requires additional hardware - ⚠️ Lower accuracy with tiny model **Recommendation**: ✅ **Good for separation** - Best if you want dedicated ASR ### Option D: Whisper.cpp on CPU (Small Box) **Configuration:** - **Engine**: Whisper.cpp - **Model**: small - **Hardware**: Always-on node - **Latency**: ~2-3s **Pros:** - ✅ Very efficient CPU usage - ✅ Low memory footprint - ✅ Good for resource-constrained devices **Cons:** - ⚠️ No GPU acceleration - ⚠️ Slower than faster-whisper on GPU **Recommendation**: Good alternative to faster-whisper on CPU ## Model Size Selection ### Small Model (Recommended for most cases) - **Size**: ~500MB - **Accuracy**: Good for conversational speech - **Latency**: 0.5-2s (depending on hardware) - **Use Case**: General voice agent interactions ### Medium Model (Best accuracy) - **Size**: ~1.5GB - **Accuracy**: Excellent for conversational speech - **Latency**: 0.5-1s (on GPU) - **Use Case**: If quality is critical and GPU available ### Tiny Model (Fastest, lower accuracy) - **Size**: ~75MB - **Accuracy**: Acceptable for simple commands - **Latency**: 0.3-1s - **Use Case**: Resource-constrained or very low latency needed ## Final Recommendation ### Primary Choice: faster-whisper on RTX 4080 **Configuration:** - **Engine**: faster-whisper - **Model**: small (or medium if GPU headroom available) - **Hardware**: RTX 4080 (shared with work agent) - **Deployment**: Time-multiplexed with LLM (pause LLM during ASR) **Rationale:** - Best balance of latency and accuracy - No additional hardware needed - Can share GPU efficiently - Small model provides good accuracy with low latency ### Alternative: faster-whisper on CPU (Always-on Node) **Configuration:** - **Engine**: faster-whisper - **Model**: small - **Hardware**: Dedicated always-on node (Pi 4+, NUC, or SFF PC) - **Deployment**: Separate from LLM servers **Rationale:** - No GPU resource contention - Dedicated hardware for ASR - Acceptable latency (2-4s) for voice interactions - Better separation of concerns ## Integration Considerations ### Wake-Word Integration - ASR should start when wake-word detected - Stop ASR when silence detected or user stops speaking - Stream audio chunks to ASR service - Return text segments in real-time ### API Design - **Endpoint**: WebSocket `/asr/stream` - **Input**: Audio stream (PCM, 16kHz, mono) - **Output**: JSON with text segments and timestamps - **Format**: ```json { "text": "transcribed text", "timestamp": 1234.56, "confidence": 0.95, "is_final": false } ``` ### Resource Management - If on 4080: Pause LLM during ASR processing (or use separate GPU) - If on CPU: No conflicts, can run continuously - Monitor GPU/CPU usage and adjust model size if needed ## Performance Targets | Hardware | Model | Target Latency | Status | |----------|-------|---------------|--------| | RTX 4080 | small | < 1s | ✅ Achievable | | RTX 4080 | medium | < 1.5s | ✅ Achievable | | RTX 1050 | small | < 2s | ✅ Achievable | | CPU (modern) | small | < 4s | ✅ Achievable | | CPU (Pi 4) | tiny | < 8s | ⚠️ Acceptable | ## Next Steps 1. ✅ ASR engine selected: **faster-whisper** 2. ✅ Deployment decided: **RTX 4080 (primary)** or **CPU node (alternative)** 3. ✅ Model size: **small** (or medium if GPU headroom) 4. Implement ASR service (TICKET-010) 5. Define ASR API contract (TICKET-011) 6. Benchmark actual performance (TICKET-012) ## References - [faster-whisper GitHub](https://github.com/guillaumekln/faster-whisper) - [Whisper.cpp GitHub](https://github.com/ggerganov/whisper.cpp) - [OpenAI Whisper](https://github.com/openai/whisper) - [ASR Benchmarking](https://github.com/robflynnyh/whisper-benchmark) --- **Last Updated**: 2024-01-XX **Status**: Evaluation Complete - Ready for Implementation (TICKET-010)