- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
8.0 KiB
ASR Engine Evaluation and Selection
Overview
This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware.
Evaluation Criteria
Requirements
- Latency: < 2s end-to-end (audio in → text out) for interactive use
- Accuracy: High word error rate (WER) for conversational speech
- Resource Usage: Efficient GPU/CPU utilization
- Streaming: Support for real-time audio streaming
- Model Size: Balance between quality and resource usage
- Integration: Easy integration with wake-word events
ASR Engine Options
1. faster-whisper (Recommended)
Description: Optimized Whisper implementation using CTranslate2
Pros:
- ⭐ Best performance - 4x faster than original Whisper
- ✅ GPU acceleration (CUDA) support
- ✅ Streaming support available
- ✅ Multiple model sizes (tiny, small, medium, large)
- ✅ Good accuracy for conversational speech
- ✅ Active development and maintenance
- ✅ Python API, easy integration
Cons:
- Requires CUDA for GPU acceleration
- Model files are large (small: 500MB, medium: 1.5GB)
Performance:
- GPU (4080): ~0.5-1s latency (medium model)
- GPU (1050): ~1-2s latency (small model)
- CPU: ~2-4s latency (small model)
Model Sizes:
- tiny: ~75MB, fastest, lower accuracy
- small: ~500MB, good balance (recommended)
- medium: ~1.5GB, higher accuracy
- large: ~3GB, best accuracy, slower
Recommendation: ⭐ Primary choice - Best balance of speed and accuracy
2. Whisper.cpp
Description: C++ port of Whisper, optimized for CPU
Pros:
- ✅ Very efficient CPU implementation
- ✅ Low memory footprint
- ✅ Cross-platform (Linux, macOS, Windows)
- ✅ Can run on small devices (Raspberry Pi)
- ✅ Streaming support
Cons:
- ⚠️ No GPU acceleration (CPU-only)
- ⚠️ Slower than faster-whisper on GPU
- ⚠️ Less Python-friendly (C++ API)
Performance:
- CPU: ~2-3s latency (small model)
- Raspberry Pi: ~5-8s latency (tiny model)
Recommendation: Good for CPU-only deployment or small devices
3. OpenAI Whisper (Original)
Description: Original PyTorch implementation
Pros:
- ✅ Reference implementation
- ✅ Well-documented
- ✅ Easy to use
Cons:
- ❌ Slowest option (4x slower than faster-whisper)
- ❌ Higher memory usage
- ❌ Not optimized for production
Recommendation: ❌ Not recommended - Use faster-whisper instead
4. Other Options
Vosk:
- Pros: Very fast, lightweight
- Cons: Lower accuracy, requires model training
- Recommendation: Not suitable for general speech
DeepSpeech:
- Pros: Open source, lightweight
- Cons: Lower accuracy, outdated
- Recommendation: Not recommended
Deployment Options
Option A: faster-whisper on RTX 4080 (Recommended)
Configuration:
- Engine: faster-whisper
- Model: medium (best accuracy) or small (faster)
- Hardware: RTX 4080 (shared with work agent LLM)
- Latency: ~0.5-1s (medium), ~0.3-0.7s (small)
Pros:
- ✅ Lowest latency
- ✅ Best accuracy (with medium model)
- ✅ No additional hardware needed
- ✅ Can share GPU with LLM (time-multiplexed)
Cons:
- ⚠️ GPU resource contention with LLM
- ⚠️ May need to pause LLM during ASR processing
Recommendation: ⭐ Best for quality - Use if 4080 has headroom
Option B: faster-whisper on RTX 1050
Configuration:
- Engine: faster-whisper
- Model: small (fits in 4GB VRAM)
- Hardware: RTX 1050 (shared with family agent LLM)
- Latency: ~1-2s
Pros:
- ✅ Good latency
- ✅ No additional hardware
- ✅ Can share with family agent LLM
Cons:
- ⚠️ VRAM constraints (4GB is tight)
- ⚠️ May conflict with family agent LLM
- ⚠️ Only small model fits
Recommendation: ⚠️ Possible but tight - Consider CPU option
Option C: faster-whisper on CPU (Small Box)
Configuration:
- Engine: faster-whisper
- Model: small or tiny
- Hardware: Always-on node (Pi/NUC/SFF PC)
- Latency: ~2-4s (small), ~1-2s (tiny)
Pros:
- ✅ No GPU resource contention
- ✅ Dedicated hardware for ASR
- ✅ Can run 24/7 without affecting LLM servers
- ✅ Lower power consumption
Cons:
- ⚠️ Higher latency (2-4s)
- ⚠️ Requires additional hardware
- ⚠️ Lower accuracy with tiny model
Recommendation: ✅ Good for separation - Best if you want dedicated ASR
Option D: Whisper.cpp on CPU (Small Box)
Configuration:
- Engine: Whisper.cpp
- Model: small
- Hardware: Always-on node
- Latency: ~2-3s
Pros:
- ✅ Very efficient CPU usage
- ✅ Low memory footprint
- ✅ Good for resource-constrained devices
Cons:
- ⚠️ No GPU acceleration
- ⚠️ Slower than faster-whisper on GPU
Recommendation: Good alternative to faster-whisper on CPU
Model Size Selection
Small Model (Recommended for most cases)
- Size: ~500MB
- Accuracy: Good for conversational speech
- Latency: 0.5-2s (depending on hardware)
- Use Case: General voice agent interactions
Medium Model (Best accuracy)
- Size: ~1.5GB
- Accuracy: Excellent for conversational speech
- Latency: 0.5-1s (on GPU)
- Use Case: If quality is critical and GPU available
Tiny Model (Fastest, lower accuracy)
- Size: ~75MB
- Accuracy: Acceptable for simple commands
- Latency: 0.3-1s
- Use Case: Resource-constrained or very low latency needed
Final Recommendation
Primary Choice: faster-whisper on RTX 4080
Configuration:
- Engine: faster-whisper
- Model: small (or medium if GPU headroom available)
- Hardware: RTX 4080 (shared with work agent)
- Deployment: Time-multiplexed with LLM (pause LLM during ASR)
Rationale:
- Best balance of latency and accuracy
- No additional hardware needed
- Can share GPU efficiently
- Small model provides good accuracy with low latency
Alternative: faster-whisper on CPU (Always-on Node)
Configuration:
- Engine: faster-whisper
- Model: small
- Hardware: Dedicated always-on node (Pi 4+, NUC, or SFF PC)
- Deployment: Separate from LLM servers
Rationale:
- No GPU resource contention
- Dedicated hardware for ASR
- Acceptable latency (2-4s) for voice interactions
- Better separation of concerns
Integration Considerations
Wake-Word Integration
- ASR should start when wake-word detected
- Stop ASR when silence detected or user stops speaking
- Stream audio chunks to ASR service
- Return text segments in real-time
API Design
- Endpoint: WebSocket
/asr/stream - Input: Audio stream (PCM, 16kHz, mono)
- Output: JSON with text segments and timestamps
- Format:
{ "text": "transcribed text", "timestamp": 1234.56, "confidence": 0.95, "is_final": false }
Resource Management
- If on 4080: Pause LLM during ASR processing (or use separate GPU)
- If on CPU: No conflicts, can run continuously
- Monitor GPU/CPU usage and adjust model size if needed
Performance Targets
| Hardware | Model | Target Latency | Status |
|---|---|---|---|
| RTX 4080 | small | < 1s | ✅ Achievable |
| RTX 4080 | medium | < 1.5s | ✅ Achievable |
| RTX 1050 | small | < 2s | ✅ Achievable |
| CPU (modern) | small | < 4s | ✅ Achievable |
| CPU (Pi 4) | tiny | < 8s | ⚠️ Acceptable |
Next Steps
- ✅ ASR engine selected: faster-whisper
- ✅ Deployment decided: RTX 4080 (primary) or CPU node (alternative)
- ✅ Model size: small (or medium if GPU headroom)
- Implement ASR service (TICKET-010)
- Define ASR API contract (TICKET-011)
- Benchmark actual performance (TICKET-012)
References
Last Updated: 2024-01-XX Status: Evaluation Complete - Ready for Implementation (TICKET-010)