atlas/docs/ASR_EVALUATION.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

288 lines
8.0 KiB
Markdown
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ASR Engine Evaluation and Selection
## Overview
This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware.
## Evaluation Criteria
### Requirements
- **Latency**: < 2s end-to-end (audio in text out) for interactive use
- **Accuracy**: High word error rate (WER) for conversational speech
- **Resource Usage**: Efficient GPU/CPU utilization
- **Streaming**: Support for real-time audio streaming
- **Model Size**: Balance between quality and resource usage
- **Integration**: Easy integration with wake-word events
## ASR Engine Options
### 1. faster-whisper (Recommended)
**Description**: Optimized Whisper implementation using CTranslate2
**Pros:**
- **Best performance** - 4x faster than original Whisper
- GPU acceleration (CUDA) support
- Streaming support available
- Multiple model sizes (tiny, small, medium, large)
- Good accuracy for conversational speech
- Active development and maintenance
- Python API, easy integration
**Cons:**
- Requires CUDA for GPU acceleration
- Model files are large (small: 500MB, medium: 1.5GB)
**Performance:**
- **GPU (4080)**: ~0.5-1s latency (medium model)
- **GPU (1050)**: ~1-2s latency (small model)
- **CPU**: ~2-4s latency (small model)
**Model Sizes:**
- **tiny**: ~75MB, fastest, lower accuracy
- **small**: ~500MB, good balance (recommended)
- **medium**: ~1.5GB, higher accuracy
- **large**: ~3GB, best accuracy, slower
**Recommendation**: **Primary choice** - Best balance of speed and accuracy
### 2. Whisper.cpp
**Description**: C++ port of Whisper, optimized for CPU
**Pros:**
- Very efficient CPU implementation
- Low memory footprint
- Cross-platform (Linux, macOS, Windows)
- Can run on small devices (Raspberry Pi)
- Streaming support
**Cons:**
- No GPU acceleration (CPU-only)
- Slower than faster-whisper on GPU
- Less Python-friendly (C++ API)
**Performance:**
- **CPU**: ~2-3s latency (small model)
- **Raspberry Pi**: ~5-8s latency (tiny model)
**Recommendation**: Good for CPU-only deployment or small devices
### 3. OpenAI Whisper (Original)
**Description**: Original PyTorch implementation
**Pros:**
- Reference implementation
- Well-documented
- Easy to use
**Cons:**
- Slowest option (4x slower than faster-whisper)
- Higher memory usage
- Not optimized for production
**Recommendation**: Not recommended - Use faster-whisper instead
### 4. Other Options
**Vosk**:
- Pros: Very fast, lightweight
- Cons: Lower accuracy, requires model training
- Recommendation: Not suitable for general speech
**DeepSpeech**:
- Pros: Open source, lightweight
- Cons: Lower accuracy, outdated
- Recommendation: Not recommended
## Deployment Options
### Option A: faster-whisper on RTX 4080 (Recommended)
**Configuration:**
- **Engine**: faster-whisper
- **Model**: medium (best accuracy) or small (faster)
- **Hardware**: RTX 4080 (shared with work agent LLM)
- **Latency**: ~0.5-1s (medium), ~0.3-0.7s (small)
**Pros:**
- Lowest latency
- Best accuracy (with medium model)
- No additional hardware needed
- Can share GPU with LLM (time-multiplexed)
**Cons:**
- GPU resource contention with LLM
- May need to pause LLM during ASR processing
**Recommendation**: **Best for quality** - Use if 4080 has headroom
### Option B: faster-whisper on RTX 1050
**Configuration:**
- **Engine**: faster-whisper
- **Model**: small (fits in 4GB VRAM)
- **Hardware**: RTX 1050 (shared with family agent LLM)
- **Latency**: ~1-2s
**Pros:**
- Good latency
- No additional hardware
- Can share with family agent LLM
**Cons:**
- VRAM constraints (4GB is tight)
- May conflict with family agent LLM
- Only small model fits
**Recommendation**: **Possible but tight** - Consider CPU option
### Option C: faster-whisper on CPU (Small Box)
**Configuration:**
- **Engine**: faster-whisper
- **Model**: small or tiny
- **Hardware**: Always-on node (Pi/NUC/SFF PC)
- **Latency**: ~2-4s (small), ~1-2s (tiny)
**Pros:**
- No GPU resource contention
- Dedicated hardware for ASR
- Can run 24/7 without affecting LLM servers
- Lower power consumption
**Cons:**
- Higher latency (2-4s)
- Requires additional hardware
- Lower accuracy with tiny model
**Recommendation**: **Good for separation** - Best if you want dedicated ASR
### Option D: Whisper.cpp on CPU (Small Box)
**Configuration:**
- **Engine**: Whisper.cpp
- **Model**: small
- **Hardware**: Always-on node
- **Latency**: ~2-3s
**Pros:**
- Very efficient CPU usage
- Low memory footprint
- Good for resource-constrained devices
**Cons:**
- No GPU acceleration
- Slower than faster-whisper on GPU
**Recommendation**: Good alternative to faster-whisper on CPU
## Model Size Selection
### Small Model (Recommended for most cases)
- **Size**: ~500MB
- **Accuracy**: Good for conversational speech
- **Latency**: 0.5-2s (depending on hardware)
- **Use Case**: General voice agent interactions
### Medium Model (Best accuracy)
- **Size**: ~1.5GB
- **Accuracy**: Excellent for conversational speech
- **Latency**: 0.5-1s (on GPU)
- **Use Case**: If quality is critical and GPU available
### Tiny Model (Fastest, lower accuracy)
- **Size**: ~75MB
- **Accuracy**: Acceptable for simple commands
- **Latency**: 0.3-1s
- **Use Case**: Resource-constrained or very low latency needed
## Final Recommendation
### Primary Choice: faster-whisper on RTX 4080
**Configuration:**
- **Engine**: faster-whisper
- **Model**: small (or medium if GPU headroom available)
- **Hardware**: RTX 4080 (shared with work agent)
- **Deployment**: Time-multiplexed with LLM (pause LLM during ASR)
**Rationale:**
- Best balance of latency and accuracy
- No additional hardware needed
- Can share GPU efficiently
- Small model provides good accuracy with low latency
### Alternative: faster-whisper on CPU (Always-on Node)
**Configuration:**
- **Engine**: faster-whisper
- **Model**: small
- **Hardware**: Dedicated always-on node (Pi 4+, NUC, or SFF PC)
- **Deployment**: Separate from LLM servers
**Rationale:**
- No GPU resource contention
- Dedicated hardware for ASR
- Acceptable latency (2-4s) for voice interactions
- Better separation of concerns
## Integration Considerations
### Wake-Word Integration
- ASR should start when wake-word detected
- Stop ASR when silence detected or user stops speaking
- Stream audio chunks to ASR service
- Return text segments in real-time
### API Design
- **Endpoint**: WebSocket `/asr/stream`
- **Input**: Audio stream (PCM, 16kHz, mono)
- **Output**: JSON with text segments and timestamps
- **Format**:
```json
{
"text": "transcribed text",
"timestamp": 1234.56,
"confidence": 0.95,
"is_final": false
}
```
### Resource Management
- If on 4080: Pause LLM during ASR processing (or use separate GPU)
- If on CPU: No conflicts, can run continuously
- Monitor GPU/CPU usage and adjust model size if needed
## Performance Targets
| Hardware | Model | Target Latency | Status |
|----------|-------|---------------|--------|
| RTX 4080 | small | < 1s | Achievable |
| RTX 4080 | medium | < 1.5s | Achievable |
| RTX 1050 | small | < 2s | Achievable |
| CPU (modern) | small | < 4s | Achievable |
| CPU (Pi 4) | tiny | < 8s | Acceptable |
## Next Steps
1. ASR engine selected: **faster-whisper**
2. Deployment decided: **RTX 4080 (primary)** or **CPU node (alternative)**
3. Model size: **small** (or medium if GPU headroom)
4. Implement ASR service (TICKET-010)
5. Define ASR API contract (TICKET-011)
6. Benchmark actual performance (TICKET-012)
## References
- [faster-whisper GitHub](https://github.com/guillaumekln/faster-whisper)
- [Whisper.cpp GitHub](https://github.com/ggerganov/whisper.cpp)
- [OpenAI Whisper](https://github.com/openai/whisper)
- [ASR Benchmarking](https://github.com/robflynnyh/whisper-benchmark)
---
**Last Updated**: 2024-01-XX
**Status**: Evaluation Complete - Ready for Implementation (TICKET-010)