atlas/docs/ASR_EVALUATION.md

# ASR Engine Evaluation and Selection

## Overview

This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware.

## Evaluation Criteria

### Requirements
- **Latency**: < 2s end-to-end (audio in → text out) for interactive use
- **Accuracy**: High word error rate (WER) for conversational speech
- **Resource Usage**: Efficient GPU/CPU utilization
- **Streaming**: Support for real-time audio streaming
- **Model Size**: Balance between quality and resource usage
- **Integration**: Easy integration with wake-word events

## ASR Engine Options

### 1. faster-whisper (Recommended)

**Description**: Optimized Whisper implementation using CTranslate2

**Pros:**
- ⭐ **Best performance** - 4x faster than original Whisper
- ✅ GPU acceleration (CUDA) support
- ✅ Streaming support available
- ✅ Multiple model sizes (tiny, small, medium, large)
- ✅ Good accuracy for conversational speech
- ✅ Active development and maintenance
- ✅ Python API, easy integration

**Cons:**
- Requires CUDA for GPU acceleration
- Model files are large (small: 500MB, medium: 1.5GB)

**Performance:**
- **GPU (4080)**: ~0.5-1s latency (medium model)
- **GPU (1050)**: ~1-2s latency (small model)
- **CPU**: ~2-4s latency (small model)

**Model Sizes:**
- **tiny**: ~75MB, fastest, lower accuracy
- **small**: ~500MB, good balance (recommended)
- **medium**: ~1.5GB, higher accuracy
- **large**: ~3GB, best accuracy, slower

**Recommendation**: ⭐ **Primary choice** - Best balance of speed and accuracy

### 2. Whisper.cpp

**Description**: C++ port of Whisper, optimized for CPU

**Pros:**
- ✅ Very efficient CPU implementation
- ✅ Low memory footprint
- ✅ Cross-platform (Linux, macOS, Windows)
- ✅ Can run on small devices (Raspberry Pi)
- ✅ Streaming support

**Cons:**
- ⚠️ No GPU acceleration (CPU-only)
- ⚠️ Slower than faster-whisper on GPU
- ⚠️ Less Python-friendly (C++ API)

**Performance:**
- **CPU**: ~2-3s latency (small model)
- **Raspberry Pi**: ~5-8s latency (tiny model)

**Recommendation**: Good for CPU-only deployment or small devices

### 3. OpenAI Whisper (Original)

**Description**: Original PyTorch implementation

**Pros:**
- ✅ Reference implementation
- ✅ Well-documented
- ✅ Easy to use

**Cons:**
- ❌ Slowest option (4x slower than faster-whisper)
- ❌ Higher memory usage
- ❌ Not optimized for production

**Recommendation**: ❌ Not recommended - Use faster-whisper instead

### 4. Other Options

**Vosk**:
- Pros: Very fast, lightweight
- Cons: Lower accuracy, requires model training
- Recommendation: Not suitable for general speech

**DeepSpeech**:
- Pros: Open source, lightweight
- Cons: Lower accuracy, outdated
- Recommendation: Not recommended

## Deployment Options

### Option A: faster-whisper on RTX 4080 (Recommended)

**Configuration:**
- **Engine**: faster-whisper
- **Model**: medium (best accuracy) or small (faster)
- **Hardware**: RTX 4080 (shared with work agent LLM)
- **Latency**: ~0.5-1s (medium), ~0.3-0.7s (small)

**Pros:**
- ✅ Lowest latency
- ✅ Best accuracy (with medium model)
- ✅ No additional hardware needed
- ✅ Can share GPU with LLM (time-multiplexed)

**Cons:**
- ⚠️ GPU resource contention with LLM
- ⚠️ May need to pause LLM during ASR processing

**Recommendation**: ⭐ **Best for quality** - Use if 4080 has headroom

### Option B: faster-whisper on RTX 1050

**Configuration:**
- **Engine**: faster-whisper
- **Model**: small (fits in 4GB VRAM)
- **Hardware**: RTX 1050 (shared with family agent LLM)
- **Latency**: ~1-2s

**Pros:**
- ✅ Good latency
- ✅ No additional hardware
- ✅ Can share with family agent LLM

**Cons:**
- ⚠️ VRAM constraints (4GB is tight)
- ⚠️ May conflict with family agent LLM
- ⚠️ Only small model fits

**Recommendation**: ⚠️ **Possible but tight** - Consider CPU option

### Option C: faster-whisper on CPU (Small Box)

**Configuration:**
- **Engine**: faster-whisper
- **Model**: small or tiny
- **Hardware**: Always-on node (Pi/NUC/SFF PC)
- **Latency**: ~2-4s (small), ~1-2s (tiny)

**Pros:**
- ✅ No GPU resource contention
- ✅ Dedicated hardware for ASR
- ✅ Can run 24/7 without affecting LLM servers
- ✅ Lower power consumption

**Cons:**
- ⚠️ Higher latency (2-4s)
- ⚠️ Requires additional hardware
- ⚠️ Lower accuracy with tiny model

**Recommendation**: ✅ **Good for separation** - Best if you want dedicated ASR

### Option D: Whisper.cpp on CPU (Small Box)

**Configuration:**
- **Engine**: Whisper.cpp
- **Model**: small
- **Hardware**: Always-on node
- **Latency**: ~2-3s

**Pros:**
- ✅ Very efficient CPU usage
- ✅ Low memory footprint
- ✅ Good for resource-constrained devices

**Cons:**
- ⚠️ No GPU acceleration
- ⚠️ Slower than faster-whisper on GPU

**Recommendation**: Good alternative to faster-whisper on CPU

## Model Size Selection

### Small Model (Recommended for most cases)
- **Size**: ~500MB
- **Accuracy**: Good for conversational speech
- **Latency**: 0.5-2s (depending on hardware)
- **Use Case**: General voice agent interactions

### Medium Model (Best accuracy)
- **Size**: ~1.5GB
- **Accuracy**: Excellent for conversational speech
- **Latency**: 0.5-1s (on GPU)
- **Use Case**: If quality is critical and GPU available

### Tiny Model (Fastest, lower accuracy)
- **Size**: ~75MB
- **Accuracy**: Acceptable for simple commands
- **Latency**: 0.3-1s
- **Use Case**: Resource-constrained or very low latency needed

## Final Recommendation

### Primary Choice: faster-whisper on RTX 4080

**Configuration:**
- **Engine**: faster-whisper
- **Model**: small (or medium if GPU headroom available)
- **Hardware**: RTX 4080 (shared with work agent)
- **Deployment**: Time-multiplexed with LLM (pause LLM during ASR)

**Rationale:**
- Best balance of latency and accuracy
- No additional hardware needed
- Can share GPU efficiently
- Small model provides good accuracy with low latency

### Alternative: faster-whisper on CPU (Always-on Node)

**Configuration:**
- **Engine**: faster-whisper
- **Model**: small
- **Hardware**: Dedicated always-on node (Pi 4+, NUC, or SFF PC)
- **Deployment**: Separate from LLM servers

**Rationale:**
- No GPU resource contention
- Dedicated hardware for ASR
- Acceptable latency (2-4s) for voice interactions
- Better separation of concerns

## Integration Considerations

### Wake-Word Integration
- ASR should start when wake-word detected
- Stop ASR when silence detected or user stops speaking
- Stream audio chunks to ASR service
- Return text segments in real-time

### API Design
- **Endpoint**: WebSocket `/asr/stream`
- **Input**: Audio stream (PCM, 16kHz, mono)
- **Output**: JSON with text segments and timestamps
- **Format**:
  ```json
  {
    "text": "transcribed text",
    "timestamp": 1234.56,
    "confidence": 0.95,
    "is_final": false
  }
  ```

### Resource Management
- If on 4080: Pause LLM during ASR processing (or use separate GPU)
- If on CPU: No conflicts, can run continuously
- Monitor GPU/CPU usage and adjust model size if needed

## Performance Targets

| Hardware | Model | Target Latency | Status |
|----------|-------|---------------|--------|
| RTX 4080 | small | < 1s | ✅ Achievable |
| RTX 4080 | medium | < 1.5s | ✅ Achievable |
| RTX 1050 | small | < 2s | ✅ Achievable |
| CPU (modern) | small | < 4s | ✅ Achievable |
| CPU (Pi 4) | tiny | < 8s | ⚠️ Acceptable |

## Next Steps

1. ✅ ASR engine selected: **faster-whisper**
2. ✅ Deployment decided: **RTX 4080 (primary)** or **CPU node (alternative)**
3. ✅ Model size: **small** (or medium if GPU headroom)
4. Implement ASR service (TICKET-010)
5. Define ASR API contract (TICKET-011)
6. Benchmark actual performance (TICKET-012)

## References

- [faster-whisper GitHub](https://github.com/guillaumekln/faster-whisper)
- [Whisper.cpp GitHub](https://github.com/ggerganov/whisper.cpp)
- [OpenAI Whisper](https://github.com/openai/whisper)
- [ASR Benchmarking](https://github.com/robflynnyh/whisper-benchmark)

---

**Last Updated**: 2024-01-XX
**Status**: Evaluation Complete - Ready for Implementation (TICKET-010)