atlas/docs/ASR_EVALUATION.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

8.0 KiB

ASR Engine Evaluation and Selection

Overview

This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware.

Evaluation Criteria

Requirements

  • Latency: < 2s end-to-end (audio in → text out) for interactive use
  • Accuracy: High word error rate (WER) for conversational speech
  • Resource Usage: Efficient GPU/CPU utilization
  • Streaming: Support for real-time audio streaming
  • Model Size: Balance between quality and resource usage
  • Integration: Easy integration with wake-word events

ASR Engine Options

Description: Optimized Whisper implementation using CTranslate2

Pros:

  • Best performance - 4x faster than original Whisper
  • GPU acceleration (CUDA) support
  • Streaming support available
  • Multiple model sizes (tiny, small, medium, large)
  • Good accuracy for conversational speech
  • Active development and maintenance
  • Python API, easy integration

Cons:

  • Requires CUDA for GPU acceleration
  • Model files are large (small: 500MB, medium: 1.5GB)

Performance:

  • GPU (4080): ~0.5-1s latency (medium model)
  • GPU (1050): ~1-2s latency (small model)
  • CPU: ~2-4s latency (small model)

Model Sizes:

  • tiny: ~75MB, fastest, lower accuracy
  • small: ~500MB, good balance (recommended)
  • medium: ~1.5GB, higher accuracy
  • large: ~3GB, best accuracy, slower

Recommendation: Primary choice - Best balance of speed and accuracy

2. Whisper.cpp

Description: C++ port of Whisper, optimized for CPU

Pros:

  • Very efficient CPU implementation
  • Low memory footprint
  • Cross-platform (Linux, macOS, Windows)
  • Can run on small devices (Raspberry Pi)
  • Streaming support

Cons:

  • ⚠️ No GPU acceleration (CPU-only)
  • ⚠️ Slower than faster-whisper on GPU
  • ⚠️ Less Python-friendly (C++ API)

Performance:

  • CPU: ~2-3s latency (small model)
  • Raspberry Pi: ~5-8s latency (tiny model)

Recommendation: Good for CPU-only deployment or small devices

3. OpenAI Whisper (Original)

Description: Original PyTorch implementation

Pros:

  • Reference implementation
  • Well-documented
  • Easy to use

Cons:

  • Slowest option (4x slower than faster-whisper)
  • Higher memory usage
  • Not optimized for production

Recommendation: Not recommended - Use faster-whisper instead

4. Other Options

Vosk:

  • Pros: Very fast, lightweight
  • Cons: Lower accuracy, requires model training
  • Recommendation: Not suitable for general speech

DeepSpeech:

  • Pros: Open source, lightweight
  • Cons: Lower accuracy, outdated
  • Recommendation: Not recommended

Deployment Options

Configuration:

  • Engine: faster-whisper
  • Model: medium (best accuracy) or small (faster)
  • Hardware: RTX 4080 (shared with work agent LLM)
  • Latency: ~0.5-1s (medium), ~0.3-0.7s (small)

Pros:

  • Lowest latency
  • Best accuracy (with medium model)
  • No additional hardware needed
  • Can share GPU with LLM (time-multiplexed)

Cons:

  • ⚠️ GPU resource contention with LLM
  • ⚠️ May need to pause LLM during ASR processing

Recommendation: Best for quality - Use if 4080 has headroom

Option B: faster-whisper on RTX 1050

Configuration:

  • Engine: faster-whisper
  • Model: small (fits in 4GB VRAM)
  • Hardware: RTX 1050 (shared with family agent LLM)
  • Latency: ~1-2s

Pros:

  • Good latency
  • No additional hardware
  • Can share with family agent LLM

Cons:

  • ⚠️ VRAM constraints (4GB is tight)
  • ⚠️ May conflict with family agent LLM
  • ⚠️ Only small model fits

Recommendation: ⚠️ Possible but tight - Consider CPU option

Option C: faster-whisper on CPU (Small Box)

Configuration:

  • Engine: faster-whisper
  • Model: small or tiny
  • Hardware: Always-on node (Pi/NUC/SFF PC)
  • Latency: ~2-4s (small), ~1-2s (tiny)

Pros:

  • No GPU resource contention
  • Dedicated hardware for ASR
  • Can run 24/7 without affecting LLM servers
  • Lower power consumption

Cons:

  • ⚠️ Higher latency (2-4s)
  • ⚠️ Requires additional hardware
  • ⚠️ Lower accuracy with tiny model

Recommendation: Good for separation - Best if you want dedicated ASR

Option D: Whisper.cpp on CPU (Small Box)

Configuration:

  • Engine: Whisper.cpp
  • Model: small
  • Hardware: Always-on node
  • Latency: ~2-3s

Pros:

  • Very efficient CPU usage
  • Low memory footprint
  • Good for resource-constrained devices

Cons:

  • ⚠️ No GPU acceleration
  • ⚠️ Slower than faster-whisper on GPU

Recommendation: Good alternative to faster-whisper on CPU

Model Size Selection

  • Size: ~500MB
  • Accuracy: Good for conversational speech
  • Latency: 0.5-2s (depending on hardware)
  • Use Case: General voice agent interactions

Medium Model (Best accuracy)

  • Size: ~1.5GB
  • Accuracy: Excellent for conversational speech
  • Latency: 0.5-1s (on GPU)
  • Use Case: If quality is critical and GPU available

Tiny Model (Fastest, lower accuracy)

  • Size: ~75MB
  • Accuracy: Acceptable for simple commands
  • Latency: 0.3-1s
  • Use Case: Resource-constrained or very low latency needed

Final Recommendation

Primary Choice: faster-whisper on RTX 4080

Configuration:

  • Engine: faster-whisper
  • Model: small (or medium if GPU headroom available)
  • Hardware: RTX 4080 (shared with work agent)
  • Deployment: Time-multiplexed with LLM (pause LLM during ASR)

Rationale:

  • Best balance of latency and accuracy
  • No additional hardware needed
  • Can share GPU efficiently
  • Small model provides good accuracy with low latency

Alternative: faster-whisper on CPU (Always-on Node)

Configuration:

  • Engine: faster-whisper
  • Model: small
  • Hardware: Dedicated always-on node (Pi 4+, NUC, or SFF PC)
  • Deployment: Separate from LLM servers

Rationale:

  • No GPU resource contention
  • Dedicated hardware for ASR
  • Acceptable latency (2-4s) for voice interactions
  • Better separation of concerns

Integration Considerations

Wake-Word Integration

  • ASR should start when wake-word detected
  • Stop ASR when silence detected or user stops speaking
  • Stream audio chunks to ASR service
  • Return text segments in real-time

API Design

  • Endpoint: WebSocket /asr/stream
  • Input: Audio stream (PCM, 16kHz, mono)
  • Output: JSON with text segments and timestamps
  • Format:
    {
      "text": "transcribed text",
      "timestamp": 1234.56,
      "confidence": 0.95,
      "is_final": false
    }
    

Resource Management

  • If on 4080: Pause LLM during ASR processing (or use separate GPU)
  • If on CPU: No conflicts, can run continuously
  • Monitor GPU/CPU usage and adjust model size if needed

Performance Targets

Hardware Model Target Latency Status
RTX 4080 small < 1s Achievable
RTX 4080 medium < 1.5s Achievable
RTX 1050 small < 2s Achievable
CPU (modern) small < 4s Achievable
CPU (Pi 4) tiny < 8s ⚠️ Acceptable

Next Steps

  1. ASR engine selected: faster-whisper
  2. Deployment decided: RTX 4080 (primary) or CPU node (alternative)
  3. Model size: small (or medium if GPU headroom)
  4. Implement ASR service (TICKET-010)
  5. Define ASR API contract (TICKET-011)
  6. Benchmark actual performance (TICKET-012)

References


Last Updated: 2024-01-XX Status: Evaluation Complete - Ready for Implementation (TICKET-010)