ilia/atlas

ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP

- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.

2026-01-05 23:44:16 -05:00

8.0 KiB

Raw Permalink Blame History

ASR Engine Evaluation and Selection

Overview

This document evaluates Automatic Speech Recognition (ASR) engines for the Atlas voice agent system, considering deployment options on RTX 4080, RTX 1050, or CPU-only hardware.

Evaluation Criteria

Requirements

Latency: < 2s end-to-end (audio in → text out) for interactive use
Accuracy: High word error rate (WER) for conversational speech
Resource Usage: Efficient GPU/CPU utilization
Streaming: Support for real-time audio streaming
Model Size: Balance between quality and resource usage
Integration: Easy integration with wake-word events

ASR Engine Options

1. faster-whisper (Recommended)

Description: Optimized Whisper implementation using CTranslate2

Pros:

⭐ Best performance - 4x faster than original Whisper
✅ GPU acceleration (CUDA) support
✅ Streaming support available
✅ Multiple model sizes (tiny, small, medium, large)
✅ Good accuracy for conversational speech
✅ Active development and maintenance
✅ Python API, easy integration

Cons:

Requires CUDA for GPU acceleration
Model files are large (small: 500MB, medium: 1.5GB)

Performance:

GPU (4080): ~0.5-1s latency (medium model)
GPU (1050): ~1-2s latency (small model)
CPU: ~2-4s latency (small model)

Model Sizes:

tiny: ~75MB, fastest, lower accuracy
small: ~500MB, good balance (recommended)
medium: ~1.5GB, higher accuracy
large: ~3GB, best accuracy, slower

Recommendation: ⭐ Primary choice - Best balance of speed and accuracy

2. Whisper.cpp

Description: C++ port of Whisper, optimized for CPU

Pros:

✅ Very efficient CPU implementation
✅ Low memory footprint
✅ Cross-platform (Linux, macOS, Windows)
✅ Can run on small devices (Raspberry Pi)
✅ Streaming support

Cons:

⚠️ No GPU acceleration (CPU-only)
⚠️ Slower than faster-whisper on GPU
⚠️ Less Python-friendly (C++ API)

Performance:

CPU: ~2-3s latency (small model)
Raspberry Pi: ~5-8s latency (tiny model)

Recommendation: Good for CPU-only deployment or small devices

3. OpenAI Whisper (Original)

Description: Original PyTorch implementation

Pros:

✅ Reference implementation
✅ Well-documented
✅ Easy to use

Cons:

❌ Slowest option (4x slower than faster-whisper)
❌ Higher memory usage
❌ Not optimized for production

Recommendation: ❌ Not recommended - Use faster-whisper instead

4. Other Options

Vosk:

Pros: Very fast, lightweight
Cons: Lower accuracy, requires model training
Recommendation: Not suitable for general speech

DeepSpeech:

Pros: Open source, lightweight
Cons: Lower accuracy, outdated
Recommendation: Not recommended

Deployment Options

Option A: faster-whisper on RTX 4080 (Recommended)

Configuration:

Engine: faster-whisper
Model: medium (best accuracy) or small (faster)
Hardware: RTX 4080 (shared with work agent LLM)
Latency: ~0.5-1s (medium), ~0.3-0.7s (small)

Pros:

✅ Lowest latency
✅ Best accuracy (with medium model)
✅ No additional hardware needed
✅ Can share GPU with LLM (time-multiplexed)

Cons:

⚠️ GPU resource contention with LLM
⚠️ May need to pause LLM during ASR processing

Recommendation: ⭐ Best for quality - Use if 4080 has headroom

Option B: faster-whisper on RTX 1050

Configuration:

Engine: faster-whisper
Model: small (fits in 4GB VRAM)
Hardware: RTX 1050 (shared with family agent LLM)
Latency: ~1-2s

Pros:

✅ Good latency
✅ No additional hardware
✅ Can share with family agent LLM

Cons:

⚠️ VRAM constraints (4GB is tight)
⚠️ May conflict with family agent LLM
⚠️ Only small model fits

Recommendation: ⚠️ Possible but tight - Consider CPU option

Option C: faster-whisper on CPU (Small Box)

Configuration:

Engine: faster-whisper
Model: small or tiny
Hardware: Always-on node (Pi/NUC/SFF PC)
Latency: ~2-4s (small), ~1-2s (tiny)

Pros:

✅ No GPU resource contention
✅ Dedicated hardware for ASR
✅ Can run 24/7 without affecting LLM servers
✅ Lower power consumption

Cons:

⚠️ Higher latency (2-4s)
⚠️ Requires additional hardware
⚠️ Lower accuracy with tiny model

Recommendation: ✅ Good for separation - Best if you want dedicated ASR

Option D: Whisper.cpp on CPU (Small Box)

Configuration:

Engine: Whisper.cpp
Model: small
Hardware: Always-on node
Latency: ~2-3s

Pros:

✅ Very efficient CPU usage
✅ Low memory footprint
✅ Good for resource-constrained devices

Cons:

⚠️ No GPU acceleration
⚠️ Slower than faster-whisper on GPU

Recommendation: Good alternative to faster-whisper on CPU

Model Size Selection

Small Model (Recommended for most cases)

Size: ~500MB
Accuracy: Good for conversational speech
Latency: 0.5-2s (depending on hardware)
Use Case: General voice agent interactions

Medium Model (Best accuracy)

Size: ~1.5GB
Accuracy: Excellent for conversational speech
Latency: 0.5-1s (on GPU)
Use Case: If quality is critical and GPU available

Tiny Model (Fastest, lower accuracy)

Size: ~75MB
Accuracy: Acceptable for simple commands
Latency: 0.3-1s
Use Case: Resource-constrained or very low latency needed

Final Recommendation

Primary Choice: faster-whisper on RTX 4080

Configuration:

Engine: faster-whisper
Model: small (or medium if GPU headroom available)
Hardware: RTX 4080 (shared with work agent)
Deployment: Time-multiplexed with LLM (pause LLM during ASR)

Rationale:

Best balance of latency and accuracy
No additional hardware needed
Can share GPU efficiently
Small model provides good accuracy with low latency

Alternative: faster-whisper on CPU (Always-on Node)

Configuration:

Engine: faster-whisper
Model: small
Hardware: Dedicated always-on node (Pi 4+, NUC, or SFF PC)
Deployment: Separate from LLM servers

Rationale:

No GPU resource contention
Dedicated hardware for ASR
Acceptable latency (2-4s) for voice interactions
Better separation of concerns

Integration Considerations

Wake-Word Integration

ASR should start when wake-word detected
Stop ASR when silence detected or user stops speaking
Stream audio chunks to ASR service
Return text segments in real-time

API Design

Endpoint: WebSocket /asr/stream
Input: Audio stream (PCM, 16kHz, mono)
Output: JSON with text segments and timestamps

Format:

{
  "text": "transcribed text",
  "timestamp": 1234.56,
  "confidence": 0.95,
  "is_final": false
}

Resource Management

If on 4080: Pause LLM during ASR processing (or use separate GPU)
If on CPU: No conflicts, can run continuously
Monitor GPU/CPU usage and adjust model size if needed

Performance Targets

Hardware	Model	Target Latency	Status
RTX 4080	small	< 1s	✅ Achievable
RTX 4080	medium	< 1.5s	✅ Achievable
RTX 1050	small	< 2s	✅ Achievable
CPU (modern)	small	< 4s	✅ Achievable
CPU (Pi 4)	tiny	< 8s	⚠️ Acceptable

Next Steps

✅ ASR engine selected: faster-whisper
✅ Deployment decided: RTX 4080 (primary) or CPU node (alternative)
✅ Model size: small (or medium if GPU headroom)
Implement ASR service (TICKET-010)
Define ASR API contract (TICKET-011)
Benchmark actual performance (TICKET-012)

References

Last Updated: 2024-01-XX Status: Evaluation Complete - Ready for Implementation (TICKET-010)

8.0 KiB Raw Permalink Blame History

ASR Engine Evaluation and Selection

Overview

Evaluation Criteria

Requirements

ASR Engine Options

1. faster-whisper (Recommended)

2. Whisper.cpp

3. OpenAI Whisper (Original)

4. Other Options

Deployment Options

Option A: faster-whisper on RTX 4080 (Recommended)

Option B: faster-whisper on RTX 1050

Option C: faster-whisper on CPU (Small Box)

Option D: Whisper.cpp on CPU (Small Box)

Model Size Selection

Small Model (Recommended for most cases)

Medium Model (Best accuracy)

Tiny Model (Fastest, lower accuracy)

Final Recommendation

Primary Choice: faster-whisper on RTX 4080

Alternative: faster-whisper on CPU (Always-on Node)

Integration Considerations

Wake-Word Integration

API Design

Resource Management

Performance Targets

Next Steps

References

8.0 KiB

Raw Permalink Blame History