ilia/atlas

ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP

- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.

2026-01-05 23:44:16 -05:00

7.1 KiB

Raw Permalink Blame History

LLM Usage and Cost Analysis

Overview

This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs.

Key Hardware:

RTX 4080 (16GB VRAM): Work agent, high-capability tasks
RTX 1050 (4GB VRAM): Family agent, always-on, low-latency

LLM Usage by Task

Primary Use Cases

1. Work Agent (RTX 4080)

Model Recommendations:

Primary: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4
Alternative: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4
Context: 8K-16K tokens
Quantization: Q4-Q5 (fits in 16GB VRAM)

Use Cases:

Coding assistance and code generation
Research and analysis
Complex reasoning tasks
Technical documentation
Code review and debugging

Cost per Request:

Electricity: ~0.15-0.25 kWh per hour of active use
At $0.12/kWh: ~$0.018-0.03/hour
Per request (avg 5s generation): ~$0.000025-0.00004 per request
Monthly (2 hours/day): ~$1.08-1.80/month

2. Family Agent (RTX 1050)

Model Recommendations:

Primary: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4
Alternative: Gemma 2B Q4, Qwen2.5 1.5B Q4
Context: 4K-8K tokens
Quantization: Q4 (fits in 4GB VRAM)

Use Cases:

Daily conversations
Task management (add task, update status)
Weather queries
Timers and reminders
Simple Q&A
Family-friendly interactions

Cost per Request:

Electricity: ~0.05-0.08 kWh per hour of active use
At $0.12/kWh: ~$0.006-0.01/hour
Per request (avg 2s generation): ~$0.000003-0.000006 per request
Monthly (always-on, 8 hours/day): ~$1.44-2.40/month

Secondary Use Cases

3. Conversation Summarization (TICKET-043)

Model Choice:

Option A: Use Family Agent (1050) - cheaper, sufficient for summaries
Option B: Use Work Agent (4080) - better quality, but more expensive
Recommendation: Use Family Agent for most summaries, Work Agent for complex/long conversations

Frequency: After N turns (e.g., every 20 messages) or size threshold Cost:

Family Agent: ~$0.00001 per summary
Work Agent: ~$0.00004 per summary
Monthly (10 summaries/day): ~$0.003-0.012/month

4. Memory Retrieval Enhancement (TICKET-041, TICKET-042)

Model Choice:

Use Family Agent (1050) for memory queries
Lightweight embeddings can be done without LLM
Only use LLM for complex memory reasoning

Cost: Minimal - mostly embedding-based retrieval

Cost Breakdown by Ticket

Milestone 1 - Survey & Architecture

TICKET-017, TICKET-018, TICKET-019, TICKET-020: No LLM costs (research only)

Milestone 2 - Voice Chat MVP

TICKET-021: Stand Up 4080 LLM Service

Setup cost: $0 (one-time)
Ongoing: ~$1.08-1.80/month (work agent usage)

TICKET-022: Stand Up 1050 LLM Service

Setup cost: $0 (one-time)
Ongoing: ~$1.44-2.40/month (family agent, always-on)

TICKET-025: System Prompts

Cost: $0 (configuration only)

TICKET-027: Multi-Turn Conversation

Cost: $0 (infrastructure, no LLM calls)

TICKET-030: MCP-LLM Integration

Cost: $0 (adapter code, uses existing LLM servers)

Milestone 3 - Memory, Reminders, Safety

TICKET-041: Long-Term Memory Design

Cost: $0 (design only)

TICKET-042: Long-Term Memory Implementation

Cost: Minimal - mostly database operations
LLM usage: Only for complex memory queries (~$0.01/month)

TICKET-043: Conversation Summarization

Cost: ~$0.003-0.012/month (10 summaries/day)
Model: Family Agent (1050) recommended

TICKET-044: Boundary Enforcement

Cost: $0 (policy enforcement, no LLM)

TICKET-045: Confirmation Flows

Cost: $0 (UI/logic, uses existing LLM for explanations)

TICKET-046: Admin Tools

Cost: $0 (UI/logging, no LLM)

Total Monthly Operating Costs

Base Infrastructure (Always Running)

Family Agent (1050): ~$1.44-2.40/month
Work Agent (4080): ~$1.08-1.80/month (when active)
Total Base: ~$2.52-4.20/month

Variable Costs (Usage-Based)

Conversation Summarization: ~$0.003-0.012/month
Memory Queries: ~$0.01/month
Total Variable: ~$0.013-0.022/month

Total Monthly Cost: ~$2.53-4.22/month

Cost Optimization Strategies

1. Model Selection

Use smallest model that meets quality requirements
Q4 quantization for both agents (good quality/performance)
Consider Q5 for work agent if quality is critical

2. Usage Patterns

Work Agent: Only run when needed (not always-on)
Family Agent: Always-on but low-power (1050 is efficient)
Summarization: Batch process, use cheaper model

3. Context Management

Keep context windows reasonable (8K for work, 4K for family)
Aggressive summarization to reduce context size
Prune old messages regularly

4. Hardware Optimization

Use efficient inference servers (llama.cpp, vLLM)
Enable KV cache for faster responses
Batch requests when possible (work agent)

Alternative: Cloud API Costs (For Comparison)

If using cloud APIs instead of local:

OpenAI GPT-4

Work Agent: ~$0.03-0.06 per request
Family Agent: ~$0.01-0.02 per request
Monthly (100 requests/day): ~$120-240/month

Anthropic Claude

Work Agent: ~$0.015-0.03 per request
Family Agent: ~$0.008-0.015 per request
Monthly (100 requests/day): ~$69-135/month

Local is 30-100x cheaper!

Recommendations by Ticket Priority

High Priority (Do First)

TICKET-019: Select Work Agent Model - Choose efficient 70B Q4 model
TICKET-020: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4
TICKET-021: Stand Up 4080 Service - Use Ollama or vLLM
TICKET-022: Stand Up 1050 Service - Use llama.cpp (lightweight)

Medium Priority

TICKET-027: Multi-Turn Conversation - Implement context management
TICKET-043: Summarization - Use Family Agent for cost efficiency

Low Priority (Optimize Later)

TICKET-042: Memory Implementation - Add LLM queries only if needed
TICKET-024: Logging & Metrics - Track costs and optimize

Model Selection Matrix

Task	Model	Hardware	Quantization	Cost/Hour	Use Case
Work Agent	Llama 3.1 70B	RTX 4080	Q4	$0.018-0.03	Coding, research
Family Agent	Phi-3 Mini 3.8B	RTX 1050	Q4	$0.006-0.01	Daily conversations
Summarization	Phi-3 Mini 3.8B	RTX 1050	Q4	$0.006-0.01	Conversation summaries
Memory Queries	Embeddings + Phi-3	RTX 1050	Q4	Minimal	Memory retrieval

Notes

All costs assume $0.12/kWh electricity rate (US average)
Costs scale with usage - adjust based on actual usage patterns
Hardware depreciation not included (one-time cost)
Local models are much cheaper than cloud APIs
Privacy benefit: No data leaves your network

Next Steps

Complete TICKET-017 (Model Survey) to finalize model choices
Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits
Select models based on this analysis
Monitor actual costs after deployment and optimize

7.1 KiB Raw Permalink Blame History