atlas/docs/LLM_USAGE_AND_COSTS.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

7.1 KiB

LLM Usage and Cost Analysis

Overview

This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs.

Key Hardware:

  • RTX 4080 (16GB VRAM): Work agent, high-capability tasks
  • RTX 1050 (4GB VRAM): Family agent, always-on, low-latency

LLM Usage by Task

Primary Use Cases

1. Work Agent (RTX 4080)

Model Recommendations:

  • Primary: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4
  • Alternative: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4
  • Context: 8K-16K tokens
  • Quantization: Q4-Q5 (fits in 16GB VRAM)

Use Cases:

  • Coding assistance and code generation
  • Research and analysis
  • Complex reasoning tasks
  • Technical documentation
  • Code review and debugging

Cost per Request:

  • Electricity: ~0.15-0.25 kWh per hour of active use
  • At $0.12/kWh: ~$0.018-0.03/hour
  • Per request (avg 5s generation): ~$0.000025-0.00004 per request
  • Monthly (2 hours/day): ~$1.08-1.80/month

2. Family Agent (RTX 1050)

Model Recommendations:

  • Primary: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4
  • Alternative: Gemma 2B Q4, Qwen2.5 1.5B Q4
  • Context: 4K-8K tokens
  • Quantization: Q4 (fits in 4GB VRAM)

Use Cases:

  • Daily conversations
  • Task management (add task, update status)
  • Weather queries
  • Timers and reminders
  • Simple Q&A
  • Family-friendly interactions

Cost per Request:

  • Electricity: ~0.05-0.08 kWh per hour of active use
  • At $0.12/kWh: ~$0.006-0.01/hour
  • Per request (avg 2s generation): ~$0.000003-0.000006 per request
  • Monthly (always-on, 8 hours/day): ~$1.44-2.40/month

Secondary Use Cases

3. Conversation Summarization (TICKET-043)

Model Choice:

  • Option A: Use Family Agent (1050) - cheaper, sufficient for summaries
  • Option B: Use Work Agent (4080) - better quality, but more expensive
  • Recommendation: Use Family Agent for most summaries, Work Agent for complex/long conversations

Frequency: After N turns (e.g., every 20 messages) or size threshold Cost:

  • Family Agent: ~$0.00001 per summary
  • Work Agent: ~$0.00004 per summary
  • Monthly (10 summaries/day): ~$0.003-0.012/month

4. Memory Retrieval Enhancement (TICKET-041, TICKET-042)

Model Choice:

  • Use Family Agent (1050) for memory queries
  • Lightweight embeddings can be done without LLM
  • Only use LLM for complex memory reasoning

Cost: Minimal - mostly embedding-based retrieval

Cost Breakdown by Ticket

Milestone 1 - Survey & Architecture

  • TICKET-017, TICKET-018, TICKET-019, TICKET-020: No LLM costs (research only)

Milestone 2 - Voice Chat MVP

TICKET-021: Stand Up 4080 LLM Service

  • Setup cost: $0 (one-time)
  • Ongoing: ~$1.08-1.80/month (work agent usage)

TICKET-022: Stand Up 1050 LLM Service

  • Setup cost: $0 (one-time)
  • Ongoing: ~$1.44-2.40/month (family agent, always-on)

TICKET-025: System Prompts

  • Cost: $0 (configuration only)

TICKET-027: Multi-Turn Conversation

  • Cost: $0 (infrastructure, no LLM calls)

TICKET-030: MCP-LLM Integration

  • Cost: $0 (adapter code, uses existing LLM servers)

Milestone 3 - Memory, Reminders, Safety

TICKET-041: Long-Term Memory Design

  • Cost: $0 (design only)

TICKET-042: Long-Term Memory Implementation

  • Cost: Minimal - mostly database operations
  • LLM usage: Only for complex memory queries (~$0.01/month)

TICKET-043: Conversation Summarization

  • Cost: ~$0.003-0.012/month (10 summaries/day)
  • Model: Family Agent (1050) recommended

TICKET-044: Boundary Enforcement

  • Cost: $0 (policy enforcement, no LLM)

TICKET-045: Confirmation Flows

  • Cost: $0 (UI/logic, uses existing LLM for explanations)

TICKET-046: Admin Tools

  • Cost: $0 (UI/logging, no LLM)

Total Monthly Operating Costs

Base Infrastructure (Always Running)

  • Family Agent (1050): ~$1.44-2.40/month
  • Work Agent (4080): ~$1.08-1.80/month (when active)
  • Total Base: ~$2.52-4.20/month

Variable Costs (Usage-Based)

  • Conversation Summarization: ~$0.003-0.012/month
  • Memory Queries: ~$0.01/month
  • Total Variable: ~$0.013-0.022/month

Total Monthly Cost: ~$2.53-4.22/month

Cost Optimization Strategies

1. Model Selection

  • Use smallest model that meets quality requirements
  • Q4 quantization for both agents (good quality/performance)
  • Consider Q5 for work agent if quality is critical

2. Usage Patterns

  • Work Agent: Only run when needed (not always-on)
  • Family Agent: Always-on but low-power (1050 is efficient)
  • Summarization: Batch process, use cheaper model

3. Context Management

  • Keep context windows reasonable (8K for work, 4K for family)
  • Aggressive summarization to reduce context size
  • Prune old messages regularly

4. Hardware Optimization

  • Use efficient inference servers (llama.cpp, vLLM)
  • Enable KV cache for faster responses
  • Batch requests when possible (work agent)

Alternative: Cloud API Costs (For Comparison)

If using cloud APIs instead of local:

OpenAI GPT-4

  • Work Agent: ~$0.03-0.06 per request
  • Family Agent: ~$0.01-0.02 per request
  • Monthly (100 requests/day): ~$120-240/month

Anthropic Claude

  • Work Agent: ~$0.015-0.03 per request
  • Family Agent: ~$0.008-0.015 per request
  • Monthly (100 requests/day): ~$69-135/month

Local is 30-100x cheaper!

Recommendations by Ticket Priority

High Priority (Do First)

  1. TICKET-019: Select Work Agent Model - Choose efficient 70B Q4 model
  2. TICKET-020: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4
  3. TICKET-021: Stand Up 4080 Service - Use Ollama or vLLM
  4. TICKET-022: Stand Up 1050 Service - Use llama.cpp (lightweight)

Medium Priority

  1. TICKET-027: Multi-Turn Conversation - Implement context management
  2. TICKET-043: Summarization - Use Family Agent for cost efficiency

Low Priority (Optimize Later)

  1. TICKET-042: Memory Implementation - Add LLM queries only if needed
  2. TICKET-024: Logging & Metrics - Track costs and optimize

Model Selection Matrix

Task Model Hardware Quantization Cost/Hour Use Case
Work Agent Llama 3.1 70B RTX 4080 Q4 $0.018-0.03 Coding, research
Family Agent Phi-3 Mini 3.8B RTX 1050 Q4 $0.006-0.01 Daily conversations
Summarization Phi-3 Mini 3.8B RTX 1050 Q4 $0.006-0.01 Conversation summaries
Memory Queries Embeddings + Phi-3 RTX 1050 Q4 Minimal Memory retrieval

Notes

  • All costs assume $0.12/kWh electricity rate (US average)
  • Costs scale with usage - adjust based on actual usage patterns
  • Hardware depreciation not included (one-time cost)
  • Local models are much cheaper than cloud APIs
  • Privacy benefit: No data leaves your network

Next Steps

  1. Complete TICKET-017 (Model Survey) to finalize model choices
  2. Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits
  3. Select models based on this analysis
  4. Monitor actual costs after deployment and optimize