atlas/docs/LLM_USAGE_AND_COSTS.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

215 lines
7.1 KiB
Markdown

# LLM Usage and Cost Analysis
## Overview
This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs.
**Key Hardware:**
- **RTX 4080** (16GB VRAM): Work agent, high-capability tasks
- **RTX 1050** (4GB VRAM): Family agent, always-on, low-latency
## LLM Usage by Task
### Primary Use Cases
#### 1. **Work Agent (RTX 4080)**
**Model Recommendations:**
- **Primary**: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4
- **Alternative**: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4
- **Context**: 8K-16K tokens
- **Quantization**: Q4-Q5 (fits in 16GB VRAM)
**Use Cases:**
- Coding assistance and code generation
- Research and analysis
- Complex reasoning tasks
- Technical documentation
- Code review and debugging
**Cost per Request:**
- **Electricity**: ~0.15-0.25 kWh per hour of active use
- **At $0.12/kWh**: ~$0.018-0.03/hour
- **Per request** (avg 5s generation): ~$0.000025-0.00004 per request
- **Monthly** (2 hours/day): ~$1.08-1.80/month
#### 2. **Family Agent (RTX 1050)**
**Model Recommendations:**
- **Primary**: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4
- **Alternative**: Gemma 2B Q4, Qwen2.5 1.5B Q4
- **Context**: 4K-8K tokens
- **Quantization**: Q4 (fits in 4GB VRAM)
**Use Cases:**
- Daily conversations
- Task management (add task, update status)
- Weather queries
- Timers and reminders
- Simple Q&A
- Family-friendly interactions
**Cost per Request:**
- **Electricity**: ~0.05-0.08 kWh per hour of active use
- **At $0.12/kWh**: ~$0.006-0.01/hour
- **Per request** (avg 2s generation): ~$0.000003-0.000006 per request
- **Monthly** (always-on, 8 hours/day): ~$1.44-2.40/month
### Secondary Use Cases
#### 3. **Conversation Summarization** (TICKET-043)
**Model Choice:**
- **Option A**: Use Family Agent (1050) - cheaper, sufficient for summaries
- **Option B**: Use Work Agent (4080) - better quality, but more expensive
- **Recommendation**: Use Family Agent for most summaries, Work Agent for complex/long conversations
**Frequency**: After N turns (e.g., every 20 messages) or size threshold
**Cost**:
- Family Agent: ~$0.00001 per summary
- Work Agent: ~$0.00004 per summary
- **Monthly** (10 summaries/day): ~$0.003-0.012/month
#### 4. **Memory Retrieval Enhancement** (TICKET-041, TICKET-042)
**Model Choice:**
- Use Family Agent (1050) for memory queries
- Lightweight embeddings can be done without LLM
- Only use LLM for complex memory reasoning
**Cost**: Minimal - mostly embedding-based retrieval
## Cost Breakdown by Ticket
### Milestone 1 - Survey & Architecture
- **TICKET-017, TICKET-018, TICKET-019, TICKET-020**: No LLM costs (research only)
### Milestone 2 - Voice Chat MVP
#### TICKET-021: Stand Up 4080 LLM Service
- **Setup cost**: $0 (one-time)
- **Ongoing**: ~$1.08-1.80/month (work agent usage)
#### TICKET-022: Stand Up 1050 LLM Service
- **Setup cost**: $0 (one-time)
- **Ongoing**: ~$1.44-2.40/month (family agent, always-on)
#### TICKET-025: System Prompts
- **Cost**: $0 (configuration only)
#### TICKET-027: Multi-Turn Conversation
- **Cost**: $0 (infrastructure, no LLM calls)
#### TICKET-030: MCP-LLM Integration
- **Cost**: $0 (adapter code, uses existing LLM servers)
### Milestone 3 - Memory, Reminders, Safety
#### TICKET-041: Long-Term Memory Design
- **Cost**: $0 (design only)
#### TICKET-042: Long-Term Memory Implementation
- **Cost**: Minimal - mostly database operations
- **LLM usage**: Only for complex memory queries (~$0.01/month)
#### TICKET-043: Conversation Summarization
- **Cost**: ~$0.003-0.012/month (10 summaries/day)
- **Model**: Family Agent (1050) recommended
#### TICKET-044: Boundary Enforcement
- **Cost**: $0 (policy enforcement, no LLM)
#### TICKET-045: Confirmation Flows
- **Cost**: $0 (UI/logic, uses existing LLM for explanations)
#### TICKET-046: Admin Tools
- **Cost**: $0 (UI/logging, no LLM)
## Total Monthly Operating Costs
### Base Infrastructure (Always Running)
- **Family Agent (1050)**: ~$1.44-2.40/month
- **Work Agent (4080)**: ~$1.08-1.80/month (when active)
- **Total Base**: ~$2.52-4.20/month
### Variable Costs (Usage-Based)
- **Conversation Summarization**: ~$0.003-0.012/month
- **Memory Queries**: ~$0.01/month
- **Total Variable**: ~$0.013-0.022/month
### **Total Monthly Cost: ~$2.53-4.22/month**
## Cost Optimization Strategies
### 1. **Model Selection**
- Use smallest model that meets quality requirements
- Q4 quantization for both agents (good quality/performance)
- Consider Q5 for work agent if quality is critical
### 2. **Usage Patterns**
- **Work Agent**: Only run when needed (not always-on)
- **Family Agent**: Always-on but low-power (1050 is efficient)
- **Summarization**: Batch process, use cheaper model
### 3. **Context Management**
- Keep context windows reasonable (8K for work, 4K for family)
- Aggressive summarization to reduce context size
- Prune old messages regularly
### 4. **Hardware Optimization**
- Use efficient inference servers (llama.cpp, vLLM)
- Enable KV cache for faster responses
- Batch requests when possible (work agent)
## Alternative: Cloud API Costs (For Comparison)
If using cloud APIs instead of local:
### OpenAI GPT-4
- **Work Agent**: ~$0.03-0.06 per request
- **Family Agent**: ~$0.01-0.02 per request
- **Monthly** (100 requests/day): ~$120-240/month
### Anthropic Claude
- **Work Agent**: ~$0.015-0.03 per request
- **Family Agent**: ~$0.008-0.015 per request
- **Monthly** (100 requests/day): ~$69-135/month
### **Local is 30-100x cheaper!**
## Recommendations by Ticket Priority
### High Priority (Do First)
1. **TICKET-019**: Select Work Agent Model - Choose efficient 70B Q4 model
2. **TICKET-020**: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4
3. **TICKET-021**: Stand Up 4080 Service - Use Ollama or vLLM
4. **TICKET-022**: Stand Up 1050 Service - Use llama.cpp (lightweight)
### Medium Priority
5. **TICKET-027**: Multi-Turn Conversation - Implement context management
6. **TICKET-043**: Summarization - Use Family Agent for cost efficiency
### Low Priority (Optimize Later)
7. **TICKET-042**: Memory Implementation - Add LLM queries only if needed
8. **TICKET-024**: Logging & Metrics - Track costs and optimize
## Model Selection Matrix
| Task | Model | Hardware | Quantization | Cost/Hour | Use Case |
|------|-------|----------|--------------|-----------|----------|
| Work Agent | Llama 3.1 70B | RTX 4080 | Q4 | $0.018-0.03 | Coding, research |
| Family Agent | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Daily conversations |
| Summarization | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Conversation summaries |
| Memory Queries | Embeddings + Phi-3 | RTX 1050 | Q4 | Minimal | Memory retrieval |
## Notes
- All costs assume $0.12/kWh electricity rate (US average)
- Costs scale with usage - adjust based on actual usage patterns
- Hardware depreciation not included (one-time cost)
- Local models are **much cheaper** than cloud APIs
- Privacy benefit: No data leaves your network
## Next Steps
1. Complete TICKET-017 (Model Survey) to finalize model choices
2. Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits
3. Select models based on this analysis
4. Monitor actual costs after deployment and optimize