atlas/docs/LLM_USAGE_AND_COSTS.md

# LLM Usage and Cost Analysis

## Overview

This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs.

**Key Hardware:**
- **RTX 4080** (16GB VRAM): Work agent, high-capability tasks
- **RTX 1050** (4GB VRAM): Family agent, always-on, low-latency

## LLM Usage by Task

### Primary Use Cases

#### 1. **Work Agent (RTX 4080)**
**Model Recommendations:**
- **Primary**: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4
- **Alternative**: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4
- **Context**: 8K-16K tokens
- **Quantization**: Q4-Q5 (fits in 16GB VRAM)

**Use Cases:**
- Coding assistance and code generation
- Research and analysis
- Complex reasoning tasks
- Technical documentation
- Code review and debugging

**Cost per Request:**
- **Electricity**: ~0.15-0.25 kWh per hour of active use
- **At $0.12/kWh**: ~$0.018-0.03/hour
- **Per request** (avg 5s generation): ~$0.000025-0.00004 per request
- **Monthly** (2 hours/day): ~$1.08-1.80/month

#### 2. **Family Agent (RTX 1050)**
**Model Recommendations:**
- **Primary**: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4
- **Alternative**: Gemma 2B Q4, Qwen2.5 1.5B Q4
- **Context**: 4K-8K tokens
- **Quantization**: Q4 (fits in 4GB VRAM)

**Use Cases:**
- Daily conversations
- Task management (add task, update status)
- Weather queries
- Timers and reminders
- Simple Q&A
- Family-friendly interactions

**Cost per Request:**
- **Electricity**: ~0.05-0.08 kWh per hour of active use
- **At $0.12/kWh**: ~$0.006-0.01/hour
- **Per request** (avg 2s generation): ~$0.000003-0.000006 per request
- **Monthly** (always-on, 8 hours/day): ~$1.44-2.40/month

### Secondary Use Cases

#### 3. **Conversation Summarization** (TICKET-043)
**Model Choice:**
- **Option A**: Use Family Agent (1050) - cheaper, sufficient for summaries
- **Option B**: Use Work Agent (4080) - better quality, but more expensive
- **Recommendation**: Use Family Agent for most summaries, Work Agent for complex/long conversations

**Frequency**: After N turns (e.g., every 20 messages) or size threshold
**Cost**:
- Family Agent: ~$0.00001 per summary
- Work Agent: ~$0.00004 per summary
- **Monthly** (10 summaries/day): ~$0.003-0.012/month

#### 4. **Memory Retrieval Enhancement** (TICKET-041, TICKET-042)
**Model Choice:**
- Use Family Agent (1050) for memory queries
- Lightweight embeddings can be done without LLM
- Only use LLM for complex memory reasoning

**Cost**: Minimal - mostly embedding-based retrieval

## Cost Breakdown by Ticket

### Milestone 1 - Survey & Architecture
- **TICKET-017, TICKET-018, TICKET-019, TICKET-020**: No LLM costs (research only)

### Milestone 2 - Voice Chat MVP

#### TICKET-021: Stand Up 4080 LLM Service
- **Setup cost**: $0 (one-time)
- **Ongoing**: ~$1.08-1.80/month (work agent usage)

#### TICKET-022: Stand Up 1050 LLM Service
- **Setup cost**: $0 (one-time)
- **Ongoing**: ~$1.44-2.40/month (family agent, always-on)

#### TICKET-025: System Prompts
- **Cost**: $0 (configuration only)

#### TICKET-027: Multi-Turn Conversation
- **Cost**: $0 (infrastructure, no LLM calls)

#### TICKET-030: MCP-LLM Integration
- **Cost**: $0 (adapter code, uses existing LLM servers)

### Milestone 3 - Memory, Reminders, Safety

#### TICKET-041: Long-Term Memory Design
- **Cost**: $0 (design only)

#### TICKET-042: Long-Term Memory Implementation
- **Cost**: Minimal - mostly database operations
- **LLM usage**: Only for complex memory queries (~$0.01/month)

#### TICKET-043: Conversation Summarization
- **Cost**: ~$0.003-0.012/month (10 summaries/day)
- **Model**: Family Agent (1050) recommended

#### TICKET-044: Boundary Enforcement
- **Cost**: $0 (policy enforcement, no LLM)

#### TICKET-045: Confirmation Flows
- **Cost**: $0 (UI/logic, uses existing LLM for explanations)

#### TICKET-046: Admin Tools
- **Cost**: $0 (UI/logging, no LLM)

## Total Monthly Operating Costs

### Base Infrastructure (Always Running)
- **Family Agent (1050)**: ~$1.44-2.40/month
- **Work Agent (4080)**: ~$1.08-1.80/month (when active)
- **Total Base**: ~$2.52-4.20/month

### Variable Costs (Usage-Based)
- **Conversation Summarization**: ~$0.003-0.012/month
- **Memory Queries**: ~$0.01/month
- **Total Variable**: ~$0.013-0.022/month

### **Total Monthly Cost: ~$2.53-4.22/month**

## Cost Optimization Strategies

### 1. **Model Selection**
- Use smallest model that meets quality requirements
- Q4 quantization for both agents (good quality/performance)
- Consider Q5 for work agent if quality is critical

### 2. **Usage Patterns**
- **Work Agent**: Only run when needed (not always-on)
- **Family Agent**: Always-on but low-power (1050 is efficient)
- **Summarization**: Batch process, use cheaper model

### 3. **Context Management**
- Keep context windows reasonable (8K for work, 4K for family)
- Aggressive summarization to reduce context size
- Prune old messages regularly

### 4. **Hardware Optimization**
- Use efficient inference servers (llama.cpp, vLLM)
- Enable KV cache for faster responses
- Batch requests when possible (work agent)

## Alternative: Cloud API Costs (For Comparison)

If using cloud APIs instead of local:

### OpenAI GPT-4
- **Work Agent**: ~$0.03-0.06 per request
- **Family Agent**: ~$0.01-0.02 per request
- **Monthly** (100 requests/day): ~$120-240/month

### Anthropic Claude
- **Work Agent**: ~$0.015-0.03 per request
- **Family Agent**: ~$0.008-0.015 per request
- **Monthly** (100 requests/day): ~$69-135/month

### **Local is 30-100x cheaper!**

## Recommendations by Ticket Priority

### High Priority (Do First)
1. **TICKET-019**: Select Work Agent Model - Choose efficient 70B Q4 model
2. **TICKET-020**: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4
3. **TICKET-021**: Stand Up 4080 Service - Use Ollama or vLLM
4. **TICKET-022**: Stand Up 1050 Service - Use llama.cpp (lightweight)

### Medium Priority
5. **TICKET-027**: Multi-Turn Conversation - Implement context management
6. **TICKET-043**: Summarization - Use Family Agent for cost efficiency

### Low Priority (Optimize Later)
7. **TICKET-042**: Memory Implementation - Add LLM queries only if needed
8. **TICKET-024**: Logging & Metrics - Track costs and optimize

## Model Selection Matrix

| Task | Model | Hardware | Quantization | Cost/Hour | Use Case |
|------|-------|----------|--------------|-----------|----------|
| Work Agent | Llama 3.1 70B | RTX 4080 | Q4 | $0.018-0.03 | Coding, research |
| Family Agent | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Daily conversations |
| Summarization | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Conversation summaries |
| Memory Queries | Embeddings + Phi-3 | RTX 1050 | Q4 | Minimal | Memory retrieval |

## Notes

- All costs assume $0.12/kWh electricity rate (US average)
- Costs scale with usage - adjust based on actual usage patterns
- Hardware depreciation not included (one-time cost)
- Local models are **much cheaper** than cloud APIs
- Privacy benefit: No data leaves your network

## Next Steps

1. Complete TICKET-017 (Model Survey) to finalize model choices
2. Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits
3. Select models based on this analysis
4. Monitor actual costs after deployment and optimize