# LLM Usage and Cost Analysis ## Overview This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs. **Key Hardware:** - **RTX 4080** (16GB VRAM): Work agent, high-capability tasks - **RTX 1050** (4GB VRAM): Family agent, always-on, low-latency ## LLM Usage by Task ### Primary Use Cases #### 1. **Work Agent (RTX 4080)** **Model Recommendations:** - **Primary**: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4 - **Alternative**: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4 - **Context**: 8K-16K tokens - **Quantization**: Q4-Q5 (fits in 16GB VRAM) **Use Cases:** - Coding assistance and code generation - Research and analysis - Complex reasoning tasks - Technical documentation - Code review and debugging **Cost per Request:** - **Electricity**: ~0.15-0.25 kWh per hour of active use - **At $0.12/kWh**: ~$0.018-0.03/hour - **Per request** (avg 5s generation): ~$0.000025-0.00004 per request - **Monthly** (2 hours/day): ~$1.08-1.80/month #### 2. **Family Agent (RTX 1050)** **Model Recommendations:** - **Primary**: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4 - **Alternative**: Gemma 2B Q4, Qwen2.5 1.5B Q4 - **Context**: 4K-8K tokens - **Quantization**: Q4 (fits in 4GB VRAM) **Use Cases:** - Daily conversations - Task management (add task, update status) - Weather queries - Timers and reminders - Simple Q&A - Family-friendly interactions **Cost per Request:** - **Electricity**: ~0.05-0.08 kWh per hour of active use - **At $0.12/kWh**: ~$0.006-0.01/hour - **Per request** (avg 2s generation): ~$0.000003-0.000006 per request - **Monthly** (always-on, 8 hours/day): ~$1.44-2.40/month ### Secondary Use Cases #### 3. **Conversation Summarization** (TICKET-043) **Model Choice:** - **Option A**: Use Family Agent (1050) - cheaper, sufficient for summaries - **Option B**: Use Work Agent (4080) - better quality, but more expensive - **Recommendation**: Use Family Agent for most summaries, Work Agent for complex/long conversations **Frequency**: After N turns (e.g., every 20 messages) or size threshold **Cost**: - Family Agent: ~$0.00001 per summary - Work Agent: ~$0.00004 per summary - **Monthly** (10 summaries/day): ~$0.003-0.012/month #### 4. **Memory Retrieval Enhancement** (TICKET-041, TICKET-042) **Model Choice:** - Use Family Agent (1050) for memory queries - Lightweight embeddings can be done without LLM - Only use LLM for complex memory reasoning **Cost**: Minimal - mostly embedding-based retrieval ## Cost Breakdown by Ticket ### Milestone 1 - Survey & Architecture - **TICKET-017, TICKET-018, TICKET-019, TICKET-020**: No LLM costs (research only) ### Milestone 2 - Voice Chat MVP #### TICKET-021: Stand Up 4080 LLM Service - **Setup cost**: $0 (one-time) - **Ongoing**: ~$1.08-1.80/month (work agent usage) #### TICKET-022: Stand Up 1050 LLM Service - **Setup cost**: $0 (one-time) - **Ongoing**: ~$1.44-2.40/month (family agent, always-on) #### TICKET-025: System Prompts - **Cost**: $0 (configuration only) #### TICKET-027: Multi-Turn Conversation - **Cost**: $0 (infrastructure, no LLM calls) #### TICKET-030: MCP-LLM Integration - **Cost**: $0 (adapter code, uses existing LLM servers) ### Milestone 3 - Memory, Reminders, Safety #### TICKET-041: Long-Term Memory Design - **Cost**: $0 (design only) #### TICKET-042: Long-Term Memory Implementation - **Cost**: Minimal - mostly database operations - **LLM usage**: Only for complex memory queries (~$0.01/month) #### TICKET-043: Conversation Summarization - **Cost**: ~$0.003-0.012/month (10 summaries/day) - **Model**: Family Agent (1050) recommended #### TICKET-044: Boundary Enforcement - **Cost**: $0 (policy enforcement, no LLM) #### TICKET-045: Confirmation Flows - **Cost**: $0 (UI/logic, uses existing LLM for explanations) #### TICKET-046: Admin Tools - **Cost**: $0 (UI/logging, no LLM) ## Total Monthly Operating Costs ### Base Infrastructure (Always Running) - **Family Agent (1050)**: ~$1.44-2.40/month - **Work Agent (4080)**: ~$1.08-1.80/month (when active) - **Total Base**: ~$2.52-4.20/month ### Variable Costs (Usage-Based) - **Conversation Summarization**: ~$0.003-0.012/month - **Memory Queries**: ~$0.01/month - **Total Variable**: ~$0.013-0.022/month ### **Total Monthly Cost: ~$2.53-4.22/month** ## Cost Optimization Strategies ### 1. **Model Selection** - Use smallest model that meets quality requirements - Q4 quantization for both agents (good quality/performance) - Consider Q5 for work agent if quality is critical ### 2. **Usage Patterns** - **Work Agent**: Only run when needed (not always-on) - **Family Agent**: Always-on but low-power (1050 is efficient) - **Summarization**: Batch process, use cheaper model ### 3. **Context Management** - Keep context windows reasonable (8K for work, 4K for family) - Aggressive summarization to reduce context size - Prune old messages regularly ### 4. **Hardware Optimization** - Use efficient inference servers (llama.cpp, vLLM) - Enable KV cache for faster responses - Batch requests when possible (work agent) ## Alternative: Cloud API Costs (For Comparison) If using cloud APIs instead of local: ### OpenAI GPT-4 - **Work Agent**: ~$0.03-0.06 per request - **Family Agent**: ~$0.01-0.02 per request - **Monthly** (100 requests/day): ~$120-240/month ### Anthropic Claude - **Work Agent**: ~$0.015-0.03 per request - **Family Agent**: ~$0.008-0.015 per request - **Monthly** (100 requests/day): ~$69-135/month ### **Local is 30-100x cheaper!** ## Recommendations by Ticket Priority ### High Priority (Do First) 1. **TICKET-019**: Select Work Agent Model - Choose efficient 70B Q4 model 2. **TICKET-020**: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4 3. **TICKET-021**: Stand Up 4080 Service - Use Ollama or vLLM 4. **TICKET-022**: Stand Up 1050 Service - Use llama.cpp (lightweight) ### Medium Priority 5. **TICKET-027**: Multi-Turn Conversation - Implement context management 6. **TICKET-043**: Summarization - Use Family Agent for cost efficiency ### Low Priority (Optimize Later) 7. **TICKET-042**: Memory Implementation - Add LLM queries only if needed 8. **TICKET-024**: Logging & Metrics - Track costs and optimize ## Model Selection Matrix | Task | Model | Hardware | Quantization | Cost/Hour | Use Case | |------|-------|----------|--------------|-----------|----------| | Work Agent | Llama 3.1 70B | RTX 4080 | Q4 | $0.018-0.03 | Coding, research | | Family Agent | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Daily conversations | | Summarization | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Conversation summaries | | Memory Queries | Embeddings + Phi-3 | RTX 1050 | Q4 | Minimal | Memory retrieval | ## Notes - All costs assume $0.12/kWh electricity rate (US average) - Costs scale with usage - adjust based on actual usage patterns - Hardware depreciation not included (one-time cost) - Local models are **much cheaper** than cloud APIs - Privacy benefit: No data leaves your network ## Next Steps 1. Complete TICKET-017 (Model Survey) to finalize model choices 2. Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits 3. Select models based on this analysis 4. Monitor actual costs after deployment and optimize