- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
215 lines
7.1 KiB
Markdown
215 lines
7.1 KiB
Markdown
# LLM Usage and Cost Analysis
|
|
|
|
## Overview
|
|
|
|
This document outlines which LLMs to use for different tasks in the Atlas voice agent system, and estimates operational costs.
|
|
|
|
**Key Hardware:**
|
|
- **RTX 4080** (16GB VRAM): Work agent, high-capability tasks
|
|
- **RTX 1050** (4GB VRAM): Family agent, always-on, low-latency
|
|
|
|
## LLM Usage by Task
|
|
|
|
### Primary Use Cases
|
|
|
|
#### 1. **Work Agent (RTX 4080)**
|
|
**Model Recommendations:**
|
|
- **Primary**: Llama 3.1 70B Q4/Q5 or DeepSeek Coder 33B Q4
|
|
- **Alternative**: Qwen 2.5 72B Q4, Mistral Large 2 67B Q4
|
|
- **Context**: 8K-16K tokens
|
|
- **Quantization**: Q4-Q5 (fits in 16GB VRAM)
|
|
|
|
**Use Cases:**
|
|
- Coding assistance and code generation
|
|
- Research and analysis
|
|
- Complex reasoning tasks
|
|
- Technical documentation
|
|
- Code review and debugging
|
|
|
|
**Cost per Request:**
|
|
- **Electricity**: ~0.15-0.25 kWh per hour of active use
|
|
- **At $0.12/kWh**: ~$0.018-0.03/hour
|
|
- **Per request** (avg 5s generation): ~$0.000025-0.00004 per request
|
|
- **Monthly** (2 hours/day): ~$1.08-1.80/month
|
|
|
|
#### 2. **Family Agent (RTX 1050)**
|
|
**Model Recommendations:**
|
|
- **Primary**: Phi-3 Mini 3.8B Q4 or TinyLlama 1.1B Q4
|
|
- **Alternative**: Gemma 2B Q4, Qwen2.5 1.5B Q4
|
|
- **Context**: 4K-8K tokens
|
|
- **Quantization**: Q4 (fits in 4GB VRAM)
|
|
|
|
**Use Cases:**
|
|
- Daily conversations
|
|
- Task management (add task, update status)
|
|
- Weather queries
|
|
- Timers and reminders
|
|
- Simple Q&A
|
|
- Family-friendly interactions
|
|
|
|
**Cost per Request:**
|
|
- **Electricity**: ~0.05-0.08 kWh per hour of active use
|
|
- **At $0.12/kWh**: ~$0.006-0.01/hour
|
|
- **Per request** (avg 2s generation): ~$0.000003-0.000006 per request
|
|
- **Monthly** (always-on, 8 hours/day): ~$1.44-2.40/month
|
|
|
|
### Secondary Use Cases
|
|
|
|
#### 3. **Conversation Summarization** (TICKET-043)
|
|
**Model Choice:**
|
|
- **Option A**: Use Family Agent (1050) - cheaper, sufficient for summaries
|
|
- **Option B**: Use Work Agent (4080) - better quality, but more expensive
|
|
- **Recommendation**: Use Family Agent for most summaries, Work Agent for complex/long conversations
|
|
|
|
**Frequency**: After N turns (e.g., every 20 messages) or size threshold
|
|
**Cost**:
|
|
- Family Agent: ~$0.00001 per summary
|
|
- Work Agent: ~$0.00004 per summary
|
|
- **Monthly** (10 summaries/day): ~$0.003-0.012/month
|
|
|
|
#### 4. **Memory Retrieval Enhancement** (TICKET-041, TICKET-042)
|
|
**Model Choice:**
|
|
- Use Family Agent (1050) for memory queries
|
|
- Lightweight embeddings can be done without LLM
|
|
- Only use LLM for complex memory reasoning
|
|
|
|
**Cost**: Minimal - mostly embedding-based retrieval
|
|
|
|
## Cost Breakdown by Ticket
|
|
|
|
### Milestone 1 - Survey & Architecture
|
|
- **TICKET-017, TICKET-018, TICKET-019, TICKET-020**: No LLM costs (research only)
|
|
|
|
### Milestone 2 - Voice Chat MVP
|
|
|
|
#### TICKET-021: Stand Up 4080 LLM Service
|
|
- **Setup cost**: $0 (one-time)
|
|
- **Ongoing**: ~$1.08-1.80/month (work agent usage)
|
|
|
|
#### TICKET-022: Stand Up 1050 LLM Service
|
|
- **Setup cost**: $0 (one-time)
|
|
- **Ongoing**: ~$1.44-2.40/month (family agent, always-on)
|
|
|
|
#### TICKET-025: System Prompts
|
|
- **Cost**: $0 (configuration only)
|
|
|
|
#### TICKET-027: Multi-Turn Conversation
|
|
- **Cost**: $0 (infrastructure, no LLM calls)
|
|
|
|
#### TICKET-030: MCP-LLM Integration
|
|
- **Cost**: $0 (adapter code, uses existing LLM servers)
|
|
|
|
### Milestone 3 - Memory, Reminders, Safety
|
|
|
|
#### TICKET-041: Long-Term Memory Design
|
|
- **Cost**: $0 (design only)
|
|
|
|
#### TICKET-042: Long-Term Memory Implementation
|
|
- **Cost**: Minimal - mostly database operations
|
|
- **LLM usage**: Only for complex memory queries (~$0.01/month)
|
|
|
|
#### TICKET-043: Conversation Summarization
|
|
- **Cost**: ~$0.003-0.012/month (10 summaries/day)
|
|
- **Model**: Family Agent (1050) recommended
|
|
|
|
#### TICKET-044: Boundary Enforcement
|
|
- **Cost**: $0 (policy enforcement, no LLM)
|
|
|
|
#### TICKET-045: Confirmation Flows
|
|
- **Cost**: $0 (UI/logic, uses existing LLM for explanations)
|
|
|
|
#### TICKET-046: Admin Tools
|
|
- **Cost**: $0 (UI/logging, no LLM)
|
|
|
|
## Total Monthly Operating Costs
|
|
|
|
### Base Infrastructure (Always Running)
|
|
- **Family Agent (1050)**: ~$1.44-2.40/month
|
|
- **Work Agent (4080)**: ~$1.08-1.80/month (when active)
|
|
- **Total Base**: ~$2.52-4.20/month
|
|
|
|
### Variable Costs (Usage-Based)
|
|
- **Conversation Summarization**: ~$0.003-0.012/month
|
|
- **Memory Queries**: ~$0.01/month
|
|
- **Total Variable**: ~$0.013-0.022/month
|
|
|
|
### **Total Monthly Cost: ~$2.53-4.22/month**
|
|
|
|
## Cost Optimization Strategies
|
|
|
|
### 1. **Model Selection**
|
|
- Use smallest model that meets quality requirements
|
|
- Q4 quantization for both agents (good quality/performance)
|
|
- Consider Q5 for work agent if quality is critical
|
|
|
|
### 2. **Usage Patterns**
|
|
- **Work Agent**: Only run when needed (not always-on)
|
|
- **Family Agent**: Always-on but low-power (1050 is efficient)
|
|
- **Summarization**: Batch process, use cheaper model
|
|
|
|
### 3. **Context Management**
|
|
- Keep context windows reasonable (8K for work, 4K for family)
|
|
- Aggressive summarization to reduce context size
|
|
- Prune old messages regularly
|
|
|
|
### 4. **Hardware Optimization**
|
|
- Use efficient inference servers (llama.cpp, vLLM)
|
|
- Enable KV cache for faster responses
|
|
- Batch requests when possible (work agent)
|
|
|
|
## Alternative: Cloud API Costs (For Comparison)
|
|
|
|
If using cloud APIs instead of local:
|
|
|
|
### OpenAI GPT-4
|
|
- **Work Agent**: ~$0.03-0.06 per request
|
|
- **Family Agent**: ~$0.01-0.02 per request
|
|
- **Monthly** (100 requests/day): ~$120-240/month
|
|
|
|
### Anthropic Claude
|
|
- **Work Agent**: ~$0.015-0.03 per request
|
|
- **Family Agent**: ~$0.008-0.015 per request
|
|
- **Monthly** (100 requests/day): ~$69-135/month
|
|
|
|
### **Local is 30-100x cheaper!**
|
|
|
|
## Recommendations by Ticket Priority
|
|
|
|
### High Priority (Do First)
|
|
1. **TICKET-019**: Select Work Agent Model - Choose efficient 70B Q4 model
|
|
2. **TICKET-020**: Select Family Agent Model - Choose Phi-3 Mini or TinyLlama Q4
|
|
3. **TICKET-021**: Stand Up 4080 Service - Use Ollama or vLLM
|
|
4. **TICKET-022**: Stand Up 1050 Service - Use llama.cpp (lightweight)
|
|
|
|
### Medium Priority
|
|
5. **TICKET-027**: Multi-Turn Conversation - Implement context management
|
|
6. **TICKET-043**: Summarization - Use Family Agent for cost efficiency
|
|
|
|
### Low Priority (Optimize Later)
|
|
7. **TICKET-042**: Memory Implementation - Add LLM queries only if needed
|
|
8. **TICKET-024**: Logging & Metrics - Track costs and optimize
|
|
|
|
## Model Selection Matrix
|
|
|
|
| Task | Model | Hardware | Quantization | Cost/Hour | Use Case |
|
|
|------|-------|----------|--------------|-----------|----------|
|
|
| Work Agent | Llama 3.1 70B | RTX 4080 | Q4 | $0.018-0.03 | Coding, research |
|
|
| Family Agent | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Daily conversations |
|
|
| Summarization | Phi-3 Mini 3.8B | RTX 1050 | Q4 | $0.006-0.01 | Conversation summaries |
|
|
| Memory Queries | Embeddings + Phi-3 | RTX 1050 | Q4 | Minimal | Memory retrieval |
|
|
|
|
## Notes
|
|
|
|
- All costs assume $0.12/kWh electricity rate (US average)
|
|
- Costs scale with usage - adjust based on actual usage patterns
|
|
- Hardware depreciation not included (one-time cost)
|
|
- Local models are **much cheaper** than cloud APIs
|
|
- Privacy benefit: No data leaves your network
|
|
|
|
## Next Steps
|
|
|
|
1. Complete TICKET-017 (Model Survey) to finalize model choices
|
|
2. Complete TICKET-018 (Capacity Assessment) to confirm VRAM fits
|
|
3. Select models based on this analysis
|
|
4. Monitor actual costs after deployment and optimize
|