# Final Model Selection ## Overview This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018). ## Work Agent Model Selection (RTX 4080) ### Selected Model: **Llama 3.1 70B Q4** **Rationale:** - Best overall balance of coding and research capabilities - Excellent function calling support (required for MCP integration) - Fits comfortably in 16GB VRAM (~14GB usage) - Large context window (128K tokens, practical limit 8K) - Well-documented and widely supported - Strong performance for both coding and general research tasks **Specifications:** - **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct - **Quantization**: Q4 (4-bit) - **VRAM Usage**: ~14GB - **Context Window**: 8K tokens (practical limit) - **Expected Latency**: ~200-300ms first token, ~3-4s for 100 tokens - **Concurrency**: 2 requests maximum **Alternative Model:** - **DeepSeek Coder 33B Q4** - If coding is the primary focus - Faster inference (~100-200ms first token) - Lower VRAM usage (~8GB) - Larger practical context (16K tokens) - Less capable for general research **Model Source:** - Hugging Face: `meta-llama/Meta-Llama-3.1-70B-Instruct` - Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization - Or use Ollama: `ollama pull llama3.1:70b-q4_0` **Performance Characteristics:** - Coding: ⭐⭐⭐⭐⭐ (Excellent) - Research: ⭐⭐⭐⭐⭐ (Excellent) - Function Calling: ✅ Native support - Speed: Medium (acceptable for work tasks) ## Family Agent Model Selection (RTX 1050) ### Selected Model: **Phi-3 Mini 3.8B Q4** **Rationale:** - Excellent instruction following (critical for family agent) - Very fast inference (<1s latency for interactive use) - Low VRAM usage (~2.5GB, comfortable margin) - Good function calling support - Large context window (128K tokens, practical limit 8K) - Microsoft-backed, well-maintained **Specifications:** - **Model**: microsoft/Phi-3-mini-4k-instruct - **Quantization**: Q4 (4-bit) - **VRAM Usage**: ~2.5GB - **Context Window**: 8K tokens (practical limit) - **Expected Latency**: ~50-100ms first token, ~1-1.5s for 100 tokens - **Concurrency**: 1-2 requests maximum **Alternative Model:** - **Qwen2.5 1.5B Q4** - If more VRAM headroom needed - Smaller VRAM footprint (~1.2GB) - Still fast inference - Slightly less capable than Phi-3 Mini **Model Source:** - Hugging Face: `microsoft/Phi-3-mini-4k-instruct` - Quantized version: Use llama.cpp for Q4 quantization - Or use Ollama: `ollama pull phi3:mini-q4_0` **Performance Characteristics:** - Instruction Following: ⭐⭐⭐⭐⭐ (Excellent) - Function Calling: ✅ Native support - Speed: Very Fast (<1s latency) - Efficiency: High (low power consumption) ## Selection Summary | Agent | Model | Size | Quantization | VRAM | Context | Latency | |-------|-------|------|--------------|------|---------|---------| | **Work** | Llama 3.1 70B | 70B | Q4 | ~14GB | 8K | ~3-4s | | **Family** | Phi-3 Mini 3.8B | 3.8B | Q4 | ~2.5GB | 8K | ~1-1.5s | ## Implementation Plan ### Phase 1: Download and Test 1. Download Llama 3.1 70B Q4 quantized model 2. Download Phi-3 Mini 3.8B Q4 quantized model 3. Test on actual hardware (4080 and 1050) 4. Benchmark actual VRAM usage and latency 5. Verify function calling support ### Phase 2: Setup Inference Servers 1. Set up Ollama or vLLM for 4080 (TICKET-021) 2. Set up llama.cpp or Ollama for 1050 (TICKET-022) 3. Configure context windows (8K for both) 4. Test concurrent request handling ### Phase 3: Integration 1. Integrate with MCP server (TICKET-030) 2. Test function calling end-to-end 3. Optimize based on real-world performance ## Model Files Location **Recommended Structure:** ``` models/ ├── work-agent/ │ └── llama-3.1-70b-q4.gguf ├── family-agent/ │ └── phi-3-mini-3.8b-q4.gguf └── backups/ ``` ## Cost Analysis Based on `docs/LLM_USAGE_AND_COSTS.md`: - **Work Agent (4080)**: ~$1.08-1.80/month (2 hours/day usage) - **Family Agent (1050)**: ~$1.44-2.40/month (always-on, 8 hours/day) - **Total**: ~$2.52-4.20/month ## Next Steps 1. ✅ Model selection complete (TICKET-019, TICKET-020) 2. Download selected models 3. Set up inference servers (TICKET-021, TICKET-022) 4. Test and benchmark on actual hardware 5. Integrate with MCP (TICKET-030) ## References - Model Survey: `docs/LLM_MODEL_SURVEY.md` - Capacity Assessment: `docs/LLM_CAPACITY.md` - Usage & Costs: `docs/LLM_USAGE_AND_COSTS.md` --- **Last Updated**: 2024-01-XX **Status**: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)