- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
4.5 KiB
4.5 KiB
Final Model Selection
Overview
This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).
Work Agent Model Selection (RTX 4080)
Selected Model: Llama 3.1 70B Q4
Rationale:
- Best overall balance of coding and research capabilities
- Excellent function calling support (required for MCP integration)
- Fits comfortably in 16GB VRAM (~14GB usage)
- Large context window (128K tokens, practical limit 8K)
- Well-documented and widely supported
- Strong performance for both coding and general research tasks
Specifications:
- Model: meta-llama/Meta-Llama-3.1-70B-Instruct
- Quantization: Q4 (4-bit)
- VRAM Usage: ~14GB
- Context Window: 8K tokens (practical limit)
- Expected Latency: ~200-300ms first token, ~3-4s for 100 tokens
- Concurrency: 2 requests maximum
Alternative Model:
- DeepSeek Coder 33B Q4 - If coding is the primary focus
- Faster inference (~100-200ms first token)
- Lower VRAM usage (~8GB)
- Larger practical context (16K tokens)
- Less capable for general research
Model Source:
- Hugging Face:
meta-llama/Meta-Llama-3.1-70B-Instruct - Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
- Or use Ollama:
ollama pull llama3.1:70b-q4_0
Performance Characteristics:
- Coding: ⭐⭐⭐⭐⭐ (Excellent)
- Research: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: ✅ Native support
- Speed: Medium (acceptable for work tasks)
Family Agent Model Selection (RTX 1050)
Selected Model: Phi-3 Mini 3.8B Q4
Rationale:
- Excellent instruction following (critical for family agent)
- Very fast inference (<1s latency for interactive use)
- Low VRAM usage (~2.5GB, comfortable margin)
- Good function calling support
- Large context window (128K tokens, practical limit 8K)
- Microsoft-backed, well-maintained
Specifications:
- Model: microsoft/Phi-3-mini-4k-instruct
- Quantization: Q4 (4-bit)
- VRAM Usage: ~2.5GB
- Context Window: 8K tokens (practical limit)
- Expected Latency: ~50-100ms first token, ~1-1.5s for 100 tokens
- Concurrency: 1-2 requests maximum
Alternative Model:
- Qwen2.5 1.5B Q4 - If more VRAM headroom needed
- Smaller VRAM footprint (~1.2GB)
- Still fast inference
- Slightly less capable than Phi-3 Mini
Model Source:
- Hugging Face:
microsoft/Phi-3-mini-4k-instruct - Quantized version: Use llama.cpp for Q4 quantization
- Or use Ollama:
ollama pull phi3:mini-q4_0
Performance Characteristics:
- Instruction Following: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: ✅ Native support
- Speed: Very Fast (<1s latency)
- Efficiency: High (low power consumption)
Selection Summary
| Agent | Model | Size | Quantization | VRAM | Context | Latency |
|---|---|---|---|---|---|---|
| Work | Llama 3.1 70B | 70B | Q4 | ~14GB | 8K | ~3-4s |
| Family | Phi-3 Mini 3.8B | 3.8B | Q4 | ~2.5GB | 8K | ~1-1.5s |
Implementation Plan
Phase 1: Download and Test
- Download Llama 3.1 70B Q4 quantized model
- Download Phi-3 Mini 3.8B Q4 quantized model
- Test on actual hardware (4080 and 1050)
- Benchmark actual VRAM usage and latency
- Verify function calling support
Phase 2: Setup Inference Servers
- Set up Ollama or vLLM for 4080 (TICKET-021)
- Set up llama.cpp or Ollama for 1050 (TICKET-022)
- Configure context windows (8K for both)
- Test concurrent request handling
Phase 3: Integration
- Integrate with MCP server (TICKET-030)
- Test function calling end-to-end
- Optimize based on real-world performance
Model Files Location
Recommended Structure:
models/
├── work-agent/
│ └── llama-3.1-70b-q4.gguf
├── family-agent/
│ └── phi-3-mini-3.8b-q4.gguf
└── backups/
Cost Analysis
Based on docs/LLM_USAGE_AND_COSTS.md:
- Work Agent (4080): ~$1.08-1.80/month (2 hours/day usage)
- Family Agent (1050): ~$1.44-2.40/month (always-on, 8 hours/day)
- Total: ~$2.52-4.20/month
Next Steps
- ✅ Model selection complete (TICKET-019, TICKET-020)
- Download selected models
- Set up inference servers (TICKET-021, TICKET-022)
- Test and benchmark on actual hardware
- Integrate with MCP (TICKET-030)
References
- Model Survey:
docs/LLM_MODEL_SURVEY.md - Capacity Assessment:
docs/LLM_CAPACITY.md - Usage & Costs:
docs/LLM_USAGE_AND_COSTS.md
Last Updated: 2024-01-XX Status: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)