atlas/docs/MODEL_SELECTION.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

147 lines
4.5 KiB
Markdown

# Final Model Selection
## Overview
This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).
## Work Agent Model Selection (RTX 4080)
### Selected Model: **Llama 3.1 70B Q4**
**Rationale:**
- Best overall balance of coding and research capabilities
- Excellent function calling support (required for MCP integration)
- Fits comfortably in 16GB VRAM (~14GB usage)
- Large context window (128K tokens, practical limit 8K)
- Well-documented and widely supported
- Strong performance for both coding and general research tasks
**Specifications:**
- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct
- **Quantization**: Q4 (4-bit)
- **VRAM Usage**: ~14GB
- **Context Window**: 8K tokens (practical limit)
- **Expected Latency**: ~200-300ms first token, ~3-4s for 100 tokens
- **Concurrency**: 2 requests maximum
**Alternative Model:**
- **DeepSeek Coder 33B Q4** - If coding is the primary focus
- Faster inference (~100-200ms first token)
- Lower VRAM usage (~8GB)
- Larger practical context (16K tokens)
- Less capable for general research
**Model Source:**
- Hugging Face: `meta-llama/Meta-Llama-3.1-70B-Instruct`
- Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
- Or use Ollama: `ollama pull llama3.1:70b-q4_0`
**Performance Characteristics:**
- Coding: ⭐⭐⭐⭐⭐ (Excellent)
- Research: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: ✅ Native support
- Speed: Medium (acceptable for work tasks)
## Family Agent Model Selection (RTX 1050)
### Selected Model: **Phi-3 Mini 3.8B Q4**
**Rationale:**
- Excellent instruction following (critical for family agent)
- Very fast inference (<1s latency for interactive use)
- Low VRAM usage (~2.5GB, comfortable margin)
- Good function calling support
- Large context window (128K tokens, practical limit 8K)
- Microsoft-backed, well-maintained
**Specifications:**
- **Model**: microsoft/Phi-3-mini-4k-instruct
- **Quantization**: Q4 (4-bit)
- **VRAM Usage**: ~2.5GB
- **Context Window**: 8K tokens (practical limit)
- **Expected Latency**: ~50-100ms first token, ~1-1.5s for 100 tokens
- **Concurrency**: 1-2 requests maximum
**Alternative Model:**
- **Qwen2.5 1.5B Q4** - If more VRAM headroom needed
- Smaller VRAM footprint (~1.2GB)
- Still fast inference
- Slightly less capable than Phi-3 Mini
**Model Source:**
- Hugging Face: `microsoft/Phi-3-mini-4k-instruct`
- Quantized version: Use llama.cpp for Q4 quantization
- Or use Ollama: `ollama pull phi3:mini-q4_0`
**Performance Characteristics:**
- Instruction Following: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: Native support
- Speed: Very Fast (<1s latency)
- Efficiency: High (low power consumption)
## Selection Summary
| Agent | Model | Size | Quantization | VRAM | Context | Latency |
|-------|-------|------|--------------|------|---------|---------|
| **Work** | Llama 3.1 70B | 70B | Q4 | ~14GB | 8K | ~3-4s |
| **Family** | Phi-3 Mini 3.8B | 3.8B | Q4 | ~2.5GB | 8K | ~1-1.5s |
## Implementation Plan
### Phase 1: Download and Test
1. Download Llama 3.1 70B Q4 quantized model
2. Download Phi-3 Mini 3.8B Q4 quantized model
3. Test on actual hardware (4080 and 1050)
4. Benchmark actual VRAM usage and latency
5. Verify function calling support
### Phase 2: Setup Inference Servers
1. Set up Ollama or vLLM for 4080 (TICKET-021)
2. Set up llama.cpp or Ollama for 1050 (TICKET-022)
3. Configure context windows (8K for both)
4. Test concurrent request handling
### Phase 3: Integration
1. Integrate with MCP server (TICKET-030)
2. Test function calling end-to-end
3. Optimize based on real-world performance
## Model Files Location
**Recommended Structure:**
```
models/
├── work-agent/
│ └── llama-3.1-70b-q4.gguf
├── family-agent/
│ └── phi-3-mini-3.8b-q4.gguf
└── backups/
```
## Cost Analysis
Based on `docs/LLM_USAGE_AND_COSTS.md`:
- **Work Agent (4080)**: ~$1.08-1.80/month (2 hours/day usage)
- **Family Agent (1050)**: ~$1.44-2.40/month (always-on, 8 hours/day)
- **Total**: ~$2.52-4.20/month
## Next Steps
1. Model selection complete (TICKET-019, TICKET-020)
2. Download selected models
3. Set up inference servers (TICKET-021, TICKET-022)
4. Test and benchmark on actual hardware
5. Integrate with MCP (TICKET-030)
## References
- Model Survey: `docs/LLM_MODEL_SURVEY.md`
- Capacity Assessment: `docs/LLM_CAPACITY.md`
- Usage & Costs: `docs/LLM_USAGE_AND_COSTS.md`
---
**Last Updated**: 2024-01-XX
**Status**: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)