atlas/docs/MODEL_SELECTION.md

# Final Model Selection

## Overview

This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).

## Work Agent Model Selection (RTX 4080)

### Selected Model: **Llama 3.1 70B Q4**

**Rationale:**
- Best overall balance of coding and research capabilities
- Excellent function calling support (required for MCP integration)
- Fits comfortably in 16GB VRAM (~14GB usage)
- Large context window (128K tokens, practical limit 8K)
- Well-documented and widely supported
- Strong performance for both coding and general research tasks

**Specifications:**
- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct
- **Quantization**: Q4 (4-bit)
- **VRAM Usage**: ~14GB
- **Context Window**: 8K tokens (practical limit)
- **Expected Latency**: ~200-300ms first token, ~3-4s for 100 tokens
- **Concurrency**: 2 requests maximum

**Alternative Model:**
- **DeepSeek Coder 33B Q4** - If coding is the primary focus
  - Faster inference (~100-200ms first token)
  - Lower VRAM usage (~8GB)
  - Larger practical context (16K tokens)
  - Less capable for general research

**Model Source:**
- Hugging Face: `meta-llama/Meta-Llama-3.1-70B-Instruct`
- Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
- Or use Ollama: `ollama pull llama3.1:70b-q4_0`

**Performance Characteristics:**
- Coding: ⭐⭐⭐⭐⭐ (Excellent)
- Research: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: ✅ Native support
- Speed: Medium (acceptable for work tasks)

## Family Agent Model Selection (RTX 1050)

### Selected Model: **Phi-3 Mini 3.8B Q4**

**Rationale:**
- Excellent instruction following (critical for family agent)
- Very fast inference (<1s latency for interactive use)
- Low VRAM usage (~2.5GB, comfortable margin)
- Good function calling support
- Large context window (128K tokens, practical limit 8K)
- Microsoft-backed, well-maintained

**Specifications:**
- **Model**: microsoft/Phi-3-mini-4k-instruct
- **Quantization**: Q4 (4-bit)
- **VRAM Usage**: ~2.5GB
- **Context Window**: 8K tokens (practical limit)
- **Expected Latency**: ~50-100ms first token, ~1-1.5s for 100 tokens
- **Concurrency**: 1-2 requests maximum

**Alternative Model:**
- **Qwen2.5 1.5B Q4** - If more VRAM headroom needed
  - Smaller VRAM footprint (~1.2GB)
  - Still fast inference
  - Slightly less capable than Phi-3 Mini

**Model Source:**
- Hugging Face: `microsoft/Phi-3-mini-4k-instruct`
- Quantized version: Use llama.cpp for Q4 quantization
- Or use Ollama: `ollama pull phi3:mini-q4_0`

**Performance Characteristics:**
- Instruction Following: ⭐⭐⭐⭐⭐ (Excellent)
- Function Calling: ✅ Native support
- Speed: Very Fast (<1s latency)
- Efficiency: High (low power consumption)

## Selection Summary

| Agent | Model | Size | Quantization | VRAM | Context | Latency |
|-------|-------|------|--------------|------|---------|---------|
| **Work** | Llama 3.1 70B | 70B | Q4 | ~14GB | 8K | ~3-4s |
| **Family** | Phi-3 Mini 3.8B | 3.8B | Q4 | ~2.5GB | 8K | ~1-1.5s |

## Implementation Plan

### Phase 1: Download and Test
1. Download Llama 3.1 70B Q4 quantized model
2. Download Phi-3 Mini 3.8B Q4 quantized model
3. Test on actual hardware (4080 and 1050)
4. Benchmark actual VRAM usage and latency
5. Verify function calling support

### Phase 2: Setup Inference Servers
1. Set up Ollama or vLLM for 4080 (TICKET-021)
2. Set up llama.cpp or Ollama for 1050 (TICKET-022)
3. Configure context windows (8K for both)
4. Test concurrent request handling

### Phase 3: Integration
1. Integrate with MCP server (TICKET-030)
2. Test function calling end-to-end
3. Optimize based on real-world performance

## Model Files Location

**Recommended Structure:**
```
models/
├── work-agent/
│   └── llama-3.1-70b-q4.gguf
├── family-agent/
│   └── phi-3-mini-3.8b-q4.gguf
└── backups/
```

## Cost Analysis

Based on `docs/LLM_USAGE_AND_COSTS.md`:

- **Work Agent (4080)**: ~$1.08-1.80/month (2 hours/day usage)
- **Family Agent (1050)**: ~$1.44-2.40/month (always-on, 8 hours/day)
- **Total**: ~$2.52-4.20/month

## Next Steps

1. ✅ Model selection complete (TICKET-019, TICKET-020)
2. Download selected models
3. Set up inference servers (TICKET-021, TICKET-022)
4. Test and benchmark on actual hardware
5. Integrate with MCP (TICKET-030)

## References

- Model Survey: `docs/LLM_MODEL_SURVEY.md`
- Capacity Assessment: `docs/LLM_CAPACITY.md`
- Usage & Costs: `docs/LLM_USAGE_AND_COSTS.md`

---

**Last Updated**: 2024-01-XX
**Status**: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)