- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
147 lines
4.5 KiB
Markdown
147 lines
4.5 KiB
Markdown
# Final Model Selection
|
|
|
|
## Overview
|
|
|
|
This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).
|
|
|
|
## Work Agent Model Selection (RTX 4080)
|
|
|
|
### Selected Model: **Llama 3.1 70B Q4**
|
|
|
|
**Rationale:**
|
|
- Best overall balance of coding and research capabilities
|
|
- Excellent function calling support (required for MCP integration)
|
|
- Fits comfortably in 16GB VRAM (~14GB usage)
|
|
- Large context window (128K tokens, practical limit 8K)
|
|
- Well-documented and widely supported
|
|
- Strong performance for both coding and general research tasks
|
|
|
|
**Specifications:**
|
|
- **Model**: meta-llama/Meta-Llama-3.1-70B-Instruct
|
|
- **Quantization**: Q4 (4-bit)
|
|
- **VRAM Usage**: ~14GB
|
|
- **Context Window**: 8K tokens (practical limit)
|
|
- **Expected Latency**: ~200-300ms first token, ~3-4s for 100 tokens
|
|
- **Concurrency**: 2 requests maximum
|
|
|
|
**Alternative Model:**
|
|
- **DeepSeek Coder 33B Q4** - If coding is the primary focus
|
|
- Faster inference (~100-200ms first token)
|
|
- Lower VRAM usage (~8GB)
|
|
- Larger practical context (16K tokens)
|
|
- Less capable for general research
|
|
|
|
**Model Source:**
|
|
- Hugging Face: `meta-llama/Meta-Llama-3.1-70B-Instruct`
|
|
- Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
|
|
- Or use Ollama: `ollama pull llama3.1:70b-q4_0`
|
|
|
|
**Performance Characteristics:**
|
|
- Coding: ⭐⭐⭐⭐⭐ (Excellent)
|
|
- Research: ⭐⭐⭐⭐⭐ (Excellent)
|
|
- Function Calling: ✅ Native support
|
|
- Speed: Medium (acceptable for work tasks)
|
|
|
|
## Family Agent Model Selection (RTX 1050)
|
|
|
|
### Selected Model: **Phi-3 Mini 3.8B Q4**
|
|
|
|
**Rationale:**
|
|
- Excellent instruction following (critical for family agent)
|
|
- Very fast inference (<1s latency for interactive use)
|
|
- Low VRAM usage (~2.5GB, comfortable margin)
|
|
- Good function calling support
|
|
- Large context window (128K tokens, practical limit 8K)
|
|
- Microsoft-backed, well-maintained
|
|
|
|
**Specifications:**
|
|
- **Model**: microsoft/Phi-3-mini-4k-instruct
|
|
- **Quantization**: Q4 (4-bit)
|
|
- **VRAM Usage**: ~2.5GB
|
|
- **Context Window**: 8K tokens (practical limit)
|
|
- **Expected Latency**: ~50-100ms first token, ~1-1.5s for 100 tokens
|
|
- **Concurrency**: 1-2 requests maximum
|
|
|
|
**Alternative Model:**
|
|
- **Qwen2.5 1.5B Q4** - If more VRAM headroom needed
|
|
- Smaller VRAM footprint (~1.2GB)
|
|
- Still fast inference
|
|
- Slightly less capable than Phi-3 Mini
|
|
|
|
**Model Source:**
|
|
- Hugging Face: `microsoft/Phi-3-mini-4k-instruct`
|
|
- Quantized version: Use llama.cpp for Q4 quantization
|
|
- Or use Ollama: `ollama pull phi3:mini-q4_0`
|
|
|
|
**Performance Characteristics:**
|
|
- Instruction Following: ⭐⭐⭐⭐⭐ (Excellent)
|
|
- Function Calling: ✅ Native support
|
|
- Speed: Very Fast (<1s latency)
|
|
- Efficiency: High (low power consumption)
|
|
|
|
## Selection Summary
|
|
|
|
| Agent | Model | Size | Quantization | VRAM | Context | Latency |
|
|
|-------|-------|------|--------------|------|---------|---------|
|
|
| **Work** | Llama 3.1 70B | 70B | Q4 | ~14GB | 8K | ~3-4s |
|
|
| **Family** | Phi-3 Mini 3.8B | 3.8B | Q4 | ~2.5GB | 8K | ~1-1.5s |
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Download and Test
|
|
1. Download Llama 3.1 70B Q4 quantized model
|
|
2. Download Phi-3 Mini 3.8B Q4 quantized model
|
|
3. Test on actual hardware (4080 and 1050)
|
|
4. Benchmark actual VRAM usage and latency
|
|
5. Verify function calling support
|
|
|
|
### Phase 2: Setup Inference Servers
|
|
1. Set up Ollama or vLLM for 4080 (TICKET-021)
|
|
2. Set up llama.cpp or Ollama for 1050 (TICKET-022)
|
|
3. Configure context windows (8K for both)
|
|
4. Test concurrent request handling
|
|
|
|
### Phase 3: Integration
|
|
1. Integrate with MCP server (TICKET-030)
|
|
2. Test function calling end-to-end
|
|
3. Optimize based on real-world performance
|
|
|
|
## Model Files Location
|
|
|
|
**Recommended Structure:**
|
|
```
|
|
models/
|
|
├── work-agent/
|
|
│ └── llama-3.1-70b-q4.gguf
|
|
├── family-agent/
|
|
│ └── phi-3-mini-3.8b-q4.gguf
|
|
└── backups/
|
|
```
|
|
|
|
## Cost Analysis
|
|
|
|
Based on `docs/LLM_USAGE_AND_COSTS.md`:
|
|
|
|
- **Work Agent (4080)**: ~$1.08-1.80/month (2 hours/day usage)
|
|
- **Family Agent (1050)**: ~$1.44-2.40/month (always-on, 8 hours/day)
|
|
- **Total**: ~$2.52-4.20/month
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Model selection complete (TICKET-019, TICKET-020)
|
|
2. Download selected models
|
|
3. Set up inference servers (TICKET-021, TICKET-022)
|
|
4. Test and benchmark on actual hardware
|
|
5. Integrate with MCP (TICKET-030)
|
|
|
|
## References
|
|
|
|
- Model Survey: `docs/LLM_MODEL_SURVEY.md`
|
|
- Capacity Assessment: `docs/LLM_CAPACITY.md`
|
|
- Usage & Costs: `docs/LLM_USAGE_AND_COSTS.md`
|
|
|
|
---
|
|
|
|
**Last Updated**: 2024-01-XX
|
|
**Status**: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)
|