# LLM Model Survey ## Overview This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050). **Hardware Constraints:** - **RTX 4080**: 16GB VRAM - Work agent, high-capability tasks - **RTX 1050**: 4GB VRAM - Family agent, always-on, low-latency ## Evaluation Criteria ### Work Agent (RTX 4080) Requirements - **Coding capabilities**: Code generation, debugging, code review - **Research capabilities**: Analysis, reasoning, documentation - **Function calling**: Must support tool/function calling for MCP integration - **Context window**: 8K-16K tokens minimum - **VRAM fit**: Must fit in 16GB with quantization - **Performance**: Reasonable latency (< 5s for typical responses) ### Family Agent (RTX 1050) Requirements - **Instruction following**: Good at following conversational instructions - **Function calling**: Must support tool/function calling - **Low latency**: < 1s response time for interactive use - **VRAM fit**: Must fit in 4GB with quantization - **Efficiency**: Low power consumption for always-on operation - **Context window**: 4K-8K tokens sufficient ## Model Comparison Matrix ### RTX 4080 Candidates (Work Agent) | Model | Size | Quantization | VRAM Usage | Coding | Research | Function Call | Context | Speed | Recommendation | |-------|------|--------------|------------|-------|----------|---------------|---------|-------|----------------| | **Llama 3.1 70B** | 70B | Q4 | ~14GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | **⭐ Top Choice** | | **Llama 3.1 70B** | 70B | Q5 | ~16GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good quality | | **DeepSeek Coder 33B** | 33B | Q4 | ~8GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ | 16K | Fast | **Best for coding** | | **Qwen 2.5 72B** | 72B | Q4 | ~14GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 32K | Medium | Strong alternative | | **Mistral Large 2 67B** | 67B | Q4 | ~13GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good option | | **Llama 3.1 8B** | 8B | Q4 | ~5GB | ⭐⭐⭐ | ⭐⭐⭐ | ✅ | 128K | Very Fast | Too small for work | **Recommendation for 4080:** 1. **Primary**: **Llama 3.1 70B Q4** - Best overall balance 2. **Alternative**: **DeepSeek Coder 33B Q4** - If coding is primary focus 3. **Fallback**: **Qwen 2.5 72B Q4** - Strong alternative ### RTX 1050 Candidates (Family Agent) | Model | Size | Quantization | VRAM Usage | Instruction | Function Call | Context | Speed | Latency | Recommendation | |-------|------|--------------|------------|-------------|---------------|---------|-------|---------|----------------| | **Phi-3 Mini 3.8B** | 3.8B | Q4 | ~2.5GB | ⭐⭐⭐⭐⭐ | ✅ | 128K | Very Fast | <1s | **⭐ Top Choice** | | **TinyLlama 1.1B** | 1.1B | Q4 | ~0.8GB | ⭐⭐⭐ | ✅ | 2K | Extremely Fast | <0.5s | Lightweight option | | **Gemma 2B** | 2B | Q4 | ~1.5GB | ⭐⭐⭐⭐ | ✅ | 8K | Very Fast | <0.8s | Good alternative | | **Qwen2.5 1.5B** | 1.5B | Q4 | ~1.2GB | ⭐⭐⭐⭐ | ✅ | 32K | Very Fast | <0.7s | Strong option | | **Phi-2 2.7B** | 2.7B | Q4 | ~1.8GB | ⭐⭐⭐⭐ | ✅ | 2K | Fast | <1s | Older, less capable | | **Llama 3.2 3B** | 3B | Q4 | ~2GB | ⭐⭐⭐⭐ | ✅ | 128K | Fast | <1s | Good but larger | **Recommendation for 1050:** 1. **Primary**: **Phi-3 Mini 3.8B Q4** - Best instruction following, good speed 2. **Alternative**: **Qwen2.5 1.5B Q4** - Smaller, still capable 3. **Fallback**: **TinyLlama 1.1B Q4** - If VRAM is tight ## Detailed Model Analysis ### Work Agent Models #### Llama 3.1 70B Q4/Q5 **Pros:** - Excellent coding and research capabilities - Large context window (128K tokens) - Strong function calling support - Well-documented and widely used - Good balance of quality and speed **Cons:** - Q5 uses full 16GB (tight fit) - Slower than smaller models - Higher power consumption **VRAM Usage:** - Q4: ~14GB (comfortable margin) - Q5: ~16GB (tight, but better quality) **Best For:** General work tasks, coding, research, complex reasoning #### DeepSeek Coder 33B Q4 **Pros:** - Excellent coding capabilities (specialized) - Faster than 70B models - Lower VRAM usage (~8GB) - Good function calling support - Strong for code generation and debugging **Cons:** - Less capable for general research/analysis - Smaller context window (16K vs 128K) - Less general-purpose than Llama 3.1 **Best For:** Coding-focused work, code generation, debugging #### Qwen 2.5 72B Q4 **Pros:** - Strong multilingual support - Good coding and research capabilities - Large context (32K tokens) - Competitive with Llama 3.1 **Cons:** - Less community support than Llama - Slightly less polished tool calling **Best For:** Multilingual work, research, general tasks ### Family Agent Models #### Phi-3 Mini 3.8B Q4 **Pros:** - Excellent instruction following - Very fast inference (<1s) - Low VRAM usage (~2.5GB) - Good function calling support - Large context (128K tokens) - Microsoft-backed, well-maintained **Cons:** - Slightly larger than alternatives - May be overkill for simple tasks **Best For:** Family conversations, task management, general Q&A #### Qwen2.5 1.5B Q4 **Pros:** - Very small VRAM footprint (~1.2GB) - Fast inference - Good instruction following - Large context (32K tokens) - Efficient for always-on use **Cons:** - Less capable than Phi-3 Mini - May struggle with complex requests **Best For:** Lightweight always-on agent, simple tasks #### TinyLlama 1.1B Q4 **Pros:** - Extremely small (~0.8GB VRAM) - Very fast inference - Minimal resource usage **Cons:** - Limited capabilities - Small context window (2K tokens) - May not handle complex conversations well **Best For:** Very resource-constrained scenarios ## Quantization Comparison ### Q4 (4-bit) - **Quality**: ~95-98% of full precision - **VRAM**: ~50% reduction - **Speed**: Fast - **Recommendation**: ✅ **Use for both agents** ### Q5 (5-bit) - **Quality**: ~98-99% of full precision - **VRAM**: ~62% of original - **Speed**: Slightly slower than Q4 - **Recommendation**: Consider for 4080 if quality is critical ### Q6 (6-bit) - **Quality**: ~99% of full precision - **VRAM**: ~75% of original - **Speed**: Slower - **Recommendation**: Not recommended (marginal quality gain) ### Q8 (8-bit) - **Quality**: Near full precision - **VRAM**: ~100% of original - **Speed**: Slowest - **Recommendation**: Not recommended (doesn't fit in constraints) ## Function Calling Support All recommended models support function calling: - **Llama 3.1**: Native function calling via `tools` parameter - **DeepSeek Coder**: Function calling support - **Qwen 2.5**: Function calling support - **Phi-3 Mini**: Function calling support - **TinyLlama**: Basic function calling (may need fine-tuning) ## Performance Benchmarks (Estimated) ### RTX 4080 (16GB VRAM) | Model | Tokens/sec | Latency (first token) | Latency (100 tokens) | |-------|------------|----------------------|----------------------| | Llama 3.1 70B Q4 | ~25-35 | ~200-300ms | ~3-4s | | Llama 3.1 70B Q5 | ~20-30 | ~250-350ms | ~3.5-5s | | DeepSeek Coder 33B Q4 | ~40-60 | ~100-200ms | ~2-3s | | Qwen 2.5 72B Q4 | ~25-35 | ~200-300ms | ~3-4s | ### RTX 1050 (4GB VRAM) | Model | Tokens/sec | Latency (first token) | Latency (100 tokens) | |-------|------------|----------------------|----------------------| | Phi-3 Mini 3.8B Q4 | ~80-120 | ~50-100ms | ~1-1.5s | | Qwen2.5 1.5B Q4 | ~100-150 | ~30-60ms | ~0.7-1s | | TinyLlama 1.1B Q4 | ~150-200 | ~20-40ms | ~0.5-0.7s | ## Final Recommendations ### Work Agent (RTX 4080) **Primary Choice: Llama 3.1 70B Q4** - Best overall capabilities - Fits comfortably in 16GB VRAM - Excellent for coding, research, and general work tasks - Strong function calling support - Large context window (128K) **Alternative: DeepSeek Coder 33B Q4** - If coding is the primary use case - Faster inference - Lower VRAM usage allows for more headroom ### Family Agent (RTX 1050) **Primary Choice: Phi-3 Mini 3.8B Q4** - Excellent instruction following - Fast inference (<1s latency) - Low VRAM usage (~2.5GB) - Good function calling support - Large context window (128K) **Alternative: Qwen2.5 1.5B Q4** - If VRAM is very tight - Still capable for simple tasks - Very fast inference ## Implementation Notes ### Model Sources - **Hugging Face**: Primary source for all models - **Ollama**: Pre-configured models (easier setup) - **Direct download**: For custom quantization ### Inference Servers - **Ollama**: Easiest setup, good for prototyping - **vLLM**: Best throughput, batching support - **llama.cpp**: Lightweight, efficient, good for 1050 ### Quantization Tools - **llama.cpp**: Built-in quantization - **AutoGPTQ**: For GPTQ quantization - **AWQ**: Alternative quantization method ## Next Steps 1. ✅ Complete this survey (TICKET-017) 2. Complete capacity assessment (TICKET-018) 3. Finalize model selection (TICKET-019, TICKET-020) 4. Download and test selected models 5. Benchmark on actual hardware 6. Set up inference servers (TICKET-021, TICKET-022) ## References - [Llama 3.1](https://llama.meta.com/llama-3-1/) - [DeepSeek Coder](https://github.com/deepseek-ai/DeepSeek-Coder) - [Phi-3](https://www.microsoft.com/en-us/research/blog/phi-3/) - [Qwen 2.5](https://qwenlm.github.io/blog/qwen2.5/) - [Model Quantization Guide](https://github.com/ggerganov/llama.cpp) --- **Last Updated**: 2024-01-XX **Status**: Survey Complete - Ready for TICKET-018 (Capacity Assessment)