atlas/docs/LLM_MODEL_SURVEY.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

278 lines
9.3 KiB
Markdown

# LLM Model Survey
## Overview
This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050).
**Hardware Constraints:**
- **RTX 4080**: 16GB VRAM - Work agent, high-capability tasks
- **RTX 1050**: 4GB VRAM - Family agent, always-on, low-latency
## Evaluation Criteria
### Work Agent (RTX 4080) Requirements
- **Coding capabilities**: Code generation, debugging, code review
- **Research capabilities**: Analysis, reasoning, documentation
- **Function calling**: Must support tool/function calling for MCP integration
- **Context window**: 8K-16K tokens minimum
- **VRAM fit**: Must fit in 16GB with quantization
- **Performance**: Reasonable latency (< 5s for typical responses)
### Family Agent (RTX 1050) Requirements
- **Instruction following**: Good at following conversational instructions
- **Function calling**: Must support tool/function calling
- **Low latency**: < 1s response time for interactive use
- **VRAM fit**: Must fit in 4GB with quantization
- **Efficiency**: Low power consumption for always-on operation
- **Context window**: 4K-8K tokens sufficient
## Model Comparison Matrix
### RTX 4080 Candidates (Work Agent)
| Model | Size | Quantization | VRAM Usage | Coding | Research | Function Call | Context | Speed | Recommendation |
|-------|------|--------------|------------|-------|----------|---------------|---------|-------|----------------|
| **Llama 3.1 70B** | 70B | Q4 | ~14GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | 128K | Medium | ** Top Choice** |
| **Llama 3.1 70B** | 70B | Q5 | ~16GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | 128K | Medium | Good quality |
| **DeepSeek Coder 33B** | 33B | Q4 | ~8GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | | 16K | Fast | **Best for coding** |
| **Qwen 2.5 72B** | 72B | Q4 | ~14GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | 32K | Medium | Strong alternative |
| **Mistral Large 2 67B** | 67B | Q4 | ~13GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | | 128K | Medium | Good option |
| **Llama 3.1 8B** | 8B | Q4 | ~5GB | ⭐⭐⭐ | ⭐⭐⭐ | | 128K | Very Fast | Too small for work |
**Recommendation for 4080:**
1. **Primary**: **Llama 3.1 70B Q4** - Best overall balance
2. **Alternative**: **DeepSeek Coder 33B Q4** - If coding is primary focus
3. **Fallback**: **Qwen 2.5 72B Q4** - Strong alternative
### RTX 1050 Candidates (Family Agent)
| Model | Size | Quantization | VRAM Usage | Instruction | Function Call | Context | Speed | Latency | Recommendation |
|-------|------|--------------|------------|-------------|---------------|---------|-------|---------|----------------|
| **Phi-3 Mini 3.8B** | 3.8B | Q4 | ~2.5GB | ⭐⭐⭐⭐⭐ | | 128K | Very Fast | <1s | ** Top Choice** |
| **TinyLlama 1.1B** | 1.1B | Q4 | ~0.8GB | ⭐⭐⭐ | | 2K | Extremely Fast | <0.5s | Lightweight option |
| **Gemma 2B** | 2B | Q4 | ~1.5GB | ⭐⭐⭐⭐ | | 8K | Very Fast | <0.8s | Good alternative |
| **Qwen2.5 1.5B** | 1.5B | Q4 | ~1.2GB | ⭐⭐⭐⭐ | | 32K | Very Fast | <0.7s | Strong option |
| **Phi-2 2.7B** | 2.7B | Q4 | ~1.8GB | ⭐⭐⭐⭐ | | 2K | Fast | <1s | Older, less capable |
| **Llama 3.2 3B** | 3B | Q4 | ~2GB | ⭐⭐⭐⭐ | | 128K | Fast | <1s | Good but larger |
**Recommendation for 1050:**
1. **Primary**: **Phi-3 Mini 3.8B Q4** - Best instruction following, good speed
2. **Alternative**: **Qwen2.5 1.5B Q4** - Smaller, still capable
3. **Fallback**: **TinyLlama 1.1B Q4** - If VRAM is tight
## Detailed Model Analysis
### Work Agent Models
#### Llama 3.1 70B Q4/Q5
**Pros:**
- Excellent coding and research capabilities
- Large context window (128K tokens)
- Strong function calling support
- Well-documented and widely used
- Good balance of quality and speed
**Cons:**
- Q5 uses full 16GB (tight fit)
- Slower than smaller models
- Higher power consumption
**VRAM Usage:**
- Q4: ~14GB (comfortable margin)
- Q5: ~16GB (tight, but better quality)
**Best For:** General work tasks, coding, research, complex reasoning
#### DeepSeek Coder 33B Q4
**Pros:**
- Excellent coding capabilities (specialized)
- Faster than 70B models
- Lower VRAM usage (~8GB)
- Good function calling support
- Strong for code generation and debugging
**Cons:**
- Less capable for general research/analysis
- Smaller context window (16K vs 128K)
- Less general-purpose than Llama 3.1
**Best For:** Coding-focused work, code generation, debugging
#### Qwen 2.5 72B Q4
**Pros:**
- Strong multilingual support
- Good coding and research capabilities
- Large context (32K tokens)
- Competitive with Llama 3.1
**Cons:**
- Less community support than Llama
- Slightly less polished tool calling
**Best For:** Multilingual work, research, general tasks
### Family Agent Models
#### Phi-3 Mini 3.8B Q4
**Pros:**
- Excellent instruction following
- Very fast inference (<1s)
- Low VRAM usage (~2.5GB)
- Good function calling support
- Large context (128K tokens)
- Microsoft-backed, well-maintained
**Cons:**
- Slightly larger than alternatives
- May be overkill for simple tasks
**Best For:** Family conversations, task management, general Q&A
#### Qwen2.5 1.5B Q4
**Pros:**
- Very small VRAM footprint (~1.2GB)
- Fast inference
- Good instruction following
- Large context (32K tokens)
- Efficient for always-on use
**Cons:**
- Less capable than Phi-3 Mini
- May struggle with complex requests
**Best For:** Lightweight always-on agent, simple tasks
#### TinyLlama 1.1B Q4
**Pros:**
- Extremely small (~0.8GB VRAM)
- Very fast inference
- Minimal resource usage
**Cons:**
- Limited capabilities
- Small context window (2K tokens)
- May not handle complex conversations well
**Best For:** Very resource-constrained scenarios
## Quantization Comparison
### Q4 (4-bit)
- **Quality**: ~95-98% of full precision
- **VRAM**: ~50% reduction
- **Speed**: Fast
- **Recommendation**: **Use for both agents**
### Q5 (5-bit)
- **Quality**: ~98-99% of full precision
- **VRAM**: ~62% of original
- **Speed**: Slightly slower than Q4
- **Recommendation**: Consider for 4080 if quality is critical
### Q6 (6-bit)
- **Quality**: ~99% of full precision
- **VRAM**: ~75% of original
- **Speed**: Slower
- **Recommendation**: Not recommended (marginal quality gain)
### Q8 (8-bit)
- **Quality**: Near full precision
- **VRAM**: ~100% of original
- **Speed**: Slowest
- **Recommendation**: Not recommended (doesn't fit in constraints)
## Function Calling Support
All recommended models support function calling:
- **Llama 3.1**: Native function calling via `tools` parameter
- **DeepSeek Coder**: Function calling support
- **Qwen 2.5**: Function calling support
- **Phi-3 Mini**: Function calling support
- **TinyLlama**: Basic function calling (may need fine-tuning)
## Performance Benchmarks (Estimated)
### RTX 4080 (16GB VRAM)
| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|-------|------------|----------------------|----------------------|
| Llama 3.1 70B Q4 | ~25-35 | ~200-300ms | ~3-4s |
| Llama 3.1 70B Q5 | ~20-30 | ~250-350ms | ~3.5-5s |
| DeepSeek Coder 33B Q4 | ~40-60 | ~100-200ms | ~2-3s |
| Qwen 2.5 72B Q4 | ~25-35 | ~200-300ms | ~3-4s |
### RTX 1050 (4GB VRAM)
| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|-------|------------|----------------------|----------------------|
| Phi-3 Mini 3.8B Q4 | ~80-120 | ~50-100ms | ~1-1.5s |
| Qwen2.5 1.5B Q4 | ~100-150 | ~30-60ms | ~0.7-1s |
| TinyLlama 1.1B Q4 | ~150-200 | ~20-40ms | ~0.5-0.7s |
## Final Recommendations
### Work Agent (RTX 4080)
**Primary Choice: Llama 3.1 70B Q4**
- Best overall capabilities
- Fits comfortably in 16GB VRAM
- Excellent for coding, research, and general work tasks
- Strong function calling support
- Large context window (128K)
**Alternative: DeepSeek Coder 33B Q4**
- If coding is the primary use case
- Faster inference
- Lower VRAM usage allows for more headroom
### Family Agent (RTX 1050)
**Primary Choice: Phi-3 Mini 3.8B Q4**
- Excellent instruction following
- Fast inference (<1s latency)
- Low VRAM usage (~2.5GB)
- Good function calling support
- Large context window (128K)
**Alternative: Qwen2.5 1.5B Q4**
- If VRAM is very tight
- Still capable for simple tasks
- Very fast inference
## Implementation Notes
### Model Sources
- **Hugging Face**: Primary source for all models
- **Ollama**: Pre-configured models (easier setup)
- **Direct download**: For custom quantization
### Inference Servers
- **Ollama**: Easiest setup, good for prototyping
- **vLLM**: Best throughput, batching support
- **llama.cpp**: Lightweight, efficient, good for 1050
### Quantization Tools
- **llama.cpp**: Built-in quantization
- **AutoGPTQ**: For GPTQ quantization
- **AWQ**: Alternative quantization method
## Next Steps
1. Complete this survey (TICKET-017)
2. Complete capacity assessment (TICKET-018)
3. Finalize model selection (TICKET-019, TICKET-020)
4. Download and test selected models
5. Benchmark on actual hardware
6. Set up inference servers (TICKET-021, TICKET-022)
## References
- [Llama 3.1](https://llama.meta.com/llama-3-1/)
- [DeepSeek Coder](https://github.com/deepseek-ai/DeepSeek-Coder)
- [Phi-3](https://www.microsoft.com/en-us/research/blog/phi-3/)
- [Qwen 2.5](https://qwenlm.github.io/blog/qwen2.5/)
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)
---
**Last Updated**: 2024-01-XX
**Status**: Survey Complete - Ready for TICKET-018 (Capacity Assessment)