- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
278 lines
9.3 KiB
Markdown
278 lines
9.3 KiB
Markdown
# LLM Model Survey
|
|
|
|
## Overview
|
|
|
|
This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050).
|
|
|
|
**Hardware Constraints:**
|
|
- **RTX 4080**: 16GB VRAM - Work agent, high-capability tasks
|
|
- **RTX 1050**: 4GB VRAM - Family agent, always-on, low-latency
|
|
|
|
## Evaluation Criteria
|
|
|
|
### Work Agent (RTX 4080) Requirements
|
|
- **Coding capabilities**: Code generation, debugging, code review
|
|
- **Research capabilities**: Analysis, reasoning, documentation
|
|
- **Function calling**: Must support tool/function calling for MCP integration
|
|
- **Context window**: 8K-16K tokens minimum
|
|
- **VRAM fit**: Must fit in 16GB with quantization
|
|
- **Performance**: Reasonable latency (< 5s for typical responses)
|
|
|
|
### Family Agent (RTX 1050) Requirements
|
|
- **Instruction following**: Good at following conversational instructions
|
|
- **Function calling**: Must support tool/function calling
|
|
- **Low latency**: < 1s response time for interactive use
|
|
- **VRAM fit**: Must fit in 4GB with quantization
|
|
- **Efficiency**: Low power consumption for always-on operation
|
|
- **Context window**: 4K-8K tokens sufficient
|
|
|
|
## Model Comparison Matrix
|
|
|
|
### RTX 4080 Candidates (Work Agent)
|
|
|
|
| Model | Size | Quantization | VRAM Usage | Coding | Research | Function Call | Context | Speed | Recommendation |
|
|
|-------|------|--------------|------------|-------|----------|---------------|---------|-------|----------------|
|
|
| **Llama 3.1 70B** | 70B | Q4 | ~14GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | **⭐ Top Choice** |
|
|
| **Llama 3.1 70B** | 70B | Q5 | ~16GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good quality |
|
|
| **DeepSeek Coder 33B** | 33B | Q4 | ~8GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ | 16K | Fast | **Best for coding** |
|
|
| **Qwen 2.5 72B** | 72B | Q4 | ~14GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 32K | Medium | Strong alternative |
|
|
| **Mistral Large 2 67B** | 67B | Q4 | ~13GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good option |
|
|
| **Llama 3.1 8B** | 8B | Q4 | ~5GB | ⭐⭐⭐ | ⭐⭐⭐ | ✅ | 128K | Very Fast | Too small for work |
|
|
|
|
**Recommendation for 4080:**
|
|
1. **Primary**: **Llama 3.1 70B Q4** - Best overall balance
|
|
2. **Alternative**: **DeepSeek Coder 33B Q4** - If coding is primary focus
|
|
3. **Fallback**: **Qwen 2.5 72B Q4** - Strong alternative
|
|
|
|
### RTX 1050 Candidates (Family Agent)
|
|
|
|
| Model | Size | Quantization | VRAM Usage | Instruction | Function Call | Context | Speed | Latency | Recommendation |
|
|
|-------|------|--------------|------------|-------------|---------------|---------|-------|---------|----------------|
|
|
| **Phi-3 Mini 3.8B** | 3.8B | Q4 | ~2.5GB | ⭐⭐⭐⭐⭐ | ✅ | 128K | Very Fast | <1s | **⭐ Top Choice** |
|
|
| **TinyLlama 1.1B** | 1.1B | Q4 | ~0.8GB | ⭐⭐⭐ | ✅ | 2K | Extremely Fast | <0.5s | Lightweight option |
|
|
| **Gemma 2B** | 2B | Q4 | ~1.5GB | ⭐⭐⭐⭐ | ✅ | 8K | Very Fast | <0.8s | Good alternative |
|
|
| **Qwen2.5 1.5B** | 1.5B | Q4 | ~1.2GB | ⭐⭐⭐⭐ | ✅ | 32K | Very Fast | <0.7s | Strong option |
|
|
| **Phi-2 2.7B** | 2.7B | Q4 | ~1.8GB | ⭐⭐⭐⭐ | ✅ | 2K | Fast | <1s | Older, less capable |
|
|
| **Llama 3.2 3B** | 3B | Q4 | ~2GB | ⭐⭐⭐⭐ | ✅ | 128K | Fast | <1s | Good but larger |
|
|
|
|
**Recommendation for 1050:**
|
|
1. **Primary**: **Phi-3 Mini 3.8B Q4** - Best instruction following, good speed
|
|
2. **Alternative**: **Qwen2.5 1.5B Q4** - Smaller, still capable
|
|
3. **Fallback**: **TinyLlama 1.1B Q4** - If VRAM is tight
|
|
|
|
## Detailed Model Analysis
|
|
|
|
### Work Agent Models
|
|
|
|
#### Llama 3.1 70B Q4/Q5
|
|
**Pros:**
|
|
- Excellent coding and research capabilities
|
|
- Large context window (128K tokens)
|
|
- Strong function calling support
|
|
- Well-documented and widely used
|
|
- Good balance of quality and speed
|
|
|
|
**Cons:**
|
|
- Q5 uses full 16GB (tight fit)
|
|
- Slower than smaller models
|
|
- Higher power consumption
|
|
|
|
**VRAM Usage:**
|
|
- Q4: ~14GB (comfortable margin)
|
|
- Q5: ~16GB (tight, but better quality)
|
|
|
|
**Best For:** General work tasks, coding, research, complex reasoning
|
|
|
|
#### DeepSeek Coder 33B Q4
|
|
**Pros:**
|
|
- Excellent coding capabilities (specialized)
|
|
- Faster than 70B models
|
|
- Lower VRAM usage (~8GB)
|
|
- Good function calling support
|
|
- Strong for code generation and debugging
|
|
|
|
**Cons:**
|
|
- Less capable for general research/analysis
|
|
- Smaller context window (16K vs 128K)
|
|
- Less general-purpose than Llama 3.1
|
|
|
|
**Best For:** Coding-focused work, code generation, debugging
|
|
|
|
#### Qwen 2.5 72B Q4
|
|
**Pros:**
|
|
- Strong multilingual support
|
|
- Good coding and research capabilities
|
|
- Large context (32K tokens)
|
|
- Competitive with Llama 3.1
|
|
|
|
**Cons:**
|
|
- Less community support than Llama
|
|
- Slightly less polished tool calling
|
|
|
|
**Best For:** Multilingual work, research, general tasks
|
|
|
|
### Family Agent Models
|
|
|
|
#### Phi-3 Mini 3.8B Q4
|
|
**Pros:**
|
|
- Excellent instruction following
|
|
- Very fast inference (<1s)
|
|
- Low VRAM usage (~2.5GB)
|
|
- Good function calling support
|
|
- Large context (128K tokens)
|
|
- Microsoft-backed, well-maintained
|
|
|
|
**Cons:**
|
|
- Slightly larger than alternatives
|
|
- May be overkill for simple tasks
|
|
|
|
**Best For:** Family conversations, task management, general Q&A
|
|
|
|
#### Qwen2.5 1.5B Q4
|
|
**Pros:**
|
|
- Very small VRAM footprint (~1.2GB)
|
|
- Fast inference
|
|
- Good instruction following
|
|
- Large context (32K tokens)
|
|
- Efficient for always-on use
|
|
|
|
**Cons:**
|
|
- Less capable than Phi-3 Mini
|
|
- May struggle with complex requests
|
|
|
|
**Best For:** Lightweight always-on agent, simple tasks
|
|
|
|
#### TinyLlama 1.1B Q4
|
|
**Pros:**
|
|
- Extremely small (~0.8GB VRAM)
|
|
- Very fast inference
|
|
- Minimal resource usage
|
|
|
|
**Cons:**
|
|
- Limited capabilities
|
|
- Small context window (2K tokens)
|
|
- May not handle complex conversations well
|
|
|
|
**Best For:** Very resource-constrained scenarios
|
|
|
|
## Quantization Comparison
|
|
|
|
### Q4 (4-bit)
|
|
- **Quality**: ~95-98% of full precision
|
|
- **VRAM**: ~50% reduction
|
|
- **Speed**: Fast
|
|
- **Recommendation**: ✅ **Use for both agents**
|
|
|
|
### Q5 (5-bit)
|
|
- **Quality**: ~98-99% of full precision
|
|
- **VRAM**: ~62% of original
|
|
- **Speed**: Slightly slower than Q4
|
|
- **Recommendation**: Consider for 4080 if quality is critical
|
|
|
|
### Q6 (6-bit)
|
|
- **Quality**: ~99% of full precision
|
|
- **VRAM**: ~75% of original
|
|
- **Speed**: Slower
|
|
- **Recommendation**: Not recommended (marginal quality gain)
|
|
|
|
### Q8 (8-bit)
|
|
- **Quality**: Near full precision
|
|
- **VRAM**: ~100% of original
|
|
- **Speed**: Slowest
|
|
- **Recommendation**: Not recommended (doesn't fit in constraints)
|
|
|
|
## Function Calling Support
|
|
|
|
All recommended models support function calling:
|
|
- **Llama 3.1**: Native function calling via `tools` parameter
|
|
- **DeepSeek Coder**: Function calling support
|
|
- **Qwen 2.5**: Function calling support
|
|
- **Phi-3 Mini**: Function calling support
|
|
- **TinyLlama**: Basic function calling (may need fine-tuning)
|
|
|
|
## Performance Benchmarks (Estimated)
|
|
|
|
### RTX 4080 (16GB VRAM)
|
|
|
|
| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|
|
|-------|------------|----------------------|----------------------|
|
|
| Llama 3.1 70B Q4 | ~25-35 | ~200-300ms | ~3-4s |
|
|
| Llama 3.1 70B Q5 | ~20-30 | ~250-350ms | ~3.5-5s |
|
|
| DeepSeek Coder 33B Q4 | ~40-60 | ~100-200ms | ~2-3s |
|
|
| Qwen 2.5 72B Q4 | ~25-35 | ~200-300ms | ~3-4s |
|
|
|
|
### RTX 1050 (4GB VRAM)
|
|
|
|
| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|
|
|-------|------------|----------------------|----------------------|
|
|
| Phi-3 Mini 3.8B Q4 | ~80-120 | ~50-100ms | ~1-1.5s |
|
|
| Qwen2.5 1.5B Q4 | ~100-150 | ~30-60ms | ~0.7-1s |
|
|
| TinyLlama 1.1B Q4 | ~150-200 | ~20-40ms | ~0.5-0.7s |
|
|
|
|
## Final Recommendations
|
|
|
|
### Work Agent (RTX 4080)
|
|
**Primary Choice: Llama 3.1 70B Q4**
|
|
- Best overall capabilities
|
|
- Fits comfortably in 16GB VRAM
|
|
- Excellent for coding, research, and general work tasks
|
|
- Strong function calling support
|
|
- Large context window (128K)
|
|
|
|
**Alternative: DeepSeek Coder 33B Q4**
|
|
- If coding is the primary use case
|
|
- Faster inference
|
|
- Lower VRAM usage allows for more headroom
|
|
|
|
### Family Agent (RTX 1050)
|
|
**Primary Choice: Phi-3 Mini 3.8B Q4**
|
|
- Excellent instruction following
|
|
- Fast inference (<1s latency)
|
|
- Low VRAM usage (~2.5GB)
|
|
- Good function calling support
|
|
- Large context window (128K)
|
|
|
|
**Alternative: Qwen2.5 1.5B Q4**
|
|
- If VRAM is very tight
|
|
- Still capable for simple tasks
|
|
- Very fast inference
|
|
|
|
## Implementation Notes
|
|
|
|
### Model Sources
|
|
- **Hugging Face**: Primary source for all models
|
|
- **Ollama**: Pre-configured models (easier setup)
|
|
- **Direct download**: For custom quantization
|
|
|
|
### Inference Servers
|
|
- **Ollama**: Easiest setup, good for prototyping
|
|
- **vLLM**: Best throughput, batching support
|
|
- **llama.cpp**: Lightweight, efficient, good for 1050
|
|
|
|
### Quantization Tools
|
|
- **llama.cpp**: Built-in quantization
|
|
- **AutoGPTQ**: For GPTQ quantization
|
|
- **AWQ**: Alternative quantization method
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Complete this survey (TICKET-017)
|
|
2. Complete capacity assessment (TICKET-018)
|
|
3. Finalize model selection (TICKET-019, TICKET-020)
|
|
4. Download and test selected models
|
|
5. Benchmark on actual hardware
|
|
6. Set up inference servers (TICKET-021, TICKET-022)
|
|
|
|
## References
|
|
|
|
- [Llama 3.1](https://llama.meta.com/llama-3-1/)
|
|
- [DeepSeek Coder](https://github.com/deepseek-ai/DeepSeek-Coder)
|
|
- [Phi-3](https://www.microsoft.com/en-us/research/blog/phi-3/)
|
|
- [Qwen 2.5](https://qwenlm.github.io/blog/qwen2.5/)
|
|
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2024-01-XX
|
|
**Status**: Survey Complete - Ready for TICKET-018 (Capacity Assessment)
|