atlas/docs/LLM_MODEL_SURVEY.md

# LLM Model Survey

## Overview

This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050).

**Hardware Constraints:**
- **RTX 4080**: 16GB VRAM - Work agent, high-capability tasks
- **RTX 1050**: 4GB VRAM - Family agent, always-on, low-latency

## Evaluation Criteria

### Work Agent (RTX 4080) Requirements
- **Coding capabilities**: Code generation, debugging, code review
- **Research capabilities**: Analysis, reasoning, documentation
- **Function calling**: Must support tool/function calling for MCP integration
- **Context window**: 8K-16K tokens minimum
- **VRAM fit**: Must fit in 16GB with quantization
- **Performance**: Reasonable latency (< 5s for typical responses)

### Family Agent (RTX 1050) Requirements
- **Instruction following**: Good at following conversational instructions
- **Function calling**: Must support tool/function calling
- **Low latency**: < 1s response time for interactive use
- **VRAM fit**: Must fit in 4GB with quantization
- **Efficiency**: Low power consumption for always-on operation
- **Context window**: 4K-8K tokens sufficient

## Model Comparison Matrix

### RTX 4080 Candidates (Work Agent)

| Model | Size | Quantization | VRAM Usage | Coding | Research | Function Call | Context | Speed | Recommendation |
|-------|------|--------------|------------|-------|----------|---------------|---------|-------|----------------|
| **Llama 3.1 70B** | 70B | Q4 | ~14GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | **⭐ Top Choice** |
| **Llama 3.1 70B** | 70B | Q5 | ~16GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good quality |
| **DeepSeek Coder 33B** | 33B | Q4 | ~8GB | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ✅ | 16K | Fast | **Best for coding** |
| **Qwen 2.5 72B** | 72B | Q4 | ~14GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 32K | Medium | Strong alternative |
| **Mistral Large 2 67B** | 67B | Q4 | ~13GB | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ✅ | 128K | Medium | Good option |
| **Llama 3.1 8B** | 8B | Q4 | ~5GB | ⭐⭐⭐ | ⭐⭐⭐ | ✅ | 128K | Very Fast | Too small for work |

**Recommendation for 4080:**
1. **Primary**: **Llama 3.1 70B Q4** - Best overall balance
2. **Alternative**: **DeepSeek Coder 33B Q4** - If coding is primary focus
3. **Fallback**: **Qwen 2.5 72B Q4** - Strong alternative

### RTX 1050 Candidates (Family Agent)

| Model | Size | Quantization | VRAM Usage | Instruction | Function Call | Context | Speed | Latency | Recommendation |
|-------|------|--------------|------------|-------------|---------------|---------|-------|---------|----------------|
| **Phi-3 Mini 3.8B** | 3.8B | Q4 | ~2.5GB | ⭐⭐⭐⭐⭐ | ✅ | 128K | Very Fast | <1s | **⭐ Top Choice** |
| **TinyLlama 1.1B** | 1.1B | Q4 | ~0.8GB | ⭐⭐⭐ | ✅ | 2K | Extremely Fast | <0.5s | Lightweight option |
| **Gemma 2B** | 2B | Q4 | ~1.5GB | ⭐⭐⭐⭐ | ✅ | 8K | Very Fast | <0.8s | Good alternative |
| **Qwen2.5 1.5B** | 1.5B | Q4 | ~1.2GB | ⭐⭐⭐⭐ | ✅ | 32K | Very Fast | <0.7s | Strong option |
| **Phi-2 2.7B** | 2.7B | Q4 | ~1.8GB | ⭐⭐⭐⭐ | ✅ | 2K | Fast | <1s | Older, less capable |
| **Llama 3.2 3B** | 3B | Q4 | ~2GB | ⭐⭐⭐⭐ | ✅ | 128K | Fast | <1s | Good but larger |

**Recommendation for 1050:**
1. **Primary**: **Phi-3 Mini 3.8B Q4** - Best instruction following, good speed
2. **Alternative**: **Qwen2.5 1.5B Q4** - Smaller, still capable
3. **Fallback**: **TinyLlama 1.1B Q4** - If VRAM is tight

## Detailed Model Analysis

### Work Agent Models

#### Llama 3.1 70B Q4/Q5
**Pros:**
- Excellent coding and research capabilities
- Large context window (128K tokens)
- Strong function calling support
- Well-documented and widely used
- Good balance of quality and speed

**Cons:**
- Q5 uses full 16GB (tight fit)
- Slower than smaller models
- Higher power consumption

**VRAM Usage:**
- Q4: ~14GB (comfortable margin)
- Q5: ~16GB (tight, but better quality)

**Best For:** General work tasks, coding, research, complex reasoning

#### DeepSeek Coder 33B Q4
**Pros:**
- Excellent coding capabilities (specialized)
- Faster than 70B models
- Lower VRAM usage (~8GB)
- Good function calling support
- Strong for code generation and debugging

**Cons:**
- Less capable for general research/analysis
- Smaller context window (16K vs 128K)
- Less general-purpose than Llama 3.1

**Best For:** Coding-focused work, code generation, debugging

#### Qwen 2.5 72B Q4
**Pros:**
- Strong multilingual support
- Good coding and research capabilities
- Large context (32K tokens)
- Competitive with Llama 3.1

**Cons:**
- Less community support than Llama
- Slightly less polished tool calling

**Best For:** Multilingual work, research, general tasks

### Family Agent Models

#### Phi-3 Mini 3.8B Q4
**Pros:**
- Excellent instruction following
- Very fast inference (<1s)
- Low VRAM usage (~2.5GB)
- Good function calling support
- Large context (128K tokens)
- Microsoft-backed, well-maintained

**Cons:**
- Slightly larger than alternatives
- May be overkill for simple tasks

**Best For:** Family conversations, task management, general Q&A

#### Qwen2.5 1.5B Q4
**Pros:**
- Very small VRAM footprint (~1.2GB)
- Fast inference
- Good instruction following
- Large context (32K tokens)
- Efficient for always-on use

**Cons:**
- Less capable than Phi-3 Mini
- May struggle with complex requests

**Best For:** Lightweight always-on agent, simple tasks

#### TinyLlama 1.1B Q4
**Pros:**
- Extremely small (~0.8GB VRAM)
- Very fast inference
- Minimal resource usage

**Cons:**
- Limited capabilities
- Small context window (2K tokens)
- May not handle complex conversations well

**Best For:** Very resource-constrained scenarios

## Quantization Comparison

### Q4 (4-bit)
- **Quality**: ~95-98% of full precision
- **VRAM**: ~50% reduction
- **Speed**: Fast
- **Recommendation**: ✅ **Use for both agents**

### Q5 (5-bit)
- **Quality**: ~98-99% of full precision
- **VRAM**: ~62% of original
- **Speed**: Slightly slower than Q4
- **Recommendation**: Consider for 4080 if quality is critical

### Q6 (6-bit)
- **Quality**: ~99% of full precision
- **VRAM**: ~75% of original
- **Speed**: Slower
- **Recommendation**: Not recommended (marginal quality gain)

### Q8 (8-bit)
- **Quality**: Near full precision
- **VRAM**: ~100% of original
- **Speed**: Slowest
- **Recommendation**: Not recommended (doesn't fit in constraints)

## Function Calling Support

All recommended models support function calling:
- **Llama 3.1**: Native function calling via `tools` parameter
- **DeepSeek Coder**: Function calling support
- **Qwen 2.5**: Function calling support
- **Phi-3 Mini**: Function calling support
- **TinyLlama**: Basic function calling (may need fine-tuning)

## Performance Benchmarks (Estimated)

### RTX 4080 (16GB VRAM)

| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|-------|------------|----------------------|----------------------|
| Llama 3.1 70B Q4 | ~25-35 | ~200-300ms | ~3-4s |
| Llama 3.1 70B Q5 | ~20-30 | ~250-350ms | ~3.5-5s |
| DeepSeek Coder 33B Q4 | ~40-60 | ~100-200ms | ~2-3s |
| Qwen 2.5 72B Q4 | ~25-35 | ~200-300ms | ~3-4s |

### RTX 1050 (4GB VRAM)

| Model | Tokens/sec | Latency (first token) | Latency (100 tokens) |
|-------|------------|----------------------|----------------------|
| Phi-3 Mini 3.8B Q4 | ~80-120 | ~50-100ms | ~1-1.5s |
| Qwen2.5 1.5B Q4 | ~100-150 | ~30-60ms | ~0.7-1s |
| TinyLlama 1.1B Q4 | ~150-200 | ~20-40ms | ~0.5-0.7s |

## Final Recommendations

### Work Agent (RTX 4080)
**Primary Choice: Llama 3.1 70B Q4**
- Best overall capabilities
- Fits comfortably in 16GB VRAM
- Excellent for coding, research, and general work tasks
- Strong function calling support
- Large context window (128K)

**Alternative: DeepSeek Coder 33B Q4**
- If coding is the primary use case
- Faster inference
- Lower VRAM usage allows for more headroom

### Family Agent (RTX 1050)
**Primary Choice: Phi-3 Mini 3.8B Q4**
- Excellent instruction following
- Fast inference (<1s latency)
- Low VRAM usage (~2.5GB)
- Good function calling support
- Large context window (128K)

**Alternative: Qwen2.5 1.5B Q4**
- If VRAM is very tight
- Still capable for simple tasks
- Very fast inference

## Implementation Notes

### Model Sources
- **Hugging Face**: Primary source for all models
- **Ollama**: Pre-configured models (easier setup)
- **Direct download**: For custom quantization

### Inference Servers
- **Ollama**: Easiest setup, good for prototyping
- **vLLM**: Best throughput, batching support
- **llama.cpp**: Lightweight, efficient, good for 1050

### Quantization Tools
- **llama.cpp**: Built-in quantization
- **AutoGPTQ**: For GPTQ quantization
- **AWQ**: Alternative quantization method

## Next Steps

1. ✅ Complete this survey (TICKET-017)
2. Complete capacity assessment (TICKET-018)
3. Finalize model selection (TICKET-019, TICKET-020)
4. Download and test selected models
5. Benchmark on actual hardware
6. Set up inference servers (TICKET-021, TICKET-022)

## References

- [Llama 3.1](https://llama.meta.com/llama-3-1/)
- [DeepSeek Coder](https://github.com/deepseek-ai/DeepSeek-Coder)
- [Phi-3](https://www.microsoft.com/en-us/research/blog/phi-3/)
- [Qwen 2.5](https://qwenlm.github.io/blog/qwen2.5/)
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)

---

**Last Updated**: 2024-01-XX
**Status**: Survey Complete - Ready for TICKET-018 (Capacity Assessment)