atlas/docs/LLM_CAPACITY.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

259 lines
8.3 KiB
Markdown

# LLM Capacity Assessment
## Overview
This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.
## VRAM Capacity Analysis
### RTX 4080 (16GB VRAM)
**Available VRAM**: ~15.5GB (after system overhead)
#### Model Size Capacity
| Model Size | Quantization | VRAM Usage | Status | Notes |
|------------|--------------|------------|--------|-------|
| 70B | Q4 | ~14GB | ✅ Comfortable | Recommended |
| 70B | Q5 | ~16GB | ⚠️ Tight | Possible but no headroom |
| 70B | Q6 | ~18GB | ❌ Won't fit | Too large |
| 72B | Q4 | ~14.5GB | ✅ Comfortable | Qwen 2.5 72B |
| 67B | Q4 | ~13.5GB | ✅ Comfortable | Mistral Large 2 |
| 33B | Q4 | ~8GB | ✅ Plenty of room | DeepSeek Coder |
| 8B | Q4 | ~5GB | ✅ Plenty of room | Too small for work agent |
**Recommendation**:
- **Q4 quantization** for 70B models (comfortable margin)
- **Q5 possible** but tight (not recommended unless quality critical)
- **33B models** leave plenty of room for larger context windows
#### Context Window Capacity
Context window size affects VRAM usage through KV cache:
| Context Size | KV Cache (70B Q4) | Total VRAM | Status |
|--------------|-------------------|-----------|--------|
| 4K tokens | ~2GB | ~16GB | ✅ Fits |
| 8K tokens | ~4GB | ~18GB | ⚠️ Tight |
| 16K tokens | ~8GB | ~22GB | ❌ Won't fit |
| 32K tokens | ~16GB | ~30GB | ❌ Won't fit |
| 128K tokens | ~64GB | ~78GB | ❌ Won't fit |
**Practical Limits for 70B Q4:**
- **Max context**: ~8K tokens (comfortable)
- **Recommended context**: 4K-8K tokens
- **128K context**: Not practical (would need Q2 or smaller model)
**For 33B Q4 (DeepSeek Coder):**
- **Max context**: ~16K tokens (comfortable)
- **Recommended context**: 8K-16K tokens
#### Batch Size and Concurrency
| Configuration | VRAM Usage | Throughput | Recommendation |
|----------------|------------|------------|----------------|
| Single request | ~14GB | 1x | Baseline |
| 2 concurrent | ~15GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~16GB | 2.5x | ⚠️ Possible but tight |
| 4 concurrent | ~17GB | 3x | ❌ Won't fit |
**Recommendation**: 2 concurrent requests maximum for 70B Q4
### RTX 1050 (4GB VRAM)
**Available VRAM**: ~3.8GB (after system overhead)
#### Model Size Capacity
| Model Size | Quantization | VRAM Usage | Status | Notes |
|------------|--------------|------------|--------|-------|
| 3.8B | Q4 | ~2.5GB | ✅ Comfortable | Phi-3 Mini |
| 3B | Q4 | ~2GB | ✅ Comfortable | Llama 3.2 3B |
| 2.7B | Q4 | ~1.8GB | ✅ Comfortable | Phi-2 |
| 2B | Q4 | ~1.5GB | ✅ Comfortable | Gemma 2B |
| 1.5B | Q4 | ~1.2GB | ✅ Plenty of room | Qwen2.5 1.5B |
| 1.1B | Q4 | ~0.8GB | ✅ Plenty of room | TinyLlama |
| 7B | Q4 | ~4.5GB | ❌ Won't fit | Too large |
| 8B | Q4 | ~5GB | ❌ Won't fit | Too large |
**Recommendation**:
- **3.8B Q4** (Phi-3 Mini) - Best balance
- **1.5B Q4** (Qwen2.5) - If more headroom needed
- **1.1B Q4** (TinyLlama) - Maximum headroom
#### Context Window Capacity
| Context Size | KV Cache (3.8B Q4) | Total VRAM | Status |
|--------------|-------------------|-----------|--------|
| 2K tokens | ~0.3GB | ~2.8GB | ✅ Fits easily |
| 4K tokens | ~0.6GB | ~3.1GB | ✅ Comfortable |
| 8K tokens | ~1.2GB | ~3.7GB | ✅ Fits |
| 16K tokens | ~2.4GB | ~4.9GB | ⚠️ Tight |
| 32K tokens | ~4.8GB | ~7.3GB | ❌ Won't fit |
| 128K tokens | ~19GB | ~21.5GB | ❌ Won't fit |
**Practical Limits for 3.8B Q4:**
- **Max context**: ~8K tokens (comfortable)
- **Recommended context**: 4K-8K tokens
- **128K context**: Not practical (model supports it but VRAM doesn't)
**For 1.5B Q4 (Qwen2.5):**
- **Max context**: ~16K tokens (comfortable)
- **Recommended context**: 8K-16K tokens
#### Batch Size and Concurrency
| Configuration | VRAM Usage | Throughput | Recommendation |
|----------------|------------|------------|----------------|
| Single request | ~2.5GB | 1x | Baseline |
| 2 concurrent | ~3.5GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~4.2GB | 2.5x | ⚠️ Possible but tight |
**Recommendation**: 1-2 concurrent requests for 3.8B Q4
## Memory Requirements Summary
### RTX 4080 (Work Agent)
**Recommended Configuration:**
- **Model**: Llama 3.1 70B Q4
- **VRAM Usage**: ~14GB
- **Context Window**: 4K-8K tokens
- **Concurrency**: 2 requests max
- **Headroom**: ~1.5GB for system/KV cache
**Alternative Configuration:**
- **Model**: DeepSeek Coder 33B Q4
- **VRAM Usage**: ~8GB
- **Context Window**: 8K-16K tokens
- **Concurrency**: 3-4 requests possible
- **Headroom**: ~7.5GB for system/KV cache
### RTX 1050 (Family Agent)
**Recommended Configuration:**
- **Model**: Phi-3 Mini 3.8B Q4
- **VRAM Usage**: ~2.5GB
- **Context Window**: 4K-8K tokens
- **Concurrency**: 1-2 requests
- **Headroom**: ~1.3GB for system/KV cache
**Alternative Configuration:**
- **Model**: Qwen2.5 1.5B Q4
- **VRAM Usage**: ~1.2GB
- **Context Window**: 8K-16K tokens
- **Concurrency**: 2-3 requests possible
- **Headroom**: ~2.6GB for system/KV cache
## Context Window Trade-offs
### Large Context Windows (128K+)
**Pros:**
- Can handle very long conversations
- More context for complex tasks
- Less need for summarization
**Cons:**
- **Not practical on 4080/1050** - Would require:
- Q2 quantization (significant quality loss)
- Or much smaller models (capability loss)
- Or external memory (complexity)
**Recommendation**: Use 4K-8K context with summarization strategy
### Practical Context Windows
**4K tokens** (~3,000 words):
- ✅ Fits comfortably on both GPUs
- ✅ Good for most conversations
- ✅ Fast inference
- ⚠️ May need summarization for long chats
**8K tokens** (~6,000 words):
- ✅ Fits on both GPUs
- ✅ Better for longer conversations
- ✅ Still fast inference
- ✅ Good balance
**16K tokens** (~12,000 words):
- ✅ Fits on 1050 with smaller models (1.5B)
- ⚠️ Tight on 4080 with 70B (not recommended)
- ✅ Fits on 4080 with 33B models
## System Memory (RAM) Requirements
### RTX 4080 System
- **Minimum**: 16GB RAM
- **Recommended**: 32GB RAM
- **For**: Model loading, system processes, KV cache overflow
### RTX 1050 System
- **Minimum**: 8GB RAM
- **Recommended**: 16GB RAM
- **For**: Model loading, system processes, KV cache overflow
## Storage Requirements
### Model Files
| Model | Size (Q4) | Download Time | Storage |
|-------|-----------|--------------|---------|
| Llama 3.1 70B Q4 | ~40GB | ~2-4 hours | SSD recommended |
| DeepSeek Coder 33B Q4 | ~20GB | ~1-2 hours | SSD recommended |
| Phi-3 Mini 3.8B Q4 | ~2.5GB | ~5-10 minutes | Any storage |
| Qwen2.5 1.5B Q4 | ~1GB | ~2-5 minutes | Any storage |
**Total Storage Needed**: ~60-80GB for all models + backups
## Performance Impact of Context Size
### Latency vs Context Size
**RTX 4080 (70B Q4):**
- 4K context: ~200ms first token, ~3s for 100 tokens
- 8K context: ~250ms first token, ~4s for 100 tokens
- 16K context: ~400ms first token, ~6s for 100 tokens (if fits)
**RTX 1050 (3.8B Q4):**
- 4K context: ~50ms first token, ~1s for 100 tokens
- 8K context: ~70ms first token, ~1.2s for 100 tokens
- 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)
**Recommendation**: Keep context at 4K-8K for optimal latency
## Recommendations
### For RTX 4080 (Work Agent)
1. **Use Q4 quantization** - Best balance of quality and VRAM
2. **Context window**: 4K-8K tokens (practical limit)
3. **Model**: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
4. **Concurrency**: 2 requests maximum
5. **Summarization**: Implement for conversations >8K tokens
### For RTX 1050 (Family Agent)
1. **Use Q4 quantization** - Only option that fits
2. **Context window**: 4K-8K tokens (practical limit)
3. **Model**: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
4. **Concurrency**: 1-2 requests maximum
5. **Summarization**: Implement for conversations >8K tokens
## Next Steps
1. ✅ Complete capacity assessment (TICKET-018)
2. Finalize model selection based on this assessment (TICKET-019, TICKET-020)
3. Test selected models on actual hardware
4. Benchmark actual VRAM usage
5. Adjust context windows based on real-world performance
## References
- [VRAM Calculator](https://huggingface.co/spaces/awf/VRAM-calculator)
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)
- [Context Window Scaling](https://arxiv.org/abs/2305.13245)
---
**Last Updated**: 2024-01-XX
**Status**: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)