# LLM Capacity Assessment ## Overview This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware. ## VRAM Capacity Analysis ### RTX 4080 (16GB VRAM) **Available VRAM**: ~15.5GB (after system overhead) #### Model Size Capacity | Model Size | Quantization | VRAM Usage | Status | Notes | |------------|--------------|------------|--------|-------| | 70B | Q4 | ~14GB | ✅ Comfortable | Recommended | | 70B | Q5 | ~16GB | ⚠️ Tight | Possible but no headroom | | 70B | Q6 | ~18GB | ❌ Won't fit | Too large | | 72B | Q4 | ~14.5GB | ✅ Comfortable | Qwen 2.5 72B | | 67B | Q4 | ~13.5GB | ✅ Comfortable | Mistral Large 2 | | 33B | Q4 | ~8GB | ✅ Plenty of room | DeepSeek Coder | | 8B | Q4 | ~5GB | ✅ Plenty of room | Too small for work agent | **Recommendation**: - **Q4 quantization** for 70B models (comfortable margin) - **Q5 possible** but tight (not recommended unless quality critical) - **33B models** leave plenty of room for larger context windows #### Context Window Capacity Context window size affects VRAM usage through KV cache: | Context Size | KV Cache (70B Q4) | Total VRAM | Status | |--------------|-------------------|-----------|--------| | 4K tokens | ~2GB | ~16GB | ✅ Fits | | 8K tokens | ~4GB | ~18GB | ⚠️ Tight | | 16K tokens | ~8GB | ~22GB | ❌ Won't fit | | 32K tokens | ~16GB | ~30GB | ❌ Won't fit | | 128K tokens | ~64GB | ~78GB | ❌ Won't fit | **Practical Limits for 70B Q4:** - **Max context**: ~8K tokens (comfortable) - **Recommended context**: 4K-8K tokens - **128K context**: Not practical (would need Q2 or smaller model) **For 33B Q4 (DeepSeek Coder):** - **Max context**: ~16K tokens (comfortable) - **Recommended context**: 8K-16K tokens #### Batch Size and Concurrency | Configuration | VRAM Usage | Throughput | Recommendation | |----------------|------------|------------|----------------| | Single request | ~14GB | 1x | Baseline | | 2 concurrent | ~15GB | 1.8x | ✅ Recommended | | 3 concurrent | ~16GB | 2.5x | ⚠️ Possible but tight | | 4 concurrent | ~17GB | 3x | ❌ Won't fit | **Recommendation**: 2 concurrent requests maximum for 70B Q4 ### RTX 1050 (4GB VRAM) **Available VRAM**: ~3.8GB (after system overhead) #### Model Size Capacity | Model Size | Quantization | VRAM Usage | Status | Notes | |------------|--------------|------------|--------|-------| | 3.8B | Q4 | ~2.5GB | ✅ Comfortable | Phi-3 Mini | | 3B | Q4 | ~2GB | ✅ Comfortable | Llama 3.2 3B | | 2.7B | Q4 | ~1.8GB | ✅ Comfortable | Phi-2 | | 2B | Q4 | ~1.5GB | ✅ Comfortable | Gemma 2B | | 1.5B | Q4 | ~1.2GB | ✅ Plenty of room | Qwen2.5 1.5B | | 1.1B | Q4 | ~0.8GB | ✅ Plenty of room | TinyLlama | | 7B | Q4 | ~4.5GB | ❌ Won't fit | Too large | | 8B | Q4 | ~5GB | ❌ Won't fit | Too large | **Recommendation**: - **3.8B Q4** (Phi-3 Mini) - Best balance - **1.5B Q4** (Qwen2.5) - If more headroom needed - **1.1B Q4** (TinyLlama) - Maximum headroom #### Context Window Capacity | Context Size | KV Cache (3.8B Q4) | Total VRAM | Status | |--------------|-------------------|-----------|--------| | 2K tokens | ~0.3GB | ~2.8GB | ✅ Fits easily | | 4K tokens | ~0.6GB | ~3.1GB | ✅ Comfortable | | 8K tokens | ~1.2GB | ~3.7GB | ✅ Fits | | 16K tokens | ~2.4GB | ~4.9GB | ⚠️ Tight | | 32K tokens | ~4.8GB | ~7.3GB | ❌ Won't fit | | 128K tokens | ~19GB | ~21.5GB | ❌ Won't fit | **Practical Limits for 3.8B Q4:** - **Max context**: ~8K tokens (comfortable) - **Recommended context**: 4K-8K tokens - **128K context**: Not practical (model supports it but VRAM doesn't) **For 1.5B Q4 (Qwen2.5):** - **Max context**: ~16K tokens (comfortable) - **Recommended context**: 8K-16K tokens #### Batch Size and Concurrency | Configuration | VRAM Usage | Throughput | Recommendation | |----------------|------------|------------|----------------| | Single request | ~2.5GB | 1x | Baseline | | 2 concurrent | ~3.5GB | 1.8x | ✅ Recommended | | 3 concurrent | ~4.2GB | 2.5x | ⚠️ Possible but tight | **Recommendation**: 1-2 concurrent requests for 3.8B Q4 ## Memory Requirements Summary ### RTX 4080 (Work Agent) **Recommended Configuration:** - **Model**: Llama 3.1 70B Q4 - **VRAM Usage**: ~14GB - **Context Window**: 4K-8K tokens - **Concurrency**: 2 requests max - **Headroom**: ~1.5GB for system/KV cache **Alternative Configuration:** - **Model**: DeepSeek Coder 33B Q4 - **VRAM Usage**: ~8GB - **Context Window**: 8K-16K tokens - **Concurrency**: 3-4 requests possible - **Headroom**: ~7.5GB for system/KV cache ### RTX 1050 (Family Agent) **Recommended Configuration:** - **Model**: Phi-3 Mini 3.8B Q4 - **VRAM Usage**: ~2.5GB - **Context Window**: 4K-8K tokens - **Concurrency**: 1-2 requests - **Headroom**: ~1.3GB for system/KV cache **Alternative Configuration:** - **Model**: Qwen2.5 1.5B Q4 - **VRAM Usage**: ~1.2GB - **Context Window**: 8K-16K tokens - **Concurrency**: 2-3 requests possible - **Headroom**: ~2.6GB for system/KV cache ## Context Window Trade-offs ### Large Context Windows (128K+) **Pros:** - Can handle very long conversations - More context for complex tasks - Less need for summarization **Cons:** - **Not practical on 4080/1050** - Would require: - Q2 quantization (significant quality loss) - Or much smaller models (capability loss) - Or external memory (complexity) **Recommendation**: Use 4K-8K context with summarization strategy ### Practical Context Windows **4K tokens** (~3,000 words): - ✅ Fits comfortably on both GPUs - ✅ Good for most conversations - ✅ Fast inference - ⚠️ May need summarization for long chats **8K tokens** (~6,000 words): - ✅ Fits on both GPUs - ✅ Better for longer conversations - ✅ Still fast inference - ✅ Good balance **16K tokens** (~12,000 words): - ✅ Fits on 1050 with smaller models (1.5B) - ⚠️ Tight on 4080 with 70B (not recommended) - ✅ Fits on 4080 with 33B models ## System Memory (RAM) Requirements ### RTX 4080 System - **Minimum**: 16GB RAM - **Recommended**: 32GB RAM - **For**: Model loading, system processes, KV cache overflow ### RTX 1050 System - **Minimum**: 8GB RAM - **Recommended**: 16GB RAM - **For**: Model loading, system processes, KV cache overflow ## Storage Requirements ### Model Files | Model | Size (Q4) | Download Time | Storage | |-------|-----------|--------------|---------| | Llama 3.1 70B Q4 | ~40GB | ~2-4 hours | SSD recommended | | DeepSeek Coder 33B Q4 | ~20GB | ~1-2 hours | SSD recommended | | Phi-3 Mini 3.8B Q4 | ~2.5GB | ~5-10 minutes | Any storage | | Qwen2.5 1.5B Q4 | ~1GB | ~2-5 minutes | Any storage | **Total Storage Needed**: ~60-80GB for all models + backups ## Performance Impact of Context Size ### Latency vs Context Size **RTX 4080 (70B Q4):** - 4K context: ~200ms first token, ~3s for 100 tokens - 8K context: ~250ms first token, ~4s for 100 tokens - 16K context: ~400ms first token, ~6s for 100 tokens (if fits) **RTX 1050 (3.8B Q4):** - 4K context: ~50ms first token, ~1s for 100 tokens - 8K context: ~70ms first token, ~1.2s for 100 tokens - 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits) **Recommendation**: Keep context at 4K-8K for optimal latency ## Recommendations ### For RTX 4080 (Work Agent) 1. **Use Q4 quantization** - Best balance of quality and VRAM 2. **Context window**: 4K-8K tokens (practical limit) 3. **Model**: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative) 4. **Concurrency**: 2 requests maximum 5. **Summarization**: Implement for conversations >8K tokens ### For RTX 1050 (Family Agent) 1. **Use Q4 quantization** - Only option that fits 2. **Context window**: 4K-8K tokens (practical limit) 3. **Model**: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative) 4. **Concurrency**: 1-2 requests maximum 5. **Summarization**: Implement for conversations >8K tokens ## Next Steps 1. ✅ Complete capacity assessment (TICKET-018) 2. Finalize model selection based on this assessment (TICKET-019, TICKET-020) 3. Test selected models on actual hardware 4. Benchmark actual VRAM usage 5. Adjust context windows based on real-world performance ## References - [VRAM Calculator](https://huggingface.co/spaces/awf/VRAM-calculator) - [Model Quantization Guide](https://github.com/ggerganov/llama.cpp) - [Context Window Scaling](https://arxiv.org/abs/2305.13245) --- **Last Updated**: 2024-01-XX **Status**: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)