atlas/docs/LLM_CAPACITY.md

# LLM Capacity Assessment

## Overview

This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.

## VRAM Capacity Analysis

### RTX 4080 (16GB VRAM)

**Available VRAM**: ~15.5GB (after system overhead)

#### Model Size Capacity

| Model Size | Quantization | VRAM Usage | Status | Notes |
|------------|--------------|------------|--------|-------|
| 70B | Q4 | ~14GB | ✅ Comfortable | Recommended |
| 70B | Q5 | ~16GB | ⚠️ Tight | Possible but no headroom |
| 70B | Q6 | ~18GB | ❌ Won't fit | Too large |
| 72B | Q4 | ~14.5GB | ✅ Comfortable | Qwen 2.5 72B |
| 67B | Q4 | ~13.5GB | ✅ Comfortable | Mistral Large 2 |
| 33B | Q4 | ~8GB | ✅ Plenty of room | DeepSeek Coder |
| 8B | Q4 | ~5GB | ✅ Plenty of room | Too small for work agent |

**Recommendation**:
- **Q4 quantization** for 70B models (comfortable margin)
- **Q5 possible** but tight (not recommended unless quality critical)
- **33B models** leave plenty of room for larger context windows

#### Context Window Capacity

Context window size affects VRAM usage through KV cache:

| Context Size | KV Cache (70B Q4) | Total VRAM | Status |
|--------------|-------------------|-----------|--------|
| 4K tokens | ~2GB | ~16GB | ✅ Fits |
| 8K tokens | ~4GB | ~18GB | ⚠️ Tight |
| 16K tokens | ~8GB | ~22GB | ❌ Won't fit |
| 32K tokens | ~16GB | ~30GB | ❌ Won't fit |
| 128K tokens | ~64GB | ~78GB | ❌ Won't fit |

**Practical Limits for 70B Q4:**
- **Max context**: ~8K tokens (comfortable)
- **Recommended context**: 4K-8K tokens
- **128K context**: Not practical (would need Q2 or smaller model)

**For 33B Q4 (DeepSeek Coder):**
- **Max context**: ~16K tokens (comfortable)
- **Recommended context**: 8K-16K tokens

#### Batch Size and Concurrency

| Configuration | VRAM Usage | Throughput | Recommendation |
|----------------|------------|------------|----------------|
| Single request | ~14GB | 1x | Baseline |
| 2 concurrent | ~15GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~16GB | 2.5x | ⚠️ Possible but tight |
| 4 concurrent | ~17GB | 3x | ❌ Won't fit |

**Recommendation**: 2 concurrent requests maximum for 70B Q4

### RTX 1050 (4GB VRAM)

**Available VRAM**: ~3.8GB (after system overhead)

#### Model Size Capacity

| Model Size | Quantization | VRAM Usage | Status | Notes |
|------------|--------------|------------|--------|-------|
| 3.8B | Q4 | ~2.5GB | ✅ Comfortable | Phi-3 Mini |
| 3B | Q4 | ~2GB | ✅ Comfortable | Llama 3.2 3B |
| 2.7B | Q4 | ~1.8GB | ✅ Comfortable | Phi-2 |
| 2B | Q4 | ~1.5GB | ✅ Comfortable | Gemma 2B |
| 1.5B | Q4 | ~1.2GB | ✅ Plenty of room | Qwen2.5 1.5B |
| 1.1B | Q4 | ~0.8GB | ✅ Plenty of room | TinyLlama |
| 7B | Q4 | ~4.5GB | ❌ Won't fit | Too large |
| 8B | Q4 | ~5GB | ❌ Won't fit | Too large |

**Recommendation**:
- **3.8B Q4** (Phi-3 Mini) - Best balance
- **1.5B Q4** (Qwen2.5) - If more headroom needed
- **1.1B Q4** (TinyLlama) - Maximum headroom

#### Context Window Capacity

| Context Size | KV Cache (3.8B Q4) | Total VRAM | Status |
|--------------|-------------------|-----------|--------|
| 2K tokens | ~0.3GB | ~2.8GB | ✅ Fits easily |
| 4K tokens | ~0.6GB | ~3.1GB | ✅ Comfortable |
| 8K tokens | ~1.2GB | ~3.7GB | ✅ Fits |
| 16K tokens | ~2.4GB | ~4.9GB | ⚠️ Tight |
| 32K tokens | ~4.8GB | ~7.3GB | ❌ Won't fit |
| 128K tokens | ~19GB | ~21.5GB | ❌ Won't fit |

**Practical Limits for 3.8B Q4:**
- **Max context**: ~8K tokens (comfortable)
- **Recommended context**: 4K-8K tokens
- **128K context**: Not practical (model supports it but VRAM doesn't)

**For 1.5B Q4 (Qwen2.5):**
- **Max context**: ~16K tokens (comfortable)
- **Recommended context**: 8K-16K tokens

#### Batch Size and Concurrency

| Configuration | VRAM Usage | Throughput | Recommendation |
|----------------|------------|------------|----------------|
| Single request | ~2.5GB | 1x | Baseline |
| 2 concurrent | ~3.5GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~4.2GB | 2.5x | ⚠️ Possible but tight |

**Recommendation**: 1-2 concurrent requests for 3.8B Q4

## Memory Requirements Summary

### RTX 4080 (Work Agent)

**Recommended Configuration:**
- **Model**: Llama 3.1 70B Q4
- **VRAM Usage**: ~14GB
- **Context Window**: 4K-8K tokens
- **Concurrency**: 2 requests max
- **Headroom**: ~1.5GB for system/KV cache

**Alternative Configuration:**
- **Model**: DeepSeek Coder 33B Q4
- **VRAM Usage**: ~8GB
- **Context Window**: 8K-16K tokens
- **Concurrency**: 3-4 requests possible
- **Headroom**: ~7.5GB for system/KV cache

### RTX 1050 (Family Agent)

**Recommended Configuration:**
- **Model**: Phi-3 Mini 3.8B Q4
- **VRAM Usage**: ~2.5GB
- **Context Window**: 4K-8K tokens
- **Concurrency**: 1-2 requests
- **Headroom**: ~1.3GB for system/KV cache

**Alternative Configuration:**
- **Model**: Qwen2.5 1.5B Q4
- **VRAM Usage**: ~1.2GB
- **Context Window**: 8K-16K tokens
- **Concurrency**: 2-3 requests possible
- **Headroom**: ~2.6GB for system/KV cache

## Context Window Trade-offs

### Large Context Windows (128K+)

**Pros:**
- Can handle very long conversations
- More context for complex tasks
- Less need for summarization

**Cons:**
- **Not practical on 4080/1050** - Would require:
  - Q2 quantization (significant quality loss)
  - Or much smaller models (capability loss)
  - Or external memory (complexity)

**Recommendation**: Use 4K-8K context with summarization strategy

### Practical Context Windows

**4K tokens** (~3,000 words):
- ✅ Fits comfortably on both GPUs
- ✅ Good for most conversations
- ✅ Fast inference
- ⚠️ May need summarization for long chats

**8K tokens** (~6,000 words):
- ✅ Fits on both GPUs
- ✅ Better for longer conversations
- ✅ Still fast inference
- ✅ Good balance

**16K tokens** (~12,000 words):
- ✅ Fits on 1050 with smaller models (1.5B)
- ⚠️ Tight on 4080 with 70B (not recommended)
- ✅ Fits on 4080 with 33B models

## System Memory (RAM) Requirements

### RTX 4080 System
- **Minimum**: 16GB RAM
- **Recommended**: 32GB RAM
- **For**: Model loading, system processes, KV cache overflow

### RTX 1050 System
- **Minimum**: 8GB RAM
- **Recommended**: 16GB RAM
- **For**: Model loading, system processes, KV cache overflow

## Storage Requirements

### Model Files

| Model | Size (Q4) | Download Time | Storage |
|-------|-----------|--------------|---------|
| Llama 3.1 70B Q4 | ~40GB | ~2-4 hours | SSD recommended |
| DeepSeek Coder 33B Q4 | ~20GB | ~1-2 hours | SSD recommended |
| Phi-3 Mini 3.8B Q4 | ~2.5GB | ~5-10 minutes | Any storage |
| Qwen2.5 1.5B Q4 | ~1GB | ~2-5 minutes | Any storage |

**Total Storage Needed**: ~60-80GB for all models + backups

## Performance Impact of Context Size

### Latency vs Context Size

**RTX 4080 (70B Q4):**
- 4K context: ~200ms first token, ~3s for 100 tokens
- 8K context: ~250ms first token, ~4s for 100 tokens
- 16K context: ~400ms first token, ~6s for 100 tokens (if fits)

**RTX 1050 (3.8B Q4):**
- 4K context: ~50ms first token, ~1s for 100 tokens
- 8K context: ~70ms first token, ~1.2s for 100 tokens
- 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)

**Recommendation**: Keep context at 4K-8K for optimal latency

## Recommendations

### For RTX 4080 (Work Agent)
1. **Use Q4 quantization** - Best balance of quality and VRAM
2. **Context window**: 4K-8K tokens (practical limit)
3. **Model**: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
4. **Concurrency**: 2 requests maximum
5. **Summarization**: Implement for conversations >8K tokens

### For RTX 1050 (Family Agent)
1. **Use Q4 quantization** - Only option that fits
2. **Context window**: 4K-8K tokens (practical limit)
3. **Model**: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
4. **Concurrency**: 1-2 requests maximum
5. **Summarization**: Implement for conversations >8K tokens

## Next Steps

1. ✅ Complete capacity assessment (TICKET-018)
2. Finalize model selection based on this assessment (TICKET-019, TICKET-020)
3. Test selected models on actual hardware
4. Benchmark actual VRAM usage
5. Adjust context windows based on real-world performance

## References

- [VRAM Calculator](https://huggingface.co/spaces/awf/VRAM-calculator)
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)
- [Context Window Scaling](https://arxiv.org/abs/2305.13245)

---

**Last Updated**: 2024-01-XX
**Status**: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)