- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
8.3 KiB
LLM Capacity Assessment
Overview
This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.
VRAM Capacity Analysis
RTX 4080 (16GB VRAM)
Available VRAM: ~15.5GB (after system overhead)
Model Size Capacity
| Model Size | Quantization | VRAM Usage | Status | Notes |
|---|---|---|---|---|
| 70B | Q4 | ~14GB | ✅ Comfortable | Recommended |
| 70B | Q5 | ~16GB | ⚠️ Tight | Possible but no headroom |
| 70B | Q6 | ~18GB | ❌ Won't fit | Too large |
| 72B | Q4 | ~14.5GB | ✅ Comfortable | Qwen 2.5 72B |
| 67B | Q4 | ~13.5GB | ✅ Comfortable | Mistral Large 2 |
| 33B | Q4 | ~8GB | ✅ Plenty of room | DeepSeek Coder |
| 8B | Q4 | ~5GB | ✅ Plenty of room | Too small for work agent |
Recommendation:
- Q4 quantization for 70B models (comfortable margin)
- Q5 possible but tight (not recommended unless quality critical)
- 33B models leave plenty of room for larger context windows
Context Window Capacity
Context window size affects VRAM usage through KV cache:
| Context Size | KV Cache (70B Q4) | Total VRAM | Status |
|---|---|---|---|
| 4K tokens | ~2GB | ~16GB | ✅ Fits |
| 8K tokens | ~4GB | ~18GB | ⚠️ Tight |
| 16K tokens | ~8GB | ~22GB | ❌ Won't fit |
| 32K tokens | ~16GB | ~30GB | ❌ Won't fit |
| 128K tokens | ~64GB | ~78GB | ❌ Won't fit |
Practical Limits for 70B Q4:
- Max context: ~8K tokens (comfortable)
- Recommended context: 4K-8K tokens
- 128K context: Not practical (would need Q2 or smaller model)
For 33B Q4 (DeepSeek Coder):
- Max context: ~16K tokens (comfortable)
- Recommended context: 8K-16K tokens
Batch Size and Concurrency
| Configuration | VRAM Usage | Throughput | Recommendation |
|---|---|---|---|
| Single request | ~14GB | 1x | Baseline |
| 2 concurrent | ~15GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~16GB | 2.5x | ⚠️ Possible but tight |
| 4 concurrent | ~17GB | 3x | ❌ Won't fit |
Recommendation: 2 concurrent requests maximum for 70B Q4
RTX 1050 (4GB VRAM)
Available VRAM: ~3.8GB (after system overhead)
Model Size Capacity
| Model Size | Quantization | VRAM Usage | Status | Notes |
|---|---|---|---|---|
| 3.8B | Q4 | ~2.5GB | ✅ Comfortable | Phi-3 Mini |
| 3B | Q4 | ~2GB | ✅ Comfortable | Llama 3.2 3B |
| 2.7B | Q4 | ~1.8GB | ✅ Comfortable | Phi-2 |
| 2B | Q4 | ~1.5GB | ✅ Comfortable | Gemma 2B |
| 1.5B | Q4 | ~1.2GB | ✅ Plenty of room | Qwen2.5 1.5B |
| 1.1B | Q4 | ~0.8GB | ✅ Plenty of room | TinyLlama |
| 7B | Q4 | ~4.5GB | ❌ Won't fit | Too large |
| 8B | Q4 | ~5GB | ❌ Won't fit | Too large |
Recommendation:
- 3.8B Q4 (Phi-3 Mini) - Best balance
- 1.5B Q4 (Qwen2.5) - If more headroom needed
- 1.1B Q4 (TinyLlama) - Maximum headroom
Context Window Capacity
| Context Size | KV Cache (3.8B Q4) | Total VRAM | Status |
|---|---|---|---|
| 2K tokens | ~0.3GB | ~2.8GB | ✅ Fits easily |
| 4K tokens | ~0.6GB | ~3.1GB | ✅ Comfortable |
| 8K tokens | ~1.2GB | ~3.7GB | ✅ Fits |
| 16K tokens | ~2.4GB | ~4.9GB | ⚠️ Tight |
| 32K tokens | ~4.8GB | ~7.3GB | ❌ Won't fit |
| 128K tokens | ~19GB | ~21.5GB | ❌ Won't fit |
Practical Limits for 3.8B Q4:
- Max context: ~8K tokens (comfortable)
- Recommended context: 4K-8K tokens
- 128K context: Not practical (model supports it but VRAM doesn't)
For 1.5B Q4 (Qwen2.5):
- Max context: ~16K tokens (comfortable)
- Recommended context: 8K-16K tokens
Batch Size and Concurrency
| Configuration | VRAM Usage | Throughput | Recommendation |
|---|---|---|---|
| Single request | ~2.5GB | 1x | Baseline |
| 2 concurrent | ~3.5GB | 1.8x | ✅ Recommended |
| 3 concurrent | ~4.2GB | 2.5x | ⚠️ Possible but tight |
Recommendation: 1-2 concurrent requests for 3.8B Q4
Memory Requirements Summary
RTX 4080 (Work Agent)
Recommended Configuration:
- Model: Llama 3.1 70B Q4
- VRAM Usage: ~14GB
- Context Window: 4K-8K tokens
- Concurrency: 2 requests max
- Headroom: ~1.5GB for system/KV cache
Alternative Configuration:
- Model: DeepSeek Coder 33B Q4
- VRAM Usage: ~8GB
- Context Window: 8K-16K tokens
- Concurrency: 3-4 requests possible
- Headroom: ~7.5GB for system/KV cache
RTX 1050 (Family Agent)
Recommended Configuration:
- Model: Phi-3 Mini 3.8B Q4
- VRAM Usage: ~2.5GB
- Context Window: 4K-8K tokens
- Concurrency: 1-2 requests
- Headroom: ~1.3GB for system/KV cache
Alternative Configuration:
- Model: Qwen2.5 1.5B Q4
- VRAM Usage: ~1.2GB
- Context Window: 8K-16K tokens
- Concurrency: 2-3 requests possible
- Headroom: ~2.6GB for system/KV cache
Context Window Trade-offs
Large Context Windows (128K+)
Pros:
- Can handle very long conversations
- More context for complex tasks
- Less need for summarization
Cons:
- Not practical on 4080/1050 - Would require:
- Q2 quantization (significant quality loss)
- Or much smaller models (capability loss)
- Or external memory (complexity)
Recommendation: Use 4K-8K context with summarization strategy
Practical Context Windows
4K tokens (~3,000 words):
- ✅ Fits comfortably on both GPUs
- ✅ Good for most conversations
- ✅ Fast inference
- ⚠️ May need summarization for long chats
8K tokens (~6,000 words):
- ✅ Fits on both GPUs
- ✅ Better for longer conversations
- ✅ Still fast inference
- ✅ Good balance
16K tokens (~12,000 words):
- ✅ Fits on 1050 with smaller models (1.5B)
- ⚠️ Tight on 4080 with 70B (not recommended)
- ✅ Fits on 4080 with 33B models
System Memory (RAM) Requirements
RTX 4080 System
- Minimum: 16GB RAM
- Recommended: 32GB RAM
- For: Model loading, system processes, KV cache overflow
RTX 1050 System
- Minimum: 8GB RAM
- Recommended: 16GB RAM
- For: Model loading, system processes, KV cache overflow
Storage Requirements
Model Files
| Model | Size (Q4) | Download Time | Storage |
|---|---|---|---|
| Llama 3.1 70B Q4 | ~40GB | ~2-4 hours | SSD recommended |
| DeepSeek Coder 33B Q4 | ~20GB | ~1-2 hours | SSD recommended |
| Phi-3 Mini 3.8B Q4 | ~2.5GB | ~5-10 minutes | Any storage |
| Qwen2.5 1.5B Q4 | ~1GB | ~2-5 minutes | Any storage |
Total Storage Needed: ~60-80GB for all models + backups
Performance Impact of Context Size
Latency vs Context Size
RTX 4080 (70B Q4):
- 4K context: ~200ms first token, ~3s for 100 tokens
- 8K context: ~250ms first token, ~4s for 100 tokens
- 16K context: ~400ms first token, ~6s for 100 tokens (if fits)
RTX 1050 (3.8B Q4):
- 4K context: ~50ms first token, ~1s for 100 tokens
- 8K context: ~70ms first token, ~1.2s for 100 tokens
- 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)
Recommendation: Keep context at 4K-8K for optimal latency
Recommendations
For RTX 4080 (Work Agent)
- Use Q4 quantization - Best balance of quality and VRAM
- Context window: 4K-8K tokens (practical limit)
- Model: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
- Concurrency: 2 requests maximum
- Summarization: Implement for conversations >8K tokens
For RTX 1050 (Family Agent)
- Use Q4 quantization - Only option that fits
- Context window: 4K-8K tokens (practical limit)
- Model: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
- Concurrency: 1-2 requests maximum
- Summarization: Implement for conversations >8K tokens
Next Steps
- ✅ Complete capacity assessment (TICKET-018)
- Finalize model selection based on this assessment (TICKET-019, TICKET-020)
- Test selected models on actual hardware
- Benchmark actual VRAM usage
- Adjust context windows based on real-world performance
References
Last Updated: 2024-01-XX Status: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)