- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
259 lines
8.3 KiB
Markdown
259 lines
8.3 KiB
Markdown
# LLM Capacity Assessment
|
|
|
|
## Overview
|
|
|
|
This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.
|
|
|
|
## VRAM Capacity Analysis
|
|
|
|
### RTX 4080 (16GB VRAM)
|
|
|
|
**Available VRAM**: ~15.5GB (after system overhead)
|
|
|
|
#### Model Size Capacity
|
|
|
|
| Model Size | Quantization | VRAM Usage | Status | Notes |
|
|
|------------|--------------|------------|--------|-------|
|
|
| 70B | Q4 | ~14GB | ✅ Comfortable | Recommended |
|
|
| 70B | Q5 | ~16GB | ⚠️ Tight | Possible but no headroom |
|
|
| 70B | Q6 | ~18GB | ❌ Won't fit | Too large |
|
|
| 72B | Q4 | ~14.5GB | ✅ Comfortable | Qwen 2.5 72B |
|
|
| 67B | Q4 | ~13.5GB | ✅ Comfortable | Mistral Large 2 |
|
|
| 33B | Q4 | ~8GB | ✅ Plenty of room | DeepSeek Coder |
|
|
| 8B | Q4 | ~5GB | ✅ Plenty of room | Too small for work agent |
|
|
|
|
**Recommendation**:
|
|
- **Q4 quantization** for 70B models (comfortable margin)
|
|
- **Q5 possible** but tight (not recommended unless quality critical)
|
|
- **33B models** leave plenty of room for larger context windows
|
|
|
|
#### Context Window Capacity
|
|
|
|
Context window size affects VRAM usage through KV cache:
|
|
|
|
| Context Size | KV Cache (70B Q4) | Total VRAM | Status |
|
|
|--------------|-------------------|-----------|--------|
|
|
| 4K tokens | ~2GB | ~16GB | ✅ Fits |
|
|
| 8K tokens | ~4GB | ~18GB | ⚠️ Tight |
|
|
| 16K tokens | ~8GB | ~22GB | ❌ Won't fit |
|
|
| 32K tokens | ~16GB | ~30GB | ❌ Won't fit |
|
|
| 128K tokens | ~64GB | ~78GB | ❌ Won't fit |
|
|
|
|
**Practical Limits for 70B Q4:**
|
|
- **Max context**: ~8K tokens (comfortable)
|
|
- **Recommended context**: 4K-8K tokens
|
|
- **128K context**: Not practical (would need Q2 or smaller model)
|
|
|
|
**For 33B Q4 (DeepSeek Coder):**
|
|
- **Max context**: ~16K tokens (comfortable)
|
|
- **Recommended context**: 8K-16K tokens
|
|
|
|
#### Batch Size and Concurrency
|
|
|
|
| Configuration | VRAM Usage | Throughput | Recommendation |
|
|
|----------------|------------|------------|----------------|
|
|
| Single request | ~14GB | 1x | Baseline |
|
|
| 2 concurrent | ~15GB | 1.8x | ✅ Recommended |
|
|
| 3 concurrent | ~16GB | 2.5x | ⚠️ Possible but tight |
|
|
| 4 concurrent | ~17GB | 3x | ❌ Won't fit |
|
|
|
|
**Recommendation**: 2 concurrent requests maximum for 70B Q4
|
|
|
|
### RTX 1050 (4GB VRAM)
|
|
|
|
**Available VRAM**: ~3.8GB (after system overhead)
|
|
|
|
#### Model Size Capacity
|
|
|
|
| Model Size | Quantization | VRAM Usage | Status | Notes |
|
|
|------------|--------------|------------|--------|-------|
|
|
| 3.8B | Q4 | ~2.5GB | ✅ Comfortable | Phi-3 Mini |
|
|
| 3B | Q4 | ~2GB | ✅ Comfortable | Llama 3.2 3B |
|
|
| 2.7B | Q4 | ~1.8GB | ✅ Comfortable | Phi-2 |
|
|
| 2B | Q4 | ~1.5GB | ✅ Comfortable | Gemma 2B |
|
|
| 1.5B | Q4 | ~1.2GB | ✅ Plenty of room | Qwen2.5 1.5B |
|
|
| 1.1B | Q4 | ~0.8GB | ✅ Plenty of room | TinyLlama |
|
|
| 7B | Q4 | ~4.5GB | ❌ Won't fit | Too large |
|
|
| 8B | Q4 | ~5GB | ❌ Won't fit | Too large |
|
|
|
|
**Recommendation**:
|
|
- **3.8B Q4** (Phi-3 Mini) - Best balance
|
|
- **1.5B Q4** (Qwen2.5) - If more headroom needed
|
|
- **1.1B Q4** (TinyLlama) - Maximum headroom
|
|
|
|
#### Context Window Capacity
|
|
|
|
| Context Size | KV Cache (3.8B Q4) | Total VRAM | Status |
|
|
|--------------|-------------------|-----------|--------|
|
|
| 2K tokens | ~0.3GB | ~2.8GB | ✅ Fits easily |
|
|
| 4K tokens | ~0.6GB | ~3.1GB | ✅ Comfortable |
|
|
| 8K tokens | ~1.2GB | ~3.7GB | ✅ Fits |
|
|
| 16K tokens | ~2.4GB | ~4.9GB | ⚠️ Tight |
|
|
| 32K tokens | ~4.8GB | ~7.3GB | ❌ Won't fit |
|
|
| 128K tokens | ~19GB | ~21.5GB | ❌ Won't fit |
|
|
|
|
**Practical Limits for 3.8B Q4:**
|
|
- **Max context**: ~8K tokens (comfortable)
|
|
- **Recommended context**: 4K-8K tokens
|
|
- **128K context**: Not practical (model supports it but VRAM doesn't)
|
|
|
|
**For 1.5B Q4 (Qwen2.5):**
|
|
- **Max context**: ~16K tokens (comfortable)
|
|
- **Recommended context**: 8K-16K tokens
|
|
|
|
#### Batch Size and Concurrency
|
|
|
|
| Configuration | VRAM Usage | Throughput | Recommendation |
|
|
|----------------|------------|------------|----------------|
|
|
| Single request | ~2.5GB | 1x | Baseline |
|
|
| 2 concurrent | ~3.5GB | 1.8x | ✅ Recommended |
|
|
| 3 concurrent | ~4.2GB | 2.5x | ⚠️ Possible but tight |
|
|
|
|
**Recommendation**: 1-2 concurrent requests for 3.8B Q4
|
|
|
|
## Memory Requirements Summary
|
|
|
|
### RTX 4080 (Work Agent)
|
|
|
|
**Recommended Configuration:**
|
|
- **Model**: Llama 3.1 70B Q4
|
|
- **VRAM Usage**: ~14GB
|
|
- **Context Window**: 4K-8K tokens
|
|
- **Concurrency**: 2 requests max
|
|
- **Headroom**: ~1.5GB for system/KV cache
|
|
|
|
**Alternative Configuration:**
|
|
- **Model**: DeepSeek Coder 33B Q4
|
|
- **VRAM Usage**: ~8GB
|
|
- **Context Window**: 8K-16K tokens
|
|
- **Concurrency**: 3-4 requests possible
|
|
- **Headroom**: ~7.5GB for system/KV cache
|
|
|
|
### RTX 1050 (Family Agent)
|
|
|
|
**Recommended Configuration:**
|
|
- **Model**: Phi-3 Mini 3.8B Q4
|
|
- **VRAM Usage**: ~2.5GB
|
|
- **Context Window**: 4K-8K tokens
|
|
- **Concurrency**: 1-2 requests
|
|
- **Headroom**: ~1.3GB for system/KV cache
|
|
|
|
**Alternative Configuration:**
|
|
- **Model**: Qwen2.5 1.5B Q4
|
|
- **VRAM Usage**: ~1.2GB
|
|
- **Context Window**: 8K-16K tokens
|
|
- **Concurrency**: 2-3 requests possible
|
|
- **Headroom**: ~2.6GB for system/KV cache
|
|
|
|
## Context Window Trade-offs
|
|
|
|
### Large Context Windows (128K+)
|
|
|
|
**Pros:**
|
|
- Can handle very long conversations
|
|
- More context for complex tasks
|
|
- Less need for summarization
|
|
|
|
**Cons:**
|
|
- **Not practical on 4080/1050** - Would require:
|
|
- Q2 quantization (significant quality loss)
|
|
- Or much smaller models (capability loss)
|
|
- Or external memory (complexity)
|
|
|
|
**Recommendation**: Use 4K-8K context with summarization strategy
|
|
|
|
### Practical Context Windows
|
|
|
|
**4K tokens** (~3,000 words):
|
|
- ✅ Fits comfortably on both GPUs
|
|
- ✅ Good for most conversations
|
|
- ✅ Fast inference
|
|
- ⚠️ May need summarization for long chats
|
|
|
|
**8K tokens** (~6,000 words):
|
|
- ✅ Fits on both GPUs
|
|
- ✅ Better for longer conversations
|
|
- ✅ Still fast inference
|
|
- ✅ Good balance
|
|
|
|
**16K tokens** (~12,000 words):
|
|
- ✅ Fits on 1050 with smaller models (1.5B)
|
|
- ⚠️ Tight on 4080 with 70B (not recommended)
|
|
- ✅ Fits on 4080 with 33B models
|
|
|
|
## System Memory (RAM) Requirements
|
|
|
|
### RTX 4080 System
|
|
- **Minimum**: 16GB RAM
|
|
- **Recommended**: 32GB RAM
|
|
- **For**: Model loading, system processes, KV cache overflow
|
|
|
|
### RTX 1050 System
|
|
- **Minimum**: 8GB RAM
|
|
- **Recommended**: 16GB RAM
|
|
- **For**: Model loading, system processes, KV cache overflow
|
|
|
|
## Storage Requirements
|
|
|
|
### Model Files
|
|
|
|
| Model | Size (Q4) | Download Time | Storage |
|
|
|-------|-----------|--------------|---------|
|
|
| Llama 3.1 70B Q4 | ~40GB | ~2-4 hours | SSD recommended |
|
|
| DeepSeek Coder 33B Q4 | ~20GB | ~1-2 hours | SSD recommended |
|
|
| Phi-3 Mini 3.8B Q4 | ~2.5GB | ~5-10 minutes | Any storage |
|
|
| Qwen2.5 1.5B Q4 | ~1GB | ~2-5 minutes | Any storage |
|
|
|
|
**Total Storage Needed**: ~60-80GB for all models + backups
|
|
|
|
## Performance Impact of Context Size
|
|
|
|
### Latency vs Context Size
|
|
|
|
**RTX 4080 (70B Q4):**
|
|
- 4K context: ~200ms first token, ~3s for 100 tokens
|
|
- 8K context: ~250ms first token, ~4s for 100 tokens
|
|
- 16K context: ~400ms first token, ~6s for 100 tokens (if fits)
|
|
|
|
**RTX 1050 (3.8B Q4):**
|
|
- 4K context: ~50ms first token, ~1s for 100 tokens
|
|
- 8K context: ~70ms first token, ~1.2s for 100 tokens
|
|
- 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)
|
|
|
|
**Recommendation**: Keep context at 4K-8K for optimal latency
|
|
|
|
## Recommendations
|
|
|
|
### For RTX 4080 (Work Agent)
|
|
1. **Use Q4 quantization** - Best balance of quality and VRAM
|
|
2. **Context window**: 4K-8K tokens (practical limit)
|
|
3. **Model**: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
|
|
4. **Concurrency**: 2 requests maximum
|
|
5. **Summarization**: Implement for conversations >8K tokens
|
|
|
|
### For RTX 1050 (Family Agent)
|
|
1. **Use Q4 quantization** - Only option that fits
|
|
2. **Context window**: 4K-8K tokens (practical limit)
|
|
3. **Model**: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
|
|
4. **Concurrency**: 1-2 requests maximum
|
|
5. **Summarization**: Implement for conversations >8K tokens
|
|
|
|
## Next Steps
|
|
|
|
1. ✅ Complete capacity assessment (TICKET-018)
|
|
2. Finalize model selection based on this assessment (TICKET-019, TICKET-020)
|
|
3. Test selected models on actual hardware
|
|
4. Benchmark actual VRAM usage
|
|
5. Adjust context windows based on real-world performance
|
|
|
|
## References
|
|
|
|
- [VRAM Calculator](https://huggingface.co/spaces/awf/VRAM-calculator)
|
|
- [Model Quantization Guide](https://github.com/ggerganov/llama.cpp)
|
|
- [Context Window Scaling](https://arxiv.org/abs/2305.13245)
|
|
|
|
---
|
|
|
|
**Last Updated**: 2024-01-XX
|
|
**Status**: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)
|