atlas/docs/LLM_CAPACITY.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

8.3 KiB

LLM Capacity Assessment

Overview

This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.

VRAM Capacity Analysis

RTX 4080 (16GB VRAM)

Available VRAM: ~15.5GB (after system overhead)

Model Size Capacity

Model Size Quantization VRAM Usage Status Notes
70B Q4 ~14GB Comfortable Recommended
70B Q5 ~16GB ⚠️ Tight Possible but no headroom
70B Q6 ~18GB Won't fit Too large
72B Q4 ~14.5GB Comfortable Qwen 2.5 72B
67B Q4 ~13.5GB Comfortable Mistral Large 2
33B Q4 ~8GB Plenty of room DeepSeek Coder
8B Q4 ~5GB Plenty of room Too small for work agent

Recommendation:

  • Q4 quantization for 70B models (comfortable margin)
  • Q5 possible but tight (not recommended unless quality critical)
  • 33B models leave plenty of room for larger context windows

Context Window Capacity

Context window size affects VRAM usage through KV cache:

Context Size KV Cache (70B Q4) Total VRAM Status
4K tokens ~2GB ~16GB Fits
8K tokens ~4GB ~18GB ⚠️ Tight
16K tokens ~8GB ~22GB Won't fit
32K tokens ~16GB ~30GB Won't fit
128K tokens ~64GB ~78GB Won't fit

Practical Limits for 70B Q4:

  • Max context: ~8K tokens (comfortable)
  • Recommended context: 4K-8K tokens
  • 128K context: Not practical (would need Q2 or smaller model)

For 33B Q4 (DeepSeek Coder):

  • Max context: ~16K tokens (comfortable)
  • Recommended context: 8K-16K tokens

Batch Size and Concurrency

Configuration VRAM Usage Throughput Recommendation
Single request ~14GB 1x Baseline
2 concurrent ~15GB 1.8x Recommended
3 concurrent ~16GB 2.5x ⚠️ Possible but tight
4 concurrent ~17GB 3x Won't fit

Recommendation: 2 concurrent requests maximum for 70B Q4

RTX 1050 (4GB VRAM)

Available VRAM: ~3.8GB (after system overhead)

Model Size Capacity

Model Size Quantization VRAM Usage Status Notes
3.8B Q4 ~2.5GB Comfortable Phi-3 Mini
3B Q4 ~2GB Comfortable Llama 3.2 3B
2.7B Q4 ~1.8GB Comfortable Phi-2
2B Q4 ~1.5GB Comfortable Gemma 2B
1.5B Q4 ~1.2GB Plenty of room Qwen2.5 1.5B
1.1B Q4 ~0.8GB Plenty of room TinyLlama
7B Q4 ~4.5GB Won't fit Too large
8B Q4 ~5GB Won't fit Too large

Recommendation:

  • 3.8B Q4 (Phi-3 Mini) - Best balance
  • 1.5B Q4 (Qwen2.5) - If more headroom needed
  • 1.1B Q4 (TinyLlama) - Maximum headroom

Context Window Capacity

Context Size KV Cache (3.8B Q4) Total VRAM Status
2K tokens ~0.3GB ~2.8GB Fits easily
4K tokens ~0.6GB ~3.1GB Comfortable
8K tokens ~1.2GB ~3.7GB Fits
16K tokens ~2.4GB ~4.9GB ⚠️ Tight
32K tokens ~4.8GB ~7.3GB Won't fit
128K tokens ~19GB ~21.5GB Won't fit

Practical Limits for 3.8B Q4:

  • Max context: ~8K tokens (comfortable)
  • Recommended context: 4K-8K tokens
  • 128K context: Not practical (model supports it but VRAM doesn't)

For 1.5B Q4 (Qwen2.5):

  • Max context: ~16K tokens (comfortable)
  • Recommended context: 8K-16K tokens

Batch Size and Concurrency

Configuration VRAM Usage Throughput Recommendation
Single request ~2.5GB 1x Baseline
2 concurrent ~3.5GB 1.8x Recommended
3 concurrent ~4.2GB 2.5x ⚠️ Possible but tight

Recommendation: 1-2 concurrent requests for 3.8B Q4

Memory Requirements Summary

RTX 4080 (Work Agent)

Recommended Configuration:

  • Model: Llama 3.1 70B Q4
  • VRAM Usage: ~14GB
  • Context Window: 4K-8K tokens
  • Concurrency: 2 requests max
  • Headroom: ~1.5GB for system/KV cache

Alternative Configuration:

  • Model: DeepSeek Coder 33B Q4
  • VRAM Usage: ~8GB
  • Context Window: 8K-16K tokens
  • Concurrency: 3-4 requests possible
  • Headroom: ~7.5GB for system/KV cache

RTX 1050 (Family Agent)

Recommended Configuration:

  • Model: Phi-3 Mini 3.8B Q4
  • VRAM Usage: ~2.5GB
  • Context Window: 4K-8K tokens
  • Concurrency: 1-2 requests
  • Headroom: ~1.3GB for system/KV cache

Alternative Configuration:

  • Model: Qwen2.5 1.5B Q4
  • VRAM Usage: ~1.2GB
  • Context Window: 8K-16K tokens
  • Concurrency: 2-3 requests possible
  • Headroom: ~2.6GB for system/KV cache

Context Window Trade-offs

Large Context Windows (128K+)

Pros:

  • Can handle very long conversations
  • More context for complex tasks
  • Less need for summarization

Cons:

  • Not practical on 4080/1050 - Would require:
    • Q2 quantization (significant quality loss)
    • Or much smaller models (capability loss)
    • Or external memory (complexity)

Recommendation: Use 4K-8K context with summarization strategy

Practical Context Windows

4K tokens (~3,000 words):

  • Fits comfortably on both GPUs
  • Good for most conversations
  • Fast inference
  • ⚠️ May need summarization for long chats

8K tokens (~6,000 words):

  • Fits on both GPUs
  • Better for longer conversations
  • Still fast inference
  • Good balance

16K tokens (~12,000 words):

  • Fits on 1050 with smaller models (1.5B)
  • ⚠️ Tight on 4080 with 70B (not recommended)
  • Fits on 4080 with 33B models

System Memory (RAM) Requirements

RTX 4080 System

  • Minimum: 16GB RAM
  • Recommended: 32GB RAM
  • For: Model loading, system processes, KV cache overflow

RTX 1050 System

  • Minimum: 8GB RAM
  • Recommended: 16GB RAM
  • For: Model loading, system processes, KV cache overflow

Storage Requirements

Model Files

Model Size (Q4) Download Time Storage
Llama 3.1 70B Q4 ~40GB ~2-4 hours SSD recommended
DeepSeek Coder 33B Q4 ~20GB ~1-2 hours SSD recommended
Phi-3 Mini 3.8B Q4 ~2.5GB ~5-10 minutes Any storage
Qwen2.5 1.5B Q4 ~1GB ~2-5 minutes Any storage

Total Storage Needed: ~60-80GB for all models + backups

Performance Impact of Context Size

Latency vs Context Size

RTX 4080 (70B Q4):

  • 4K context: ~200ms first token, ~3s for 100 tokens
  • 8K context: ~250ms first token, ~4s for 100 tokens
  • 16K context: ~400ms first token, ~6s for 100 tokens (if fits)

RTX 1050 (3.8B Q4):

  • 4K context: ~50ms first token, ~1s for 100 tokens
  • 8K context: ~70ms first token, ~1.2s for 100 tokens
  • 16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)

Recommendation: Keep context at 4K-8K for optimal latency

Recommendations

For RTX 4080 (Work Agent)

  1. Use Q4 quantization - Best balance of quality and VRAM
  2. Context window: 4K-8K tokens (practical limit)
  3. Model: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
  4. Concurrency: 2 requests maximum
  5. Summarization: Implement for conversations >8K tokens

For RTX 1050 (Family Agent)

  1. Use Q4 quantization - Only option that fits
  2. Context window: 4K-8K tokens (practical limit)
  3. Model: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
  4. Concurrency: 1-2 requests maximum
  5. Summarization: Implement for conversations >8K tokens

Next Steps

  1. Complete capacity assessment (TICKET-018)
  2. Finalize model selection based on this assessment (TICKET-019, TICKET-020)
  3. Test selected models on actual hardware
  4. Benchmark actual VRAM usage
  5. Adjust context windows based on real-world performance

References


Last Updated: 2024-01-XX Status: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)