ilia/atlas

ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP

- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.

2026-01-05 23:44:16 -05:00

8.3 KiB

Raw Blame History

LLM Capacity Assessment

Overview

This document assesses VRAM capacity, context window limits, and memory requirements for running LLMs on RTX 4080 (16GB) and RTX 1050 (4GB) hardware.

VRAM Capacity Analysis

RTX 4080 (16GB VRAM)

Available VRAM: ~15.5GB (after system overhead)

Model Size Capacity

Model Size	Quantization	VRAM Usage	Status	Notes
70B	Q4	~14GB	✅ Comfortable	Recommended
70B	Q5	~16GB	⚠️ Tight	Possible but no headroom
70B	Q6	~18GB	❌ Won't fit	Too large
72B	Q4	~14.5GB	✅ Comfortable	Qwen 2.5 72B
67B	Q4	~13.5GB	✅ Comfortable	Mistral Large 2
33B	Q4	~8GB	✅ Plenty of room	DeepSeek Coder
8B	Q4	~5GB	✅ Plenty of room	Too small for work agent

Recommendation:

Q4 quantization for 70B models (comfortable margin)
Q5 possible but tight (not recommended unless quality critical)
33B models leave plenty of room for larger context windows

Context Window Capacity

Context window size affects VRAM usage through KV cache:

Context Size	KV Cache (70B Q4)	Total VRAM	Status
4K tokens	~2GB	~16GB	✅ Fits
8K tokens	~4GB	~18GB	⚠️ Tight
16K tokens	~8GB	~22GB	❌ Won't fit
32K tokens	~16GB	~30GB	❌ Won't fit
128K tokens	~64GB	~78GB	❌ Won't fit

Practical Limits for 70B Q4:

Max context: ~8K tokens (comfortable)
Recommended context: 4K-8K tokens
128K context: Not practical (would need Q2 or smaller model)

For 33B Q4 (DeepSeek Coder):

Max context: ~16K tokens (comfortable)
Recommended context: 8K-16K tokens

Batch Size and Concurrency

Configuration	VRAM Usage	Throughput	Recommendation
Single request	~14GB	1x	Baseline
2 concurrent	~15GB	1.8x	✅ Recommended
3 concurrent	~16GB	2.5x	⚠️ Possible but tight
4 concurrent	~17GB	3x	❌ Won't fit

Recommendation: 2 concurrent requests maximum for 70B Q4

RTX 1050 (4GB VRAM)

Available VRAM: ~3.8GB (after system overhead)

Model Size Capacity

Model Size	Quantization	VRAM Usage	Status	Notes
3.8B	Q4	~2.5GB	✅ Comfortable	Phi-3 Mini
3B	Q4	~2GB	✅ Comfortable	Llama 3.2 3B
2.7B	Q4	~1.8GB	✅ Comfortable	Phi-2
2B	Q4	~1.5GB	✅ Comfortable	Gemma 2B
1.5B	Q4	~1.2GB	✅ Plenty of room	Qwen2.5 1.5B
1.1B	Q4	~0.8GB	✅ Plenty of room	TinyLlama
7B	Q4	~4.5GB	❌ Won't fit	Too large
8B	Q4	~5GB	❌ Won't fit	Too large

Recommendation:

3.8B Q4 (Phi-3 Mini) - Best balance
1.5B Q4 (Qwen2.5) - If more headroom needed
1.1B Q4 (TinyLlama) - Maximum headroom

Context Window Capacity

Context Size	KV Cache (3.8B Q4)	Total VRAM	Status
2K tokens	~0.3GB	~2.8GB	✅ Fits easily
4K tokens	~0.6GB	~3.1GB	✅ Comfortable
8K tokens	~1.2GB	~3.7GB	✅ Fits
16K tokens	~2.4GB	~4.9GB	⚠️ Tight
32K tokens	~4.8GB	~7.3GB	❌ Won't fit
128K tokens	~19GB	~21.5GB	❌ Won't fit

Practical Limits for 3.8B Q4:

Max context: ~8K tokens (comfortable)
Recommended context: 4K-8K tokens
128K context: Not practical (model supports it but VRAM doesn't)

For 1.5B Q4 (Qwen2.5):

Max context: ~16K tokens (comfortable)
Recommended context: 8K-16K tokens

Batch Size and Concurrency

Configuration	VRAM Usage	Throughput	Recommendation
Single request	~2.5GB	1x	Baseline
2 concurrent	~3.5GB	1.8x	✅ Recommended
3 concurrent	~4.2GB	2.5x	⚠️ Possible but tight

Recommendation: 1-2 concurrent requests for 3.8B Q4

Memory Requirements Summary

RTX 4080 (Work Agent)

Recommended Configuration:

Model: Llama 3.1 70B Q4
VRAM Usage: ~14GB
Context Window: 4K-8K tokens
Concurrency: 2 requests max
Headroom: ~1.5GB for system/KV cache

Alternative Configuration:

Model: DeepSeek Coder 33B Q4
VRAM Usage: ~8GB
Context Window: 8K-16K tokens
Concurrency: 3-4 requests possible
Headroom: ~7.5GB for system/KV cache

RTX 1050 (Family Agent)

Recommended Configuration:

Model: Phi-3 Mini 3.8B Q4
VRAM Usage: ~2.5GB
Context Window: 4K-8K tokens
Concurrency: 1-2 requests
Headroom: ~1.3GB for system/KV cache

Alternative Configuration:

Model: Qwen2.5 1.5B Q4
VRAM Usage: ~1.2GB
Context Window: 8K-16K tokens
Concurrency: 2-3 requests possible
Headroom: ~2.6GB for system/KV cache

Context Window Trade-offs

Large Context Windows (128K+)

Pros:

Can handle very long conversations
More context for complex tasks
Less need for summarization

Cons:

Not practical on 4080/1050 - Would require:
- Q2 quantization (significant quality loss)
- Or much smaller models (capability loss)
- Or external memory (complexity)

Recommendation: Use 4K-8K context with summarization strategy

Practical Context Windows

4K tokens (~3,000 words):

✅ Fits comfortably on both GPUs
✅ Good for most conversations
✅ Fast inference
⚠️ May need summarization for long chats

8K tokens (~6,000 words):

✅ Fits on both GPUs
✅ Better for longer conversations
✅ Still fast inference
✅ Good balance

16K tokens (~12,000 words):

✅ Fits on 1050 with smaller models (1.5B)
⚠️ Tight on 4080 with 70B (not recommended)
✅ Fits on 4080 with 33B models

System Memory (RAM) Requirements

RTX 4080 System

Minimum: 16GB RAM
Recommended: 32GB RAM
For: Model loading, system processes, KV cache overflow

RTX 1050 System

Minimum: 8GB RAM
Recommended: 16GB RAM
For: Model loading, system processes, KV cache overflow

Storage Requirements

Model Files

Model	Size (Q4)	Download Time	Storage
Llama 3.1 70B Q4	~40GB	~2-4 hours	SSD recommended
DeepSeek Coder 33B Q4	~20GB	~1-2 hours	SSD recommended
Phi-3 Mini 3.8B Q4	~2.5GB	~5-10 minutes	Any storage
Qwen2.5 1.5B Q4	~1GB	~2-5 minutes	Any storage

Total Storage Needed: ~60-80GB for all models + backups

Performance Impact of Context Size

Latency vs Context Size

RTX 4080 (70B Q4):

4K context: ~200ms first token, ~3s for 100 tokens
8K context: ~250ms first token, ~4s for 100 tokens
16K context: ~400ms first token, ~6s for 100 tokens (if fits)

RTX 1050 (3.8B Q4):

4K context: ~50ms first token, ~1s for 100 tokens
8K context: ~70ms first token, ~1.2s for 100 tokens
16K context: ~100ms first token, ~1.5s for 100 tokens (if fits)

Recommendation: Keep context at 4K-8K for optimal latency

Recommendations

For RTX 4080 (Work Agent)

Use Q4 quantization - Best balance of quality and VRAM
Context window: 4K-8K tokens (practical limit)
Model: Llama 3.1 70B Q4 (primary) or DeepSeek Coder 33B Q4 (alternative)
Concurrency: 2 requests maximum
Summarization: Implement for conversations >8K tokens

For RTX 1050 (Family Agent)

Use Q4 quantization - Only option that fits
Context window: 4K-8K tokens (practical limit)
Model: Phi-3 Mini 3.8B Q4 (primary) or Qwen2.5 1.5B Q4 (alternative)
Concurrency: 1-2 requests maximum
Summarization: Implement for conversations >8K tokens

Next Steps

✅ Complete capacity assessment (TICKET-018)
Finalize model selection based on this assessment (TICKET-019, TICKET-020)
Test selected models on actual hardware
Benchmark actual VRAM usage
Adjust context windows based on real-world performance

References

Last Updated: 2024-01-XX Status: Assessment Complete - Ready for Model Selection (TICKET-019, TICKET-020)

8.3 KiB Raw Blame History

LLM Capacity Assessment

Overview

VRAM Capacity Analysis

RTX 4080 (16GB VRAM)

Model Size Capacity

Context Window Capacity

Batch Size and Concurrency

RTX 1050 (4GB VRAM)

Model Size Capacity

Context Window Capacity

Batch Size and Concurrency

Memory Requirements Summary

RTX 4080 (Work Agent)

RTX 1050 (Family Agent)

Context Window Trade-offs

Large Context Windows (128K+)

Practical Context Windows

System Memory (RAM) Requirements

RTX 4080 System

RTX 1050 System

Storage Requirements

Model Files

Performance Impact of Context Size

Latency vs Context Size

Recommendations

For RTX 4080 (Work Agent)

For RTX 1050 (Family Agent)

Next Steps

References

8.3 KiB

Raw Blame History