ilia/atlas

ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP

- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.

2026-01-05 23:44:16 -05:00

9.3 KiB

Raw Permalink Blame History

LLM Model Survey

Overview

This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050).

Hardware Constraints:

RTX 4080: 16GB VRAM - Work agent, high-capability tasks
RTX 1050: 4GB VRAM - Family agent, always-on, low-latency

Evaluation Criteria

Work Agent (RTX 4080) Requirements

Coding capabilities: Code generation, debugging, code review
Research capabilities: Analysis, reasoning, documentation
Function calling: Must support tool/function calling for MCP integration
Context window: 8K-16K tokens minimum
VRAM fit: Must fit in 16GB with quantization
Performance: Reasonable latency (< 5s for typical responses)

Family Agent (RTX 1050) Requirements

Instruction following: Good at following conversational instructions
Function calling: Must support tool/function calling
Low latency: < 1s response time for interactive use
VRAM fit: Must fit in 4GB with quantization
Efficiency: Low power consumption for always-on operation
Context window: 4K-8K tokens sufficient

Model Comparison Matrix

RTX 4080 Candidates (Work Agent)

Model	Size	Quantization	VRAM Usage	Coding	Research	Function Call	Context	Speed	Recommendation
Llama 3.1 70B	70B	Q4	~14GB	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	✅	128K	Medium	⭐ Top Choice
Llama 3.1 70B	70B	Q5	~16GB	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	✅	128K	Medium	Good quality
DeepSeek Coder 33B	33B	Q4	~8GB	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	✅	16K	Fast	Best for coding
Qwen 2.5 72B	72B	Q4	~14GB	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	✅	32K	Medium	Strong alternative
Mistral Large 2 67B	67B	Q4	~13GB	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	✅	128K	Medium	Good option
Llama 3.1 8B	8B	Q4	~5GB	⭐⭐⭐	⭐⭐⭐	✅	128K	Very Fast	Too small for work

Recommendation for 4080:

Primary: Llama 3.1 70B Q4 - Best overall balance
Alternative: DeepSeek Coder 33B Q4 - If coding is primary focus
Fallback: Qwen 2.5 72B Q4 - Strong alternative

RTX 1050 Candidates (Family Agent)

Model	Size	Quantization	VRAM Usage	Instruction	Function Call	Context	Speed	Latency	Recommendation
Phi-3 Mini 3.8B	3.8B	Q4	~2.5GB	⭐⭐⭐⭐⭐	✅	128K	Very Fast	<1s	⭐ Top Choice
TinyLlama 1.1B	1.1B	Q4	~0.8GB	⭐⭐⭐	✅	2K	Extremely Fast	<0.5s	Lightweight option
Gemma 2B	2B	Q4	~1.5GB	⭐⭐⭐⭐	✅	8K	Very Fast	<0.8s	Good alternative
Qwen2.5 1.5B	1.5B	Q4	~1.2GB	⭐⭐⭐⭐	✅	32K	Very Fast	<0.7s	Strong option
Phi-2 2.7B	2.7B	Q4	~1.8GB	⭐⭐⭐⭐	✅	2K	Fast	<1s	Older, less capable
Llama 3.2 3B	3B	Q4	~2GB	⭐⭐⭐⭐	✅	128K	Fast	<1s	Good but larger

Recommendation for 1050:

Primary: Phi-3 Mini 3.8B Q4 - Best instruction following, good speed
Alternative: Qwen2.5 1.5B Q4 - Smaller, still capable
Fallback: TinyLlama 1.1B Q4 - If VRAM is tight

Detailed Model Analysis

Work Agent Models

Llama 3.1 70B Q4/Q5

Pros:

Excellent coding and research capabilities
Large context window (128K tokens)
Strong function calling support
Well-documented and widely used
Good balance of quality and speed

Cons:

Q5 uses full 16GB (tight fit)
Slower than smaller models
Higher power consumption

VRAM Usage:

Q4: ~14GB (comfortable margin)
Q5: ~16GB (tight, but better quality)

Best For: General work tasks, coding, research, complex reasoning

DeepSeek Coder 33B Q4

Pros:

Excellent coding capabilities (specialized)
Faster than 70B models
Lower VRAM usage (~8GB)
Good function calling support
Strong for code generation and debugging

Cons:

Less capable for general research/analysis
Smaller context window (16K vs 128K)
Less general-purpose than Llama 3.1

Best For: Coding-focused work, code generation, debugging

Qwen 2.5 72B Q4

Pros:

Strong multilingual support
Good coding and research capabilities
Large context (32K tokens)
Competitive with Llama 3.1

Cons:

Less community support than Llama
Slightly less polished tool calling

Best For: Multilingual work, research, general tasks

Family Agent Models

Phi-3 Mini 3.8B Q4

Pros:

Excellent instruction following
Very fast inference (<1s)
Low VRAM usage (~2.5GB)
Good function calling support
Large context (128K tokens)
Microsoft-backed, well-maintained

Cons:

Slightly larger than alternatives
May be overkill for simple tasks

Best For: Family conversations, task management, general Q&A

Qwen2.5 1.5B Q4

Pros:

Very small VRAM footprint (~1.2GB)
Fast inference
Good instruction following
Large context (32K tokens)
Efficient for always-on use

Cons:

Less capable than Phi-3 Mini
May struggle with complex requests

Best For: Lightweight always-on agent, simple tasks

TinyLlama 1.1B Q4

Pros:

Extremely small (~0.8GB VRAM)
Very fast inference
Minimal resource usage

Cons:

Limited capabilities
Small context window (2K tokens)
May not handle complex conversations well

Best For: Very resource-constrained scenarios

Quantization Comparison

Q4 (4-bit)

Quality: ~95-98% of full precision
VRAM: ~50% reduction
Speed: Fast
Recommendation: ✅ Use for both agents

Q5 (5-bit)

Quality: ~98-99% of full precision
VRAM: ~62% of original
Speed: Slightly slower than Q4
Recommendation: Consider for 4080 if quality is critical

Q6 (6-bit)

Quality: ~99% of full precision
VRAM: ~75% of original
Speed: Slower
Recommendation: Not recommended (marginal quality gain)

Q8 (8-bit)

Quality: Near full precision
VRAM: ~100% of original
Speed: Slowest
Recommendation: Not recommended (doesn't fit in constraints)

Function Calling Support

All recommended models support function calling:

Llama 3.1: Native function calling via tools parameter
DeepSeek Coder: Function calling support
Qwen 2.5: Function calling support
Phi-3 Mini: Function calling support
TinyLlama: Basic function calling (may need fine-tuning)

Performance Benchmarks (Estimated)

RTX 4080 (16GB VRAM)

Model	Tokens/sec	Latency (first token)	Latency (100 tokens)
Llama 3.1 70B Q4	~25-35	~200-300ms	~3-4s
Llama 3.1 70B Q5	~20-30	~250-350ms	~3.5-5s
DeepSeek Coder 33B Q4	~40-60	~100-200ms	~2-3s
Qwen 2.5 72B Q4	~25-35	~200-300ms	~3-4s

RTX 1050 (4GB VRAM)

Model	Tokens/sec	Latency (first token)	Latency (100 tokens)
Phi-3 Mini 3.8B Q4	~80-120	~50-100ms	~1-1.5s
Qwen2.5 1.5B Q4	~100-150	~30-60ms	~0.7-1s
TinyLlama 1.1B Q4	~150-200	~20-40ms	~0.5-0.7s

Final Recommendations

Work Agent (RTX 4080)

Primary Choice: Llama 3.1 70B Q4

Best overall capabilities
Fits comfortably in 16GB VRAM
Excellent for coding, research, and general work tasks
Strong function calling support
Large context window (128K)

Alternative: DeepSeek Coder 33B Q4

If coding is the primary use case
Faster inference
Lower VRAM usage allows for more headroom

Family Agent (RTX 1050)

Primary Choice: Phi-3 Mini 3.8B Q4

Excellent instruction following
Fast inference (<1s latency)
Low VRAM usage (~2.5GB)
Good function calling support
Large context window (128K)

Alternative: Qwen2.5 1.5B Q4

If VRAM is very tight
Still capable for simple tasks
Very fast inference

Implementation Notes

Model Sources

Hugging Face: Primary source for all models
Ollama: Pre-configured models (easier setup)
Direct download: For custom quantization

Inference Servers

Ollama: Easiest setup, good for prototyping
vLLM: Best throughput, batching support
llama.cpp: Lightweight, efficient, good for 1050

Quantization Tools

llama.cpp: Built-in quantization
AutoGPTQ: For GPTQ quantization
AWQ: Alternative quantization method

Next Steps

✅ Complete this survey (TICKET-017)
Complete capacity assessment (TICKET-018)
Finalize model selection (TICKET-019, TICKET-020)
Download and test selected models
Benchmark on actual hardware
Set up inference servers (TICKET-021, TICKET-022)

References

Last Updated: 2024-01-XX Status: Survey Complete - Ready for TICKET-018 (Capacity Assessment)

9.3 KiB Raw Permalink Blame History

LLM Model Survey

Overview

Evaluation Criteria

Work Agent (RTX 4080) Requirements

Family Agent (RTX 1050) Requirements

Model Comparison Matrix

RTX 4080 Candidates (Work Agent)

RTX 1050 Candidates (Family Agent)

Detailed Model Analysis

Work Agent Models

Llama 3.1 70B Q4/Q5

DeepSeek Coder 33B Q4

Qwen 2.5 72B Q4

Family Agent Models

Phi-3 Mini 3.8B Q4

Qwen2.5 1.5B Q4

TinyLlama 1.1B Q4

Quantization Comparison

Q4 (4-bit)

Q5 (5-bit)

Q6 (6-bit)

Q8 (8-bit)

Function Calling Support

Performance Benchmarks (Estimated)

RTX 4080 (16GB VRAM)

RTX 1050 (4GB VRAM)

Final Recommendations

Work Agent (RTX 4080)

Family Agent (RTX 1050)

Implementation Notes

Model Sources

Inference Servers

Quantization Tools

Next Steps

References

9.3 KiB

Raw Permalink Blame History