atlas/docs/LLM_MODEL_SURVEY.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

9.3 KiB

LLM Model Survey

Overview

This document surveys and evaluates open-weight LLM models for the Atlas voice agent system, with separate recommendations for the work agent (RTX 4080) and family agent (RTX 1050).

Hardware Constraints:

  • RTX 4080: 16GB VRAM - Work agent, high-capability tasks
  • RTX 1050: 4GB VRAM - Family agent, always-on, low-latency

Evaluation Criteria

Work Agent (RTX 4080) Requirements

  • Coding capabilities: Code generation, debugging, code review
  • Research capabilities: Analysis, reasoning, documentation
  • Function calling: Must support tool/function calling for MCP integration
  • Context window: 8K-16K tokens minimum
  • VRAM fit: Must fit in 16GB with quantization
  • Performance: Reasonable latency (< 5s for typical responses)

Family Agent (RTX 1050) Requirements

  • Instruction following: Good at following conversational instructions
  • Function calling: Must support tool/function calling
  • Low latency: < 1s response time for interactive use
  • VRAM fit: Must fit in 4GB with quantization
  • Efficiency: Low power consumption for always-on operation
  • Context window: 4K-8K tokens sufficient

Model Comparison Matrix

RTX 4080 Candidates (Work Agent)

Model Size Quantization VRAM Usage Coding Research Function Call Context Speed Recommendation
Llama 3.1 70B 70B Q4 ~14GB 128K Medium Top Choice
Llama 3.1 70B 70B Q5 ~16GB 128K Medium Good quality
DeepSeek Coder 33B 33B Q4 ~8GB 16K Fast Best for coding
Qwen 2.5 72B 72B Q4 ~14GB 32K Medium Strong alternative
Mistral Large 2 67B 67B Q4 ~13GB 128K Medium Good option
Llama 3.1 8B 8B Q4 ~5GB 128K Very Fast Too small for work

Recommendation for 4080:

  1. Primary: Llama 3.1 70B Q4 - Best overall balance
  2. Alternative: DeepSeek Coder 33B Q4 - If coding is primary focus
  3. Fallback: Qwen 2.5 72B Q4 - Strong alternative

RTX 1050 Candidates (Family Agent)

Model Size Quantization VRAM Usage Instruction Function Call Context Speed Latency Recommendation
Phi-3 Mini 3.8B 3.8B Q4 ~2.5GB 128K Very Fast <1s Top Choice
TinyLlama 1.1B 1.1B Q4 ~0.8GB 2K Extremely Fast <0.5s Lightweight option
Gemma 2B 2B Q4 ~1.5GB 8K Very Fast <0.8s Good alternative
Qwen2.5 1.5B 1.5B Q4 ~1.2GB 32K Very Fast <0.7s Strong option
Phi-2 2.7B 2.7B Q4 ~1.8GB 2K Fast <1s Older, less capable
Llama 3.2 3B 3B Q4 ~2GB 128K Fast <1s Good but larger

Recommendation for 1050:

  1. Primary: Phi-3 Mini 3.8B Q4 - Best instruction following, good speed
  2. Alternative: Qwen2.5 1.5B Q4 - Smaller, still capable
  3. Fallback: TinyLlama 1.1B Q4 - If VRAM is tight

Detailed Model Analysis

Work Agent Models

Llama 3.1 70B Q4/Q5

Pros:

  • Excellent coding and research capabilities
  • Large context window (128K tokens)
  • Strong function calling support
  • Well-documented and widely used
  • Good balance of quality and speed

Cons:

  • Q5 uses full 16GB (tight fit)
  • Slower than smaller models
  • Higher power consumption

VRAM Usage:

  • Q4: ~14GB (comfortable margin)
  • Q5: ~16GB (tight, but better quality)

Best For: General work tasks, coding, research, complex reasoning

DeepSeek Coder 33B Q4

Pros:

  • Excellent coding capabilities (specialized)
  • Faster than 70B models
  • Lower VRAM usage (~8GB)
  • Good function calling support
  • Strong for code generation and debugging

Cons:

  • Less capable for general research/analysis
  • Smaller context window (16K vs 128K)
  • Less general-purpose than Llama 3.1

Best For: Coding-focused work, code generation, debugging

Qwen 2.5 72B Q4

Pros:

  • Strong multilingual support
  • Good coding and research capabilities
  • Large context (32K tokens)
  • Competitive with Llama 3.1

Cons:

  • Less community support than Llama
  • Slightly less polished tool calling

Best For: Multilingual work, research, general tasks

Family Agent Models

Phi-3 Mini 3.8B Q4

Pros:

  • Excellent instruction following
  • Very fast inference (<1s)
  • Low VRAM usage (~2.5GB)
  • Good function calling support
  • Large context (128K tokens)
  • Microsoft-backed, well-maintained

Cons:

  • Slightly larger than alternatives
  • May be overkill for simple tasks

Best For: Family conversations, task management, general Q&A

Qwen2.5 1.5B Q4

Pros:

  • Very small VRAM footprint (~1.2GB)
  • Fast inference
  • Good instruction following
  • Large context (32K tokens)
  • Efficient for always-on use

Cons:

  • Less capable than Phi-3 Mini
  • May struggle with complex requests

Best For: Lightweight always-on agent, simple tasks

TinyLlama 1.1B Q4

Pros:

  • Extremely small (~0.8GB VRAM)
  • Very fast inference
  • Minimal resource usage

Cons:

  • Limited capabilities
  • Small context window (2K tokens)
  • May not handle complex conversations well

Best For: Very resource-constrained scenarios

Quantization Comparison

Q4 (4-bit)

  • Quality: ~95-98% of full precision
  • VRAM: ~50% reduction
  • Speed: Fast
  • Recommendation: Use for both agents

Q5 (5-bit)

  • Quality: ~98-99% of full precision
  • VRAM: ~62% of original
  • Speed: Slightly slower than Q4
  • Recommendation: Consider for 4080 if quality is critical

Q6 (6-bit)

  • Quality: ~99% of full precision
  • VRAM: ~75% of original
  • Speed: Slower
  • Recommendation: Not recommended (marginal quality gain)

Q8 (8-bit)

  • Quality: Near full precision
  • VRAM: ~100% of original
  • Speed: Slowest
  • Recommendation: Not recommended (doesn't fit in constraints)

Function Calling Support

All recommended models support function calling:

  • Llama 3.1: Native function calling via tools parameter
  • DeepSeek Coder: Function calling support
  • Qwen 2.5: Function calling support
  • Phi-3 Mini: Function calling support
  • TinyLlama: Basic function calling (may need fine-tuning)

Performance Benchmarks (Estimated)

RTX 4080 (16GB VRAM)

Model Tokens/sec Latency (first token) Latency (100 tokens)
Llama 3.1 70B Q4 ~25-35 ~200-300ms ~3-4s
Llama 3.1 70B Q5 ~20-30 ~250-350ms ~3.5-5s
DeepSeek Coder 33B Q4 ~40-60 ~100-200ms ~2-3s
Qwen 2.5 72B Q4 ~25-35 ~200-300ms ~3-4s

RTX 1050 (4GB VRAM)

Model Tokens/sec Latency (first token) Latency (100 tokens)
Phi-3 Mini 3.8B Q4 ~80-120 ~50-100ms ~1-1.5s
Qwen2.5 1.5B Q4 ~100-150 ~30-60ms ~0.7-1s
TinyLlama 1.1B Q4 ~150-200 ~20-40ms ~0.5-0.7s

Final Recommendations

Work Agent (RTX 4080)

Primary Choice: Llama 3.1 70B Q4

  • Best overall capabilities
  • Fits comfortably in 16GB VRAM
  • Excellent for coding, research, and general work tasks
  • Strong function calling support
  • Large context window (128K)

Alternative: DeepSeek Coder 33B Q4

  • If coding is the primary use case
  • Faster inference
  • Lower VRAM usage allows for more headroom

Family Agent (RTX 1050)

Primary Choice: Phi-3 Mini 3.8B Q4

  • Excellent instruction following
  • Fast inference (<1s latency)
  • Low VRAM usage (~2.5GB)
  • Good function calling support
  • Large context window (128K)

Alternative: Qwen2.5 1.5B Q4

  • If VRAM is very tight
  • Still capable for simple tasks
  • Very fast inference

Implementation Notes

Model Sources

  • Hugging Face: Primary source for all models
  • Ollama: Pre-configured models (easier setup)
  • Direct download: For custom quantization

Inference Servers

  • Ollama: Easiest setup, good for prototyping
  • vLLM: Best throughput, batching support
  • llama.cpp: Lightweight, efficient, good for 1050

Quantization Tools

  • llama.cpp: Built-in quantization
  • AutoGPTQ: For GPTQ quantization
  • AWQ: Alternative quantization method

Next Steps

  1. Complete this survey (TICKET-017)
  2. Complete capacity assessment (TICKET-018)
  3. Finalize model selection (TICKET-019, TICKET-020)
  4. Download and test selected models
  5. Benchmark on actual hardware
  6. Set up inference servers (TICKET-021, TICKET-022)

References


Last Updated: 2024-01-XX Status: Survey Complete - Ready for TICKET-018 (Capacity Assessment)