atlas/docs/MODEL_SELECTION.md
ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

4.5 KiB

Final Model Selection

Overview

This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).

Work Agent Model Selection (RTX 4080)

Selected Model: Llama 3.1 70B Q4

Rationale:

  • Best overall balance of coding and research capabilities
  • Excellent function calling support (required for MCP integration)
  • Fits comfortably in 16GB VRAM (~14GB usage)
  • Large context window (128K tokens, practical limit 8K)
  • Well-documented and widely supported
  • Strong performance for both coding and general research tasks

Specifications:

  • Model: meta-llama/Meta-Llama-3.1-70B-Instruct
  • Quantization: Q4 (4-bit)
  • VRAM Usage: ~14GB
  • Context Window: 8K tokens (practical limit)
  • Expected Latency: ~200-300ms first token, ~3-4s for 100 tokens
  • Concurrency: 2 requests maximum

Alternative Model:

  • DeepSeek Coder 33B Q4 - If coding is the primary focus
    • Faster inference (~100-200ms first token)
    • Lower VRAM usage (~8GB)
    • Larger practical context (16K tokens)
    • Less capable for general research

Model Source:

  • Hugging Face: meta-llama/Meta-Llama-3.1-70B-Instruct
  • Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
  • Or use Ollama: ollama pull llama3.1:70b-q4_0

Performance Characteristics:

  • Coding: (Excellent)
  • Research: (Excellent)
  • Function Calling: Native support
  • Speed: Medium (acceptable for work tasks)

Family Agent Model Selection (RTX 1050)

Selected Model: Phi-3 Mini 3.8B Q4

Rationale:

  • Excellent instruction following (critical for family agent)
  • Very fast inference (<1s latency for interactive use)
  • Low VRAM usage (~2.5GB, comfortable margin)
  • Good function calling support
  • Large context window (128K tokens, practical limit 8K)
  • Microsoft-backed, well-maintained

Specifications:

  • Model: microsoft/Phi-3-mini-4k-instruct
  • Quantization: Q4 (4-bit)
  • VRAM Usage: ~2.5GB
  • Context Window: 8K tokens (practical limit)
  • Expected Latency: ~50-100ms first token, ~1-1.5s for 100 tokens
  • Concurrency: 1-2 requests maximum

Alternative Model:

  • Qwen2.5 1.5B Q4 - If more VRAM headroom needed
    • Smaller VRAM footprint (~1.2GB)
    • Still fast inference
    • Slightly less capable than Phi-3 Mini

Model Source:

  • Hugging Face: microsoft/Phi-3-mini-4k-instruct
  • Quantized version: Use llama.cpp for Q4 quantization
  • Or use Ollama: ollama pull phi3:mini-q4_0

Performance Characteristics:

  • Instruction Following: (Excellent)
  • Function Calling: Native support
  • Speed: Very Fast (<1s latency)
  • Efficiency: High (low power consumption)

Selection Summary

Agent Model Size Quantization VRAM Context Latency
Work Llama 3.1 70B 70B Q4 ~14GB 8K ~3-4s
Family Phi-3 Mini 3.8B 3.8B Q4 ~2.5GB 8K ~1-1.5s

Implementation Plan

Phase 1: Download and Test

  1. Download Llama 3.1 70B Q4 quantized model
  2. Download Phi-3 Mini 3.8B Q4 quantized model
  3. Test on actual hardware (4080 and 1050)
  4. Benchmark actual VRAM usage and latency
  5. Verify function calling support

Phase 2: Setup Inference Servers

  1. Set up Ollama or vLLM for 4080 (TICKET-021)
  2. Set up llama.cpp or Ollama for 1050 (TICKET-022)
  3. Configure context windows (8K for both)
  4. Test concurrent request handling

Phase 3: Integration

  1. Integrate with MCP server (TICKET-030)
  2. Test function calling end-to-end
  3. Optimize based on real-world performance

Model Files Location

Recommended Structure:

models/
├── work-agent/
│   └── llama-3.1-70b-q4.gguf
├── family-agent/
│   └── phi-3-mini-3.8b-q4.gguf
└── backups/

Cost Analysis

Based on docs/LLM_USAGE_AND_COSTS.md:

  • Work Agent (4080): ~$1.08-1.80/month (2 hours/day usage)
  • Family Agent (1050): ~$1.44-2.40/month (always-on, 8 hours/day)
  • Total: ~$2.52-4.20/month

Next Steps

  1. Model selection complete (TICKET-019, TICKET-020)
  2. Download selected models
  3. Set up inference servers (TICKET-021, TICKET-022)
  4. Test and benchmark on actual hardware
  5. Integrate with MCP (TICKET-030)

References

  • Model Survey: docs/LLM_MODEL_SURVEY.md
  • Capacity Assessment: docs/LLM_CAPACITY.md
  • Usage & Costs: docs/LLM_USAGE_AND_COSTS.md

Last Updated: 2024-01-XX Status: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)