ilia/atlas

ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP

- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.

2026-01-05 23:44:16 -05:00

4.5 KiB

Raw Permalink Blame History

Final Model Selection

Overview

This document finalizes the LLM model selections for the Atlas voice agent system based on the model survey (TICKET-017) and capacity assessment (TICKET-018).

Work Agent Model Selection (RTX 4080)

Selected Model: Llama 3.1 70B Q4

Rationale:

Best overall balance of coding and research capabilities
Excellent function calling support (required for MCP integration)
Fits comfortably in 16GB VRAM (~14GB usage)
Large context window (128K tokens, practical limit 8K)
Well-documented and widely supported
Strong performance for both coding and general research tasks

Specifications:

Model: meta-llama/Meta-Llama-3.1-70B-Instruct
Quantization: Q4 (4-bit)
VRAM Usage: ~14GB
Context Window: 8K tokens (practical limit)
Expected Latency: ~200-300ms first token, ~3-4s for 100 tokens
Concurrency: 2 requests maximum

Alternative Model:

DeepSeek Coder 33B Q4 - If coding is the primary focus
- Faster inference (~100-200ms first token)
- Lower VRAM usage (~8GB)
- Larger practical context (16K tokens)
- Less capable for general research

Model Source:

Hugging Face: meta-llama/Meta-Llama-3.1-70B-Instruct
Quantized version: Use llama.cpp or AutoGPTQ for Q4 quantization
Or use Ollama: ollama pull llama3.1:70b-q4_0

Performance Characteristics:

Coding: ⭐⭐⭐⭐⭐ (Excellent)
Research: ⭐⭐⭐⭐⭐ (Excellent)
Function Calling: ✅ Native support
Speed: Medium (acceptable for work tasks)

Family Agent Model Selection (RTX 1050)

Selected Model: Phi-3 Mini 3.8B Q4

Rationale:

Excellent instruction following (critical for family agent)
Very fast inference (<1s latency for interactive use)
Low VRAM usage (~2.5GB, comfortable margin)
Good function calling support
Large context window (128K tokens, practical limit 8K)
Microsoft-backed, well-maintained

Specifications:

Model: microsoft/Phi-3-mini-4k-instruct
Quantization: Q4 (4-bit)
VRAM Usage: ~2.5GB
Context Window: 8K tokens (practical limit)
Expected Latency: ~50-100ms first token, ~1-1.5s for 100 tokens
Concurrency: 1-2 requests maximum

Alternative Model:

Qwen2.5 1.5B Q4 - If more VRAM headroom needed
- Smaller VRAM footprint (~1.2GB)
- Still fast inference
- Slightly less capable than Phi-3 Mini

Model Source:

Hugging Face: microsoft/Phi-3-mini-4k-instruct
Quantized version: Use llama.cpp for Q4 quantization
Or use Ollama: ollama pull phi3:mini-q4_0

Performance Characteristics:

Instruction Following: ⭐⭐⭐⭐⭐ (Excellent)
Function Calling: ✅ Native support
Speed: Very Fast (<1s latency)
Efficiency: High (low power consumption)

Selection Summary

Agent	Model	Size	Quantization	VRAM	Context	Latency
Work	Llama 3.1 70B	70B	Q4	~14GB	8K	~3-4s
Family	Phi-3 Mini 3.8B	3.8B	Q4	~2.5GB	8K	~1-1.5s

Implementation Plan

Phase 1: Download and Test

Download Llama 3.1 70B Q4 quantized model
Download Phi-3 Mini 3.8B Q4 quantized model
Test on actual hardware (4080 and 1050)
Benchmark actual VRAM usage and latency
Verify function calling support

Phase 2: Setup Inference Servers

Set up Ollama or vLLM for 4080 (TICKET-021)
Set up llama.cpp or Ollama for 1050 (TICKET-022)
Configure context windows (8K for both)
Test concurrent request handling

Phase 3: Integration

Integrate with MCP server (TICKET-030)
Test function calling end-to-end
Optimize based on real-world performance

Model Files Location

Recommended Structure:

models/
├── work-agent/
│   └── llama-3.1-70b-q4.gguf
├── family-agent/
│   └── phi-3-mini-3.8b-q4.gguf
└── backups/

Cost Analysis

Based on docs/LLM_USAGE_AND_COSTS.md:

Work Agent (4080): ~$1.08-1.80/month (2 hours/day usage)
Family Agent (1050): ~$1.44-2.40/month (always-on, 8 hours/day)
Total: ~$2.52-4.20/month

Next Steps

✅ Model selection complete (TICKET-019, TICKET-020)
Download selected models
Set up inference servers (TICKET-021, TICKET-022)
Test and benchmark on actual hardware
Integrate with MCP (TICKET-030)

References

Model Survey: docs/LLM_MODEL_SURVEY.md
Capacity Assessment: docs/LLM_CAPACITY.md
Usage & Costs: docs/LLM_USAGE_AND_COSTS.md

Last Updated: 2024-01-XX Status: Selection Finalized - Ready for Implementation (TICKET-021, TICKET-022)

4.5 KiB Raw Permalink Blame History