18 KiB
Architecture Documentation
Overview
This document describes the architecture, design patterns, and technical decisions for the Atlas home voice agent project.
Atlas is a local, privacy-focused voice agent system with separate work and family agents, running on dedicated hardware (RTX 4080 for work agent, RTX 1050 for family agent).
System Architecture
High-Level Design
The system consists of 5 parallel tracks:
- Voice I/O: Wake-word detection, ASR (Automatic Speech Recognition), TTS (Text-to-Speech)
- LLM Infrastructure: Two separate LLM servers (4080 for work, 1050 for family)
- Tools/MCP: Model Context Protocol (MCP) tool servers for weather, tasks, notes, etc.
- Clients/UI: Phone PWA and web LAN dashboard
- Safety/Memory: Long-term memory, conversation management, safety controls
Component Architecture
┌─────────────────────────────────────────────────────────────┐
│ Clients Layer │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Phone PWA │ │ Web Dashboard│ │
│ └──────┬───────┘ └──────┬───────┘ │
└─────────┼──────────────────────────────┼──────────────────┘
│ │
│ WebSocket/HTTP │
│ │
┌─────────┼──────────────────────────────┼──────────────────┐
│ │ Voice Stack │ │
│ ┌──────▼──────┐ ┌──────────┐ ┌─────▼──────┐ │
│ │ Wake-Word │ │ ASR │ │ TTS │ │
│ │ Node │─▶│ Service │ │ Service │ │
│ └─────────────┘ └────┬─────┘ └────────────┘ │
└────────────────────────┼────────────────────────────────┘
│
┌────────────────────────┼────────────────────────────────┐
│ │ LLM Infrastructure │
│ ┌──────▼──────┐ ┌──────────────┐ │
│ │ 4080 Server │ │ 1050 Server │ │
│ │ (Work Agent)│ │(Family Agent)│ │
│ └──────┬──────┘ └──────┬───────┘ │
│ │ │ │
│ └────────────┬───────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Routing Layer │ │
│ └───────┬────────┘ │
└──────────────────────┼─────────────────────────────────┘
│
┌──────────────────────┼─────────────────────────────────┐
│ │ MCP Tools Layer │
│ ┌──────▼──────────┐ │
│ │ MCP Server │ │
│ │ ┌───────────┐ │ │
│ │ │ Weather │ │ │
│ │ │ Tasks │ │ │
│ │ │ Timers │ │ │
│ │ │ Notes │ │ │
│ │ └───────────┘ │ │
│ └──────────────────┘ │
└────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────┐
│ │ Safety & Memory │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Memory │ │ Boundaries │ │
│ │ Store │ │ Enforcement │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────┘
Technology Stack
- Languages: Python (backend services), TypeScript/JavaScript (clients)
- LLM Servers: Ollama, vLLM, or llama.cpp
- ASR: faster-whisper or Whisper.cpp
- TTS: Piper, Mimic 3, or Coqui TTS
- Wake-Word: openWakeWord (see
docs/WAKE_WORD_EVALUATION.mdfor details) - Protocols: MCP (Model Context Protocol), WebSocket, HTTP/gRPC
- Storage: SQLite (memory, sessions), Markdown files (tasks, notes)
- Infrastructure: Docker, systemd, Linux
TTS Selection
For initial development, Piper has been selected as the primary Text-to-Speech (TTS) engine. This decision is based on its high performance, low resource requirements, and permissive license, which are ideal for prototyping and early-stage implementation. Coqui TTS is identified as a potential future upgrade for a high-quality voice when more resources can be allocated.
For a detailed comparison of all evaluated options, see the TTS Evaluation document.
Design Patterns
Core Patterns
- Microservices Architecture: Separate services for wake-word, ASR, TTS, LLM servers, MCP tools
- Event-Driven: Wake-word events trigger ASR capture, tool calls trigger actions
- API Gateway Pattern: Routing layer directs requests to appropriate LLM server
- Repository Pattern: Separate config repo for family agent (no work content)
- Tool Pattern: MCP tools as independent, composable services
Architectural Patterns
- Separation of Concerns: Clear boundaries between work and family agents
- Layered Architecture: Clients → Voice Stack → LLM Infrastructure → MCP Tools → Safety/Memory
- Service-Oriented: Each component is an independent service with defined APIs
- Privacy by Design: Local processing, minimal external dependencies
Project Structure
Repository Structure
This project uses a mono-repo for the main application code and a separate repository for family-specific configurations, ensuring a clean separation of concerns.
home-voice-agent (Mono-repo)
This repository contains all the code for the voice agent, its services, and clients.
home-voice-agent/
├── llm-servers/ # LLM inference servers
│ ├── 4080/ # Work agent server (e.g., Llama 70B)
│ └── 1050/ # Family agent server (e.g., Phi-2)
├── mcp-server/ # MCP (Model Context Protocol) tool server
│ └── tools/ # Individual tool implementations (e.g., weather, time)
├── wake-word/ # Wake-word detection node
├── asr/ # ASR (Automatic Speech Recognition) service
├── tts/ # TTS (Text-to-Speech) service
├── clients/ # Front-end applications
│ ├── phone/ # Phone PWA (Progressive Web App)
│ └── web-dashboard/ # Web-based administration dashboard
├── routing/ # LLM routing layer to direct requests
├── conversation/ # Conversation management and history
├── memory/ # Long-term memory storage and retrieval
├── safety/ # Safety, boundary enforcement, and content filtering
├── admin/ # Administration and monitoring tools
└── infrastructure/ # Deployment scripts, Dockerfiles, and IaC
family-agent-config (Configuration Repo)
This repository stores all personal and family-related configurations. It is kept separate to maintain privacy and prevent work-related data from mixing with family data.
family-agent-config/
├── prompts/ # System prompts and character definitions
├── tools/ # Tool configurations and settings
├── secrets/ # Credentials and API keys (e.g., weather API)
└── tasks/ # Markdown-based Kanban board for home tasks
└── home/ # Tasks for the home
Atlas Project (This Repo)
atlas/
├── tickets/ # Kanban tickets
│ ├── backlog/ # Future work
│ ├── todo/ # Ready to work on
│ ├── in-progress/ # Active work
│ ├── review/ # Awaiting review
│ └── done/ # Completed
├── docs/ # Documentation
├── ARCHITECTURE.md # This file
└── README.md # Project overview
Data Models
Memory Schema
Long-term memory stores personal facts, preferences, and routines:
MemoryEntry:
- id: str
- category: str # personal, family, preferences, routines
- content: str
- timestamp: datetime
- confidence: float
- source: str # conversation, explicit, inferred
Conversation Session
Session:
- session_id: str
- agent_type: str # "work" or "family"
- messages: List[Message]
- created_at: datetime
- last_activity: datetime
- summary: str # after summarization
Task Model (Markdown Kanban)
---
id: TICKET-XXX
title: Task title
status: backlog|todo|in-progress|review|done
priority: high|medium|low
created: YYYY-MM-DD
updated: YYYY-MM-DD
assignee: name
tags: [tag1, tag2]
---
Task description...
MCP Tool Definition
{
"name": "tool_name",
"description": "Tool description",
"inputSchema": {
"type": "object",
"properties": {...}
}
}
API Design
LLM Server API
Endpoint: POST /v1/chat/completions
{
"model": "work-agent" | "family-agent",
"messages": [...],
"tools": [...],
"temperature": 0.7
}
ASR Service API
Endpoint: WebSocket /asr/stream
- Input: Audio stream (PCM, 16kHz, mono)
- Output: Text segments with timestamps
{
"text": "transcribed text",
"timestamp": 1234.56,
"confidence": 0.95,
"is_final": false
}
TTS Service API
Endpoint: POST /tts/synthesize
{
"text": "Text to speak",
"voice": "family-voice",
"stream": true
}
Response: Audio stream (WAV or MP3)
MCP Server API
Protocol: JSON-RPC 2.0
Methods:
tools/list: List available toolstools/call: Execute a tool
{
"jsonrpc": "2.0",
"method": "tools/call",
"params": {
"name": "weather",
"arguments": {...}
},
"id": 1
}
Security Considerations
Privacy Policy
- Core Principle: No external APIs for ASR/LLM processing
- Exception: Weather API (documented exception)
- Local Processing: All voice and LLM processing runs locally
- Data Retention: Configurable retention policies for conversations
Boundary Enforcement
- Repository Separation: Family agent config in separate repo, no work content
- Path Whitelists: Tools can only access whitelisted directories
- Network Isolation: Containers/namespaces prevent cross-access
- Firewall Rules: Block family agent from accessing work repo paths
- Static Analysis: CI/CD checks reject code that would grant cross-access
Confirmation Flows
- High-Risk Actions: Email send, calendar changes, file edits outside safe areas
- Confirmation Tokens: Signed tokens required from client, not just model intent
- User Approval: Explicit "Yes/No" confirmation for sensitive operations
- Audit Logging: All confirmations and high-risk actions logged
Authentication & Authorization
- Token-Based: Separate tokens for work vs family agents
- Revocation: System to disable compromised tokens/devices
- Admin Controls: Kill switches for services, tools, or entire agent
Performance Considerations
Latency Targets
- Wake-Word Detection: < 200ms
- ASR Processing: < 2s end-to-end (audio in → text out)
- LLM Response: < 3s for family agent (1050), < 5s for work agent (4080)
- TTS Synthesis: < 500ms first chunk, streaming thereafter
- Tool Execution: < 1s for simple tools (weather, time)
Resource Allocation
-
4080 (Work Agent):
- Model: 8-14B or 30B quantized (Q4-Q6)
- Context: 8K-16K tokens
- Concurrency: 2-4 requests
-
1050 (Family Agent):
- Model: 1B-3B quantized (Q4-Q5)
- Context: 4K-8K tokens
- Concurrency: 1-2 requests (always-on, low latency)
Optimization Strategies
- Model Quantization: Q4-Q6 for 4080, Q4-Q5 for 1050
- Context Management: Summarization and pruning for long conversations
- Caching: Weather API responses, tool results
- Streaming: ASR and TTS use streaming for lower perceived latency
- Batching: LLM requests where possible (work agent)
Deployment
Hardware Requirements
- RTX 4080 Server: Work agent LLM, ASR (optional)
- RTX 1050 Server: Family agent LLM (always-on)
- Wake-Word Node: Raspberry Pi 4+, NUC, or SFF PC
- Microphones: USB mics or array mic for living room/office
- Storage: SSD for logs, HDD for archives
Service Deployment
- LLM Servers: Systemd services or Docker containers
- MCP Server: Systemd service with auto-restart
- Voice Services: ASR and TTS as systemd services
- Wake-Word Node: Standalone service on dedicated hardware
- Clients: PWA served via web server, web dashboard on LAN
Configuration Management
- Family Agent Config: Separate
family-agent-config/repo - Secrets: Environment variables, separate .env files
- Prompts: Version-controlled in config repo
- Tool Configs: YAML/JSON files in config repo
Monitoring & Logging
- Structured Logging: JSON logs for all services
- Metrics: GPU usage, latency, error rates
- Admin Dashboard: Web UI for logs, metrics, controls
- Alerting: System notifications for errors or high resource usage
Development Workflow
Ticket-Based Development
- Select Ticket: Choose from
tickets/backlog/ortickets/todo/ - Check Dependencies: Review ticket dependencies before starting
- Move to In-Progress: Move ticket to
tickets/in-progress/ - Implement: Follow architecture patterns and conventions
- Test: Write and run tests
- Document: Update relevant documentation
- Move to Review: Move ticket to
tickets/review/when complete - Move to Done: Move to
tickets/done/after approval
Parallel Development
Many tickets can be worked on simultaneously:
- Voice I/O: Independent of LLM and MCP
- LLM Infrastructure: Can proceed after model selection
- MCP Tools: Can start with minimal server, add tools incrementally
- Clients/UI: Can mock APIs early, integrate later
Milestone Progression
- Milestone 1: Foundation and surveys (TICKET-002 through TICKET-029)
- Milestone 2: MVP with voice chat, weather, tasks (core functionality)
- Milestone 3: Memory, reminders, safety features
- Milestone 4: Optional integrations (email, calendar, smart home)
Future Considerations
Planned Enhancements
- Semantic Search: Add embeddings for note search (beyond ripgrep)
- Routine Learning: Automatically learn and suggest routines from memory
- Multi-Device: Support multiple wake-word nodes and clients
- Offline Mode: Enhanced offline capabilities for clients
- Voice Cloning: Custom voice profiles for family members
Technical Debt
- Start with basic implementations, optimize later
- Initial memory system: simple schema, enhance with better retrieval
- Tool permissions: Start with whitelists, add more granular control later
- Logging: Start with files, migrate to time-series DB if needed
Scalability
- Current design supports single household
- Future: Multi-user support with user-specific memory and preferences
- Future: Distributed deployment across multiple nodes
Related Documentation
- Tickets: See
tickets/TICKETS_SUMMARY.mdfor all 46 tickets - Quick Start: See
tickets/QUICK_START.mdfor recommended starting order - Ticket Template: See
tickets/TICKET_TEMPLATE.mdfor creating new tickets - Privacy Policy: See
docs/PRIVACY_POLICY.mdfor details on data handling. - Safety Constraints: See
docs/SAFETY_CONSTRAINTS.mdfor details on security boundaries.
Note: Update this document as the architecture evolves.