- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
7.5 KiB
Implementation Guide - Milestone 2
Overview
This guide provides step-by-step instructions for implementing Milestone 2 core infrastructure. All planning and evaluation work is complete - ready to build!
Prerequisites
✅ Completed:
- Model selections finalized (Llama 3.1 70B Q4, Phi-3 Mini 3.8B Q4)
- ASR engine selected (faster-whisper)
- MCP architecture documented
- Hardware plan ready
Implementation Order
Phase 1: Core Infrastructure (Priority 1)
1. LLM Servers (TICKET-021, TICKET-022)
Why First: Everything else depends on LLM infrastructure
TICKET-021: 4080 LLM Service
Recommended Approach: Ollama
-
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh -
Download Model
ollama pull llama3.1:70b-q4_0 # Or use custom quantized model -
Start Ollama Service
ollama serve # Runs on http://localhost:11434 -
Test Function Calling
curl http://localhost:11434/api/chat -d '{ "model": "llama3.1:70b-q4_0", "messages": [{"role": "user", "content": "Hello"}], "tools": [...] }' -
Create Systemd Service (for auto-start)
[Unit] Description=Ollama LLM Server (4080) After=network.target [Service] Type=simple User=atlas ExecStart=/usr/local/bin/ollama serve Restart=always [Install] WantedBy=multi-user.target
Alternative: vLLM (if you need batching/higher throughput)
- More complex setup
- Better for multiple concurrent requests
- See vLLM documentation
TICKET-022: 1050 LLM Service
Recommended Approach: Ollama (same as 4080)
-
Install Ollama (on 1050 machine)
-
Download Model
ollama pull phi3:mini-q4_0 -
Start Service
ollama serve --host 0.0.0.0 # Runs on http://<1050-ip>:11434 -
Test
curl http://<1050-ip>:11434/api/chat -d '{ "model": "phi3:mini-q4_0", "messages": [{"role": "user", "content": "Hello"}] }'
Key Differences:
- Different model (Phi-3 Mini vs Llama 3.1)
- Different port or IP binding
- Lower resource usage
2. MCP Server (TICKET-029)
Why Second: Foundation for all tools
Implementation Steps:
-
Create Project Structure
home-voice-agent/ └── mcp-server/ ├── __init__.py ├── server.py # Main JSON-RPC server ├── tools/ │ ├── __init__.py │ ├── weather.py │ └── echo.py └── requirements.txt -
Install Dependencies
pip install jsonrpc-base jsonrpc-websocket fastapi uvicorn -
Implement JSON-RPC 2.0 Server
- Use
jsonrpc-baseor implement manually - Handle
tools/listandtools/callmethods - Error handling with proper JSON-RPC error codes
- Use
-
Create Example Tools
- Echo Tool: Simple echo for testing
- Weather Tool: Stub implementation (real API later)
-
Test Server
# Start server python mcp-server/server.py # Test tools/list curl -X POST http://localhost:8000/mcp \ -H "Content-Type: application/json" \ -d '{"jsonrpc": "2.0", "method": "tools/list", "id": 1}' # Test tools/call curl -X POST http://localhost:8000/mcp \ -H "Content-Type: application/json" \ -d '{ "jsonrpc": "2.0", "method": "tools/call", "params": {"name": "echo", "arguments": {"text": "hello"}}, "id": 2 }'
Phase 2: Voice I/O Services (Priority 2)
3. Wake-Word Node (TICKET-006)
Prerequisites: Hardware (microphone, always-on node)
Implementation Steps:
-
Install openWakeWord (or selected engine)
pip install openwakeword -
Create Wake-Word Service
- Audio capture (PyAudio)
- Wake-word detection loop
- Event emission (WebSocket/MQTT/HTTP)
-
Test Detection
- Train/configure "Hey Atlas" wake-word
- Test false positive/negative rates
4. ASR Service (TICKET-010)
Prerequisites: faster-whisper selected
Implementation Steps:
-
Install faster-whisper
pip install faster-whisper -
Download Model
from faster_whisper import WhisperModel model = WhisperModel("small", device="cuda", compute_type="float16") -
Create WebSocket Service
- Audio streaming endpoint
- Real-time transcription
- Text segment output
-
Integrate with Wake-Word
- Start ASR on wake-word event
- Stop on silence or user command
5. TTS Service (TICKET-014)
Prerequisites: TTS evaluation complete
Implementation Steps:
-
Install Piper (or selected TTS)
# Install Piper wget https://github.com/rhasspy/piper/releases/download/v1.2.0/piper_amd64.tar.gz tar -xzf piper_amd64.tar.gz -
Download Voice Model
# Download voice model wget https://huggingface.co/rhasspy/piper-voices/resolve/main/en/en_US/lessac/medium/en_US-lessac-medium.onnx -
Create HTTP Service
- Text input → audio output
- Streaming support
- Voice selection
Quick Start Checklist
Week 1: Core Infrastructure
- Set up 4080 LLM server (TICKET-021)
- Set up 1050 LLM server (TICKET-022)
- Test both servers independently
- Implement minimal MCP server (TICKET-029)
- Test MCP server with echo tool
Week 2: Voice Services
- Prototype wake-word node (TICKET-006) - if hardware ready
- Implement ASR service (TICKET-010)
- Implement TTS service (TICKET-014)
- Test voice pipeline end-to-end
Week 3: Integration
- Implement MCP-LLM adapter (TICKET-030)
- Add core tools (weather, time, tasks)
- Create routing layer (TICKET-023)
- Test full system
Common Issues & Solutions
LLM Server Issues
Problem: Model doesn't fit in VRAM
- Solution: Use Q4 quantization, reduce context window
Problem: Slow inference
- Solution: Check GPU utilization, use GPU-accelerated inference
Problem: Function calling not working
- Solution: Verify model supports function calling, check prompt format
MCP Server Issues
Problem: JSON-RPC errors
- Solution: Validate request format, check error codes
Problem: Tools not discovered
- Solution: Verify tool registration, check
tools/listresponse
Voice Services Issues
Problem: High latency
- Solution: Use GPU for ASR, optimize model size
Problem: Poor accuracy
- Solution: Use larger model, improve audio quality
Testing Strategy
Unit Tests
- Test each service independently
- Mock dependencies where needed
Integration Tests
- Test LLM → MCP → Tool flow
- Test Wake-word → ASR → LLM → TTS flow
End-to-End Tests
- Full voice interaction
- Tool calling scenarios
- Error handling
Next Steps After Milestone 2
Once core infrastructure is working:
- Add more MCP tools (TICKET-031, TICKET-032, TICKET-033, TICKET-034)
- Implement phone client (TICKET-039)
- Add system prompts (TICKET-025)
- Implement conversation handling (TICKET-027)
References
- Ollama Docs: https://ollama.com/docs
- vLLM Docs: https://docs.vllm.ai
- faster-whisper: https://github.com/guillaumekln/faster-whisper
- MCP Spec: https://modelcontextprotocol.io/specification
- Model Selection:
docs/MODEL_SELECTION.md - ASR Evaluation:
docs/ASR_EVALUATION.md - MCP Architecture:
docs/MCP_ARCHITECTURE.md
Last Updated: 2024-01-XX Status: Ready for Implementation