- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
4080 LLM Server (Work Agent)
LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080.
Setup
Option 1: Ollama (Recommended - Easiest)
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Download model
ollama pull llama3.1:70b-q4_0
# Start server
ollama serve
# Runs on http://localhost:11434
Option 2: vLLM (For Higher Throughput)
# Install vLLM
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--tensor-parallel-size 1 \
--host 0.0.0.0 \
--port 8000
Configuration
- Model: Llama 3.1 70B Q4
- Context Window: 8K tokens (practical limit)
- VRAM Usage: ~14GB
- Concurrency: 2 requests max
API
Ollama uses OpenAI-compatible API:
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:70b-q4_0",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": false
}'
Systemd Service
See ollama-4080.service for systemd configuration.