ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

1.0 KiB

4080 LLM Server (Work Agent)

LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080.

Setup

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Download model
ollama pull llama3.1:70b-q4_0

# Start server
ollama serve
# Runs on http://localhost:11434

Option 2: vLLM (For Higher Throughput)

# Install vLLM
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --quantization awq \
  --tensor-parallel-size 1 \
  --host 0.0.0.0 \
  --port 8000

Configuration

  • Model: Llama 3.1 70B Q4
  • Context Window: 8K tokens (practical limit)
  • VRAM Usage: ~14GB
  • Concurrency: 2 requests max

API

Ollama uses OpenAI-compatible API:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:70b-q4_0",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "stream": false
}'

Systemd Service

See ollama-4080.service for systemd configuration.