ilia 4b9ffb5ddf docs: Update architecture and add new documentation for LLM and MCP
- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4).
- Introduced new documents:
  - `ASR_EVALUATION.md` for ASR engine evaluation and selection.
  - `HARDWARE.md` outlining hardware requirements and purchase plans.
  - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps.
  - `LLM_CAPACITY.md` assessing VRAM and context window limits.
  - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models.
  - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs.
  - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture.
  - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status.

These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
2026-01-05 23:44:16 -05:00

60 lines
1.0 KiB
Markdown

# 4080 LLM Server (Work Agent)
LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080.
## Setup
### Option 1: Ollama (Recommended - Easiest)
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Download model
ollama pull llama3.1:70b-q4_0
# Start server
ollama serve
# Runs on http://localhost:11434
```
### Option 2: vLLM (For Higher Throughput)
```bash
# Install vLLM
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
--quantization awq \
--tensor-parallel-size 1 \
--host 0.0.0.0 \
--port 8000
```
## Configuration
- **Model**: Llama 3.1 70B Q4
- **Context Window**: 8K tokens (practical limit)
- **VRAM Usage**: ~14GB
- **Concurrency**: 2 requests max
## API
Ollama uses OpenAI-compatible API:
```bash
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:70b-q4_0",
"messages": [
{"role": "user", "content": "Hello"}
],
"stream": false
}'
```
## Systemd Service
See `ollama-4080.service` for systemd configuration.