- Enhanced `ARCHITECTURE.md` with details on LLM models for work (Llama 3.1 70B Q4) and family agents (Phi-3 Mini 3.8B Q4). - Introduced new documents: - `ASR_EVALUATION.md` for ASR engine evaluation and selection. - `HARDWARE.md` outlining hardware requirements and purchase plans. - `IMPLEMENTATION_GUIDE.md` for Milestone 2 implementation steps. - `LLM_CAPACITY.md` assessing VRAM and context window limits. - `LLM_MODEL_SURVEY.md` surveying open-weight LLM models. - `LLM_USAGE_AND_COSTS.md` detailing LLM usage and operational costs. - `MCP_ARCHITECTURE.md` describing the Model Context Protocol architecture. - `MCP_IMPLEMENTATION_SUMMARY.md` summarizing MCP implementation status. These updates provide comprehensive guidance for the next phases of development and ensure clarity in project documentation.
60 lines
1.0 KiB
Markdown
60 lines
1.0 KiB
Markdown
# 4080 LLM Server (Work Agent)
|
|
|
|
LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080.
|
|
|
|
## Setup
|
|
|
|
### Option 1: Ollama (Recommended - Easiest)
|
|
|
|
```bash
|
|
# Install Ollama
|
|
curl -fsSL https://ollama.com/install.sh | sh
|
|
|
|
# Download model
|
|
ollama pull llama3.1:70b-q4_0
|
|
|
|
# Start server
|
|
ollama serve
|
|
# Runs on http://localhost:11434
|
|
```
|
|
|
|
### Option 2: vLLM (For Higher Throughput)
|
|
|
|
```bash
|
|
# Install vLLM
|
|
pip install vllm
|
|
|
|
# Start server
|
|
python -m vllm.entrypoints.openai.api_server \
|
|
--model meta-llama/Meta-Llama-3.1-70B-Instruct \
|
|
--quantization awq \
|
|
--tensor-parallel-size 1 \
|
|
--host 0.0.0.0 \
|
|
--port 8000
|
|
```
|
|
|
|
## Configuration
|
|
|
|
- **Model**: Llama 3.1 70B Q4
|
|
- **Context Window**: 8K tokens (practical limit)
|
|
- **VRAM Usage**: ~14GB
|
|
- **Concurrency**: 2 requests max
|
|
|
|
## API
|
|
|
|
Ollama uses OpenAI-compatible API:
|
|
|
|
```bash
|
|
curl http://localhost:11434/api/chat -d '{
|
|
"model": "llama3.1:70b-q4_0",
|
|
"messages": [
|
|
{"role": "user", "content": "Hello"}
|
|
],
|
|
"stream": false
|
|
}'
|
|
```
|
|
|
|
## Systemd Service
|
|
|
|
See `ollama-4080.service` for systemd configuration.
|