atlas/home-voice-agent/llm-servers/4080/README.md

# 4080 LLM Server (Work Agent)

LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080.

## Setup

### Option 1: Ollama (Recommended - Easiest)

```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Download model
ollama pull llama3.1:70b-q4_0

# Start server
ollama serve
# Runs on http://localhost:11434
```

### Option 2: vLLM (For Higher Throughput)

```bash
# Install vLLM
pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-70B-Instruct \
  --quantization awq \
  --tensor-parallel-size 1 \
  --host 0.0.0.0 \
  --port 8000
```

## Configuration

- **Model**: Llama 3.1 70B Q4
- **Context Window**: 8K tokens (practical limit)
- **VRAM Usage**: ~14GB
- **Concurrency**: 2 requests max

## API

Ollama uses OpenAI-compatible API:

```bash
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:70b-q4_0",
  "messages": [
    {"role": "user", "content": "Hello"}
  ],
  "stream": false
}'
```

## Systemd Service

See `ollama-4080.service` for systemd configuration.