llm_council/docs/DEPLOYMENT.md

# Deployment Guide

## Overview

LLM Council can be deployed in several configurations depending on your needs:
- **Local Development**: Everything runs on your local machine
- **Hybrid**: Frontend/Backend local, LLM server on remote GPU VM
- **Full Remote**: Everything on a server/VM
- **Production**: Professional deployment with proper infrastructure

## Architecture Options

### Option 1: Hybrid (Recommended for Development)

**Setup:**
- Frontend + Backend: Run on your local machine
- LLM Server (Ollama): Run on remote GPU VM

**Pros:**
- Easy development and debugging
- GPU resources available remotely
- No need to deploy frontend/backend code
- Fast iteration

**Cons:**
- Requires network connectivity to GPU VM
- Latency for LLM requests

**Configuration:**
```bash
# .env on local machine
USE_LOCAL_OLLAMA=false
OPENAI_COMPAT_BASE_URL=http://your-gpu-vm-ip:11434
```

### Option 2: Full Remote Deployment

**Setup:**
- Everything runs on the GPU VM or dedicated server

**Pros:**
- Centralized deployment
- Can be accessed from multiple machines
- Better for team use

**Cons:**
- More complex setup
- Requires proper security configuration
- Slower development iteration

### Option 3: Production Deployment (Professional)

**Recommended Stack:**
- **Frontend**: Serve static build via nginx/CDN
- **Backend**: Run via systemd/gunicorn/uvicorn with reverse proxy
- **LLM Server**: Separate service on GPU VM
- **Security**: TLS/HTTPS, authentication, rate limiting

## GPU VM Setup

### Prerequisites

1. GPU VM with:
   - NVIDIA GPU with CUDA support
   - Sufficient VRAM for your models
   - Network access from your local machine

2. Ollama installed on GPU VM:
   ```bash
   curl -fsSL https://ollama.ai/install.sh | sh
   ```

### Step 1: Configure Ollama to Accept Remote Connections

**On GPU VM:**

```bash
# Option A: Environment variable (temporary)
export OLLAMA_HOST=0.0.0.0:11434

# Option B: Systemd service (persistent - recommended)
sudo systemctl edit ollama
```

Add to the override file:
```ini
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
```

Then restart:
```bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
```

### Step 2: Configure Firewall

**On GPU VM:**

```bash
# Allow port 11434 from your local network
sudo ufw allow from YOUR_LOCAL_IP to any port 11434
# Or allow from entire subnet (less secure)
sudo ufw allow 11434/tcp
```

### Step 3: Pull Required Models

**On GPU VM:**

```bash
ollama pull qwen2.5:7b
ollama pull llama3.1:8b
ollama pull qwen2.5:14b
ollama pull qwen2:latest
```

### Step 3.5 (GPU VM): Ensure Ollama Uses GPU + Stores Data on /mnt/data

If your VM has a small root disk, keep Ollama's storage and HOME off `/` (common cause of weird failures).
Also note that on this setup, `OLLAMA_LLM_LIBRARY=cuda` caused Ollama to *skip* CUDA libraries; use `cuda_v12`.

**On GPU VM:**

```bash
sudo mkdir -p /etc/systemd/system/ollama.service.d
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_KEEP_ALIVE=24h"
Environment="OLLAMA_MODELS=/mnt/data/ollama"
Environment="HOME=/mnt/data/ollama/home"
Environment="OLLAMA_LLM_LIBRARY=cuda_v12"
Environment="LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12"
EOF

sudo systemctl daemon-reload
sudo systemctl restart ollama
```

**Verify GPU offload (on GPU VM):**

```bash
ollama run qwen2:latest "Write 80 words about GPUs."
ollama ps
```

### Step 4: Verify Remote Access

**From local machine:**

```bash
curl http://YOUR_GPU_VM_IP:11434/api/tags
# Should return list of available models
curl http://YOUR_GPU_VM_IP:11434/v1/models
```

### Step 5: Configure LLM Council

**On local machine `.env`:**

```bash
USE_LOCAL_OLLAMA=false
OPENAI_COMPAT_BASE_URL=http://YOUR_GPU_VM_IP:11434
# Local (small) example:
# COUNCIL_MODELS=llama3.2:1b,qwen2.5:0.5b,gemma2:2b
# CHAIRMAN_MODEL=llama3.2:3b

# GPU (available models):
COUNCIL_MODELS=qwen2.5:7b,llama3.1:8b,qwen2:latest
CHAIRMAN_MODEL=qwen2.5:14b
```

## Security Considerations

### For Development/Internal Use

1. **Network Security:**
   - Use VPN or private network
   - Restrict firewall to specific IPs
   - Consider SSH tunnel for extra security

2. **Ollama Security:**
   - Ollama has no built-in authentication
   - Only expose on trusted networks
   - Consider reverse proxy with auth (nginx + basic auth)

### For Production

1. **Authentication:**
   - Add API key authentication to backend
   - Use session-based auth for frontend
   - Implement rate limiting

2. **Network Security:**
   - Use HTTPS/TLS everywhere
   - Set up proper firewall rules
   - Consider using a reverse proxy (nginx/traefik)

3. **Infrastructure:**
   - Use container orchestration (Docker Compose/Kubernetes)
   - Set up monitoring and logging
   - Implement backup strategy for conversations

## Deployment Scripts

### Quick Start (Local + Remote Ollama)

```bash
# 1. Start Ollama on GPU VM (already running if systemd configured)
# 2. On local machine:
./start.sh
```

### Full Remote Deployment

See `docs/DEPLOYMENT_FULL.md` for complete remote deployment instructions.

## Troubleshooting

### Connection Timeouts

1. Check Ollama is listening on all interfaces:
   ```bash
   # On GPU VM
   sudo netstat -tlnp | grep 11434
   # Should show 0.0.0.0:11434, not 127.0.0.1:11434
   ```

2. Check firewall rules:
   ```bash
   # On GPU VM
   sudo ufw status
   ```

3. Test connectivity:
   ```bash
   # From local machine
   curl -v http://GPU_VM_IP:11434/api/tags
   ```

### Model Loading Issues

1. Check available VRAM:
   ```bash
   nvidia-smi
   ```

2. Adjust `OLLAMA_MAX_LOADED_MODELS` if needed

3. Check model sizes vs available memory

## Performance Tuning

### Ollama Settings

```bash
# On GPU VM, edit systemd override:
Environment="OLLAMA_KEEP_ALIVE=24h"        # Keep models loaded
Environment="OLLAMA_MAX_LOADED_MODELS=3"   # Max concurrent models
Environment="OLLAMA_NUM_PARALLEL=1"        # Parallel requests
```

### LLM Council Timeouts

Adjust in `.env`:
```bash
LLM_TIMEOUT_SECONDS=600.0              # For slow models
CHAIRMAN_TIMEOUT_SECONDS=600.0
OPENAI_COMPAT_TIMEOUT_SECONDS=600.0
OPENAI_COMPAT_CONNECT_TIMEOUT_SECONDS=30.0
OPENAI_COMPAT_WRITE_TIMEOUT_SECONDS=30.0
OPENAI_COMPAT_POOL_TIMEOUT_SECONDS=30.0
```