# Deployment Guide ## Overview LLM Council can be deployed in several configurations depending on your needs: - **Local Development**: Everything runs on your local machine - **Hybrid**: Frontend/Backend local, LLM server on remote GPU VM - **Full Remote**: Everything on a server/VM - **Production**: Professional deployment with proper infrastructure ## Architecture Options ### Option 1: Hybrid (Recommended for Development) **Setup:** - Frontend + Backend: Run on your local machine - LLM Server (Ollama): Run on remote GPU VM **Pros:** - Easy development and debugging - GPU resources available remotely - No need to deploy frontend/backend code - Fast iteration **Cons:** - Requires network connectivity to GPU VM - Latency for LLM requests **Configuration:** ```bash # .env on local machine USE_LOCAL_OLLAMA=false OPENAI_COMPAT_BASE_URL=http://your-gpu-vm-ip:11434 ``` ### Option 2: Full Remote Deployment **Setup:** - Everything runs on the GPU VM or dedicated server **Pros:** - Centralized deployment - Can be accessed from multiple machines - Better for team use **Cons:** - More complex setup - Requires proper security configuration - Slower development iteration ### Option 3: Production Deployment (Professional) **Recommended Stack:** - **Frontend**: Serve static build via nginx/CDN - **Backend**: Run via systemd/gunicorn/uvicorn with reverse proxy - **LLM Server**: Separate service on GPU VM - **Security**: TLS/HTTPS, authentication, rate limiting ## GPU VM Setup ### Prerequisites 1. GPU VM with: - NVIDIA GPU with CUDA support - Sufficient VRAM for your models - Network access from your local machine 2. Ollama installed on GPU VM: ```bash curl -fsSL https://ollama.ai/install.sh | sh ``` ### Step 1: Configure Ollama to Accept Remote Connections **On GPU VM:** ```bash # Option A: Environment variable (temporary) export OLLAMA_HOST=0.0.0.0:11434 # Option B: Systemd service (persistent - recommended) sudo systemctl edit ollama ``` Add to the override file: ```ini [Service] Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_KEEP_ALIVE=24h" Environment="OLLAMA_MAX_LOADED_MODELS=3" ``` Then restart: ```bash sudo systemctl daemon-reload sudo systemctl restart ollama ``` ### Step 2: Configure Firewall **On GPU VM:** ```bash # Allow port 11434 from your local network sudo ufw allow from YOUR_LOCAL_IP to any port 11434 # Or allow from entire subnet (less secure) sudo ufw allow 11434/tcp ``` ### Step 3: Pull Required Models **On GPU VM:** ```bash ollama pull qwen2.5:7b ollama pull llama3.1:8b ollama pull qwen2.5:14b ollama pull qwen2:latest ``` ### Step 3.5 (GPU VM): Ensure Ollama Uses GPU + Stores Data on /mnt/data If your VM has a small root disk, keep Ollama's storage and HOME off `/` (common cause of weird failures). Also note that on this setup, `OLLAMA_LLM_LIBRARY=cuda` caused Ollama to *skip* CUDA libraries; use `cuda_v12`. **On GPU VM:** ```bash sudo mkdir -p /etc/systemd/system/ollama.service.d sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF' [Service] Environment="OLLAMA_HOST=0.0.0.0:11434" Environment="OLLAMA_KEEP_ALIVE=24h" Environment="OLLAMA_MODELS=/mnt/data/ollama" Environment="HOME=/mnt/data/ollama/home" Environment="OLLAMA_LLM_LIBRARY=cuda_v12" Environment="LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12" EOF sudo systemctl daemon-reload sudo systemctl restart ollama ``` **Verify GPU offload (on GPU VM):** ```bash ollama run qwen2:latest "Write 80 words about GPUs." ollama ps ``` ### Step 4: Verify Remote Access **From local machine:** ```bash curl http://YOUR_GPU_VM_IP:11434/api/tags # Should return list of available models curl http://YOUR_GPU_VM_IP:11434/v1/models ``` ### Step 5: Configure LLM Council **On local machine `.env`:** ```bash USE_LOCAL_OLLAMA=false OPENAI_COMPAT_BASE_URL=http://YOUR_GPU_VM_IP:11434 # Local (small) example: # COUNCIL_MODELS=llama3.2:1b,qwen2.5:0.5b,gemma2:2b # CHAIRMAN_MODEL=llama3.2:3b # GPU (available models): COUNCIL_MODELS=qwen2.5:7b,llama3.1:8b,qwen2:latest CHAIRMAN_MODEL=qwen2.5:14b ``` ## Security Considerations ### For Development/Internal Use 1. **Network Security:** - Use VPN or private network - Restrict firewall to specific IPs - Consider SSH tunnel for extra security 2. **Ollama Security:** - Ollama has no built-in authentication - Only expose on trusted networks - Consider reverse proxy with auth (nginx + basic auth) ### For Production 1. **Authentication:** - Add API key authentication to backend - Use session-based auth for frontend - Implement rate limiting 2. **Network Security:** - Use HTTPS/TLS everywhere - Set up proper firewall rules - Consider using a reverse proxy (nginx/traefik) 3. **Infrastructure:** - Use container orchestration (Docker Compose/Kubernetes) - Set up monitoring and logging - Implement backup strategy for conversations ## Deployment Scripts ### Quick Start (Local + Remote Ollama) ```bash # 1. Start Ollama on GPU VM (already running if systemd configured) # 2. On local machine: ./start.sh ``` ### Full Remote Deployment See `docs/DEPLOYMENT_FULL.md` for complete remote deployment instructions. ## Troubleshooting ### Connection Timeouts 1. Check Ollama is listening on all interfaces: ```bash # On GPU VM sudo netstat -tlnp | grep 11434 # Should show 0.0.0.0:11434, not 127.0.0.1:11434 ``` 2. Check firewall rules: ```bash # On GPU VM sudo ufw status ``` 3. Test connectivity: ```bash # From local machine curl -v http://GPU_VM_IP:11434/api/tags ``` ### Model Loading Issues 1. Check available VRAM: ```bash nvidia-smi ``` 2. Adjust `OLLAMA_MAX_LOADED_MODELS` if needed 3. Check model sizes vs available memory ## Performance Tuning ### Ollama Settings ```bash # On GPU VM, edit systemd override: Environment="OLLAMA_KEEP_ALIVE=24h" # Keep models loaded Environment="OLLAMA_MAX_LOADED_MODELS=3" # Max concurrent models Environment="OLLAMA_NUM_PARALLEL=1" # Parallel requests ``` ### LLM Council Timeouts Adjust in `.env`: ```bash LLM_TIMEOUT_SECONDS=600.0 # For slow models CHAIRMAN_TIMEOUT_SECONDS=600.0 OPENAI_COMPAT_TIMEOUT_SECONDS=600.0 OPENAI_COMPAT_CONNECT_TIMEOUT_SECONDS=30.0 OPENAI_COMPAT_WRITE_TIMEOUT_SECONDS=30.0 OPENAI_COMPAT_POOL_TIMEOUT_SECONDS=30.0 ```