Some checks failed
CI / backend-test (push) Successful in 4m9s
CI / frontend-test (push) Failing after 3m48s
CI / lint-python (push) Successful in 1m41s
CI / secret-scanning (push) Successful in 1m20s
CI / dependency-scan (push) Successful in 10m50s
CI / workflow-summary (push) Successful in 1m11s
## Features Added
### Document Reference System
- Implemented numbered document references (@1, @2, etc.) with autocomplete dropdown
- Added fuzzy filename matching for @filename references
- Document filtering now prioritizes numeric refs > filename refs > all documents
- Autocomplete dropdown appears when typing @ with keyboard navigation (Up/Down, Enter/Tab, Escape)
- Document numbers displayed in UI for easy reference
### Conversation Management
- Added conversation rename functionality with inline editing
- Implemented conversation search (by title and content)
- Search box always visible, even when no conversations exist
- Export reports now replace @N references with actual filenames
### UI/UX Improvements
- Removed debug toggle button
- Improved text contrast in dark mode (better visibility)
- Made input textarea expand to full available width
- Fixed file text color for better readability
- Enhanced document display with numbered badges
### Configuration & Timeouts
- Made HTTP client timeouts configurable (connect, write, pool)
- Added .env.example with all configuration options
- Updated timeout documentation
### Developer Experience
- Added `make test-setup` target for automated test conversation creation
- Test setup script supports TEST_MESSAGE and TEST_DOCS env vars
- Improved Makefile with dev and test-setup targets
### Documentation
- Updated ARCHITECTURE.md with all new features
- Created comprehensive deployment documentation
- Added GPU VM setup guides
- Removed unnecessary markdown files (CLAUDE.md, CONTRIBUTING.md, header.jpg)
- Organized documentation in docs/ directory
### GPU VM / Ollama (Stability + GPU Offload)
- Updated GPU VM docs to reflect the working systemd environment for remote Ollama
- Standardized remote Ollama port to 11434 (and added /v1/models verification)
- Documented required env for GPU offload on this VM:
- `OLLAMA_MODELS=/mnt/data/ollama`, `HOME=/mnt/data/ollama/home`
- `OLLAMA_LLM_LIBRARY=cuda_v12` (not `cuda`)
- `LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12`
## Technical Changes
### Backend
- Enhanced `docs_context.py` with reference parsing (numeric and filename)
- Added `update_conversation_title` to storage.py
- New endpoints: PATCH /api/conversations/{id}/title, GET /api/conversations/search
- Improved report generation with filename substitution
### Frontend
- Removed debugMode state and related code
- Added autocomplete dropdown component
- Implemented search functionality in Sidebar
- Enhanced ChatInterface with autocomplete and improved textarea sizing
- Updated CSS for better contrast and responsive design
## Files Changed
- Backend: config.py, council.py, docs_context.py, main.py, storage.py
- Frontend: App.jsx, ChatInterface.jsx, Sidebar.jsx, and related CSS files
- Documentation: README.md, ARCHITECTURE.md, new docs/ directory
- Configuration: .env.example, Makefile
- Scripts: scripts/test_setup.py
## Breaking Changes
None - all changes are backward compatible
## Testing
- All existing tests pass
- New test-setup script validates conversation creation workflow
- Manual testing of autocomplete, search, and rename features
279 lines
6.3 KiB
Markdown
279 lines
6.3 KiB
Markdown
# Deployment Guide
|
|
|
|
## Overview
|
|
|
|
LLM Council can be deployed in several configurations depending on your needs:
|
|
- **Local Development**: Everything runs on your local machine
|
|
- **Hybrid**: Frontend/Backend local, LLM server on remote GPU VM
|
|
- **Full Remote**: Everything on a server/VM
|
|
- **Production**: Professional deployment with proper infrastructure
|
|
|
|
## Architecture Options
|
|
|
|
### Option 1: Hybrid (Recommended for Development)
|
|
|
|
**Setup:**
|
|
- Frontend + Backend: Run on your local machine
|
|
- LLM Server (Ollama): Run on remote GPU VM
|
|
|
|
**Pros:**
|
|
- Easy development and debugging
|
|
- GPU resources available remotely
|
|
- No need to deploy frontend/backend code
|
|
- Fast iteration
|
|
|
|
**Cons:**
|
|
- Requires network connectivity to GPU VM
|
|
- Latency for LLM requests
|
|
|
|
**Configuration:**
|
|
```bash
|
|
# .env on local machine
|
|
USE_LOCAL_OLLAMA=false
|
|
OPENAI_COMPAT_BASE_URL=http://your-gpu-vm-ip:11434
|
|
```
|
|
|
|
### Option 2: Full Remote Deployment
|
|
|
|
**Setup:**
|
|
- Everything runs on the GPU VM or dedicated server
|
|
|
|
**Pros:**
|
|
- Centralized deployment
|
|
- Can be accessed from multiple machines
|
|
- Better for team use
|
|
|
|
**Cons:**
|
|
- More complex setup
|
|
- Requires proper security configuration
|
|
- Slower development iteration
|
|
|
|
### Option 3: Production Deployment (Professional)
|
|
|
|
**Recommended Stack:**
|
|
- **Frontend**: Serve static build via nginx/CDN
|
|
- **Backend**: Run via systemd/gunicorn/uvicorn with reverse proxy
|
|
- **LLM Server**: Separate service on GPU VM
|
|
- **Security**: TLS/HTTPS, authentication, rate limiting
|
|
|
|
## GPU VM Setup
|
|
|
|
### Prerequisites
|
|
|
|
1. GPU VM with:
|
|
- NVIDIA GPU with CUDA support
|
|
- Sufficient VRAM for your models
|
|
- Network access from your local machine
|
|
|
|
2. Ollama installed on GPU VM:
|
|
```bash
|
|
curl -fsSL https://ollama.ai/install.sh | sh
|
|
```
|
|
|
|
### Step 1: Configure Ollama to Accept Remote Connections
|
|
|
|
**On GPU VM:**
|
|
|
|
```bash
|
|
# Option A: Environment variable (temporary)
|
|
export OLLAMA_HOST=0.0.0.0:11434
|
|
|
|
# Option B: Systemd service (persistent - recommended)
|
|
sudo systemctl edit ollama
|
|
```
|
|
|
|
Add to the override file:
|
|
```ini
|
|
[Service]
|
|
Environment="OLLAMA_HOST=0.0.0.0:11434"
|
|
Environment="OLLAMA_KEEP_ALIVE=24h"
|
|
Environment="OLLAMA_MAX_LOADED_MODELS=3"
|
|
```
|
|
|
|
Then restart:
|
|
```bash
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart ollama
|
|
```
|
|
|
|
### Step 2: Configure Firewall
|
|
|
|
**On GPU VM:**
|
|
|
|
```bash
|
|
# Allow port 11434 from your local network
|
|
sudo ufw allow from YOUR_LOCAL_IP to any port 11434
|
|
# Or allow from entire subnet (less secure)
|
|
sudo ufw allow 11434/tcp
|
|
```
|
|
|
|
### Step 3: Pull Required Models
|
|
|
|
**On GPU VM:**
|
|
|
|
```bash
|
|
ollama pull qwen2.5:7b
|
|
ollama pull llama3.1:8b
|
|
ollama pull qwen2.5:14b
|
|
ollama pull qwen2:latest
|
|
```
|
|
|
|
### Step 3.5 (GPU VM): Ensure Ollama Uses GPU + Stores Data on /mnt/data
|
|
|
|
If your VM has a small root disk, keep Ollama's storage and HOME off `/` (common cause of weird failures).
|
|
Also note that on this setup, `OLLAMA_LLM_LIBRARY=cuda` caused Ollama to *skip* CUDA libraries; use `cuda_v12`.
|
|
|
|
**On GPU VM:**
|
|
|
|
```bash
|
|
sudo mkdir -p /etc/systemd/system/ollama.service.d
|
|
sudo tee /etc/systemd/system/ollama.service.d/override.conf >/dev/null <<'EOF'
|
|
[Service]
|
|
Environment="OLLAMA_HOST=0.0.0.0:11434"
|
|
Environment="OLLAMA_KEEP_ALIVE=24h"
|
|
Environment="OLLAMA_MODELS=/mnt/data/ollama"
|
|
Environment="HOME=/mnt/data/ollama/home"
|
|
Environment="OLLAMA_LLM_LIBRARY=cuda_v12"
|
|
Environment="LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12"
|
|
EOF
|
|
|
|
sudo systemctl daemon-reload
|
|
sudo systemctl restart ollama
|
|
```
|
|
|
|
**Verify GPU offload (on GPU VM):**
|
|
|
|
```bash
|
|
ollama run qwen2:latest "Write 80 words about GPUs."
|
|
ollama ps
|
|
```
|
|
|
|
### Step 4: Verify Remote Access
|
|
|
|
**From local machine:**
|
|
|
|
```bash
|
|
curl http://YOUR_GPU_VM_IP:11434/api/tags
|
|
# Should return list of available models
|
|
curl http://YOUR_GPU_VM_IP:11434/v1/models
|
|
```
|
|
|
|
### Step 5: Configure LLM Council
|
|
|
|
**On local machine `.env`:**
|
|
|
|
```bash
|
|
USE_LOCAL_OLLAMA=false
|
|
OPENAI_COMPAT_BASE_URL=http://YOUR_GPU_VM_IP:11434
|
|
# Local (small) example:
|
|
# COUNCIL_MODELS=llama3.2:1b,qwen2.5:0.5b,gemma2:2b
|
|
# CHAIRMAN_MODEL=llama3.2:3b
|
|
|
|
# GPU (available models):
|
|
COUNCIL_MODELS=qwen2.5:7b,llama3.1:8b,qwen2:latest
|
|
CHAIRMAN_MODEL=qwen2.5:14b
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### For Development/Internal Use
|
|
|
|
1. **Network Security:**
|
|
- Use VPN or private network
|
|
- Restrict firewall to specific IPs
|
|
- Consider SSH tunnel for extra security
|
|
|
|
2. **Ollama Security:**
|
|
- Ollama has no built-in authentication
|
|
- Only expose on trusted networks
|
|
- Consider reverse proxy with auth (nginx + basic auth)
|
|
|
|
### For Production
|
|
|
|
1. **Authentication:**
|
|
- Add API key authentication to backend
|
|
- Use session-based auth for frontend
|
|
- Implement rate limiting
|
|
|
|
2. **Network Security:**
|
|
- Use HTTPS/TLS everywhere
|
|
- Set up proper firewall rules
|
|
- Consider using a reverse proxy (nginx/traefik)
|
|
|
|
3. **Infrastructure:**
|
|
- Use container orchestration (Docker Compose/Kubernetes)
|
|
- Set up monitoring and logging
|
|
- Implement backup strategy for conversations
|
|
|
|
## Deployment Scripts
|
|
|
|
### Quick Start (Local + Remote Ollama)
|
|
|
|
```bash
|
|
# 1. Start Ollama on GPU VM (already running if systemd configured)
|
|
# 2. On local machine:
|
|
./start.sh
|
|
```
|
|
|
|
### Full Remote Deployment
|
|
|
|
See `docs/DEPLOYMENT_FULL.md` for complete remote deployment instructions.
|
|
|
|
## Troubleshooting
|
|
|
|
### Connection Timeouts
|
|
|
|
1. Check Ollama is listening on all interfaces:
|
|
```bash
|
|
# On GPU VM
|
|
sudo netstat -tlnp | grep 11434
|
|
# Should show 0.0.0.0:11434, not 127.0.0.1:11434
|
|
```
|
|
|
|
2. Check firewall rules:
|
|
```bash
|
|
# On GPU VM
|
|
sudo ufw status
|
|
```
|
|
|
|
3. Test connectivity:
|
|
```bash
|
|
# From local machine
|
|
curl -v http://GPU_VM_IP:11434/api/tags
|
|
```
|
|
|
|
### Model Loading Issues
|
|
|
|
1. Check available VRAM:
|
|
```bash
|
|
nvidia-smi
|
|
```
|
|
|
|
2. Adjust `OLLAMA_MAX_LOADED_MODELS` if needed
|
|
|
|
3. Check model sizes vs available memory
|
|
|
|
## Performance Tuning
|
|
|
|
### Ollama Settings
|
|
|
|
```bash
|
|
# On GPU VM, edit systemd override:
|
|
Environment="OLLAMA_KEEP_ALIVE=24h" # Keep models loaded
|
|
Environment="OLLAMA_MAX_LOADED_MODELS=3" # Max concurrent models
|
|
Environment="OLLAMA_NUM_PARALLEL=1" # Parallel requests
|
|
```
|
|
|
|
### LLM Council Timeouts
|
|
|
|
Adjust in `.env`:
|
|
```bash
|
|
LLM_TIMEOUT_SECONDS=600.0 # For slow models
|
|
CHAIRMAN_TIMEOUT_SECONDS=600.0
|
|
OPENAI_COMPAT_TIMEOUT_SECONDS=600.0
|
|
OPENAI_COMPAT_CONNECT_TIMEOUT_SECONDS=30.0
|
|
OPENAI_COMPAT_WRITE_TIMEOUT_SECONDS=30.0
|
|
OPENAI_COMPAT_POOL_TIMEOUT_SECONDS=30.0
|
|
```
|
|
|