llm_council/docs/DEPLOYMENT_RECOMMENDATIONS.md
Irina Levit 3546c04348
Some checks failed
CI / backend-test (push) Successful in 4m9s
CI / frontend-test (push) Failing after 3m48s
CI / lint-python (push) Successful in 1m41s
CI / secret-scanning (push) Successful in 1m20s
CI / dependency-scan (push) Successful in 10m50s
CI / workflow-summary (push) Successful in 1m11s
feat: Major UI/UX improvements and production readiness
## Features Added

### Document Reference System
- Implemented numbered document references (@1, @2, etc.) with autocomplete dropdown
- Added fuzzy filename matching for @filename references
- Document filtering now prioritizes numeric refs > filename refs > all documents
- Autocomplete dropdown appears when typing @ with keyboard navigation (Up/Down, Enter/Tab, Escape)
- Document numbers displayed in UI for easy reference

### Conversation Management
- Added conversation rename functionality with inline editing
- Implemented conversation search (by title and content)
- Search box always visible, even when no conversations exist
- Export reports now replace @N references with actual filenames

### UI/UX Improvements
- Removed debug toggle button
- Improved text contrast in dark mode (better visibility)
- Made input textarea expand to full available width
- Fixed file text color for better readability
- Enhanced document display with numbered badges

### Configuration & Timeouts
- Made HTTP client timeouts configurable (connect, write, pool)
- Added .env.example with all configuration options
- Updated timeout documentation

### Developer Experience
- Added `make test-setup` target for automated test conversation creation
- Test setup script supports TEST_MESSAGE and TEST_DOCS env vars
- Improved Makefile with dev and test-setup targets

### Documentation
- Updated ARCHITECTURE.md with all new features
- Created comprehensive deployment documentation
- Added GPU VM setup guides
- Removed unnecessary markdown files (CLAUDE.md, CONTRIBUTING.md, header.jpg)
- Organized documentation in docs/ directory

### GPU VM / Ollama (Stability + GPU Offload)
- Updated GPU VM docs to reflect the working systemd environment for remote Ollama
- Standardized remote Ollama port to 11434 (and added /v1/models verification)
- Documented required env for GPU offload on this VM:
  - `OLLAMA_MODELS=/mnt/data/ollama`, `HOME=/mnt/data/ollama/home`
  - `OLLAMA_LLM_LIBRARY=cuda_v12` (not `cuda`)
  - `LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12`

## Technical Changes

### Backend
- Enhanced `docs_context.py` with reference parsing (numeric and filename)
- Added `update_conversation_title` to storage.py
- New endpoints: PATCH /api/conversations/{id}/title, GET /api/conversations/search
- Improved report generation with filename substitution

### Frontend
- Removed debugMode state and related code
- Added autocomplete dropdown component
- Implemented search functionality in Sidebar
- Enhanced ChatInterface with autocomplete and improved textarea sizing
- Updated CSS for better contrast and responsive design

## Files Changed
- Backend: config.py, council.py, docs_context.py, main.py, storage.py
- Frontend: App.jsx, ChatInterface.jsx, Sidebar.jsx, and related CSS files
- Documentation: README.md, ARCHITECTURE.md, new docs/ directory
- Configuration: .env.example, Makefile
- Scripts: scripts/test_setup.py

## Breaking Changes
None - all changes are backward compatible

## Testing
- All existing tests pass
- New test-setup script validates conversation creation workflow
- Manual testing of autocomplete, search, and rename features
2025-12-28 18:15:02 -05:00

5.1 KiB

Professional Deployment Recommendations

For Development/Personal Use

Hybrid Approach (Recommended):

┌─────────────────┐         ┌──────────────────┐
│  Local Machine  │         │   GPU VM         │
│                 │         │                  │
│  Frontend       │         │  Ollama Server   │
│  (React/Vite)   │         │  (LLM Models)    │
│                 │◄────────┤                  │
│  Backend        │  HTTP   │  Port 11434      │
│  (FastAPI)      │         │                  │
└─────────────────┘         └──────────────────┘

Why this is best:

  • Fast development iteration
  • Easy debugging (logs on local machine)
  • GPU resources available remotely
  • No complex deployment needed
  • Can work offline (if models cached locally)

For Production/Team Use

Full Remote Deployment:

┌─────────────────┐
│   Users         │
│  (Browsers)     │
└────────┬────────┘
         │ HTTPS
         ▼
┌─────────────────┐
│  Reverse Proxy  │
│  (nginx/traefik)│
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
    ▼         ▼
┌────────┐ ┌──────────┐
│Frontend│ │ Backend  │
│(Static)│ │(FastAPI) │
└────────┘ └────┬─────┘
                │ HTTP
                ▼
         ┌──────────────┐
         │  GPU VM      │
         │  Ollama      │
         └──────────────┘

Comparison of Approaches

Aspect Hybrid (Local + Remote) Full Remote Production
Setup Complexity Low Medium High
Development Speed Fast Medium Slow
Security Medium Medium High
Scalability Low Medium High
Cost Low Medium High
Best For Dev/Personal Team/Internal Public/Enterprise

Security Best Practices

Development/Internal Network

  1. Network Isolation:

    • Use private network/VPN
    • Restrict firewall to specific IPs
    • Consider SSH tunnel for Ollama
  2. Ollama Access:

    # Only allow from specific IPs
    sudo ufw allow from YOUR_IP to any port 11434
    
  3. SSH Tunnel Alternative:

    # More secure - no direct network exposure
    ssh -L 11434:localhost:11434 user@gpu-vm
    # Then use localhost:11434 in .env
    

Production Deployment

  1. Authentication:

    • Add API keys to backend
    • Implement user sessions
    • Use OAuth2/JWT for API
  2. Network Security:

    • HTTPS/TLS everywhere
    • WAF (Web Application Firewall)
    • Rate limiting
    • DDoS protection
  3. Infrastructure:

    • Container orchestration (Docker/K8s)
    • Service mesh for internal communication
    • Monitoring and alerting
    • Automated backups

Deployment Checklist

  • GPU VM has Ollama installed
  • Ollama configured to listen on 0.0.0.0:11434
  • Firewall allows connections from local machine
  • Models pulled on GPU VM
  • Local .env configured with GPU VM IP
  • Test connection: curl http://GPU_VM_IP:11434/api/tags

Full Remote Setup

  • Server/VM provisioned
  • Frontend built and served (nginx/static host)
  • Backend running as service (systemd/supervisor)
  • Reverse proxy configured (nginx/traefik)
  • SSL certificates installed
  • Authentication implemented
  • Monitoring set up
  • Backup strategy in place

Production Setup

  • All of Full Remote checklist
  • Load balancing configured
  • Database for conversations (optional upgrade)
  • Logging and monitoring (Prometheus/Grafana)
  • CI/CD pipeline
  • Security audit
  • Documentation for ops team
  • Disaster recovery plan

Cost Considerations

Hybrid (Local + Remote GPU VM)

  • Cost: GPU VM only (~$0.50-2/hour depending on GPU)
  • Best for: Development, personal projects, small teams

Full Remote

  • Cost: GPU VM + Application Server (~$1-3/hour)
  • Best for: Teams, internal tools

Production

  • Cost: $100-1000+/month depending on scale
  • Best for: Public services, enterprise

Migration Path

  1. Start: Hybrid (local dev, remote GPU)
  2. Grow: Full remote (when team needs it)
  3. Scale: Production (when going public/enterprise)

Recommendation

For your use case (development/personal):

Use the Hybrid approach:

  • Run frontend + backend locally
  • Connect to Ollama on GPU VM
  • Use SSH tunnel for extra security if needed
  • Simple, fast, cost-effective

This gives you:

  • Fast development iteration
  • Easy debugging
  • GPU resources when needed
  • Minimal infrastructure complexity
  • Low cost