Some checks failed
CI / backend-test (push) Successful in 4m9s
CI / frontend-test (push) Failing after 3m48s
CI / lint-python (push) Successful in 1m41s
CI / secret-scanning (push) Successful in 1m20s
CI / dependency-scan (push) Successful in 10m50s
CI / workflow-summary (push) Successful in 1m11s
## Features Added
### Document Reference System
- Implemented numbered document references (@1, @2, etc.) with autocomplete dropdown
- Added fuzzy filename matching for @filename references
- Document filtering now prioritizes numeric refs > filename refs > all documents
- Autocomplete dropdown appears when typing @ with keyboard navigation (Up/Down, Enter/Tab, Escape)
- Document numbers displayed in UI for easy reference
### Conversation Management
- Added conversation rename functionality with inline editing
- Implemented conversation search (by title and content)
- Search box always visible, even when no conversations exist
- Export reports now replace @N references with actual filenames
### UI/UX Improvements
- Removed debug toggle button
- Improved text contrast in dark mode (better visibility)
- Made input textarea expand to full available width
- Fixed file text color for better readability
- Enhanced document display with numbered badges
### Configuration & Timeouts
- Made HTTP client timeouts configurable (connect, write, pool)
- Added .env.example with all configuration options
- Updated timeout documentation
### Developer Experience
- Added `make test-setup` target for automated test conversation creation
- Test setup script supports TEST_MESSAGE and TEST_DOCS env vars
- Improved Makefile with dev and test-setup targets
### Documentation
- Updated ARCHITECTURE.md with all new features
- Created comprehensive deployment documentation
- Added GPU VM setup guides
- Removed unnecessary markdown files (CLAUDE.md, CONTRIBUTING.md, header.jpg)
- Organized documentation in docs/ directory
### GPU VM / Ollama (Stability + GPU Offload)
- Updated GPU VM docs to reflect the working systemd environment for remote Ollama
- Standardized remote Ollama port to 11434 (and added /v1/models verification)
- Documented required env for GPU offload on this VM:
- `OLLAMA_MODELS=/mnt/data/ollama`, `HOME=/mnt/data/ollama/home`
- `OLLAMA_LLM_LIBRARY=cuda_v12` (not `cuda`)
- `LD_LIBRARY_PATH=/usr/local/lib/ollama:/usr/local/lib/ollama/cuda_v12`
## Technical Changes
### Backend
- Enhanced `docs_context.py` with reference parsing (numeric and filename)
- Added `update_conversation_title` to storage.py
- New endpoints: PATCH /api/conversations/{id}/title, GET /api/conversations/search
- Improved report generation with filename substitution
### Frontend
- Removed debugMode state and related code
- Added autocomplete dropdown component
- Implemented search functionality in Sidebar
- Enhanced ChatInterface with autocomplete and improved textarea sizing
- Updated CSS for better contrast and responsive design
## Files Changed
- Backend: config.py, council.py, docs_context.py, main.py, storage.py
- Frontend: App.jsx, ChatInterface.jsx, Sidebar.jsx, and related CSS files
- Documentation: README.md, ARCHITECTURE.md, new docs/ directory
- Configuration: .env.example, Makefile
- Scripts: scripts/test_setup.py
## Breaking Changes
None - all changes are backward compatible
## Testing
- All existing tests pass
- New test-setup script validates conversation creation workflow
- Manual testing of autocomplete, search, and rename features
133 lines
3.9 KiB
Python
133 lines
3.9 KiB
Python
"""Unified LLM client.
|
|
|
|
This module routes LLM requests to OpenAI-compatible servers (Ollama, vLLM, TGI, etc.).
|
|
|
|
The base URL is determined by:
|
|
- If USE_LOCAL_OLLAMA=true: uses http://localhost:11434
|
|
- Else if OPENAI_COMPAT_BASE_URL is set: uses that URL
|
|
- Else: raises an error (base URL must be configured)
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import os
|
|
from typing import Any, Dict, List, Optional
|
|
|
|
from .config import MAX_TOKENS, OPENAI_COMPAT_BASE_URL, LLM_TIMEOUT_SECONDS, DEBUG
|
|
|
|
|
|
def _get_provider_name() -> str:
|
|
"""Returns the provider name (always 'openai_compat' now)."""
|
|
return "openai_compat"
|
|
|
|
|
|
def _get_max_concurrency() -> int:
|
|
"""
|
|
Maximum number of in-flight model requests when calling query_models_parallel.
|
|
|
|
- If LLM_MAX_CONCURRENCY is unset/empty/invalid: unlimited (0)
|
|
- If set to 1: strictly sequential
|
|
- If set to N>1: at most N in flight
|
|
"""
|
|
raw = (os.getenv("LLM_MAX_CONCURRENCY") or "").strip()
|
|
if not raw:
|
|
return 0
|
|
try:
|
|
v = int(raw)
|
|
except ValueError:
|
|
return 0
|
|
return max(0, v)
|
|
|
|
|
|
def get_provider_info() -> Dict[str, Any]:
|
|
"""Get information about the configured provider."""
|
|
from .config import OPENAI_COMPAT_BASE_URL
|
|
return {
|
|
"provider": "openai_compat",
|
|
"base_url": OPENAI_COMPAT_BASE_URL
|
|
}
|
|
|
|
|
|
async def list_models() -> Optional[List[str]]:
|
|
"""List available models from the OpenAI-compatible server."""
|
|
from .openai_compat import list_models as _list
|
|
return await _list()
|
|
|
|
|
|
async def query_model(
|
|
model: str,
|
|
messages: List[Dict[str, str]],
|
|
timeout: Optional[float] = None,
|
|
max_tokens_override: Optional[int] = None,
|
|
) -> Optional[Dict[str, Any]]:
|
|
"""Query a model via OpenAI-compatible API."""
|
|
from .openai_compat import query_model as _query
|
|
|
|
max_tokens = max_tokens_override if max_tokens_override is not None else MAX_TOKENS
|
|
resolved_timeout = timeout if timeout is not None else LLM_TIMEOUT_SECONDS
|
|
|
|
return await _query(
|
|
model,
|
|
messages,
|
|
max_tokens=max_tokens,
|
|
timeout=resolved_timeout,
|
|
)
|
|
|
|
|
|
async def query_models_parallel(
|
|
models: List[str],
|
|
messages: List[Dict[str, str]],
|
|
timeout: Optional[float] = None,
|
|
max_tokens_override: Optional[int] = None,
|
|
) -> Dict[str, Optional[Dict[str, Any]]]:
|
|
import asyncio
|
|
|
|
resolved_timeout = timeout if timeout is not None else LLM_TIMEOUT_SECONDS
|
|
limit = _get_max_concurrency()
|
|
|
|
# If limit is 1, run completely sequentially (one at a time, wait for each to finish)
|
|
if limit == 1:
|
|
results = {}
|
|
for model in models:
|
|
if DEBUG:
|
|
print(f"[DEBUG] Running model '{model}' sequentially (concurrency=1)")
|
|
results[model] = await query_model(
|
|
model,
|
|
messages,
|
|
timeout=resolved_timeout,
|
|
max_tokens_override=max_tokens_override,
|
|
)
|
|
return results
|
|
|
|
# If limit <= 0 or >= len(models), run all in parallel (no limit)
|
|
if limit <= 0 or limit >= len(models):
|
|
tasks = [
|
|
query_model(
|
|
model,
|
|
messages,
|
|
timeout=resolved_timeout,
|
|
max_tokens_override=max_tokens_override,
|
|
)
|
|
for model in models
|
|
]
|
|
responses = await asyncio.gather(*tasks)
|
|
return {model: response for model, response in zip(models, responses)}
|
|
|
|
# Otherwise, use semaphore to limit concurrency (2, 3, etc.)
|
|
sem = asyncio.Semaphore(limit)
|
|
|
|
async def _run_one(model: str) -> Optional[Dict[str, Any]]:
|
|
async with sem:
|
|
return await query_model(
|
|
model,
|
|
messages,
|
|
timeout=resolved_timeout,
|
|
max_tokens_override=max_tokens_override,
|
|
)
|
|
|
|
tasks = [_run_one(model) for model in models]
|
|
responses = await asyncio.gather(*tasks)
|
|
return {model: response for model, response in zip(models, responses)}
|
|
|
|
|