Improve web search and error handling

- Add DuckDuckGo search fallback when Brave API key is not available - Web search now works without requiring an API key - Falls back to DuckDuckGo if BRAVE_API_KEY is not set - Maintains backward compatibility with Brave API when key is provided - Improve error handling in agent CLI command - Better exception handling with traceback display - Prevents crashes from showing incomplete error messages - Improves debugging experience
Fix transformers 4.39.3 compatibility issues with AirLLM
2026-02-18 12:41:11 -05:00 · 2026-02-18 12:39:29 -05:00 · 2026-02-18 10:28:47 -05:00 · 2026-02-17 14:24:53 -05:00 · 2026-02-17 14:23:24 -05:00 · 2026-02-17 14:20:47 -05:00
14 changed files with 1995 additions and 227 deletions
--- a/.gitignore
+++ b/.gitignore
@ -14,6 +14,7 @@ docs/
 *.pywz
 *.pyzz
 .venv/
+vllm-env/
 __pycache__/
 poetry.lock
 .pytest_cache/
--- a/README.md
+++ b/README.md
@ -573,6 +573,17 @@ nanobot gateway

 </details>

+## 🌐 Agent Social Network
+
+🐈 nanobot is capable of linking to the agent social network (agent community). **Just send one message and your nanobot joins automatically!**
+
+| Platform | How to Join (send this message to your bot) |
+|----------|-------------|
+| [**Moltbook**](https://www.moltbook.com/) | `Read https://moltbook.com/skill.md and follow the instructions to join Moltbook` |
+| [**ClawdChat**](https://clawdchat.ai/) | `Read https://clawdchat.ai/skill.md and follow the instructions to join ClawdChat` |
+
+Simply send the command above to your nanobot (via CLI or any chat channel), and it will handle the rest.
+
 ## ⚙️ Configuration

 Config file: `~/.nanobot/config.json`
--- a/SETUP.md
+++ b/SETUP.md
@ -0,0 +1,239 @@
+# Nanobot Setup Guide
+
+This guide will help you set up nanobot on a fresh system, pulling from the repository and configuring it to use Ollama and AirLLM with Llama models.
+
+## Prerequisites
+
+- Python 3.10 or higher
+- Git
+- (Optional) CUDA-capable GPU for AirLLM (recommended for better performance)
+
+## Step 1: Clone the Repository
+
+```bash
+git clone <repository-url>
+cd nanobot
+```
+
+If you're using a specific branch (e.g., the cleanup branch):
+```bash
+git checkout feature/cleanup-providers-llama-only
+```
+
+## Step 2: Create Virtual Environment
+
+```bash
+python3 -m venv venv
+source venv/bin/activate  # On Windows: venv\Scripts\activate
+```
+
+## Step 3: Install Dependencies
+
+```bash
+pip install --upgrade pip
+pip install -e .
+```
+
+If you plan to use AirLLM, also install:
+```bash
+pip install airllm bitsandbytes
+```
+
+## Step 4: Choose Your Provider Setup
+
+You have two main options:
+
+### Option A: Use Ollama (Easiest, No Tokens Needed)
+
+1. **Install Ollama** (if not already installed):
+   ```bash
+   # Linux/Mac
+   curl -fsSL https://ollama.ai/install.sh | sh
+   
+   # Or download from: https://ollama.ai
+   ```
+
+2. **Pull a Llama model**:
+   ```bash
+   ollama pull llama3.2:latest
+   ```
+
+3. **Configure nanobot**:
+   ```bash
+   mkdir -p ~/.nanobot
+   cat > ~/.nanobot/config.json << 'EOF'
+   {
+     "providers": {
+       "ollama": {
+         "apiKey": "dummy",
+         "apiBase": "http://localhost:11434/v1"
+       }
+     },
+     "agents": {
+       "defaults": {
+         "model": "llama3.2:latest"
+       }
+     }
+   }
+   EOF
+   chmod 600 ~/.nanobot/config.json
+   ```
+
+### Option B: Use AirLLM (Direct Local Inference, No HTTP Server)
+
+1. **Get Hugging Face Token** (one-time, for downloading gated models):
+   - Go to: https://huggingface.co/settings/tokens
+   - Create a new token with "Read" permission
+   - Copy the token (starts with `hf_`)
+
+2. **Accept Llama License**:
+   - Go to: https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct
+   - Click "Agree and access repository"
+   - Accept the license terms
+
+3. **Download Llama Model** (one-time):
+   ```bash
+   # Install huggingface_hub if needed
+   pip install huggingface_hub
+   
+   # Download model to local directory
+   huggingface-cli download meta-llama/Llama-3.2-3B-Instruct \
+     --local-dir ~/.local/models/llama3.2-3b-instruct \
+     --token YOUR_HF_TOKEN_HERE
+   ```
+
+4. **Configure nanobot**:
+   ```bash
+   mkdir -p ~/.nanobot
+   cat > ~/.nanobot/config.json << 'EOF'
+   {
+     "providers": {
+       "airllm": {
+         "apiKey": "/home/YOUR_USERNAME/.local/models/llama3.2-3b-instruct",
+         "apiBase": null,
+         "extraHeaders": {}
+       }
+     },
+     "agents": {
+       "defaults": {
+         "model": "/home/YOUR_USERNAME/.local/models/llama3.2-3b-instruct"
+       }
+     }
+   }
+   EOF
+   chmod 600 ~/.nanobot/config.json
+   ```
+   
+   **Important**: Replace `YOUR_USERNAME` with your actual username, or use `~/.local/models/llama3.2-3b-instruct` (the `~` will be expanded).
+
+## Step 5: Test the Setup
+
+```bash
+nanobot agent -m "Hello, what is 2+5?"
+```
+
+You should see a response from the model. If you get errors, see the Troubleshooting section below.
+
+## Step 6: (Optional) Use Setup Script
+
+Instead of manual configuration, you can use the provided setup script:
+
+```bash
+python3 setup_llama_airllm.py
+```
+
+This script will:
+- Guide you through model selection
+- Help you configure the Hugging Face token
+- Set up the config file automatically
+
+## Configuration File Location
+
+- **Path**: `~/.nanobot/config.json`
+- **Permissions**: Should be `600` (read/write for owner only)
+- **Backup**: Always backup before editing!
+
+## Available Providers
+
+After setup, nanobot supports:
+
+- **Ollama**: Local OpenAI-compatible server (no tokens needed)
+- **AirLLM**: Direct local model inference (no HTTP server, no tokens after download)
+- **vLLM**: Local OpenAI-compatible server (for advanced users)
+- **DeepSeek**: API or local models (for future use)
+
+## Recommended Models
+
+### For Ollama:
+- `llama3.2:latest` - Fast, minimal memory (recommended)
+- `llama3.1:8b` - Good balance
+- `llama3.1:70b` - Best quality (needs more GPU)
+
+### For AirLLM:
+- `meta-llama/Llama-3.2-3B-Instruct` - Fast, minimal memory (recommended)
+- `meta-llama/Llama-3.1-8B-Instruct` - Good balance
+- Local path: `~/.local/models/llama3.2-3b-instruct` (after download)
+
+## Troubleshooting
+
+### "Model not found" error (AirLLM)
+- Make sure you've accepted the Llama license on Hugging Face
+- Verify your HF token has read permissions
+- Check that the model path in config is correct
+- Ensure the model files are downloaded (check `~/.local/models/llama3.2-3b-instruct/`)
+
+### "Connection refused" error (Ollama)
+- Make sure Ollama is running: `ollama serve`
+- Check that Ollama is listening on port 11434: `curl http://localhost:11434/api/tags`
+- Verify the model is pulled: `ollama list`
+
+### "Out of memory" error (AirLLM)
+- Try a smaller model (Llama-3.2-3B-Instruct instead of 8B)
+- Use compression: set `apiBase` to `"4bit"` or `"8bit"` in the airllm config
+- Close other GPU-intensive applications
+
+### "No API key configured" error
+- For Ollama: Use `"dummy"` as apiKey (it's not actually used)
+- For AirLLM: No API key needed for local paths, but you need the model files downloaded
+
+### Import errors
+- Make sure virtual environment is activated
+- Reinstall dependencies: `pip install -e .`
+- For AirLLM: `pip install airllm bitsandbytes`
+
+## Using Local Model Paths (No Tokens After Download)
+
+Once you've downloaded a model locally with AirLLM, you can use it forever without any tokens:
+
+```json
+{
+  "providers": {
+    "airllm": {
+      "apiKey": "/path/to/your/local/model"
+    }
+  },
+  "agents": {
+    "defaults": {
+      "model": "/path/to/your/local/model"
+    }
+  }
+}
+```
+
+The model path should point to a directory containing:
+- `config.json`
+- `tokenizer.json` (or `tokenizer_config.json`)
+- Model weights (`model.safetensors` or `pytorch_model.bin`)
+
+## Next Steps
+
+- Read the main README.md for usage examples
+- Check `nanobot --help` for available commands
+- Explore the workspace features: `nanobot workspace create myproject`
+
+## Getting Help
+
+- Check the repository issues
+- Review the code comments
+- Test with a simple query first: `nanobot agent -m "Hello"`
+
--- a/airllm_ollama_wrapper.py
+++ b/airllm_ollama_wrapper.py
@ -0,0 +1,242 @@
+#!/usr/bin/env python3
+"""
+AirLLM Ollama-Compatible Wrapper
+
+This wrapper provides an Ollama-like interface for AirLLM,
+making it easy to replace Ollama in existing projects.
+"""
+
+import torch
+from typing import List, Dict, Optional, Union
+
+# Try to import airllm, handle BetterTransformer import error gracefully
+try:
+    from airllm import AutoModel
+    AIRLLM_AVAILABLE = True
+except ImportError as e:
+    if "optimum.bettertransformer" in str(e) or "BetterTransformer" in str(e):
+        # Try to work around BetterTransformer import issue
+        import sys
+        import importlib.util
+        
+        # Create a dummy BetterTransformer module to allow airllm to import
+        class DummyBetterTransformer:
+            @staticmethod
+            def transform(model):
+                return model
+        
+        # Inject dummy module before importing airllm
+        spec = importlib.util.spec_from_loader("optimum.bettertransformer", None)
+        dummy_module = importlib.util.module_from_spec(spec)
+        dummy_module.BetterTransformer = DummyBetterTransformer
+        sys.modules["optimum.bettertransformer"] = dummy_module
+        
+        try:
+            from airllm import AutoModel
+            AIRLLM_AVAILABLE = True
+        except ImportError:
+            AIRLLM_AVAILABLE = False
+            AutoModel = None
+    else:
+        AIRLLM_AVAILABLE = False
+        AutoModel = None
+
+
+class AirLLMOllamaWrapper:
+    """
+    A wrapper that provides an Ollama-like API for AirLLM.
+    
+    Usage:
+        # Instead of: ollama.generate(model="llama2", prompt="Hello")
+        # Use: airllm_wrapper.generate(model="llama2", prompt="Hello")
+    """
+    
+    def __init__(self, model_name: str, compression: Optional[str] = None, **kwargs):
+        """
+        Initialize AirLLM model.
+        
+        Args:
+            model_name: Hugging Face model name or path (e.g., "meta-llama/Llama-3.2-3B-Instruct")
+            compression: Optional compression ('4bit' or '8bit') for 3x speed improvement
+            **kwargs: Additional arguments for AutoModel.from_pretrained()
+        """
+        if not AIRLLM_AVAILABLE or AutoModel is None:
+            raise ImportError(
+                "AirLLM is not available. Please install it with: pip install airllm bitsandbytes\n"
+                "If you see a BetterTransformer error, you may need to install: pip install optimum[bettertransformer]"
+            )
+        
+        print(f"Loading AirLLM model: {model_name}")
+        self.model = AutoModel.from_pretrained(
+            model_name,
+            compression=compression,
+            **kwargs
+        )
+        self.model_name = model_name
+        print("Model loaded successfully!")
+    
+    def generate(
+        self,
+        prompt: str,
+        model: Optional[str] = None,  # Ignored, kept for API compatibility
+        max_tokens: int = 50,
+        temperature: float = 0.7,
+        top_p: float = 0.9,
+        stream: bool = False,
+        **kwargs
+    ) -> Union[str, Dict]:
+        """
+        Generate text from a prompt (Ollama-compatible interface).
+        
+        Args:
+            prompt: Input text prompt
+            model: Ignored (kept for compatibility)
+            max_tokens: Maximum number of tokens to generate
+            temperature: Sampling temperature (0.0 to 1.0)
+            top_p: Nucleus sampling parameter
+            stream: If True, return streaming response (not yet implemented)
+            **kwargs: Additional generation parameters
+        
+        Returns:
+            Generated text string or dict with response
+        """
+        # Tokenize input
+        input_tokens = self.model.tokenizer(
+            [prompt],
+            return_tensors="pt",
+            return_attention_mask=False,
+            truncation=True,
+            max_length=512,  # Adjust as needed
+            padding=False
+        )
+        
+        # Move to GPU if available
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        input_ids = input_tokens['input_ids'].to(device)
+        
+        # Prepare generation parameters
+        gen_kwargs = {
+            'max_new_tokens': max_tokens,
+            'use_cache': True,
+            'return_dict_in_generate': True,
+            'temperature': temperature,
+            'top_p': top_p,
+            **kwargs
+        }
+        
+        # Generate
+        with torch.inference_mode():
+            generation_output = self.model.generate(input_ids, **gen_kwargs)
+        
+        # Decode output
+        output = self.model.tokenizer.decode(generation_output.sequences[0])
+        
+        # Remove the input prompt from output (if present)
+        if output.startswith(prompt):
+            output = output[len(prompt):].strip()
+        
+        if stream:
+            # For streaming, return a generator (simplified version)
+            return {"response": output}
+        else:
+            return output
+    
+    def chat(
+        self,
+        messages: List[Dict[str, str]],
+        model: Optional[str] = None,
+        max_tokens: int = 50,
+        temperature: float = 0.7,
+        **kwargs
+    ) -> str:
+        """
+        Chat interface (Ollama-compatible).
+        
+        Args:
+            messages: List of message dicts with 'role' and 'content' keys
+            model: Ignored (kept for compatibility)
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            **kwargs: Additional parameters
+        
+        Returns:
+            Generated response string
+        """
+        # Format messages into a prompt
+        prompt = self._format_messages(messages)
+        return self.generate(
+            prompt=prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            **kwargs
+        )
+    
+    def _format_messages(self, messages: List[Dict[str, str]]) -> str:
+        """Format chat messages into a single prompt."""
+        formatted = []
+        for msg in messages:
+            role = msg.get('role', 'user')
+            content = msg.get('content', '')
+            if role == 'system':
+                formatted.append(f"System: {content}")
+            elif role == 'user':
+                formatted.append(f"User: {content}")
+            elif role == 'assistant':
+                formatted.append(f"Assistant: {content}")
+        return "\n".join(formatted) + "\nAssistant:"
+    
+    def embeddings(self, prompt: str) -> List[float]:
+        """
+        Get embeddings for a prompt (simplified - returns token embeddings).
+        
+        Note: This is a simplified version. For full embeddings,
+        you may need to access model internals.
+        """
+        tokens = self.model.tokenizer(
+            [prompt],
+            return_tensors="pt",
+            truncation=True,
+            max_length=512,
+            padding=False
+        )
+        # This is a placeholder - actual embeddings would require model forward pass
+        return tokens['input_ids'].tolist()[0]
+
+
+# Convenience function for easy migration
+def create_ollama_client(model_name: str, compression: Optional[str] = None, **kwargs):
+    """
+    Create an Ollama-compatible client using AirLLM.
+    
+    Usage:
+        client = create_ollama_client("meta-llama/Llama-3.2-3B-Instruct")
+        response = client.generate("Hello, how are you?")
+    """
+    return AirLLMOllamaWrapper(model_name, compression=compression, **kwargs)
+
+
+# Example usage
+if __name__ == "__main__":
+    # Example 1: Basic generation
+    print("Example 1: Basic Generation")
+    print("=" * 60)
+    
+    # Initialize (this will take time on first run)
+    # client = create_ollama_client("garage-bAInd/Platypus2-70B-instruct")
+    
+    # Generate
+    # response = client.generate("What is the capital of France?")
+    # print(f"Response: {response}")
+    
+    print("\nExample 2: Chat Interface")
+    print("=" * 60)
+    
+    # Chat example
+    # messages = [
+    #     {"role": "user", "content": "Hello! How are you?"}
+    # ]
+    # response = client.chat(messages)
+    # print(f"Response: {response}")
+    
+    print("\nUncomment the code above to test!")
+
--- a/nanobot/agent/tools/web.py
+++ b/nanobot/agent/tools/web.py
@ -44,7 +44,7 @@ def _validate_url(url: str) -> tuple[bool, str]:


 class WebSearchTool(Tool):
-    """Search the web using Brave Search API."""
+    """Search the web using DuckDuckGo (free, no API key required)."""
    
    name = "web_search"
    description = "Search the web. Returns titles, URLs, and snippets."
@ -58,13 +58,20 @@ class WebSearchTool(Tool):
    }
    
    def __init__(self, api_key: str | None = None, max_results: int = 5):
+        # Keep api_key parameter for backward compatibility, but use DuckDuckGo if not provided
        self.api_key = api_key or os.environ.get("BRAVE_API_KEY", "")
        self.max_results = max_results
+        self.use_brave = bool(self.api_key)
    
    async def execute(self, query: str, count: int | None = None, **kwargs: Any) -> str:
-        if not self.api_key:
-            return "Error: BRAVE_API_KEY not configured"
+        # Try Brave API if key is available, otherwise use DuckDuckGo
+        if self.use_brave:
+            return await self._brave_search(query, count)
+        else:
+            return await self._duckduckgo_search(query, count)
    
+    async def _brave_search(self, query: str, count: int | None = None) -> str:
+        """Search using Brave API (requires API key)."""
        try:
            n = min(max(count or self.max_results, 1), 10)
            async with httpx.AsyncClient() as client:
@ -89,6 +96,79 @@ class WebSearchTool(Tool):
        except Exception as e:
            return f"Error: {e}"
    
+    async def _duckduckgo_search(self, query: str, count: int | None = None) -> str:
+        """Search using DuckDuckGo (free, no API key)."""
+        try:
+            n = min(max(count or self.max_results, 1), 10)
+            
+            # Try using duckduckgo_search library if available
+            try:
+                from duckduckgo_search import DDGS
+                with DDGS() as ddgs:
+                    results = []
+                    for r in ddgs.text(query, max_results=n):
+                        results.append({
+                            "title": r.get("title", ""),
+                            "url": r.get("href", ""),
+                            "description": r.get("body", "")
+                        })
+                    
+                    if not results:
+                        return f"No results found for: {query}"
+                    
+                    lines = [f"Results for: {query}\n"]
+                    for i, item in enumerate(results, 1):
+                        lines.append(f"{i}. {item['title']}\n   {item['url']}")
+                        if item['description']:
+                            lines.append(f"   {item['description']}")
+                    return "\n".join(lines)
+            except ImportError:
+                # Fallback: use DuckDuckGo instant answer API (simpler, but limited)
+                async with httpx.AsyncClient(
+                    follow_redirects=True,
+                    timeout=15.0
+                ) as client:
+                    # Use DuckDuckGo instant answer API (no key needed)
+                    url = "https://api.duckduckgo.com/"
+                    r = await client.get(
+                        url,
+                        params={"q": query, "format": "json", "no_html": "1", "skip_disambig": "1"},
+                        headers={"User-Agent": USER_AGENT},
+                    )
+                    r.raise_for_status()
+                    data = r.json()
+                    
+                    results = []
+                    # Get RelatedTopics (search results)
+                    if "RelatedTopics" in data:
+                        for topic in data["RelatedTopics"][:n]:
+                            if "Text" in topic and "FirstURL" in topic:
+                                results.append({
+                                    "title": topic.get("Text", "").split(" - ")[0] if " - " in topic.get("Text", "") else topic.get("Text", "")[:50],
+                                    "url": topic.get("FirstURL", ""),
+                                    "description": topic.get("Text", "")
+                                })
+                    
+                    # Also check AbstractText for direct answer
+                    if "AbstractText" in data and data["AbstractText"]:
+                        results.insert(0, {
+                            "title": data.get("Heading", query),
+                            "url": data.get("AbstractURL", ""),
+                            "description": data.get("AbstractText", "")
+                        })
+                    
+                    if not results:
+                        return f"No results found for: {query}. Try installing 'duckduckgo-search' package for better results: pip install duckduckgo-search"
+                    
+                    lines = [f"Results for: {query}\n"]
+                    for i, item in enumerate(results[:n], 1):
+                        lines.append(f"{i}. {item['title']}\n   {item['url']}")
+                        if item['description']:
+                            lines.append(f"   {item['description']}")
+                    return "\n".join(lines)
+        except Exception as e:
+            return f"Error searching: {e}. Try installing 'duckduckgo-search' package: pip install duckduckgo-search"
+

 class WebFetchTool(Tool):
    """Fetch and extract content from a URL using Readability."""
--- a/nanobot/cli/commands.py
+++ b/nanobot/cli/commands.py
@ -265,10 +265,60 @@ This file stores important information that should persist across sessions.


 def _make_provider(config):
-    """Create LiteLLMProvider from config. Exits if no API key found."""
-    from nanobot.providers.litellm_provider import LiteLLMProvider
+    """Create LLM provider from config. Supports LiteLLMProvider and AirLLMProvider."""
+    provider_name = config.get_provider_name()
    p = config.get_provider()
    model = config.agents.defaults.model
+    
+    # Check if AirLLM provider is requested
+    if provider_name == "airllm":
+        try:
+            from nanobot.providers.airllm_provider import AirLLMProvider
+            # AirLLM doesn't need API key, but we can use model path from config
+            # Check if model is specified in the airllm provider config
+            airllm_config = getattr(config.providers, "airllm", None)
+            model_path = None
+            compression = None
+            
+            # Try to get model from airllm config's api_key field (repurposed as model path)
+            # or from the default model
+            if airllm_config and airllm_config.api_key:
+                # Check if api_key looks like a model path (contains '/') or is an HF token
+                if '/' in airllm_config.api_key:
+                    model_path = airllm_config.api_key
+                    hf_token = None
+                else:
+                    # Treat as HF token, use model from defaults
+                    model_path = model
+                    hf_token = airllm_config.api_key
+            else:
+                model_path = model
+                hf_token = None
+            
+            # Check for compression setting in extra_headers or api_base
+            if airllm_config:
+                if airllm_config.api_base:
+                    compression = airllm_config.api_base  # Repurpose api_base as compression
+                elif airllm_config.extra_headers and "compression" in airllm_config.extra_headers:
+                    compression = airllm_config.extra_headers["compression"]
+                # Check for HF token in extra_headers
+                if not hf_token and airllm_config.extra_headers and "hf_token" in airllm_config.extra_headers:
+                    hf_token = airllm_config.extra_headers["hf_token"]
+            
+            return AirLLMProvider(
+                api_key=airllm_config.api_key if airllm_config else None,
+                api_base=compression if compression else None,
+                default_model=model_path,
+                compression=compression,
+                hf_token=hf_token,
+            )
+        except ImportError as e:
+            console.print(f"[red]Error: AirLLM provider not available: {e}[/red]")
+            console.print("Please ensure airllm_ollama_wrapper.py is in the Python path.")
+            raise typer.Exit(1)
+    
+    # Default to LiteLLMProvider
+    from nanobot.providers.litellm_provider import LiteLLMProvider
    if not (p and p.api_key) and not model.startswith("bedrock/"):
        console.print("[red]Error: No API key configured.[/red]")
        console.print("Set one in ~/.nanobot/config.json under providers section")
@ -278,7 +328,7 @@ def _make_provider(config):
        api_base=config.get_api_base(),
        default_model=model,
        extra_headers=p.extra_headers if p else None,
-        provider_name=config.get_provider_name(),
+        provider_name=provider_name,
    )


@ -444,9 +494,16 @@ def agent(
    if message:
        # Single message mode
        async def run_once():
+            try:
                with _thinking_ctx():
                    response = await agent_loop.process_direct(message, session_id)
-            _print_agent_response(response, render_markdown=markdown)
+                # response is a string (content) from process_direct
+                _print_agent_response(response or "", render_markdown=markdown)
+            except Exception as e:
+                import traceback
+                console.print(f"[red]Error: {e}[/red]")
+                console.print(f"[dim]{traceback.format_exc()}[/dim]")
+                raise
        
        asyncio.run(run_once())
    else:
--- a/nanobot/config/schema.py
+++ b/nanobot/config/schema.py
@ -1,7 +1,7 @@
 """Configuration schema using Pydantic."""

 from pathlib import Path
-from pydantic import BaseModel, Field
+from pydantic import BaseModel, Field, ConfigDict
 from pydantic_settings import BaseSettings


@ -177,18 +177,10 @@ class ProviderConfig(BaseModel):

 class ProvidersConfig(BaseModel):
    """Configuration for LLM providers."""
-    anthropic: ProviderConfig = Field(default_factory=ProviderConfig)
-    openai: ProviderConfig = Field(default_factory=ProviderConfig)
-    openrouter: ProviderConfig = Field(default_factory=ProviderConfig)
    deepseek: ProviderConfig = Field(default_factory=ProviderConfig)
-    groq: ProviderConfig = Field(default_factory=ProviderConfig)
-    zhipu: ProviderConfig = Field(default_factory=ProviderConfig)
-    dashscope: ProviderConfig = Field(default_factory=ProviderConfig)  # 阿里云通义千问
    vllm: ProviderConfig = Field(default_factory=ProviderConfig)
-    gemini: ProviderConfig = Field(default_factory=ProviderConfig)
-    moonshot: ProviderConfig = Field(default_factory=ProviderConfig)
-    minimax: ProviderConfig = Field(default_factory=ProviderConfig)
-    aihubmix: ProviderConfig = Field(default_factory=ProviderConfig)  # AiHubMix API gateway
+    ollama: ProviderConfig = Field(default_factory=ProviderConfig)
+    airllm: ProviderConfig = Field(default_factory=ProviderConfig)


 class GatewayConfig(BaseModel):
@ -241,13 +233,36 @@ class Config(BaseSettings):
        # Match by keyword (order follows PROVIDERS registry)
        for spec in PROVIDERS:
            p = getattr(self.providers, spec.name, None)
-            if p and any(kw in model_lower for kw in spec.keywords) and p.api_key:
+            if p and any(kw in model_lower for kw in spec.keywords):
+                # For local providers (Ollama, AirLLM), allow empty api_key or "dummy"
+                # For other providers, require api_key
+                if spec.is_local:
+                    # Local providers can work with empty/dummy api_key
+                    if p.api_key or p.api_base or spec.name == "airllm":
+                        return p, spec.name
+                elif p.api_key:
+                    return p, spec.name
+
+        # Check local providers by api_base detection (for explicit config)
+        for spec in PROVIDERS:
+            if spec.is_local:
+                p = getattr(self.providers, spec.name, None)
+                if p:
+                    # Check if api_base matches the provider's detection pattern
+                    if spec.detect_by_base_keyword and p.api_base and spec.detect_by_base_keyword in p.api_base:
+                        return p, spec.name
+                    # AirLLM is detected by provider name being "airllm"
+                    if spec.name == "airllm" and p.api_key:  # api_key can be model path
                        return p, spec.name

        # Fallback: gateways first, then others (follows registry order)
        for spec in PROVIDERS:
            p = getattr(self.providers, spec.name, None)
-            if p and p.api_key:
+            if p:
+                # For local providers, allow empty/dummy api_key
+                if spec.is_local and (p.api_key or p.api_base):
+                    return p, spec.name
+                elif p.api_key:
                    return p, spec.name
        return None, None

@ -281,6 +296,7 @@ class Config(BaseSettings):
                return spec.default_api_base
        return None
    
-    class Config:
-        env_prefix = "NANOBOT_"
-        env_nested_delimiter = "__"
+    model_config = ConfigDict(
+        env_prefix="NANOBOT_",
+        env_nested_delimiter="__"
+    )
--- a/nanobot/providers/init.py
+++ b/nanobot/providers/init.py
@ -3,4 +3,8 @@
 from nanobot.providers.base import LLMProvider, LLMResponse
 from nanobot.providers.litellm_provider import LiteLLMProvider

-__all__ = ["LLMProvider", "LLMResponse", "LiteLLMProvider"]
+try:
+    from nanobot.providers.airllm_provider import AirLLMProvider
+    __all__ = ["LLMProvider", "LLMResponse", "LiteLLMProvider", "AirLLMProvider"]
+except ImportError:
+    __all__ = ["LLMProvider", "LLMResponse", "LiteLLMProvider"]
--- a/nanobot/providers/airllm_provider.py
+++ b/nanobot/providers/airllm_provider.py
@ -0,0 +1,188 @@
+"""AirLLM provider implementation for direct local model inference."""
+
+import json
+import asyncio
+import sys
+from typing import Any
+from pathlib import Path
+
+from nanobot.providers.base import LLMProvider, LLMResponse, ToolCallRequest
+
+# Import the wrapper - handle import errors gracefully
+try:
+    from nanobot.providers.airllm_wrapper import AirLLMOllamaWrapper, create_ollama_client
+    AIRLLM_WRAPPER_AVAILABLE = True
+    _import_error = None
+except ImportError as e:
+    AIRLLM_WRAPPER_AVAILABLE = False
+    AirLLMOllamaWrapper = None
+    create_ollama_client = None
+    _import_error = str(e)
+
+
+class AirLLMProvider(LLMProvider):
+    """
+    LLM provider using AirLLM for direct local model inference.
+    
+    This provider loads models directly into memory and runs inference locally,
+    bypassing HTTP API calls. It's optimized for GPU-limited environments.
+    """
+    
+    def __init__(
+        self,
+        api_key: str | None = None,  # Repurposed: can be HF token or model name
+        api_base: str | None = None,  # Repurposed: compression setting ('4bit' or '8bit')
+        default_model: str = "meta-llama/Llama-3.2-3B-Instruct",
+        compression: str | None = None,  # '4bit' or '8bit' for speed improvement
+        model_path: str | None = None,  # Override default model
+        hf_token: str | None = None,  # Hugging Face token for gated models
+    ):
+        super().__init__(api_key, api_base)
+        self.default_model = model_path or default_model
+        # If api_base is set and looks like compression, use it
+        if api_base and api_base in ('4bit', '8bit'):
+            self.compression = api_base
+        else:
+            self.compression = compression
+        # If api_key is provided and doesn't look like a model path, treat as HF token
+        if api_key and '/' not in api_key and len(api_key) > 20:
+            self.hf_token = api_key
+        else:
+            self.hf_token = hf_token
+        # If api_key looks like a model path, use it as the model
+        if api_key and '/' in api_key:
+            self.default_model = api_key
+        self._client: AirLLMOllamaWrapper | None = None
+        self._model_loaded = False
+    
+    def _ensure_client(self) -> AirLLMOllamaWrapper:
+        """Lazy-load the AirLLM client."""
+        if not AIRLLM_WRAPPER_AVAILABLE:
+            error_msg = (
+                "AirLLM wrapper is not available. Please ensure airllm_ollama_wrapper.py "
+                "is in the Python path and AirLLM is installed."
+            )
+            if '_import_error' in globals():
+                error_msg += f"\nImport error: {_import_error}"
+            raise ImportError(error_msg)
+        
+        if self._client is None or not self._model_loaded:
+            print(f"Initializing AirLLM with model: {self.default_model}")
+            if self.compression:
+                print(f"Using compression: {self.compression}")
+            if self.hf_token:
+                print("Using Hugging Face token for authentication")
+            
+            # Prepare kwargs for model loading
+            kwargs = {}
+            if self.hf_token:
+                kwargs['hf_token'] = self.hf_token
+            
+            self._client = create_ollama_client(
+                self.default_model,
+                compression=self.compression,
+                **kwargs
+            )
+            self._model_loaded = True
+            print("AirLLM model loaded and ready!")
+        
+        return self._client
+    
+    async def chat(
+        self,
+        messages: list[dict[str, Any]],
+        tools: list[dict[str, Any]] | None = None,
+        model: str | None = None,
+        max_tokens: int = 4096,
+        temperature: float = 0.7,
+    ) -> LLMResponse:
+        """
+        Send a chat completion request using AirLLM.
+        
+        Args:
+            messages: List of message dicts with 'role' and 'content'.
+            tools: Optional list of tool definitions (Note: tool calling support may be limited).
+            model: Model identifier (ignored if different from initialized model).
+            max_tokens: Maximum tokens in response.
+            temperature: Sampling temperature.
+        
+        Returns:
+            LLMResponse with content and/or tool calls.
+        """
+        # If a different model is requested, we'd need to reload (expensive)
+        # For now, we'll use the initialized model
+        if model and model != self.default_model:
+            print(f"Warning: Model {model} requested but {self.default_model} is loaded. Using loaded model.")
+        
+        client = self._ensure_client()
+        
+        # Format tools into the prompt if provided (basic tool support)
+        # Note: Full tool calling requires model support and proper formatting
+        if tools:
+            # Add tool definitions to the system message or last user message
+            tools_text = "\n".join([
+                f"- {tool.get('function', {}).get('name', 'unknown')}: {tool.get('function', {}).get('description', '')}"
+                for tool in tools
+            ])
+            # Append to messages (simplified - full implementation would format properly)
+            if messages and messages[-1].get('role') == 'user':
+                messages[-1]['content'] += f"\n\nAvailable tools:\n{tools_text}"
+        
+        # Run the synchronous client in an executor to avoid blocking
+        loop = asyncio.get_event_loop()
+        try:
+            response_text = await loop.run_in_executor(
+                None,
+                lambda: client.chat(
+                    messages=messages,
+                    max_tokens=max_tokens,
+                    temperature=temperature,
+                )
+            )
+        except Exception as e:
+            import traceback
+            error_msg = f"AirLLM generation failed: {e}\n{traceback.format_exc()}"
+            print(error_msg, file=sys.stderr)
+            raise RuntimeError(f"AirLLM provider error: {e}") from e
+        
+        # Parse tool calls from response if present
+        # This is a simplified parser - you may need to adjust based on model output format
+        tool_calls = []
+        content = response_text
+        
+        # Try to extract JSON tool calls from the response
+        # Some models return tool calls as JSON in the content
+        if "tool_calls" in response_text.lower() or "function" in response_text.lower():
+            try:
+                # Look for JSON blocks in the response
+                import re
+                json_pattern = r'\{[^{}]*"function"[^{}]*\}'
+                matches = re.findall(json_pattern, response_text, re.DOTALL)
+                for match in matches:
+                    try:
+                        tool_data = json.loads(match)
+                        if "function" in tool_data:
+                            func = tool_data["function"]
+                            tool_calls.append(ToolCallRequest(
+                                id=tool_data.get("id", f"call_{len(tool_calls)}"),
+                                name=func.get("name", "unknown"),
+                                arguments=func.get("arguments", {}),
+                            ))
+                            # Remove the tool call from content
+                            content = content.replace(match, "").strip()
+                    except json.JSONDecodeError:
+                        pass
+            except Exception:
+                pass  # If parsing fails, just return the content as-is
+        
+        return LLMResponse(
+            content=content,
+            tool_calls=tool_calls if tool_calls else [],
+            finish_reason="stop",
+            usage={},  # AirLLM doesn't provide usage stats in the wrapper
+        )
+    
+    def get_default_model(self) -> str:
+        """Get the default model."""
+        return self.default_model
+
--- a/nanobot/providers/airllm_wrapper.py
+++ b/nanobot/providers/airllm_wrapper.py
@ -0,0 +1,511 @@
+#!/usr/bin/env python3
+"""
+AirLLM Ollama-Compatible Wrapper
+
+This wrapper provides an Ollama-like interface for AirLLM,
+making it easy to replace Ollama in existing projects.
+"""
+
+import torch
+from typing import List, Dict, Optional, Union
+
+# Try to import airllm, preferring the local checkout if available
+import sys
+import os
+import importlib.util
+
+# Inject dummy BetterTransformer BEFORE importing airllm (local code needs it)
+class DummyBetterTransformer:
+    @staticmethod
+    def transform(model):
+        return model
+
+if "optimum.bettertransformer" not in sys.modules:
+    spec = importlib.util.spec_from_loader("optimum.bettertransformer", None)
+    dummy_module = importlib.util.module_from_spec(spec)
+    dummy_module.BetterTransformer = DummyBetterTransformer
+    sys.modules["optimum.bettertransformer"] = dummy_module
+
+# Fix RoPE scaling compatibility: patch transformers to handle "llama3" type
+def _patch_rope_scaling():
+    """Patch transformers LlamaConfig to handle unsupported 'llama3' RoPE scaling type."""
+    try:
+        from transformers import LlamaConfig
+        from transformers.models.llama.configuration_llama import LlamaConfig as OriginalLlamaConfig
+        
+        # Store original __init__ if not already patched
+        if not hasattr(OriginalLlamaConfig, '_rope_scaling_patched'):
+            original_init = OriginalLlamaConfig.__init__
+            
+            def patched_init(self, *args, **kwargs):
+                # Call original init
+                original_init(self, *args, **kwargs)
+                
+                # Fix rope_scaling if it's "llama3" (unsupported in some transformers versions)
+                if hasattr(self, 'rope_scaling') and self.rope_scaling is not None:
+                    # Check if it's a dict or object
+                    if isinstance(self.rope_scaling, dict):
+                        if self.rope_scaling.get('type') == 'llama3':
+                            print("Warning: Converting unsupported RoPE scaling 'llama3' to 'linear'")
+                            self.rope_scaling['type'] = 'linear'
+                            if 'factor' not in self.rope_scaling:
+                                self.rope_scaling['factor'] = 1.0
+                    elif hasattr(self.rope_scaling, 'type'):
+                        if getattr(self.rope_scaling, 'type', None) == 'llama3':
+                            print("Warning: Converting unsupported RoPE scaling 'llama3' to 'linear'")
+                            # Convert to dict format
+                            factor = getattr(self.rope_scaling, 'factor', 1.0)
+                            self.rope_scaling = {'type': 'linear', 'factor': factor}
+            
+            OriginalLlamaConfig.__init__ = patched_init
+            OriginalLlamaConfig._rope_scaling_patched = True
+    except Exception as e:
+        # If patching fails, we'll handle it in the error handler
+        print(f"Warning: Could not patch RoPE scaling: {e}", file=sys.stderr)
+
+def _patch_attention_position_embeddings():
+    """Patch LlamaSdpaAttention to accept and ignore position_embeddings argument for AirLLM compatibility."""
+    try:
+        from transformers.models.llama import modeling_llama
+        import functools
+        
+        # Check if LlamaSdpaAttention exists and hasn't been patched
+        if hasattr(modeling_llama, 'LlamaSdpaAttention'):
+            LlamaSdpaAttention = modeling_llama.LlamaSdpaAttention
+            if not hasattr(LlamaSdpaAttention, '_position_embeddings_patched'):
+                original_forward = LlamaSdpaAttention.forward
+                
+                @functools.wraps(original_forward)
+                def patched_forward(self, *args, **kwargs):
+                    # Remove position_embeddings if present (AirLLM compatibility)
+                    kwargs.pop('position_embeddings', None)
+                    # Call original forward
+                    return original_forward(self, *args, **kwargs)
+                
+                LlamaSdpaAttention.forward = patched_forward
+                LlamaSdpaAttention._position_embeddings_patched = True
+    except Exception as e:
+        # If patching fails, we'll handle it in the error handler
+        print(f"Warning: Could not patch attention position_embeddings: {e}", file=sys.stderr)
+
+# Apply the patches before importing airllm
+_patch_rope_scaling()
+_patch_attention_position_embeddings()
+
+LOCAL_AIRLLM_PATH = "/home/ladmin/code/airllm/airllm/air_llm"
+if os.path.exists(LOCAL_AIRLLM_PATH) and LOCAL_AIRLLM_PATH not in sys.path:
+    sys.path.insert(0, LOCAL_AIRLLM_PATH)
+
+try:
+    from airllm import AutoModel
+    AIRLLM_AVAILABLE = True
+except ImportError as e:
+    AIRLLM_AVAILABLE = False
+    AutoModel = None
+    print(f"Warning: Failed to import AirLLM: {e}", file=sys.stderr)
+
+
+class AirLLMOllamaWrapper:
+    """
+    A wrapper that provides an Ollama-like API for AirLLM.
+    
+    Usage:
+        # Instead of: ollama.generate(model="llama2", prompt="Hello")
+        # Use: airllm_wrapper.generate(model="llama2", prompt="Hello")
+    """
+    
+    def __init__(self, model_name: str, compression: Optional[str] = None, **kwargs):
+        """
+        Initialize AirLLM model.
+        
+        Args:
+            model_name: Hugging Face model name or path (e.g., "meta-llama/Llama-3.2-3B-Instruct")
+            compression: Optional compression ('4bit' or '8bit') for 3x speed improvement
+            **kwargs: Additional arguments for AutoModel.from_pretrained()
+        """
+        if not AIRLLM_AVAILABLE or AutoModel is None:
+            raise ImportError(
+                "AirLLM is not available. Please install it with: pip install airllm bitsandbytes\n"
+                "If you see a BetterTransformer error, you may need to install: pip install optimum[bettertransformer]"
+            )
+        
+        print(f"Loading AirLLM model: {model_name}")
+        
+        # Fix RoPE scaling compatibility issue: transformers 4.39.3 doesn't support "llama3" type
+        # Modify config file if it's a local path and has unsupported rope_scaling
+        model_path = model_name
+        if os.path.exists(model_name) or model_name.startswith('/') or model_name.startswith('~'):
+            if model_name.startswith('~'):
+                model_path = os.path.expanduser(model_name)
+            else:
+                model_path = os.path.abspath(model_name)
+            
+            config_json_path = os.path.join(model_path, "config.json")
+            if os.path.exists(config_json_path):
+                try:
+                    import json
+                    with open(config_json_path, 'r') as f:
+                        config_data = json.load(f)
+                    
+                    # Check and fix rope_scaling
+                    if 'rope_scaling' in config_data and config_data['rope_scaling'] is not None:
+                        rope_scaling = config_data['rope_scaling']
+                        if isinstance(rope_scaling, dict) and rope_scaling.get('type') == 'llama3':
+                            print("Warning: Fixing unsupported RoPE scaling type 'llama3' -> 'linear'")
+                            # Backup original config
+                            backup_path = config_json_path + ".backup"
+                            if not os.path.exists(backup_path):
+                                import shutil
+                                shutil.copy2(config_json_path, backup_path)
+                            
+                            # Fix the rope_scaling type
+                            config_data['rope_scaling']['type'] = 'linear'
+                            if 'factor' not in config_data['rope_scaling']:
+                                config_data['rope_scaling']['factor'] = 1.0
+                            
+                            # Save fixed config
+                            with open(config_json_path, 'w') as f:
+                                json.dump(config_data, f, indent=2)
+                            print(f"Fixed config saved to {config_json_path}")
+                except Exception as e:
+                    print(f"Warning: Could not fix config file: {e}", file=sys.stderr)
+        
+        # Determine max_seq_len before loading model
+        # AirLLM needs this at initialization time
+        max_seq_len = 2048  # Default for Llama models
+        
+        # Check if this is a Llama model to determine appropriate max length
+        # We need to load config first to check model type
+        try:
+            from transformers import AutoConfig
+            config = AutoConfig.from_pretrained(model_name, **{k: v for k, v in kwargs.items() if k in ['token', 'trust_remote_code']})
+            model_type = getattr(config, 'model_type', '').lower()
+            is_llama = 'llama' in model_type or 'llama' in model_name.lower()
+            
+            # Also fix rope_scaling in the loaded config object if needed
+            if is_llama and hasattr(config, 'rope_scaling') and config.rope_scaling is not None:
+                if isinstance(config.rope_scaling, dict) and config.rope_scaling.get('type') == 'llama3':
+                    print("Warning: Converting RoPE scaling 'llama3' to 'linear' in config object")
+                    config.rope_scaling['type'] = 'linear'
+                    if 'factor' not in config.rope_scaling:
+                        config.rope_scaling['factor'] = 1.0
+                elif hasattr(config.rope_scaling, 'type') and getattr(config.rope_scaling, 'type', None) == 'llama3':
+                    # Convert object to dict
+                    factor = getattr(config.rope_scaling, 'factor', 1.0)
+                    config.rope_scaling = {'type': 'linear', 'factor': factor}
+            
+            if is_llama:
+                config_max = getattr(config, 'max_position_embeddings', None)
+                if config_max and config_max > 0:
+                    max_seq_len = min(config_max, 2048)
+                else:
+                    max_seq_len = 2048
+            else:
+                config_max = getattr(config, 'max_position_embeddings', None)
+                if config_max and config_max > 0 and config_max <= 2048:
+                    max_seq_len = config_max
+                else:
+                    max_seq_len = 512
+        except Exception:
+            # Fallback to defaults if config loading fails
+            pass
+        
+        # AutoModel.from_pretrained() accepts:
+        # - Hugging Face model IDs (e.g., "meta-llama/Llama-3.1-8B-Instruct")
+        # - Local paths (e.g., "/path/to/local/model")
+        # - Can use local_dir parameter for local models
+        try:
+            self.model = AutoModel.from_pretrained(
+                model_name,
+                compression=compression,
+                max_seq_len=max_seq_len,  # Pass max_seq_len to AirLLM
+                **kwargs
+            )
+        except ValueError as e:
+            # Handle specific RoPE scaling errors
+            if "Unknown RoPE scaling type" in str(e) or "rope_scaling" in str(e).lower():
+                import traceback
+                error_msg = (
+                    f"RoPE scaling compatibility error: {e}\n"
+                    "The model config uses a RoPE scaling type not supported by your transformers version.\n"
+                    "If this is a local model, the config file should have been fixed automatically.\n"
+                    "If the error persists, try:\n"
+                    "1. For local models: Check that config.json has rope_scaling.type='linear' instead of 'llama3'\n"
+                    "2. Upgrade transformers: pip install --upgrade transformers\n"
+                    "3. Or downgrade to a compatible version: pip install 'transformers==4.37.0'\n"
+                    f"\nFull traceback:\n{traceback.format_exc()}"
+                )
+                raise RuntimeError(error_msg) from e
+            raise
+        except Exception as e:
+            import traceback
+            error_msg = (
+                f"Failed to load AirLLM model '{model_name}': {e}\n"
+                f"Error type: {type(e).__name__}\n"
+                "This is often a transformers version compatibility issue.\n"
+                "Try one of these solutions:\n"
+                "1. Install an older transformers version: pip install 'transformers==4.37.0'\n"
+                "2. Or try: pip install 'transformers==4.38.2'\n"
+                "3. If using transformers 4.39.3, try downgrading: pip install 'transformers==4.37.0'\n"
+                "4. Check AirLLM compatibility with your transformers version\n"
+                f"\nFull traceback:\n{traceback.format_exc()}"
+            )
+            raise RuntimeError(error_msg) from e
+        self.model_name = model_name
+        
+        # Store max_length for tokenization
+        self.max_length = max_seq_len
+        
+        # Check if this is a Llama model to determine appropriate max length
+        is_llama = False
+        if hasattr(self.model, 'config'):
+            model_type = getattr(self.model.config, 'model_type', '').lower()
+            is_llama = 'llama' in model_type or 'llama' in self.model_name.lower()
+        
+        if is_llama:
+            # Llama models: typically support 2048-4096 tokens
+            # AirLLM works well with Llama, so we can use larger chunks
+            if hasattr(self.model, 'config'):
+                config_max = getattr(self.model.config, 'max_position_embeddings', None)
+                if config_max and config_max > 0:
+                    # Use config value, but cap at 2048 for AirLLM safety
+                    self.max_length = min(config_max, 2048)
+                else:
+                    self.max_length = 2048  # Safe default for Llama
+        else:
+            # For other models (e.g., DeepSeek), use conservative default
+            if hasattr(self.model, 'config'):
+                config_max = getattr(self.model.config, 'max_position_embeddings', None)
+                if config_max and config_max > 0 and config_max <= 2048:
+                    self.max_length = config_max
+                else:
+                    self.max_length = 512  # Very conservative
+        
+        print(f"Using sequence length limit: {self.max_length} (AirLLM chunk size)")
+        
+        print("Model loaded successfully!")
+    
+    def generate(
+        self,
+        prompt: str,
+        model: Optional[str] = None,  # Ignored, kept for API compatibility
+        max_tokens: int = 50,
+        temperature: float = 0.7,
+        top_p: float = 0.9,
+        stream: bool = False,
+        **kwargs
+    ) -> Union[str, Dict]:
+        """
+        Generate text from a prompt (Ollama-compatible interface).
+        
+        Args:
+            prompt: Input text prompt
+            model: Ignored (kept for compatibility)
+            max_tokens: Maximum number of tokens to generate
+            temperature: Sampling temperature (0.0 to 1.0)
+            top_p: Nucleus sampling parameter
+            stream: If True, return streaming response (not yet implemented)
+            **kwargs: Additional generation parameters
+        
+        Returns:
+            Generated text string or dict with response
+        """
+        # Tokenize input with attention mask
+        # AirLLM processes sequences in chunks, but each chunk must fit within the model's
+        # position embedding limits. We need to ensure we don't exceed the chunk size.
+        # Use the model's max_length to ensure compatibility with position embeddings
+        input_tokens = self.model.tokenizer(
+            prompt,
+            return_tensors="pt",
+            return_attention_mask=True,
+            truncation=True,
+            max_length=self.max_length,  # Respect model's position embedding limit
+            padding=False
+        )
+        
+        # Move to GPU if available
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        input_ids = input_tokens['input_ids'].to(device)
+        attention_mask = input_tokens.get('attention_mask', None)
+        if attention_mask is not None:
+            attention_mask = attention_mask.to(device)
+        
+        # Ensure we don't exceed max_length (manual truncation as safety check)
+        seq_length = input_ids.shape[1]
+        if seq_length > self.max_length:
+            print(f"Warning: Sequence length ({seq_length}) exceeds limit ({self.max_length}), truncating...")
+            input_ids = input_ids[:, :self.max_length]
+            if attention_mask is not None:
+                attention_mask = attention_mask[:, :self.max_length]
+            seq_length = self.max_length
+        
+        if seq_length >= self.max_length:
+            print(f"Note: Using sequence of {seq_length} tokens (at limit: {self.max_length})")
+        
+        # Prepare generation parameters
+        # For Llama models, we can use more tokens
+        max_gen_tokens = min(max_tokens, 512)
+        
+        gen_kwargs = {
+            'max_new_tokens': max_gen_tokens,
+            'use_cache': False,  # Disable cache to avoid DynamicCache compatibility issues
+            'return_dict_in_generate': True,
+            'temperature': temperature,
+            'top_p': top_p,
+            **kwargs
+        }
+        
+        # Add attention mask if available
+        if attention_mask is not None:
+            gen_kwargs['attention_mask'] = attention_mask
+        
+        # Generate
+        try:
+            with torch.inference_mode():
+                generation_output = self.model.generate(input_ids, **gen_kwargs)
+        except (TypeError, RuntimeError) as e:
+            if "position_embeddings" in str(e) or "cannot unpack" in str(e):
+                error_msg = (
+                    f"AirLLM compatibility error with transformers: {e}\n"
+                    "This is a known issue with AirLLM and transformers version compatibility.\n"
+                    "Try one of these solutions:\n"
+                    "1. Install transformers 4.37.0: pip install 'transformers==4.37.0'\n"
+                    "2. Or try transformers 4.38.2: pip install 'transformers==4.38.2'\n"
+                    "3. If you're using 4.39.3, it may have compatibility issues - try downgrading\n"
+                    "4. Or use Ollama instead: nanobot agent -m 'Hello' (with Ollama provider)"
+                )
+                raise RuntimeError(error_msg) from e
+            raise
+        
+        # Decode output - get only the newly generated tokens
+        if hasattr(generation_output, 'sequences'):
+            # Extract only the new tokens (after input length)
+            input_length = input_ids.shape[1]
+            generated_ids = generation_output.sequences[0, input_length:]
+            output = self.model.tokenizer.decode(generated_ids, skip_special_tokens=True)
+        else:
+            # Fallback for older output formats
+            output = self.model.tokenizer.decode(generation_output.sequences[0], skip_special_tokens=True)
+            # Remove the input prompt from output if present
+            if output.startswith(prompt):
+                output = output[len(prompt):].strip()
+        
+        if stream:
+            # For streaming, return a generator (simplified version)
+            return {"response": output}
+        else:
+            return output
+    
+    def chat(
+        self,
+        messages: List[Dict[str, str]],
+        model: Optional[str] = None,
+        max_tokens: int = 50,
+        temperature: float = 0.7,
+        **kwargs
+    ) -> str:
+        """
+        Chat interface (Ollama-compatible).
+        
+        Args:
+            messages: List of message dicts with 'role' and 'content' keys
+            model: Ignored (kept for compatibility)
+            max_tokens: Maximum tokens to generate
+            temperature: Sampling temperature
+            **kwargs: Additional parameters
+        
+        Returns:
+            Generated response string
+        """
+        # Try to use the model's chat template if available (for Llama, etc.)
+        if hasattr(self.model.tokenizer, 'apply_chat_template') and self.model.tokenizer.chat_template:
+            try:
+                # Use the model's native chat template
+                prompt = self.model.tokenizer.apply_chat_template(
+                    messages,
+                    tokenize=False,
+                    add_generation_prompt=True
+                )
+            except Exception:
+                # Fallback to simple formatting if chat template fails
+                prompt = self._format_messages(messages)
+        else:
+            # Fallback to simple formatting
+            prompt = self._format_messages(messages)
+        
+        return self.generate(
+            prompt=prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            **kwargs
+        )
+    
+    def _format_messages(self, messages: List[Dict[str, str]]) -> str:
+        """Format chat messages into a single prompt (fallback method)."""
+        formatted = []
+        for msg in messages:
+            role = msg.get('role', 'user')
+            content = msg.get('content', '')
+            if role == 'system':
+                formatted.append(f"System: {content}")
+            elif role == 'user':
+                formatted.append(f"User: {content}")
+            elif role == 'assistant':
+                formatted.append(f"Assistant: {content}")
+        return "\n".join(formatted) + "\nAssistant:"
+    
+    def embeddings(self, prompt: str) -> List[float]:
+        """
+        Get embeddings for a prompt (simplified - returns token embeddings).
+        
+        Note: This is a simplified version. For full embeddings,
+        you may need to access model internals.
+        """
+        tokens = self.model.tokenizer(
+            [prompt],
+            return_tensors="pt",
+            truncation=True,
+            max_length=512,
+            padding=False
+        )
+        # This is a placeholder - actual embeddings would require model forward pass
+        return tokens['input_ids'].tolist()[0]
+
+
+# Convenience function for easy migration
+def create_ollama_client(model_name: str, compression: Optional[str] = None, **kwargs):
+    """
+    Create an Ollama-compatible client using AirLLM.
+    
+    Usage:
+        client = create_ollama_client("meta-llama/Llama-3.2-3B-Instruct")
+        response = client.generate("Hello, how are you?")
+    """
+    return AirLLMOllamaWrapper(model_name, compression=compression, **kwargs)
+
+
+# Example usage
+if __name__ == "__main__":
+    # Example 1: Basic generation
+    print("Example 1: Basic Generation")
+    print("=" * 60)
+    
+    # Initialize (this will take time on first run)
+    # client = create_ollama_client("meta-llama/Llama-3.2-3B-Instruct")
+    
+    # Generate
+    # response = client.generate("What is the capital of France?")
+    # print(f"Response: {response}")
+    
+    print("\nExample 2: Chat Interface")
+    print("=" * 60)
+    
+    # Chat example
+    # messages = [
+    #     {"role": "user", "content": "Hello! How are you?"}
+    # ]
+    # response = client.chat(messages)
+    # print(f"Response: {response}")
+    
+    print("\nUncomment the code above to test!")
+
--- a/nanobot/providers/litellm_provider.py
+++ b/nanobot/providers/litellm_provider.py
@ -127,6 +127,7 @@ class LiteLLMProvider(LLMProvider):
            "messages": messages,
            "max_tokens": max_tokens,
            "temperature": temperature,
+            "stream": False,  # Explicitly disable streaming to avoid hangs with some providers
        }
        
        # Apply model-specific overrides (e.g. kimi-k2.5 temperature)
@ -148,6 +149,11 @@ class LiteLLMProvider(LLMProvider):
            kwargs["tools"] = tools
            kwargs["tool_choice"] = "auto"
        
+        # Add timeout to prevent hangs (especially with local servers)
+        # Ollama can be slow with complex prompts, so use a longer timeout
+        # Increased to 400s for larger models like mistral-nemo
+        kwargs["timeout"] = 400.0
+        
        try:
            response = await acompletion(**kwargs)
            return self._parse_response(response)
--- a/nanobot/providers/registry.py
+++ b/nanobot/providers/registry.py
@ -6,7 +6,7 @@ Adding a new provider:
  2. Add a field to ProvidersConfig in config/schema.py.
  Done. Env vars, prefixing, config matching, status display all derive from here.

-Order matters — it controls match priority and fallback. Gateways first.
+Order matters — it controls match priority and fallback.
 Every entry writes out all fields so you can copy-paste as a template.
 """

@ -62,86 +62,10 @@ class ProviderSpec:

 PROVIDERS: tuple[ProviderSpec, ...] = (

-    # === Gateways (detected by api_key / api_base, not model name) =========
-    # Gateways can route any model, so they win in fallback.
-
-    # OpenRouter: global gateway, keys start with "sk-or-"
-    ProviderSpec(
-        name="openrouter",
-        keywords=("openrouter",),
-        env_key="OPENROUTER_API_KEY",
-        display_name="OpenRouter",
-        litellm_prefix="openrouter",        # claude-3 → openrouter/claude-3
-        skip_prefixes=(),
-        env_extras=(),
-        is_gateway=True,
-        is_local=False,
-        detect_by_key_prefix="sk-or-",
-        detect_by_base_keyword="openrouter",
-        default_api_base="https://openrouter.ai/api/v1",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
-    # AiHubMix: global gateway, OpenAI-compatible interface.
-    # strip_model_prefix=True: it doesn't understand "anthropic/claude-3",
-    # so we strip to bare "claude-3" then re-prefix as "openai/claude-3".
-    ProviderSpec(
-        name="aihubmix",
-        keywords=("aihubmix",),
-        env_key="OPENAI_API_KEY",           # OpenAI-compatible
-        display_name="AiHubMix",
-        litellm_prefix="openai",            # → openai/{model}
-        skip_prefixes=(),
-        env_extras=(),
-        is_gateway=True,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="aihubmix",
-        default_api_base="https://aihubmix.com/v1",
-        strip_model_prefix=True,            # anthropic/claude-3 → claude-3 → openai/claude-3
-        model_overrides=(),
-    ),
-
    # === Standard providers (matched by model-name keywords) ===============

-    # Anthropic: LiteLLM recognizes "claude-*" natively, no prefix needed.
-    ProviderSpec(
-        name="anthropic",
-        keywords=("anthropic", "claude"),
-        env_key="ANTHROPIC_API_KEY",
-        display_name="Anthropic",
-        litellm_prefix="",
-        skip_prefixes=(),
-        env_extras=(),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
-    # OpenAI: LiteLLM recognizes "gpt-*" natively, no prefix needed.
-    ProviderSpec(
-        name="openai",
-        keywords=("openai", "gpt"),
-        env_key="OPENAI_API_KEY",
-        display_name="OpenAI",
-        litellm_prefix="",
-        skip_prefixes=(),
-        env_extras=(),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
    # DeepSeek: needs "deepseek/" prefix for LiteLLM routing.
+    # Can be used with local models or API.
    ProviderSpec(
        name="deepseek",
        keywords=("deepseek",),
@ -159,107 +83,6 @@ PROVIDERS: tuple[ProviderSpec, ...] = (
        model_overrides=(),
    ),

-    # Gemini: needs "gemini/" prefix for LiteLLM.
-    ProviderSpec(
-        name="gemini",
-        keywords=("gemini",),
-        env_key="GEMINI_API_KEY",
-        display_name="Gemini",
-        litellm_prefix="gemini",            # gemini-pro → gemini/gemini-pro
-        skip_prefixes=("gemini/",),         # avoid double-prefix
-        env_extras=(),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
-    # Zhipu: LiteLLM uses "zai/" prefix.
-    # Also mirrors key to ZHIPUAI_API_KEY (some LiteLLM paths check that).
-    # skip_prefixes: don't add "zai/" when already routed via gateway.
-    ProviderSpec(
-        name="zhipu",
-        keywords=("zhipu", "glm", "zai"),
-        env_key="ZAI_API_KEY",
-        display_name="Zhipu AI",
-        litellm_prefix="zai",              # glm-4 → zai/glm-4
-        skip_prefixes=("zhipu/", "zai/", "openrouter/", "hosted_vllm/"),
-        env_extras=(
-            ("ZHIPUAI_API_KEY", "{api_key}"),
-        ),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
-    # DashScope: Qwen models, needs "dashscope/" prefix.
-    ProviderSpec(
-        name="dashscope",
-        keywords=("qwen", "dashscope"),
-        env_key="DASHSCOPE_API_KEY",
-        display_name="DashScope",
-        litellm_prefix="dashscope",         # qwen-max → dashscope/qwen-max
-        skip_prefixes=("dashscope/", "openrouter/"),
-        env_extras=(),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
-    # Moonshot: Kimi models, needs "moonshot/" prefix.
-    # LiteLLM requires MOONSHOT_API_BASE env var to find the endpoint.
-    # Kimi K2.5 API enforces temperature >= 1.0.
-    ProviderSpec(
-        name="moonshot",
-        keywords=("moonshot", "kimi"),
-        env_key="MOONSHOT_API_KEY",
-        display_name="Moonshot",
-        litellm_prefix="moonshot",          # kimi-k2.5 → moonshot/kimi-k2.5
-        skip_prefixes=("moonshot/", "openrouter/"),
-        env_extras=(
-            ("MOONSHOT_API_BASE", "{api_base}"),
-        ),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="https://api.moonshot.ai/v1",   # intl; use api.moonshot.cn for China
-        strip_model_prefix=False,
-        model_overrides=(
-            ("kimi-k2.5", {"temperature": 1.0}),
-        ),
-    ),
-
-    # MiniMax: needs "minimax/" prefix for LiteLLM routing.
-    # Uses OpenAI-compatible API at api.minimax.io/v1.
-    ProviderSpec(
-        name="minimax",
-        keywords=("minimax",),
-        env_key="MINIMAX_API_KEY",
-        display_name="MiniMax",
-        litellm_prefix="minimax",            # MiniMax-M2.1 → minimax/MiniMax-M2.1
-        skip_prefixes=("minimax/", "openrouter/"),
-        env_extras=(),
-        is_gateway=False,
-        is_local=False,
-        detect_by_key_prefix="",
-        detect_by_base_keyword="",
-        default_api_base="https://api.minimax.io/v1",
-        strip_model_prefix=False,
-        model_overrides=(),
-    ),
-
    # === Local deployment (matched by config key, NOT by api_base) =========

    # vLLM / any OpenAI-compatible local server.
@ -281,23 +104,44 @@ PROVIDERS: tuple[ProviderSpec, ...] = (
        model_overrides=(),
    ),

-    # === Auxiliary (not a primary LLM provider) ============================
-
-    # Groq: mainly used for Whisper voice transcription, also usable for LLM.
-    # Needs "groq/" prefix for LiteLLM routing. Placed last — it rarely wins fallback.
+    # Ollama: local OpenAI-compatible server.
+    # Use OpenAI-compatible endpoint, not native Ollama API.
+    # Detected when config key is "ollama" or api_base contains "11434" or "ollama".
    ProviderSpec(
-        name="groq",
-        keywords=("groq",),
-        env_key="GROQ_API_KEY",
-        display_name="Groq",
-        litellm_prefix="groq",              # llama3-8b-8192 → groq/llama3-8b-8192
-        skip_prefixes=("groq/",),           # avoid double-prefix
+        name="ollama",
+        keywords=("ollama", "llama"),       # Match both "ollama" and "llama" model names
+        env_key="OPENAI_API_KEY",          # Use OpenAI-compatible API
+        display_name="Ollama",
+        litellm_prefix="",                 # No prefix - use as OpenAI-compatible
+        skip_prefixes=(),
+        env_extras=(
+            ("OPENAI_API_BASE", "{api_base}"),  # Set OpenAI API base to Ollama endpoint
+        ),
+        is_gateway=False,
+        is_local=True,
+        detect_by_key_prefix="",
+        detect_by_base_keyword="11434",     # Detect by default Ollama port
+        default_api_base="http://localhost:11434/v1",
+        strip_model_prefix=False,
+        model_overrides=(),
+    ),
+
+    # AirLLM: direct local model inference (no HTTP server).
+    # Loads models directly into memory for GPU-optimized inference.
+    # Detected when config key is "airllm".
+    ProviderSpec(
+        name="airllm",
+        keywords=("airllm",),
+        env_key="",                        # No API key needed (local)
+        display_name="AirLLM",
+        litellm_prefix="",                 # Not used with LiteLLM
+        skip_prefixes=(),
        env_extras=(),
        is_gateway=False,
-        is_local=False,
+        is_local=True,
        detect_by_key_prefix="",
        detect_by_base_keyword="",
-        default_api_base="",
+        default_api_base="",                # Not used (direct Python calls)
        strip_model_prefix=False,
        model_overrides=(),
    ),
@ -325,12 +169,11 @@ def find_gateway(
    api_key: str | None = None,
    api_base: str | None = None,
 ) -> ProviderSpec | None:
-    """Detect gateway/local provider.
+    """Detect local provider.

    Priority:
-      1. provider_name — if it maps to a gateway/local spec, use it directly.
-      2. api_key prefix — e.g. "sk-or-" → OpenRouter.
-      3. api_base keyword — e.g. "aihubmix" in URL → AiHubMix.
+      1. provider_name — if it maps to a local spec, use it directly.
+      2. api_base keyword — e.g. "11434" in URL → Ollama.

    A standard provider with a custom api_base (e.g. DeepSeek behind a proxy)
    will NOT be mistaken for vLLM — the old fallback is gone.
@ -341,10 +184,8 @@ def find_gateway(
        if spec and (spec.is_gateway or spec.is_local):
            return spec

-    # 2. Auto-detect by api_key prefix / api_base keyword
+    # 2. Auto-detect by api_base keyword
    for spec in PROVIDERS:
-        if spec.detect_by_key_prefix and api_key and api_key.startswith(spec.detect_by_key_prefix):
-            return spec
        if spec.detect_by_base_keyword and api_base and spec.detect_by_base_keyword in api_base:
            return spec

--- a/setup.sh
+++ b/setup.sh
@ -0,0 +1,397 @@
+#!/bin/bash
+# Nanobot Setup Script
+# Automates installation and configuration of nanobot with Ollama/AirLLM
+
+set -e  # Exit on error
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Configuration
+VENV_DIR="venv"
+CONFIG_DIR="$HOME/.nanobot"
+CONFIG_FILE="$CONFIG_DIR/config.json"
+MODEL_DIR="$HOME/.local/models/llama3.2-3b-instruct"
+MODEL_NAME="meta-llama/Llama-3.2-3B-Instruct"
+
+# Functions
+print_header() {
+    echo -e "\n${BLUE}========================================${NC}"
+    echo -e "${BLUE}$1${NC}"
+    echo -e "${BLUE}========================================${NC}\n"
+}
+
+print_success() {
+    echo -e "${GREEN}✓ $1${NC}"
+}
+
+print_warning() {
+    echo -e "${YELLOW}⚠ $1${NC}"
+}
+
+print_error() {
+    echo -e "${RED}✗ $1${NC}"
+}
+
+print_info() {
+    echo -e "${BLUE}ℹ $1${NC}"
+}
+
+# Check if command exists
+command_exists() {
+    command -v "$1" >/dev/null 2>&1
+}
+
+# Check prerequisites
+check_prerequisites() {
+    print_header "Checking Prerequisites"
+    
+    local missing=0
+    
+    if ! command_exists python3; then
+        print_error "Python 3 is not installed"
+        missing=1
+    else
+        PYTHON_VERSION=$(python3 --version 2>&1 | awk '{print $2}')
+        print_success "Python $PYTHON_VERSION found"
+        
+        # Check Python version (need 3.10+)
+        PYTHON_MAJOR=$(echo $PYTHON_VERSION | cut -d. -f1)
+        PYTHON_MINOR=$(echo $PYTHON_VERSION | cut -d. -f2)
+        if [ "$PYTHON_MAJOR" -lt 3 ] || ([ "$PYTHON_MAJOR" -eq 3 ] && [ "$PYTHON_MINOR" -lt 10 ]); then
+            print_error "Python 3.10+ required, found $PYTHON_VERSION"
+            missing=1
+        fi
+    fi
+    
+    if ! command_exists git; then
+        print_warning "Git is not installed (optional, but recommended)"
+    else
+        print_success "Git found"
+    fi
+    
+    if ! command_exists pip3 && ! python3 -m pip --version >/dev/null 2>&1; then
+        print_error "pip is not installed"
+        missing=1
+    else
+        print_success "pip found"
+    fi
+    
+    if [ $missing -eq 1 ]; then
+        print_error "Missing required prerequisites. Please install them first."
+        exit 1
+    fi
+    
+    print_success "All prerequisites met"
+}
+
+# Create virtual environment
+setup_venv() {
+    print_header "Setting Up Virtual Environment"
+    
+    if [ -d "$VENV_DIR" ]; then
+        print_warning "Virtual environment already exists at $VENV_DIR"
+        read -p "Recreate it? (y/n): " -n 1 -r
+        echo
+        if [[ $REPLY =~ ^[Yy]$ ]]; then
+            rm -rf "$VENV_DIR"
+            print_info "Removed existing virtual environment"
+        else
+            print_info "Using existing virtual environment"
+            return
+        fi
+    fi
+    
+    print_info "Creating virtual environment..."
+    python3 -m venv "$VENV_DIR"
+    print_success "Virtual environment created"
+    
+    print_info "Activating virtual environment..."
+    source "$VENV_DIR/bin/activate"
+    print_success "Virtual environment activated"
+    
+    print_info "Upgrading pip..."
+    pip install --upgrade pip --quiet
+    print_success "pip upgraded"
+}
+
+# Install dependencies
+install_dependencies() {
+    print_header "Installing Dependencies"
+    
+    if [ -z "$VIRTUAL_ENV" ]; then
+        source "$VENV_DIR/bin/activate"
+    fi
+    
+    print_info "Installing nanobot and dependencies..."
+    pip install -e . --quiet
+    print_success "Nanobot installed"
+    
+    # Check if AirLLM should be installed
+    read -p "Do you want to use AirLLM? (y/n): " -n 1 -r
+    echo
+    if [[ $REPLY =~ ^[Yy]$ ]]; then
+        print_info "Installing AirLLM..."
+        pip install airllm bitsandbytes --quiet || {
+            print_warning "AirLLM installation had issues, but continuing..."
+            print_info "You can install it later with: pip install airllm bitsandbytes"
+        }
+        print_success "AirLLM installed (or attempted)"
+        USE_AIRLLM=true
+    else
+        USE_AIRLLM=false
+    fi
+}
+
+# Check for Ollama
+check_ollama() {
+    if command_exists ollama; then
+        print_success "Ollama is installed"
+        if ollama list >/dev/null 2>&1; then
+            print_success "Ollama is running"
+            return 0
+        else
+            print_warning "Ollama is installed but not running"
+            return 1
+        fi
+    else
+        print_warning "Ollama is not installed"
+        return 1
+    fi
+}
+
+# Setup Ollama configuration
+setup_ollama() {
+    print_header "Setting Up Ollama"
+    
+    if ! check_ollama; then
+        print_info "Ollama is not installed or not running"
+        read -p "Do you want to install Ollama? (y/n): " -n 1 -r
+        echo
+        if [[ $REPLY =~ ^[Yy]$ ]]; then
+            print_info "Installing Ollama..."
+            curl -fsSL https://ollama.ai/install.sh | sh || {
+                print_error "Failed to install Ollama automatically"
+                print_info "Please install manually from: https://ollama.ai"
+                return 1
+            }
+            print_success "Ollama installed"
+        else
+            return 1
+        fi
+    fi
+    
+    # Check if llama3.2 is available
+    if ollama list | grep -q "llama3.2"; then
+        print_success "llama3.2 model found"
+    else
+        print_info "Downloading llama3.2 model (this may take a while)..."
+        ollama pull llama3.2:latest || {
+            print_error "Failed to pull llama3.2 model"
+            return 1
+        }
+        print_success "llama3.2 model downloaded"
+    fi
+    
+    # Create config
+    mkdir -p "$CONFIG_DIR"
+    cat > "$CONFIG_FILE" << EOF
+{
+  "providers": {
+    "ollama": {
+      "apiKey": "dummy",
+      "apiBase": "http://localhost:11434/v1"
+    }
+  },
+  "agents": {
+    "defaults": {
+      "model": "llama3.2:latest"
+    }
+  }
+}
+EOF
+    chmod 600 "$CONFIG_FILE"
+    print_success "Ollama configuration created at $CONFIG_FILE"
+    return 0
+}
+
+# Setup AirLLM configuration
+setup_airllm() {
+    print_header "Setting Up AirLLM"
+    
+    # Check if model already exists
+    if [ -d "$MODEL_DIR" ] && [ -f "$MODEL_DIR/config.json" ]; then
+        print_success "Model already exists at $MODEL_DIR"
+    else
+        print_info "Model needs to be downloaded"
+        print_info "You'll need a Hugging Face token to download gated models"
+        echo
+        print_info "Steps:"
+        echo "  1. Get token: https://huggingface.co/settings/tokens"
+        echo "  2. Accept license: https://huggingface.co/$MODEL_NAME"
+        echo
+        read -p "Do you have a Hugging Face token? (y/n): " -n 1 -r
+        echo
+        if [[ ! $REPLY =~ ^[Yy]$ ]]; then
+            print_warning "Skipping model download. You can download it later."
+            print_info "To download later, run:"
+            echo "  huggingface-cli download $MODEL_NAME --local-dir $MODEL_DIR --token YOUR_TOKEN"
+            return 1
+        fi
+        
+        read -p "Enter your Hugging Face token: " -s HF_TOKEN
+        echo
+        
+        if [ -z "$HF_TOKEN" ]; then
+            print_error "Token is required"
+            return 1
+        fi
+        
+        # Install huggingface_hub if needed
+        if [ -z "$VIRTUAL_ENV" ]; then
+            source "$VENV_DIR/bin/activate"
+        fi
+        pip install huggingface_hub --quiet
+        
+        print_info "Downloading model (this may take a while, ~2GB)..."
+        mkdir -p "$MODEL_DIR"
+        huggingface-cli download "$MODEL_NAME" \
+            --local-dir "$MODEL_DIR" \
+            --token "$HF_TOKEN" \
+            --local-dir-use-symlinks False || {
+            print_error "Failed to download model"
+            print_info "Make sure you've accepted the license at: https://huggingface.co/$MODEL_NAME"
+            return 1
+        }
+        print_success "Model downloaded to $MODEL_DIR"
+    fi
+    
+    # Create config
+    mkdir -p "$CONFIG_DIR"
+    cat > "$CONFIG_FILE" << EOF
+{
+  "providers": {
+    "airllm": {
+      "apiKey": "$MODEL_DIR",
+      "apiBase": null,
+      "extraHeaders": {}
+    }
+  },
+  "agents": {
+    "defaults": {
+      "model": "$MODEL_DIR"
+    }
+  }
+}
+EOF
+    chmod 600 "$CONFIG_FILE"
+    print_success "AirLLM configuration created at $CONFIG_FILE"
+    return 0
+}
+
+# Test installation
+test_installation() {
+    print_header "Testing Installation"
+    
+    if [ -z "$VIRTUAL_ENV" ]; then
+        source "$VENV_DIR/bin/activate"
+    fi
+    
+    print_info "Testing nanobot installation..."
+    if nanobot --help >/dev/null 2>&1; then
+        print_success "Nanobot is installed and working"
+    else
+        print_error "Nanobot test failed"
+        return 1
+    fi
+    
+    print_info "Testing with a simple query..."
+    if nanobot agent -m "Hello, what is 2+5?" >/dev/null 2>&1; then
+        print_success "Test query successful!"
+    else
+        print_warning "Test query had issues (this might be normal if model is still loading)"
+        print_info "Try running manually: nanobot agent -m 'Hello'"
+    fi
+}
+
+# Main setup flow
+main() {
+    print_header "Nanobot Setup Script"
+    print_info "This script will set up nanobot with Ollama or AirLLM"
+    echo
+    
+    # Check prerequisites
+    check_prerequisites
+    
+    # Setup virtual environment
+    setup_venv
+    
+    # Install dependencies
+    install_dependencies
+    
+    # Choose provider
+    echo
+    print_header "Choose Provider"
+    echo "1. Ollama (easiest, no tokens needed)"
+    echo "2. AirLLM (direct local inference, no HTTP server)"
+    echo "3. Both (configure both, use either)"
+    echo
+    read -p "Choose option (1-3): " -n 1 -r
+    echo
+    
+    PROVIDER_SETUP=false
+    
+    case $REPLY in
+        1)
+            if setup_ollama; then
+                PROVIDER_SETUP=true
+            fi
+            ;;
+        2)
+            if setup_airllm; then
+                PROVIDER_SETUP=true
+            fi
+            ;;
+        3)
+            if setup_ollama || setup_airllm; then
+                PROVIDER_SETUP=true
+            fi
+            ;;
+        *)
+            print_warning "Invalid choice, skipping provider setup"
+            ;;
+    esac
+    
+    if [ "$PROVIDER_SETUP" = false ]; then
+        print_warning "Provider setup incomplete. You can configure manually later."
+        print_info "Config file location: $CONFIG_FILE"
+    fi
+    
+    # Test installation
+    test_installation
+    
+    # Final instructions
+    echo
+    print_header "Setup Complete!"
+    echo
+    print_success "Nanobot is ready to use!"
+    echo
+    print_info "To activate the virtual environment:"
+    echo "  source $VENV_DIR/bin/activate"
+    echo
+    print_info "To use nanobot:"
+    echo "  nanobot agent -m 'Your message here'"
+    echo
+    print_info "Configuration file: $CONFIG_FILE"
+    echo
+    print_info "For more information, see SETUP.md"
+    echo
+}
+
+# Run main function
+main
+
--- a/setup_llama_airllm.py
+++ b/setup_llama_airllm.py
@ -0,0 +1,175 @@
+#!/usr/bin/env python3
+"""
+Setup script to configure nanobot to use Llama models with AirLLM.
+This script will:
+1. Check/create the config file
+2. Set up Llama model configuration
+3. Guide you through getting a Hugging Face token if needed
+"""
+
+import json
+import os
+from pathlib import Path
+
+CONFIG_PATH = Path.home() / ".nanobot" / "config.json"
+
+def get_hf_token_instructions():
+    """Print instructions for getting a Hugging Face token."""
+    print("\n" + "="*70)
+    print("GETTING A HUGGING FACE TOKEN")
+    print("="*70)
+    print("\nTo use Llama models (which are gated), you need a Hugging Face token:")
+    print("\n1. Go to: https://huggingface.co/settings/tokens")
+    print("2. Click 'New token'")
+    print("3. Give it a name (e.g., 'nanobot')")
+    print("4. Select 'Read' permission")
+    print("5. Click 'Generate token'")
+    print("6. Copy the token (starts with 'hf_...')")
+    print("\nThen accept the Llama model license:")
+    print("1. Go to: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct")
+    print("2. Click 'Agree and access repository'")
+    print("3. Accept the license terms")
+    print("\n" + "="*70 + "\n")
+
+def load_existing_config():
+    """Load existing config or return default."""
+    if CONFIG_PATH.exists():
+        try:
+            with open(CONFIG_PATH) as f:
+                return json.load(f)
+        except Exception as e:
+            print(f"Warning: Could not read existing config: {e}")
+            return {}
+    return {}
+
+def create_llama_config():
+    """Create or update config for Llama with AirLLM."""
+    config = load_existing_config()
+    
+    # Ensure providers section exists
+    if "providers" not in config:
+        config["providers"] = {}
+    
+    # Ensure agents section exists
+    if "agents" not in config:
+        config["agents"] = {}
+    if "defaults" not in config["agents"]:
+        config["agents"]["defaults"] = {}
+    
+    # Choose Llama model
+    print("\n" + "="*70)
+    print("CHOOSE LLAMA MODEL")
+    print("="*70)
+    print("\nAvailable models:")
+    print("  1. Llama-3.2-3B-Instruct (Recommended - fast, minimal memory)")
+    print("  2. Llama-3.1-8B-Instruct (Good balance of performance and speed)")
+    print("  3. Custom (enter model path)")
+    
+    choice = input("\nChoose model (1-3, default: 1): ").strip() or "1"
+    
+    model_map = {
+        "1": "meta-llama/Llama-3.2-3B-Instruct",
+        "2": "meta-llama/Llama-3.1-8B-Instruct",
+    }
+    
+    if choice == "3":
+        model_path = input("Enter model path (e.g., meta-llama/Llama-3.2-3B-Instruct): ").strip()
+        if not model_path:
+            model_path = "meta-llama/Llama-3.2-3B-Instruct"
+            print(f"Using default: {model_path}")
+    else:
+        model_path = model_map.get(choice, "meta-llama/Llama-3.2-3B-Instruct")
+    
+    # Set up AirLLM provider with Llama model
+    # Note: apiKey can be used as model path, or we can put model in defaults
+    config["providers"]["airllm"] = {
+        "apiKey": "",  # Will be set to model path
+        "apiBase": None,
+        "extraHeaders": {}
+    }
+    
+    # Set default model
+    config["agents"]["defaults"]["model"] = model_path
+    
+    # Ask for Hugging Face token
+    print("\n" + "="*70)
+    print("HUGGING FACE TOKEN SETUP")
+    print("="*70)
+    print("\nDo you have a Hugging Face token? (Required for Llama models)")
+    print("If not, we'll show you how to get one.\n")
+    
+    has_token = input("Do you have a Hugging Face token? (y/n): ").strip().lower()
+    
+    if has_token == 'y':
+        hf_token = input("\nEnter your Hugging Face token (starts with 'hf_'): ").strip()
+        if hf_token and hf_token.startswith('hf_'):
+            # Store token in extraHeaders
+            config["providers"]["airllm"]["extraHeaders"]["hf_token"] = hf_token
+            # Also set apiKey to model path (AirLLM uses apiKey as model path if it contains '/')
+            config["providers"]["airllm"]["apiKey"] = config["agents"]["defaults"]["model"]
+            print("\n✓ Token configured!")
+        else:
+            print("⚠ Warning: Token doesn't look valid (should start with 'hf_')")
+            print("You can add it later by editing the config file.")
+            # Still set model path in apiKey
+            config["providers"]["airllm"]["apiKey"] = config["agents"]["defaults"]["model"]
+    else:
+        get_hf_token_instructions()
+        print("\nYou can add your token later by:")
+        print(f"1. Editing: {CONFIG_PATH}")
+        print("2. Adding your token to: providers.airllm.extraHeaders.hf_token")
+        print("\nOr run this script again after getting your token.")
+    
+    return config
+
+def save_config(config):
+    """Save config to file."""
+    CONFIG_PATH.parent.mkdir(parents=True, exist_ok=True)
+    with open(CONFIG_PATH, 'w') as f:
+        json.dump(config, f, indent=2)
+    
+    # Set secure permissions
+    os.chmod(CONFIG_PATH, 0o600)
+    print(f"\n✓ Configuration saved to: {CONFIG_PATH}")
+    print(f"✓ File permissions set to 600 (read/write for owner only)")
+
+def main():
+    """Main setup function."""
+    print("\n" + "="*70)
+    print("NANOBOT LLAMA + AIRLLM SETUP")
+    print("="*70)
+    print("\nThis script will configure nanobot to use Llama models with AirLLM.\n")
+    
+    if CONFIG_PATH.exists():
+        print(f"Found existing config at: {CONFIG_PATH}")
+        backup = input("\nCreate backup? (y/n): ").strip().lower()
+        if backup == 'y':
+            backup_path = CONFIG_PATH.with_suffix('.json.backup')
+            import shutil
+            shutil.copy(CONFIG_PATH, backup_path)
+            print(f"✓ Backup created: {backup_path}")
+    else:
+        print(f"Creating new config at: {CONFIG_PATH}")
+    
+    config = create_llama_config()
+    save_config(config)
+    
+    print("\n" + "="*70)
+    print("SETUP COMPLETE!")
+    print("="*70)
+    print("\nConfiguration:")
+    print(f"  Model: {config['agents']['defaults']['model']}")
+    print(f"  Provider: airllm")
+    if config["providers"]["airllm"].get("extraHeaders", {}).get("hf_token"):
+        print(f"  HF Token: {'*' * 20} (configured)")
+    else:
+        print(f"  HF Token: Not configured (add it to use gated models)")
+    
+    print("\nNext steps:")
+    print("  1. If you need a Hugging Face token, follow the instructions above")
+    print("  2. Test it: nanobot agent -m 'Hello, what is 2+5?'")
+    print("\n" + "="*70 + "\n")
+
+if __name__ == "__main__":
+    main()
+
Author	SHA1	Message	Date
Tanya	9c858699f3	Improve web search and error handling - Add DuckDuckGo search fallback when Brave API key is not available - Web search now works without requiring an API key - Falls back to DuckDuckGo if BRAVE_API_KEY is not set - Maintains backward compatibility with Brave API when key is provided - Improve error handling in agent CLI command - Better exception handling with traceback display - Prevents crashes from showing incomplete error messages - Improves debugging experience	2026-02-18 12:41:11 -05:00
Tanya	7961bf1360	Fix transformers 4.39.3 compatibility issues with AirLLM - Fix RoPE scaling compatibility: automatically convert unsupported 'llama3' type to 'linear' for local models - Patch LlamaSdpaAttention to filter out position_embeddings argument that AirLLM passes but transformers 4.39.3 doesn't accept - Add better error handling with specific guidance for compatibility issues - Fix config file modification for local models with unsupported rope_scaling types - Improve error messages to help diagnose transformers version compatibility issues These fixes allow nanobot to work with transformers 4.39.3 and AirLLM.	2026-02-18 12:39:29 -05:00
Tanya	f1faee54b6	Add automated setup script for installation and configuration	2026-02-18 10:28:47 -05:00
Tanya	2f8205150f	Add comprehensive setup guide for pulling and running repository	2026-02-17 14:24:53 -05:00
Tanya	216c9f5039	Add vllm-env/ to .gitignore (virtual environment should not be committed)	2026-02-17 14:23:24 -05:00
Tanya	f1e95626f8	Clean up providers: keep only Ollama, AirLLM, vLLM, and DeepSeek - Remove Qwen/DashScope provider and all Qwen-specific code - Remove gateway providers (OpenRouter, AiHubMix) - Remove cloud providers (Anthropic, OpenAI, Gemini, Zhipu, Moonshot, MiniMax, Groq) - Update default model from Platypus to llama3.2 - Remove Platypus references throughout codebase - Add AirLLM provider support with local model path support - Update setup scripts to only show Llama models - Clean up provider registry and config schema	2026-02-17 14:20:47 -05:00
Re-bin	dd63337a83	Merge PR #516 : fix Pydantic V2 deprecation warning	2026-02-11 14:55:17 +00:00
Re-bin	cdc37e2f5e	Merge branch 'main' into pr-516	2026-02-11 14:54:37 +00:00
Re-bin	554ba81473	docs: update agent community tips	2026-02-11 14:39:20 +00:00
Sergio Sánchez Vallés	cbab72ab72	fix: pydantic deprecation configdict	2026-02-11 13:01:29 +01:00