# 4080 LLM Server (Work Agent) LLM server for work agent running Llama 3.1 70B Q4 on RTX 4080. ## Setup ### Option 1: Ollama (Recommended - Easiest) ```bash # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Download model ollama pull llama3.1:70b-q4_0 # Start server ollama serve # Runs on http://localhost:11434 ``` ### Option 2: vLLM (For Higher Throughput) ```bash # Install vLLM pip install vllm # Start server python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Meta-Llama-3.1-70B-Instruct \ --quantization awq \ --tensor-parallel-size 1 \ --host 0.0.0.0 \ --port 8000 ``` ## Configuration - **Model**: Llama 3.1 70B Q4 - **Context Window**: 8K tokens (practical limit) - **VRAM Usage**: ~14GB - **Concurrency**: 2 requests max ## API Ollama uses OpenAI-compatible API: ```bash curl http://localhost:11434/api/chat -d '{ "model": "llama3.1:70b-q4_0", "messages": [ {"role": "user", "content": "Hello"} ], "stream": false }' ``` ## Systemd Service See `ollama-4080.service` for systemd configuration.