# Security Audit Report **Last audit:** 2026-05-23 (re-run after SSH keys + `make maintenance`) **Previous audit:** 2026-05-20 **Auditor:** `scripts/security-audit-*.sh`, Ansible `maintenance` + `maintenance_cron` roles **Repo baseline** (`roles/ssh/defaults/main.yml`): `PermitRootLogin prohibit-password`, `PasswordAuthentication no`, UFW enabled. --- ## 2026-05-23 — Actions completed | Action | Status | |--------|--------| | SSH keys → caseware, auto, cal, vikunja, mailcow, listmonk | ✅ All six reachable as `root` | | SSH keys → mailcow/listmonk VMs | ✅ Via brief VM shutdown + disk inject on pve201 (no guest agent) | | Inventory rename `vikanjans` → `vikunja` | ✅ `hosts` + `proxmox_vmid=301` | | `apt upgrade` fleet (skip reboot) | ✅ 14 hosts via Ansible; auto via `pct exec` on pve10 | | Tier 1 cron (journal + apt) | ✅ `roles/maintenance_cron` on PVE, sites, comms, ansible, hermes, etc. | | Tier 2 cron (docker prune) | ✅ identity, monitoring, vikunja; git-ci-01 keeps `docker-prune-ci` | | VM 104 (GPU-Dev) RAM 72→64 GiB | ✅ pve201; host free RAM ~1.7→10 GiB | | Fix broken `host_vars` (ansibleVM, listmonk) | ✅ Plain YAML; old blobs → `*.vault-bak` | | Vault `vault_*_become_password` + maintenance vaultwardenVM | ✅ 2026-05-23 | | caddy root SSH + maintenance | ✅ `bootstrap-root-ssh-caddy`; inventory `ansible_user=root` | | ansibleVM maintenance | ✅ become password in vault | ### Post-maintenance SSH reachability | Host | SSH | Notes | |------|-----|-------| | caseware | ✅ | | | auto | ✅ | Was slow from laptop earlier; OK after upgrade | | cal | ✅ | | | vikunja | ✅ | LXC 301 @ 10.0.10.159 | | mailcow | ✅ | ~1 min downtime for key inject | | listmonk | ✅ | ~1 min downtime for key inject | ### Maintenance playbook recap (`skip_reboot=true`) | Host | Result | |------|--------| | pve201, pve10, caseware, cal, vikunja, mailcow, listmonk, identity, monitoring, hermes, levkin, portfolio, git-ci-01, sonarqube-01 | ✅ upgraded | | caddy | ✅ (as `root`; no `sudo` package on host) | | ansibleVM | ✅ (`vault_ansiblevm_become_password`) | | vaultwardenVM | ✅ (`vault_vaultwarden_become_password`) | ### Open security gaps (unchanged until `make security`) | Control | Fleet status | Risk if fixed wrong | |---------|--------------|---------------------| | `PasswordAuthentication yes` | Most LXCs + both PVE | **Low break risk** if SSH keys tested first in a second session | | `PermitRootLogin yes` | pve201, pve10, sonarqube-01 | Same — use `prohibit-password`, not `no`, if you need root+key | | fail2ban | Off everywhere | Enabling is safe; may lock you out only if you brute-force yourself | | UFW | Off (except one dev LXC) | **Medium risk** — wrong rules drop SSH/80/443; apply via Ansible `roles/ssh` after allowlist | | unattended-upgrades | hermes, ansibleVM only | Safe; schedule reboots separately | | Proxmox :8006 | Open on LAN | Restrict in PVE firewall — **won't break VMs** | | Docker on `0.0.0.0` | identity, monitoring, vaultwarden, qBit | Bind to `127.0.0.1` — **can break access** if Caddy route missing; test URL after | | Tailscale | **Deferred** | Off by choice; remote access via **UniFi VPN** to LAN | See [Risk explanations (2026-05-23)](#risk-explanations-2026-05-23) and [fail2ban vs password SSH](#fail2ban-vs-password-ssh) below. --- ## GPU-Dev (pve201 VM 104) — Ollama / LLMs | Resource | Current | |----------|---------| | Host | pve201, VMID **104**, `GPU-Dev-Debian` | | LAN IP | **10.0.10.122** (inventory `devGPU` @ 10.0.30.63 is a different network — use `.122` from LAN) | | RAM | **64 GiB** guest (~60 GiB available when idle) | | GPU | **RTX 4080 16 GiB** (PCI passthrough `hostpci0`) | | Workload | **Ollama** already running (~3.6 GiB VRAM in sample) | ### Getting the most from RAM + GPU 1. **Right-size models to VRAM** — On a 16 GiB 4080, prefer quantised models that fit entirely in VRAM (e.g. 7B–14B Q4/Q5, or 32B Q2/Q3 if you accept quality trade-offs). If a model spills to CPU RAM, throughput drops sharply. 2. **One heavy model at a time** — Ollama loads models on demand; set `OLLAMA_MAX_LOADED_MODELS=1` (or keep only one client) so you do not fragment 64 GiB RAM + 16 GiB VRAM across several large weights. 3. **Parallel requests** — `OLLAMA_NUM_PARALLEL` defaults are conservative; raise only if VRAM headroom exists (watch `nvidia-smi` while under load). 4. **Keep guest RAM for KV cache** — With 64 GiB you can run larger context windows; set `OLLAMA_CONTEXT_LENGTH` / model `num_ctx` to what you need, not maximum “just because”. 5. **CPU offload only when needed** — `num_gpu` layers = all layers for speed; partial offload is for models that do not fit in VRAM, not for tuning. 6. **Disk** — Store models on fast local disk (not NFS); `ollama pull` once, prune old tags periodically (`ollama list` / remove unused). 7. **Proxmox** — Do not balloon GPU VM RAM; GPU passthrough already reserves most of the 64 GiB. Freeing pve201 meant lowering this VM from 72→64 GiB, not overcommitting other guests on 201. 8. **Optional** — [Open WebUI](https://github.com/open-webui/open-webui) on localhost + Caddy TLS; bind Ollama to `127.0.0.1:11434` only (LAN via VPN). **Not in Ansible yet:** add `devGPU` / `10.0.10.122` to inventory when you want playbooks (cron, hardening) on this box. --- ## fail2ban vs password SSH **What fail2ban does:** After too many failed SSH logins from an IP, it adds a **temporary firewall ban** for that IP (typically 10–60 minutes). It does **not** disable password authentication globally. **Can passwords stay on if fail2ban is on?** Technically yes — fail2ban only rate-limits brute force; passwords are still weaker than keys. Best practice on servers: **keys + `PasswordAuthentication no` + fail2ban** (defence in depth). **Your Proxmox console fallback:** If you lock yourself out of SSH on a guest, you can still use **Proxmox → VM → Console** or `pct enter` / `qm guest exec` from pve201/pve10. That is a good break-glass path, but it is **not** a substitute for keys on hosts you manage daily — console is slow and easy to misconfigure under pressure. **Recommendation:** Enable fail2ban via `make security` with `ignoreip` including `10.0.10.0/24` and your UniFi VPN client subnet. Then disable password SSH once keys work everywhere you care about. --- ## Risk explanations (2026-05-23) ### Password SSH (`PasswordAuthentication yes`) **How bad:** High on internet-facing IPs; medium on `10.0.10.0/24` only. Anyone who can reach :22 can try passwords indefinitely (no fail2ban). **Will fixing break things?** No, if you (1) confirm key login works, (2) set `PasswordAuthentication no`, (3) keep a second SSH session open, (4) reload sshd. Breakage happens only if keys are missing/wrong. ### Root login (`PermitRootLogin yes` on hypervisors) **How bad:** High — root + password on PVE is full cluster compromise. **Will fixing break things?** Use `prohibit-password` (keys only), not `no`, unless you have another admin user with sudo. Ansible playbooks expect root on PVE today. ### fail2ban off **How bad:** Medium — relies on LAN trust; SSH noise from scanners still fills logs. **Will fixing break things?** Rarely. Tune `ignoreip` to your admin IP/subnet so your own typos don't ban you. ### UFW off **How bad:** Medium on segmented LAN; high if any host has a public IP. **Will fixing break things?** **Yes, if misconfigured** — default deny without allowing 22 from admin IP, 80/443 from Caddy, or Docker-published ports you still need. Use Ansible `roles/ssh` (UFW after SSH rules) and test. ### unattended-upgrades off **How bad:** Medium — security patches lag until manual maintenance. **Will fixing break things?** Usually no. Kernel updates may require reboot; use `Unattended-Upgrade::Automatic-Reboot "false"` until you want reboot windows. ### Proxmox UI :8006 exposed **How bad:** **Critical** on untrusted networks — API gives VM/storage control. **Will fixing break things?** Restricting to `10.0.10.0/24` does not break normal LAN admin access. ### HTTP services on all interfaces (8080, 3000, …) **How bad:** High without TLS/auth at the edge; medium behind Caddy + LAN only. **Will fixing break things?** **Yes** if you bind to `127.0.0.1` before Caddy `reverse_proxy` is updated. Order: Caddy route → test → then bind Docker to localhost. ### Remote access (Tailscale deferred) **Decision:** Tailscale off; use **UniFi site-to-site / VPN** into `10.0.10.0/24` for admin and Ollama/GPU access. **Security:** Ensure VPN is required for SSH and Proxmox :8006 from outside; do not port-forward :22/:8006 on the router without IP allowlists. ### pve201 RAM (was 97% used) **How bad:** **Critical** — OOM kills guests, swap thrashing. **Mitigation done:** VM 104 reduced 73728→65536 MiB (~8 GiB freed on hypervisor). Still tight; consider moving git-ci-01 or other workloads to pve10. --- ## 2026-05-20 — Original audit **Scope:** Proxmox nodes `pve201` (10.0.10.201) and `pve10` (10.0.10.10), all LXCs via `pct exec`, SSH deep-dive on hypervisors. --- ## Executive summary | Area | Critical | High | Medium | |------|----------|------|--------| | Hypervisors (201, 10) | 2 | 4 | 2 | | LXCs on 201 (10 running) | 0 | 10 | 8 | | LXCs on 10 (3 running) | 0 | 3 | 3 | **Top priorities** 1. Harden **SSH on both Proxmox hosts** (root + passwords currently allowed). 2. Restrict **Proxmox API/UI port 8006** to admin IPs. 3. Disable **password SSH on all LXCs**; deploy keys + `make copy-ssh-keys` for inventory IPs. 4. Patch hosts with **40–105** pending apt upgrades (hypervisors worst). 5. Put **HTTP services** (8080, 8000, qBit, etc.) behind reverse proxy + TLS or bind to internal IPs. --- ## Proxmox hypervisors ### pve201 — 10.0.10.201 (`pve`) | Resource | Status | |----------|--------| | OS | Debian 12, PVE 8.4.16, kernel 6.8.12-18-pve | | RAM free | ~2.5 GB / 126 GB (**critical**) | | Pending apt | **105** | | UFW / fail2ban / unattended-upgrades | **None** | #### SSH audit (dedicated) | Setting | Current | Target | |---------|---------|--------| | `permitrootlogin` | **yes** | `prohibit-password` | | `passwordauthentication` | **yes** | `no` | | `pubkeyauthentication` | yes | yes | | `maxauthtries` | 6 | 3–4 | | `x11forwarding` | yes | no (on servers) | | Root keys | 3 keys in `authorized_keys` | audit/remove unused | #### Exposed services | Port | Service | Risk | |------|---------|------| | 22 | SSH | Brute-force (no fail2ban) | | 8006 | Proxmox API/UI | **Critical** — full cluster control | | 3128 | spiceproxy | Medium | | 111 | rpcbind | Low — reduce exposure | #### Fixes (pve201) ```bash # 1) SSH — prefer Ansible after limiting to your IP make copy-ssh-key HOST=pve201 # if needed # Manual quick fix on host: sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config sshd -t && systemctl reload sshd # 2) Proxmox firewall — Datacenter → Firewall → restrict 8006 to 10.0.10.0/24 or admin IP # Or iptables on host for port 8006 # 3) fail2ban apt install fail2ban -y systemctl enable --now fail2ban # 4) Auto security updates apt install unattended-upgrades apt-listchanges -y dpkg-reconfigure -plow unattended-upgrades # 5) Patch apt update && apt upgrade -y ``` **Ansible (when ready):** add `pve201` / `pve10` to a `proxmox` group play with `roles/ssh` + `roles/monitoring_server` (fail2ban). Do **not** lock yourself out — test with second session first. --- ### pve10 — 10.0.10.10 (`PVENAS`) | Resource | Status | |----------|--------| | OS | Debian 13 (trixie), PVE, kernel 6.17.13-3-pve | | Load | **~30** on 24 CPUs (overloaded) | | Pending apt | **92** | | UFW / fail2ban / unattended-upgrades | **None** | | ZFS `NAS.SP00` | **inactive** (I/O suspended) | | PBS `PVEBUVD00` → 10.0.10.200:8007 | **unreachable** | #### SSH audit (dedicated) Same as pve201: `permitrootlogin yes`, `passwordauthentication yes`, 3 root authorized_keys. #### Exposed services | Port | Service | Risk | |------|---------|------| | 22 | SSH | High | | 8006 | Proxmox API/UI | **Critical** | | 2049, mountd, statd | NFS/RPC | High on LAN | | 3128 | spiceproxy | Medium | #### Fixes (pve10) Same SSH / fail2ban / unattended-upgrades / patch steps as pve201. Additional: ```bash # Investigate ZFS pool zpool status NAS.SP00 # Fix PBS connectivity or remove stale datastore from Proxmox UI ``` --- ## LXCs on pve201 (via `pct exec`) | VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services | |------|------|-----|--------|----------|---------------|-----|----------|----------|-----------------| | 301 | vikunja-debian | 10.0.10.159 | running | without-password | **yes** | no | no | 0 | **3456**, 22 | | 302 | qbit-debian | 10.0.10.91 | running | without-password | **yes** | no | no | 0 | **8080** (qBit), 22 | | 303 | searchXNG-debian | 10.0.10.70 | running | without-password | **yes** | no | no | **83** | **8080**, 22 | | 304 | wireguard-debian | 10.0.10.192 | running | without-password | **yes** | no | no | 0 | 22 | | 305 | kuma-debian | 10.0.10.197 | **stopped** | — | — | — | — | — | replaced by LXC 218 | | 306 | portfolio | — | **destroyed** | — | — | — | — | — | migrated → pve10 LXC **219** @ `10.0.10.106` (purged 2026-05-22) | | 307 | jobber-delian | 10.0.10.178 | running | without-password | **yes** | no | no | **83** | **3005**, 22 | | 308 | stirling-pdf | 10.0.10.43 | running | without-password | **yes** | no | no | 0 | **8080**, 22 | | 9001 | pote-dev | 10.0.10.114 | **stopped** | — | — | — | — | — | — | | 9101 | punimTagFE-dev | 10.0.10.121 | running | without-password | **yes** | **active** | no | **89** | **8000**, 111, 22 | | 9401 | mirrormatch-dev | 10.0.10.141 | **stopped** | — | — | — | — | — | — | **Inventory mapping:** `vikunja` → 159 (LXC 301), `qBittorrent` → 91, `punimTag` app → 121. ### Common LXC issues (pve201) | Issue | Severity | Fix | |-------|----------|-----| | `passwordauthentication yes` on all LXCs | High | Set `PasswordAuthentication no` in `/etc/ssh/sshd_config`, reload sshd | | No fail2ban | High | Install fail2ban or rely on Proxmox FW + LAN segmentation | | Apps on `0.0.0.0:8080` / 8000 / 3456 | High | Bind to localhost + Caddy, or restrict via Proxmox guest firewall (`firewall=1` on net0 — enable rules) | | 79–89 pending upgrades on several CTs | Medium | `pct exec -- apt update && apt upgrade -y` | | Stopped dev CTs (9001, 9401) | Low | Start when needed or keep stopped to reduce attack surface | ### Per-LXC fixes (pve201) ```bash # Example: harden + patch vikunja (301) from Proxmox host pct exec 301 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config pct exec 301 -- systemctl reload ssh # Patch container pct exec 303 -- bash -c 'apt update && apt upgrade -y' # Copy your SSH key (from Mac, once password/key works) make copy-ssh-key HOST=vikunja # 10.0.10.159 make copy-ssh-key HOST=qBittorrent # 10.0.10.91 ``` **punimTagFE-dev (9101):** Only LXC with **UFW active** — extend rules to deny inbound except 22 from admin subnet; still disable password auth. --- ## LXCs on pve10 (via `pct exec`) | VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services | |------|------|-----|--------|----------|---------------|-----|----------|----------|-----------------| | 210 | cal | 10.0.10.228 | running | without-password | **yes** | no | no | 0 | **3000**, 22 | | 215 | caseware | 10.0.10.105 | running | without-password | **yes** | no | no | **40** | **80** (nginx), 22 | | 216 | auto | 10.0.10.59 | running | without-password | **yes** | no | no | **40** | **80** (nginx), 22 | **Inventory mapping:** `caseware` → 105, `auto` → 59. ### Fixes (pve10 LXCs) ```bash # SSH harden caseware (215) pct exec 215 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config pct exec 215 -- systemctl reload sshd # Patch pct exec 215 -- apt update && apt upgrade -y pct exec 216 -- apt update && apt upgrade -y # Deploy keys from Mac make copy-ssh-key HOST=caseware make copy-ssh-key HOST=auto ``` **HTTP port 80 on caseware/auto:** Ensure TLS termination on Caddy (inventory host `caddy` 10.0.10.50) and no plain HTTP from WAN if exposed. --- ## SSH hardening checklist (all Linux targets) Use this order to avoid lockout: 1. Confirm your key works: `ssh -o BatchMode=yes root@ true` 2. Set `PasswordAuthentication no` 3. Set `PermitRootLogin prohibit-password` (LXCs already `without-password` — equivalent for keys-only) 4. `sshd -t && systemctl reload sshd` 5. Open **second terminal** and test before closing first 6. Optional: change SSH port, `MaxAuthTries 4`, disable `X11Forwarding` **Ansible alignment:** ```bash # After keys on host make dev HOST= --tags security # or role ssh via playbooks that include roles/ssh ``` --- ## Re-run audits ```bash # Hypervisor full audit ssh root@10.0.10.201 'bash -s' < scripts/security-audit-remote.sh ssh root@10.0.10.10 'bash -s' < scripts/security-audit-remote.sh # Hypervisor SSH-only ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh # All LXCs on a node ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh ssh root@10.0.10.10 'bash -s' < scripts/security-audit-lxc-via-pve.sh ``` --- ## Tracking | Item | Owner | Status | |------|-------|--------| | SSH keys caseware, auto, cal, vikunja, mailcow, listmonk | 2026-05-23 | ☑ | | Fleet `apt upgrade` (no reboot) | 2026-05-23 | ☑ all previously failed hosts fixed | | Tier 1 cron (journal + apt) | 2026-05-23 | ☑ PVE + most hosts via Ansible | | Tier 2 cron (docker prune) | 2026-05-23 | ☑ identity, monitoring, vikunja, git-ci-01 | | VM 104 RAM 72→64 GiB | 2026-05-23 | ☑ | | Inventory `vikunja` rename | 2026-05-23 | ☑ | | Fix `host_vars` ansibleVM / listmonk merge | 2026-05-23 | ☑ plain YAML (review `*.vault-bak`) | | SSH harden pve201 | | ☐ | | SSH harden pve10 | | ☐ | | Restrict 8006 on both nodes | | ☐ | | fail2ban on hypervisors | | ☐ | | `make security` on production groups | | ☐ | | Disable password SSH on all LXCs | | ☐ | | `copy-ssh-keys` remaining inventory | | ☐ partial | | TLS / localhost bind for :8080 services | | ☐ | | unattended-upgrades all production | | ☐ | | Tailscale re-auth | | ⏸ deferred (UniFi VPN) | | Fix ZFS NAS.SP00 on pve10 | | ☐ | | caddy Ansible as root | 2026-05-23 | ☑ | | vaultwardenVM / ansibleVM become in vault | 2026-05-23 | ☑ | | Add GPU-Dev `10.0.10.122` to inventory | | ☐ | | Ollama bind localhost + optional Open WebUI | | ☐ | --- ## Next steps (priority) 1. **`make security`** on one site host (e.g. caseware) with a second SSH session open — disable password SSH, enable UFW + fail2ban (`ignoreip` = LAN + VPN pool). 2. **Restrict Proxmox :8006** to `10.0.10.0/24` + VPN subnet on pve201 and pve10. 3. **Bind internal Docker ports** on identity / monitoring / vaultwarden to `127.0.0.1` after confirming Caddy routes. 4. **GPU-Dev:** point clients at `http://10.0.10.122:11434` over VPN; tune Ollama env vars; add host to inventory when automating. 5. **unattended-upgrades** on production LXCs (reboot policy manual). 6. Review `host_vars/*.vault-bak` and merge any secrets still needed into vault + plain host_vars. --- ## References - **[Security remediation plan](security-remediation-plan.md)** — phased fixes (critical → low) and login model - [Security hardening guide](security.md) - [SECURITY_HARDENING_PLAN.md](../SECURITY_HARDENING_PLAN.md) - Role defaults: `roles/ssh/defaults/main.yml`