ansible/docs/guides/security-audit-report.md
ilia f17a1a3bcc
Some checks failed
CI / skip-ci-check (pull_request) Successful in 7s
CI / lint-and-test (pull_request) Failing after 10s
CI / secret-scanning (pull_request) Successful in 7s
CI / dependency-scan (pull_request) Successful in 16s
CI / sast-scan (pull_request) Successful in 29s
CI / ansible-validation (pull_request) Failing after 54s
CI / license-check (pull_request) Successful in 14s
CI / vault-check (pull_request) Successful in 12s
CI / container-scan (pull_request) Successful in 7s
CI / sonar-analysis (pull_request) Successful in 7s
CI / playbook-test (pull_request) Successful in 25s
CI / workflow-summary (pull_request) Successful in 5s
Add homelab SSO, maintenance cron, and inventory cleanup.
Cal Authentik OIDC playbook/role (deferred until license), Vikunja OIDC
docs and vault secrets, SSO matrix, mailcow LAN proxy fix, extended
security audit docs, maintenance_cron role with group_vars split, and
inventory updates (vikunja rename, identity/monitoring/cal host_vars).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 20:23:10 -04:00

19 KiB
Raw Blame History

Security Audit Report

Last audit: 2026-05-23 (re-run after SSH keys + make maintenance)
Previous audit: 2026-05-20
Auditor: scripts/security-audit-*.sh, Ansible maintenance + maintenance_cron roles
Repo baseline (roles/ssh/defaults/main.yml): PermitRootLogin prohibit-password, PasswordAuthentication no, UFW enabled.


2026-05-23 — Actions completed

Action Status
SSH keys → caseware, auto, cal, vikunja, mailcow, listmonk All six reachable as root
SSH keys → mailcow/listmonk VMs Via brief VM shutdown + disk inject on pve201 (no guest agent)
Inventory rename vikanjansvikunja hosts + proxmox_vmid=301
apt upgrade fleet (skip reboot) 14 hosts via Ansible; auto via pct exec on pve10
Tier 1 cron (journal + apt) roles/maintenance_cron on PVE, sites, comms, ansible, hermes, etc.
Tier 2 cron (docker prune) identity, monitoring, vikunja; git-ci-01 keeps docker-prune-ci
VM 104 (GPU-Dev) RAM 72→64 GiB pve201; host free RAM ~1.7→10 GiB
Fix broken host_vars (ansibleVM, listmonk) Plain YAML; old blobs → *.vault-bak
Vault vault_*_become_password + maintenance vaultwardenVM 2026-05-23
caddy root SSH + maintenance bootstrap-root-ssh-caddy; inventory ansible_user=root
ansibleVM maintenance become password in vault

Post-maintenance SSH reachability

Host SSH Notes
caseware
auto Was slow from laptop earlier; OK after upgrade
cal
vikunja LXC 301 @ 10.0.10.159
mailcow ~1 min downtime for key inject
listmonk ~1 min downtime for key inject

Maintenance playbook recap (skip_reboot=true)

Host Result
pve201, pve10, caseware, cal, vikunja, mailcow, listmonk, identity, monitoring, hermes, levkin, portfolio, git-ci-01, sonarqube-01 upgraded
caddy (as root; no sudo package on host)
ansibleVM (vault_ansiblevm_become_password)
vaultwardenVM (vault_vaultwarden_become_password)

Open security gaps (unchanged until make security)

Control Fleet status Risk if fixed wrong
PasswordAuthentication yes Most LXCs + both PVE Low break risk if SSH keys tested first in a second session
PermitRootLogin yes pve201, pve10, sonarqube-01 Same — use prohibit-password, not no, if you need root+key
fail2ban Off everywhere Enabling is safe; may lock you out only if you brute-force yourself
UFW Off (except one dev LXC) Medium risk — wrong rules drop SSH/80/443; apply via Ansible roles/ssh after allowlist
unattended-upgrades hermes, ansibleVM only Safe; schedule reboots separately
Proxmox :8006 Open on LAN Restrict in PVE firewall — won't break VMs
Docker on 0.0.0.0 identity, monitoring, vaultwarden, qBit Bind to 127.0.0.1can break access if Caddy route missing; test URL after
Tailscale Deferred Off by choice; remote access via UniFi VPN to LAN

See Risk explanations (2026-05-23) and fail2ban vs password SSH below.


GPU-Dev (pve201 VM 104) — Ollama / LLMs

Resource Current
Host pve201, VMID 104, GPU-Dev-Debian
LAN IP 10.0.10.122 (inventory devGPU @ 10.0.30.63 is a different network — use .122 from LAN)
RAM 64 GiB guest (~60 GiB available when idle)
GPU RTX 4080 16 GiB (PCI passthrough hostpci0)
Workload Ollama already running (~3.6 GiB VRAM in sample)

Getting the most from RAM + GPU

  1. Right-size models to VRAM — On a 16 GiB 4080, prefer quantised models that fit entirely in VRAM (e.g. 7B14B Q4/Q5, or 32B Q2/Q3 if you accept quality trade-offs). If a model spills to CPU RAM, throughput drops sharply.
  2. One heavy model at a time — Ollama loads models on demand; set OLLAMA_MAX_LOADED_MODELS=1 (or keep only one client) so you do not fragment 64 GiB RAM + 16 GiB VRAM across several large weights.
  3. Parallel requestsOLLAMA_NUM_PARALLEL defaults are conservative; raise only if VRAM headroom exists (watch nvidia-smi while under load).
  4. Keep guest RAM for KV cache — With 64 GiB you can run larger context windows; set OLLAMA_CONTEXT_LENGTH / model num_ctx to what you need, not maximum “just because”.
  5. CPU offload only when needednum_gpu layers = all layers for speed; partial offload is for models that do not fit in VRAM, not for tuning.
  6. Disk — Store models on fast local disk (not NFS); ollama pull once, prune old tags periodically (ollama list / remove unused).
  7. Proxmox — Do not balloon GPU VM RAM; GPU passthrough already reserves most of the 64 GiB. Freeing pve201 meant lowering this VM from 72→64 GiB, not overcommitting other guests on 201.
  8. OptionalOpen WebUI on localhost + Caddy TLS; bind Ollama to 127.0.0.1:11434 only (LAN via VPN).

Not in Ansible yet: add devGPU / 10.0.10.122 to inventory when you want playbooks (cron, hardening) on this box.


fail2ban vs password SSH

What fail2ban does: After too many failed SSH logins from an IP, it adds a temporary firewall ban for that IP (typically 1060 minutes). It does not disable password authentication globally.

Can passwords stay on if fail2ban is on? Technically yes — fail2ban only rate-limits brute force; passwords are still weaker than keys. Best practice on servers: keys + PasswordAuthentication no + fail2ban (defence in depth).

Your Proxmox console fallback: If you lock yourself out of SSH on a guest, you can still use Proxmox → VM → Console or pct enter / qm guest exec from pve201/pve10. That is a good break-glass path, but it is not a substitute for keys on hosts you manage daily — console is slow and easy to misconfigure under pressure.

Recommendation: Enable fail2ban via make security with ignoreip including 10.0.10.0/24 and your UniFi VPN client subnet. Then disable password SSH once keys work everywhere you care about.


Risk explanations (2026-05-23)

Password SSH (PasswordAuthentication yes)

How bad: High on internet-facing IPs; medium on 10.0.10.0/24 only. Anyone who can reach :22 can try passwords indefinitely (no fail2ban).

Will fixing break things? No, if you (1) confirm key login works, (2) set PasswordAuthentication no, (3) keep a second SSH session open, (4) reload sshd. Breakage happens only if keys are missing/wrong.

Root login (PermitRootLogin yes on hypervisors)

How bad: High — root + password on PVE is full cluster compromise.

Will fixing break things? Use prohibit-password (keys only), not no, unless you have another admin user with sudo. Ansible playbooks expect root on PVE today.

fail2ban off

How bad: Medium — relies on LAN trust; SSH noise from scanners still fills logs.

Will fixing break things? Rarely. Tune ignoreip to your admin IP/subnet so your own typos don't ban you.

UFW off

How bad: Medium on segmented LAN; high if any host has a public IP.

Will fixing break things? Yes, if misconfigured — default deny without allowing 22 from admin IP, 80/443 from Caddy, or Docker-published ports you still need. Use Ansible roles/ssh (UFW after SSH rules) and test.

unattended-upgrades off

How bad: Medium — security patches lag until manual maintenance.

Will fixing break things? Usually no. Kernel updates may require reboot; use Unattended-Upgrade::Automatic-Reboot "false" until you want reboot windows.

Proxmox UI :8006 exposed

How bad: Critical on untrusted networks — API gives VM/storage control.

Will fixing break things? Restricting to 10.0.10.0/24 does not break normal LAN admin access.

HTTP services on all interfaces (8080, 3000, …)

How bad: High without TLS/auth at the edge; medium behind Caddy + LAN only.

Will fixing break things? Yes if you bind to 127.0.0.1 before Caddy reverse_proxy is updated. Order: Caddy route → test → then bind Docker to localhost.

Remote access (Tailscale deferred)

Decision: Tailscale off; use UniFi site-to-site / VPN into 10.0.10.0/24 for admin and Ollama/GPU access.

Security: Ensure VPN is required for SSH and Proxmox :8006 from outside; do not port-forward :22/:8006 on the router without IP allowlists.

pve201 RAM (was 97% used)

How bad: Critical — OOM kills guests, swap thrashing.

Mitigation done: VM 104 reduced 73728→65536 MiB (~8 GiB freed on hypervisor). Still tight; consider moving git-ci-01 or other workloads to pve10.


2026-05-20 — Original audit

Scope: Proxmox nodes pve201 (10.0.10.201) and pve10 (10.0.10.10), all LXCs via pct exec, SSH deep-dive on hypervisors.


Executive summary

Area Critical High Medium
Hypervisors (201, 10) 2 4 2
LXCs on 201 (10 running) 0 10 8
LXCs on 10 (3 running) 0 3 3

Top priorities

  1. Harden SSH on both Proxmox hosts (root + passwords currently allowed).
  2. Restrict Proxmox API/UI port 8006 to admin IPs.
  3. Disable password SSH on all LXCs; deploy keys + make copy-ssh-keys for inventory IPs.
  4. Patch hosts with 40105 pending apt upgrades (hypervisors worst).
  5. Put HTTP services (8080, 8000, qBit, etc.) behind reverse proxy + TLS or bind to internal IPs.

Proxmox hypervisors

pve201 — 10.0.10.201 (pve)

Resource Status
OS Debian 12, PVE 8.4.16, kernel 6.8.12-18-pve
RAM free ~2.5 GB / 126 GB (critical)
Pending apt 105
UFW / fail2ban / unattended-upgrades None

SSH audit (dedicated)

Setting Current Target
permitrootlogin yes prohibit-password
passwordauthentication yes no
pubkeyauthentication yes yes
maxauthtries 6 34
x11forwarding yes no (on servers)
Root keys 3 keys in authorized_keys audit/remove unused

Exposed services

Port Service Risk
22 SSH Brute-force (no fail2ban)
8006 Proxmox API/UI Critical — full cluster control
3128 spiceproxy Medium
111 rpcbind Low — reduce exposure

Fixes (pve201)

# 1) SSH — prefer Ansible after limiting to your IP
make copy-ssh-key HOST=pve201   # if needed
# Manual quick fix on host:
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd

# 2) Proxmox firewall — Datacenter → Firewall → restrict 8006 to 10.0.10.0/24 or admin IP
# Or iptables on host for port 8006

# 3) fail2ban
apt install fail2ban -y
systemctl enable --now fail2ban

# 4) Auto security updates
apt install unattended-upgrades apt-listchanges -y
dpkg-reconfigure -plow unattended-upgrades

# 5) Patch
apt update && apt upgrade -y

Ansible (when ready): add pve201 / pve10 to a proxmox group play with roles/ssh + roles/monitoring_server (fail2ban). Do not lock yourself out — test with second session first.


pve10 — 10.0.10.10 (PVENAS)

Resource Status
OS Debian 13 (trixie), PVE, kernel 6.17.13-3-pve
Load ~30 on 24 CPUs (overloaded)
Pending apt 92
UFW / fail2ban / unattended-upgrades None
ZFS NAS.SP00 inactive (I/O suspended)
PBS PVEBUVD00 → 10.0.10.200:8007 unreachable

SSH audit (dedicated)

Same as pve201: permitrootlogin yes, passwordauthentication yes, 3 root authorized_keys.

Exposed services

Port Service Risk
22 SSH High
8006 Proxmox API/UI Critical
2049, mountd, statd NFS/RPC High on LAN
3128 spiceproxy Medium

Fixes (pve10)

Same SSH / fail2ban / unattended-upgrades / patch steps as pve201.

Additional:

# Investigate ZFS pool
zpool status NAS.SP00
# Fix PBS connectivity or remove stale datastore from Proxmox UI

LXCs on pve201 (via pct exec)

VMID Name IP Status SSH root Password auth UFW fail2ban Upgrades Public services
301 vikunja-debian 10.0.10.159 running without-password yes no no 0 3456, 22
302 qbit-debian 10.0.10.91 running without-password yes no no 0 8080 (qBit), 22
303 searchXNG-debian 10.0.10.70 running without-password yes no no 83 8080, 22
304 wireguard-debian 10.0.10.192 running without-password yes no no 0 22
305 kuma-debian 10.0.10.197 stopped replaced by LXC 218
306 portfolio destroyed migrated → pve10 LXC 219 @ 10.0.10.106 (purged 2026-05-22)
307 jobber-delian 10.0.10.178 running without-password yes no no 83 3005, 22
308 stirling-pdf 10.0.10.43 running without-password yes no no 0 8080, 22
9001 pote-dev 10.0.10.114 stopped
9101 punimTagFE-dev 10.0.10.121 running without-password yes active no 89 8000, 111, 22
9401 mirrormatch-dev 10.0.10.141 stopped

Inventory mapping: vikunja → 159 (LXC 301), qBittorrent → 91, punimTag app → 121.

Common LXC issues (pve201)

Issue Severity Fix
passwordauthentication yes on all LXCs High Set PasswordAuthentication no in /etc/ssh/sshd_config, reload sshd
No fail2ban High Install fail2ban or rely on Proxmox FW + LAN segmentation
Apps on 0.0.0.0:8080 / 8000 / 3456 High Bind to localhost + Caddy, or restrict via Proxmox guest firewall (firewall=1 on net0 — enable rules)
7989 pending upgrades on several CTs Medium pct exec <id> -- apt update && apt upgrade -y
Stopped dev CTs (9001, 9401) Low Start when needed or keep stopped to reduce attack surface

Per-LXC fixes (pve201)

# Example: harden + patch vikunja (301) from Proxmox host
pct exec 301 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 301 -- systemctl reload ssh

# Patch container
pct exec 303 -- bash -c 'apt update && apt upgrade -y'

# Copy your SSH key (from Mac, once password/key works)
make copy-ssh-key HOST=vikunja   # 10.0.10.159
make copy-ssh-key HOST=qBittorrent # 10.0.10.91

punimTagFE-dev (9101): Only LXC with UFW active — extend rules to deny inbound except 22 from admin subnet; still disable password auth.


LXCs on pve10 (via pct exec)

VMID Name IP Status SSH root Password auth UFW fail2ban Upgrades Public services
210 cal 10.0.10.228 running without-password yes no no 0 3000, 22
215 caseware 10.0.10.105 running without-password yes no no 40 80 (nginx), 22
216 auto 10.0.10.59 running without-password yes no no 40 80 (nginx), 22

Inventory mapping: caseware → 105, auto → 59.

Fixes (pve10 LXCs)

# SSH harden caseware (215)
pct exec 215 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 215 -- systemctl reload sshd

# Patch
pct exec 215 -- apt update && apt upgrade -y
pct exec 216 -- apt update && apt upgrade -y

# Deploy keys from Mac
make copy-ssh-key HOST=caseware
make copy-ssh-key HOST=auto

HTTP port 80 on caseware/auto: Ensure TLS termination on Caddy (inventory host caddy 10.0.10.50) and no plain HTTP from WAN if exposed.


SSH hardening checklist (all Linux targets)

Use this order to avoid lockout:

  1. Confirm your key works: ssh -o BatchMode=yes root@<ip> true
  2. Set PasswordAuthentication no
  3. Set PermitRootLogin prohibit-password (LXCs already without-password — equivalent for keys-only)
  4. sshd -t && systemctl reload sshd
  5. Open second terminal and test before closing first
  6. Optional: change SSH port, MaxAuthTries 4, disable X11Forwarding

Ansible alignment:

# After keys on host
make dev HOST=<hostname> --tags security
# or role ssh via playbooks that include roles/ssh

Re-run audits

# Hypervisor full audit
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-remote.sh
ssh root@10.0.10.10  'bash -s' < scripts/security-audit-remote.sh

# Hypervisor SSH-only
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh

# All LXCs on a node
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh
ssh root@10.0.10.10  'bash -s' < scripts/security-audit-lxc-via-pve.sh

Tracking

Item Owner Status
SSH keys caseware, auto, cal, vikunja, mailcow, listmonk 2026-05-23
Fleet apt upgrade (no reboot) 2026-05-23 ☑ all previously failed hosts fixed
Tier 1 cron (journal + apt) 2026-05-23 ☑ PVE + most hosts via Ansible
Tier 2 cron (docker prune) 2026-05-23 ☑ identity, monitoring, vikunja, git-ci-01
VM 104 RAM 72→64 GiB 2026-05-23
Inventory vikunja rename 2026-05-23
Fix host_vars ansibleVM / listmonk merge 2026-05-23 ☑ plain YAML (review *.vault-bak)
SSH harden pve201
SSH harden pve10
Restrict 8006 on both nodes
fail2ban on hypervisors
make security on production groups
Disable password SSH on all LXCs
copy-ssh-keys remaining inventory ☐ partial
TLS / localhost bind for :8080 services
unattended-upgrades all production
Tailscale re-auth ⏸ deferred (UniFi VPN)
Fix ZFS NAS.SP00 on pve10
caddy Ansible as root 2026-05-23
vaultwardenVM / ansibleVM become in vault 2026-05-23
Add GPU-Dev 10.0.10.122 to inventory
Ollama bind localhost + optional Open WebUI

Next steps (priority)

  1. make security on one site host (e.g. caseware) with a second SSH session open — disable password SSH, enable UFW + fail2ban (ignoreip = LAN + VPN pool).
  2. Restrict Proxmox :8006 to 10.0.10.0/24 + VPN subnet on pve201 and pve10.
  3. Bind internal Docker ports on identity / monitoring / vaultwarden to 127.0.0.1 after confirming Caddy routes.
  4. GPU-Dev: point clients at http://10.0.10.122:11434 over VPN; tune Ollama env vars; add host to inventory when automating.
  5. unattended-upgrades on production LXCs (reboot policy manual).
  6. Review host_vars/*.vault-bak and merge any secrets still needed into vault + plain host_vars.

References