Cal Authentik OIDC playbook/role (deferred until license), Vikunja OIDC docs and vault secrets, SSO matrix, mailcow LAN proxy fix, extended security audit docs, maintenance_cron role with group_vars split, and inventory updates (vikunja rename, identity/monitoring/cal host_vars). Co-authored-by: Cursor <cursoragent@cursor.com>
19 KiB
Security Audit Report
Last audit: 2026-05-23 (re-run after SSH keys + make maintenance)
Previous audit: 2026-05-20
Auditor: scripts/security-audit-*.sh, Ansible maintenance + maintenance_cron roles
Repo baseline (roles/ssh/defaults/main.yml): PermitRootLogin prohibit-password, PasswordAuthentication no, UFW enabled.
2026-05-23 — Actions completed
| Action | Status |
|---|---|
| SSH keys → caseware, auto, cal, vikunja, mailcow, listmonk | ✅ All six reachable as root |
| SSH keys → mailcow/listmonk VMs | ✅ Via brief VM shutdown + disk inject on pve201 (no guest agent) |
Inventory rename vikanjans → vikunja |
✅ hosts + proxmox_vmid=301 |
apt upgrade fleet (skip reboot) |
✅ 14 hosts via Ansible; auto via pct exec on pve10 |
| Tier 1 cron (journal + apt) | ✅ roles/maintenance_cron on PVE, sites, comms, ansible, hermes, etc. |
| Tier 2 cron (docker prune) | ✅ identity, monitoring, vikunja; git-ci-01 keeps docker-prune-ci |
| VM 104 (GPU-Dev) RAM 72→64 GiB | ✅ pve201; host free RAM ~1.7→10 GiB |
Fix broken host_vars (ansibleVM, listmonk) |
✅ Plain YAML; old blobs → *.vault-bak |
Vault vault_*_become_password + maintenance vaultwardenVM |
✅ 2026-05-23 |
| caddy root SSH + maintenance | ✅ bootstrap-root-ssh-caddy; inventory ansible_user=root |
| ansibleVM maintenance | ✅ become password in vault |
Post-maintenance SSH reachability
| Host | SSH | Notes |
|---|---|---|
| caseware | ✅ | |
| auto | ✅ | Was slow from laptop earlier; OK after upgrade |
| cal | ✅ | |
| vikunja | ✅ | LXC 301 @ 10.0.10.159 |
| mailcow | ✅ | ~1 min downtime for key inject |
| listmonk | ✅ | ~1 min downtime for key inject |
Maintenance playbook recap (skip_reboot=true)
| Host | Result |
|---|---|
| pve201, pve10, caseware, cal, vikunja, mailcow, listmonk, identity, monitoring, hermes, levkin, portfolio, git-ci-01, sonarqube-01 | ✅ upgraded |
| caddy | ✅ (as root; no sudo package on host) |
| ansibleVM | ✅ (vault_ansiblevm_become_password) |
| vaultwardenVM | ✅ (vault_vaultwarden_become_password) |
Open security gaps (unchanged until make security)
| Control | Fleet status | Risk if fixed wrong |
|---|---|---|
PasswordAuthentication yes |
Most LXCs + both PVE | Low break risk if SSH keys tested first in a second session |
PermitRootLogin yes |
pve201, pve10, sonarqube-01 | Same — use prohibit-password, not no, if you need root+key |
| fail2ban | Off everywhere | Enabling is safe; may lock you out only if you brute-force yourself |
| UFW | Off (except one dev LXC) | Medium risk — wrong rules drop SSH/80/443; apply via Ansible roles/ssh after allowlist |
| unattended-upgrades | hermes, ansibleVM only | Safe; schedule reboots separately |
| Proxmox :8006 | Open on LAN | Restrict in PVE firewall — won't break VMs |
Docker on 0.0.0.0 |
identity, monitoring, vaultwarden, qBit | Bind to 127.0.0.1 — can break access if Caddy route missing; test URL after |
| Tailscale | Deferred | Off by choice; remote access via UniFi VPN to LAN |
See Risk explanations (2026-05-23) and fail2ban vs password SSH below.
GPU-Dev (pve201 VM 104) — Ollama / LLMs
| Resource | Current |
|---|---|
| Host | pve201, VMID 104, GPU-Dev-Debian |
| LAN IP | 10.0.10.122 (inventory devGPU @ 10.0.30.63 is a different network — use .122 from LAN) |
| RAM | 64 GiB guest (~60 GiB available when idle) |
| GPU | RTX 4080 16 GiB (PCI passthrough hostpci0) |
| Workload | Ollama already running (~3.6 GiB VRAM in sample) |
Getting the most from RAM + GPU
- Right-size models to VRAM — On a 16 GiB 4080, prefer quantised models that fit entirely in VRAM (e.g. 7B–14B Q4/Q5, or 32B Q2/Q3 if you accept quality trade-offs). If a model spills to CPU RAM, throughput drops sharply.
- One heavy model at a time — Ollama loads models on demand; set
OLLAMA_MAX_LOADED_MODELS=1(or keep only one client) so you do not fragment 64 GiB RAM + 16 GiB VRAM across several large weights. - Parallel requests —
OLLAMA_NUM_PARALLELdefaults are conservative; raise only if VRAM headroom exists (watchnvidia-smiwhile under load). - Keep guest RAM for KV cache — With 64 GiB you can run larger context windows; set
OLLAMA_CONTEXT_LENGTH/ modelnum_ctxto what you need, not maximum “just because”. - CPU offload only when needed —
num_gpulayers = all layers for speed; partial offload is for models that do not fit in VRAM, not for tuning. - Disk — Store models on fast local disk (not NFS);
ollama pullonce, prune old tags periodically (ollama list/ remove unused). - Proxmox — Do not balloon GPU VM RAM; GPU passthrough already reserves most of the 64 GiB. Freeing pve201 meant lowering this VM from 72→64 GiB, not overcommitting other guests on 201.
- Optional — Open WebUI on localhost + Caddy TLS; bind Ollama to
127.0.0.1:11434only (LAN via VPN).
Not in Ansible yet: add devGPU / 10.0.10.122 to inventory when you want playbooks (cron, hardening) on this box.
fail2ban vs password SSH
What fail2ban does: After too many failed SSH logins from an IP, it adds a temporary firewall ban for that IP (typically 10–60 minutes). It does not disable password authentication globally.
Can passwords stay on if fail2ban is on? Technically yes — fail2ban only rate-limits brute force; passwords are still weaker than keys. Best practice on servers: keys + PasswordAuthentication no + fail2ban (defence in depth).
Your Proxmox console fallback: If you lock yourself out of SSH on a guest, you can still use Proxmox → VM → Console or pct enter / qm guest exec from pve201/pve10. That is a good break-glass path, but it is not a substitute for keys on hosts you manage daily — console is slow and easy to misconfigure under pressure.
Recommendation: Enable fail2ban via make security with ignoreip including 10.0.10.0/24 and your UniFi VPN client subnet. Then disable password SSH once keys work everywhere you care about.
Risk explanations (2026-05-23)
Password SSH (PasswordAuthentication yes)
How bad: High on internet-facing IPs; medium on 10.0.10.0/24 only. Anyone who can reach :22 can try passwords indefinitely (no fail2ban).
Will fixing break things? No, if you (1) confirm key login works, (2) set PasswordAuthentication no, (3) keep a second SSH session open, (4) reload sshd. Breakage happens only if keys are missing/wrong.
Root login (PermitRootLogin yes on hypervisors)
How bad: High — root + password on PVE is full cluster compromise.
Will fixing break things? Use prohibit-password (keys only), not no, unless you have another admin user with sudo. Ansible playbooks expect root on PVE today.
fail2ban off
How bad: Medium — relies on LAN trust; SSH noise from scanners still fills logs.
Will fixing break things? Rarely. Tune ignoreip to your admin IP/subnet so your own typos don't ban you.
UFW off
How bad: Medium on segmented LAN; high if any host has a public IP.
Will fixing break things? Yes, if misconfigured — default deny without allowing 22 from admin IP, 80/443 from Caddy, or Docker-published ports you still need. Use Ansible roles/ssh (UFW after SSH rules) and test.
unattended-upgrades off
How bad: Medium — security patches lag until manual maintenance.
Will fixing break things? Usually no. Kernel updates may require reboot; use Unattended-Upgrade::Automatic-Reboot "false" until you want reboot windows.
Proxmox UI :8006 exposed
How bad: Critical on untrusted networks — API gives VM/storage control.
Will fixing break things? Restricting to 10.0.10.0/24 does not break normal LAN admin access.
HTTP services on all interfaces (8080, 3000, …)
How bad: High without TLS/auth at the edge; medium behind Caddy + LAN only.
Will fixing break things? Yes if you bind to 127.0.0.1 before Caddy reverse_proxy is updated. Order: Caddy route → test → then bind Docker to localhost.
Remote access (Tailscale deferred)
Decision: Tailscale off; use UniFi site-to-site / VPN into 10.0.10.0/24 for admin and Ollama/GPU access.
Security: Ensure VPN is required for SSH and Proxmox :8006 from outside; do not port-forward :22/:8006 on the router without IP allowlists.
pve201 RAM (was 97% used)
How bad: Critical — OOM kills guests, swap thrashing.
Mitigation done: VM 104 reduced 73728→65536 MiB (~8 GiB freed on hypervisor). Still tight; consider moving git-ci-01 or other workloads to pve10.
2026-05-20 — Original audit
Scope: Proxmox nodes pve201 (10.0.10.201) and pve10 (10.0.10.10), all LXCs via pct exec, SSH deep-dive on hypervisors.
Executive summary
| Area | Critical | High | Medium |
|---|---|---|---|
| Hypervisors (201, 10) | 2 | 4 | 2 |
| LXCs on 201 (10 running) | 0 | 10 | 8 |
| LXCs on 10 (3 running) | 0 | 3 | 3 |
Top priorities
- Harden SSH on both Proxmox hosts (root + passwords currently allowed).
- Restrict Proxmox API/UI port 8006 to admin IPs.
- Disable password SSH on all LXCs; deploy keys +
make copy-ssh-keysfor inventory IPs. - Patch hosts with 40–105 pending apt upgrades (hypervisors worst).
- Put HTTP services (8080, 8000, qBit, etc.) behind reverse proxy + TLS or bind to internal IPs.
Proxmox hypervisors
pve201 — 10.0.10.201 (pve)
| Resource | Status |
|---|---|
| OS | Debian 12, PVE 8.4.16, kernel 6.8.12-18-pve |
| RAM free | ~2.5 GB / 126 GB (critical) |
| Pending apt | 105 |
| UFW / fail2ban / unattended-upgrades | None |
SSH audit (dedicated)
| Setting | Current | Target |
|---|---|---|
permitrootlogin |
yes | prohibit-password |
passwordauthentication |
yes | no |
pubkeyauthentication |
yes | yes |
maxauthtries |
6 | 3–4 |
x11forwarding |
yes | no (on servers) |
| Root keys | 3 keys in authorized_keys |
audit/remove unused |
Exposed services
| Port | Service | Risk |
|---|---|---|
| 22 | SSH | Brute-force (no fail2ban) |
| 8006 | Proxmox API/UI | Critical — full cluster control |
| 3128 | spiceproxy | Medium |
| 111 | rpcbind | Low — reduce exposure |
Fixes (pve201)
# 1) SSH — prefer Ansible after limiting to your IP
make copy-ssh-key HOST=pve201 # if needed
# Manual quick fix on host:
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd
# 2) Proxmox firewall — Datacenter → Firewall → restrict 8006 to 10.0.10.0/24 or admin IP
# Or iptables on host for port 8006
# 3) fail2ban
apt install fail2ban -y
systemctl enable --now fail2ban
# 4) Auto security updates
apt install unattended-upgrades apt-listchanges -y
dpkg-reconfigure -plow unattended-upgrades
# 5) Patch
apt update && apt upgrade -y
Ansible (when ready): add pve201 / pve10 to a proxmox group play with roles/ssh + roles/monitoring_server (fail2ban).
Do not lock yourself out — test with second session first.
pve10 — 10.0.10.10 (PVENAS)
| Resource | Status |
|---|---|
| OS | Debian 13 (trixie), PVE, kernel 6.17.13-3-pve |
| Load | ~30 on 24 CPUs (overloaded) |
| Pending apt | 92 |
| UFW / fail2ban / unattended-upgrades | None |
ZFS NAS.SP00 |
inactive (I/O suspended) |
PBS PVEBUVD00 → 10.0.10.200:8007 |
unreachable |
SSH audit (dedicated)
Same as pve201: permitrootlogin yes, passwordauthentication yes, 3 root authorized_keys.
Exposed services
| Port | Service | Risk |
|---|---|---|
| 22 | SSH | High |
| 8006 | Proxmox API/UI | Critical |
| 2049, mountd, statd | NFS/RPC | High on LAN |
| 3128 | spiceproxy | Medium |
Fixes (pve10)
Same SSH / fail2ban / unattended-upgrades / patch steps as pve201.
Additional:
# Investigate ZFS pool
zpool status NAS.SP00
# Fix PBS connectivity or remove stale datastore from Proxmox UI
LXCs on pve201 (via pct exec)
| VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services |
|---|---|---|---|---|---|---|---|---|---|
| 301 | vikunja-debian | 10.0.10.159 | running | without-password | yes | no | no | 0 | 3456, 22 |
| 302 | qbit-debian | 10.0.10.91 | running | without-password | yes | no | no | 0 | 8080 (qBit), 22 |
| 303 | searchXNG-debian | 10.0.10.70 | running | without-password | yes | no | no | 83 | 8080, 22 |
| 304 | wireguard-debian | 10.0.10.192 | running | without-password | yes | no | no | 0 | 22 |
| 305 | kuma-debian | 10.0.10.197 | stopped | — | — | — | — | — | replaced by LXC 218 |
| 306 | portfolio | — | destroyed | — | — | — | — | — | migrated → pve10 LXC 219 @ 10.0.10.106 (purged 2026-05-22) |
| 307 | jobber-delian | 10.0.10.178 | running | without-password | yes | no | no | 83 | 3005, 22 |
| 308 | stirling-pdf | 10.0.10.43 | running | without-password | yes | no | no | 0 | 8080, 22 |
| 9001 | pote-dev | 10.0.10.114 | stopped | — | — | — | — | — | — |
| 9101 | punimTagFE-dev | 10.0.10.121 | running | without-password | yes | active | no | 89 | 8000, 111, 22 |
| 9401 | mirrormatch-dev | 10.0.10.141 | stopped | — | — | — | — | — | — |
Inventory mapping: vikunja → 159 (LXC 301), qBittorrent → 91, punimTag app → 121.
Common LXC issues (pve201)
| Issue | Severity | Fix |
|---|---|---|
passwordauthentication yes on all LXCs |
High | Set PasswordAuthentication no in /etc/ssh/sshd_config, reload sshd |
| No fail2ban | High | Install fail2ban or rely on Proxmox FW + LAN segmentation |
Apps on 0.0.0.0:8080 / 8000 / 3456 |
High | Bind to localhost + Caddy, or restrict via Proxmox guest firewall (firewall=1 on net0 — enable rules) |
| 79–89 pending upgrades on several CTs | Medium | pct exec <id> -- apt update && apt upgrade -y |
| Stopped dev CTs (9001, 9401) | Low | Start when needed or keep stopped to reduce attack surface |
Per-LXC fixes (pve201)
# Example: harden + patch vikunja (301) from Proxmox host
pct exec 301 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 301 -- systemctl reload ssh
# Patch container
pct exec 303 -- bash -c 'apt update && apt upgrade -y'
# Copy your SSH key (from Mac, once password/key works)
make copy-ssh-key HOST=vikunja # 10.0.10.159
make copy-ssh-key HOST=qBittorrent # 10.0.10.91
punimTagFE-dev (9101): Only LXC with UFW active — extend rules to deny inbound except 22 from admin subnet; still disable password auth.
LXCs on pve10 (via pct exec)
| VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services |
|---|---|---|---|---|---|---|---|---|---|
| 210 | cal | 10.0.10.228 | running | without-password | yes | no | no | 0 | 3000, 22 |
| 215 | caseware | 10.0.10.105 | running | without-password | yes | no | no | 40 | 80 (nginx), 22 |
| 216 | auto | 10.0.10.59 | running | without-password | yes | no | no | 40 | 80 (nginx), 22 |
Inventory mapping: caseware → 105, auto → 59.
Fixes (pve10 LXCs)
# SSH harden caseware (215)
pct exec 215 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 215 -- systemctl reload sshd
# Patch
pct exec 215 -- apt update && apt upgrade -y
pct exec 216 -- apt update && apt upgrade -y
# Deploy keys from Mac
make copy-ssh-key HOST=caseware
make copy-ssh-key HOST=auto
HTTP port 80 on caseware/auto: Ensure TLS termination on Caddy (inventory host caddy 10.0.10.50) and no plain HTTP from WAN if exposed.
SSH hardening checklist (all Linux targets)
Use this order to avoid lockout:
- Confirm your key works:
ssh -o BatchMode=yes root@<ip> true - Set
PasswordAuthentication no - Set
PermitRootLogin prohibit-password(LXCs alreadywithout-password— equivalent for keys-only) sshd -t && systemctl reload sshd- Open second terminal and test before closing first
- Optional: change SSH port,
MaxAuthTries 4, disableX11Forwarding
Ansible alignment:
# After keys on host
make dev HOST=<hostname> --tags security
# or role ssh via playbooks that include roles/ssh
Re-run audits
# Hypervisor full audit
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-remote.sh
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-remote.sh
# Hypervisor SSH-only
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh
# All LXCs on a node
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-lxc-via-pve.sh
Tracking
| Item | Owner | Status |
|---|---|---|
| SSH keys caseware, auto, cal, vikunja, mailcow, listmonk | 2026-05-23 | ☑ |
Fleet apt upgrade (no reboot) |
2026-05-23 | ☑ all previously failed hosts fixed |
| Tier 1 cron (journal + apt) | 2026-05-23 | ☑ PVE + most hosts via Ansible |
| Tier 2 cron (docker prune) | 2026-05-23 | ☑ identity, monitoring, vikunja, git-ci-01 |
| VM 104 RAM 72→64 GiB | 2026-05-23 | ☑ |
Inventory vikunja rename |
2026-05-23 | ☑ |
Fix host_vars ansibleVM / listmonk merge |
2026-05-23 | ☑ plain YAML (review *.vault-bak) |
| SSH harden pve201 | ☐ | |
| SSH harden pve10 | ☐ | |
| Restrict 8006 on both nodes | ☐ | |
| fail2ban on hypervisors | ☐ | |
make security on production groups |
☐ | |
| Disable password SSH on all LXCs | ☐ | |
copy-ssh-keys remaining inventory |
☐ partial | |
| TLS / localhost bind for :8080 services | ☐ | |
| unattended-upgrades all production | ☐ | |
| Tailscale re-auth | ⏸ deferred (UniFi VPN) | |
| Fix ZFS NAS.SP00 on pve10 | ☐ | |
| caddy Ansible as root | 2026-05-23 | ☑ |
| vaultwardenVM / ansibleVM become in vault | 2026-05-23 | ☑ |
Add GPU-Dev 10.0.10.122 to inventory |
☐ | |
| Ollama bind localhost + optional Open WebUI | ☐ |
Next steps (priority)
make securityon one site host (e.g. caseware) with a second SSH session open — disable password SSH, enable UFW + fail2ban (ignoreip= LAN + VPN pool).- Restrict Proxmox :8006 to
10.0.10.0/24+ VPN subnet on pve201 and pve10. - Bind internal Docker ports on identity / monitoring / vaultwarden to
127.0.0.1after confirming Caddy routes. - GPU-Dev: point clients at
http://10.0.10.122:11434over VPN; tune Ollama env vars; add host to inventory when automating. - unattended-upgrades on production LXCs (reboot policy manual).
- Review
host_vars/*.vault-bakand merge any secrets still needed into vault + plain host_vars.
References
- Security remediation plan — phased fixes (critical → low) and login model
- Security hardening guide
- SECURITY_HARDENING_PLAN.md
- Role defaults:
roles/ssh/defaults/main.yml