ilia/ansible

Fork 0

ilia f17a1a3bcc

CI / skip-ci-check (pull_request) Successful in 7s

Details

CI / lint-and-test (pull_request) Failing after 10s

Details

CI / secret-scanning (pull_request) Successful in 7s

Details

CI / dependency-scan (pull_request) Successful in 16s

Details

CI / sast-scan (pull_request) Successful in 29s

Details

CI / ansible-validation (pull_request) Failing after 54s

Details

CI / license-check (pull_request) Successful in 14s

Details

CI / vault-check (pull_request) Successful in 12s

Details

CI / container-scan (pull_request) Successful in 7s

Details

CI / sonar-analysis (pull_request) Successful in 7s

Details

CI / playbook-test (pull_request) Successful in 25s

Details

CI / workflow-summary (pull_request) Successful in 5s

Details

Add homelab SSO, maintenance cron, and inventory cleanup.

Cal Authentik OIDC playbook/role (deferred until license), Vikunja OIDC
docs and vault secrets, SSO matrix, mailcow LAN proxy fix, extended
security audit docs, maintenance_cron role with group_vars split, and
inventory updates (vikunja rename, identity/monitoring/cal host_vars).

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-23 20:23:10 -04:00

19 KiB

Raw Blame History

Security Audit Report

Last audit: 2026-05-23 (re-run after SSH keys + make maintenance)
Previous audit: 2026-05-20
Auditor: scripts/security-audit-*.sh, Ansible maintenance + maintenance_cron roles
Repo baseline (roles/ssh/defaults/main.yml): PermitRootLogin prohibit-password, PasswordAuthentication no, UFW enabled.

2026-05-23 — Actions completed

Action	Status
SSH keys → caseware, auto, cal, vikunja, mailcow, listmonk	✅ All six reachable as `root`
SSH keys → mailcow/listmonk VMs	✅ Via brief VM shutdown + disk inject on pve201 (no guest agent)
Inventory rename `vikanjans` → `vikunja`	✅ `hosts` + `proxmox_vmid=301`
`apt upgrade` fleet (skip reboot)	✅ 14 hosts via Ansible; auto via `pct exec` on pve10
Tier 1 cron (journal + apt)	✅ `roles/maintenance_cron` on PVE, sites, comms, ansible, hermes, etc.
Tier 2 cron (docker prune)	✅ identity, monitoring, vikunja; git-ci-01 keeps `docker-prune-ci`
VM 104 (GPU-Dev) RAM 72→64 GiB	✅ pve201; host free RAM ~1.7→10 GiB
Fix broken `host_vars` (ansibleVM, listmonk)	✅ Plain YAML; old blobs → `*.vault-bak`
Vault `vault_*_become_password` + maintenance vaultwardenVM	✅ 2026-05-23
caddy root SSH + maintenance	✅ `bootstrap-root-ssh-caddy`; inventory `ansible_user=root`
ansibleVM maintenance	✅ become password in vault

Post-maintenance SSH reachability

Host	SSH	Notes
caseware	✅
auto	✅	Was slow from laptop earlier; OK after upgrade
cal	✅
vikunja	✅	LXC 301 @ 10.0.10.159
mailcow	✅	~1 min downtime for key inject
listmonk	✅	~1 min downtime for key inject

Maintenance playbook recap (`skip_reboot=true`)

Host	Result
pve201, pve10, caseware, cal, vikunja, mailcow, listmonk, identity, monitoring, hermes, levkin, portfolio, git-ci-01, sonarqube-01	✅ upgraded
caddy	✅ (as `root`; no `sudo` package on host)
ansibleVM	✅ (`vault_ansiblevm_become_password`)
vaultwardenVM	✅ (`vault_vaultwarden_become_password`)

Open security gaps (unchanged until `make security`)

Control	Fleet status	Risk if fixed wrong
`PasswordAuthentication yes`	Most LXCs + both PVE	Low break risk if SSH keys tested first in a second session
`PermitRootLogin yes`	pve201, pve10, sonarqube-01	Same — use `prohibit-password`, not `no`, if you need root+key
fail2ban	Off everywhere	Enabling is safe; may lock you out only if you brute-force yourself
UFW	Off (except one dev LXC)	Medium risk — wrong rules drop SSH/80/443; apply via Ansible `roles/ssh` after allowlist
unattended-upgrades	hermes, ansibleVM only	Safe; schedule reboots separately
Proxmox :8006	Open on LAN	Restrict in PVE firewall — won't break VMs
Docker on `0.0.0.0`	identity, monitoring, vaultwarden, qBit	Bind to `127.0.0.1` — can break access if Caddy route missing; test URL after
Tailscale	Deferred	Off by choice; remote access via UniFi VPN to LAN

See Risk explanations (2026-05-23) and fail2ban vs password SSH below.

GPU-Dev (pve201 VM 104) — Ollama / LLMs

Resource	Current
Host	pve201, VMID 104, `GPU-Dev-Debian`
LAN IP	10.0.10.122 (inventory `devGPU` @ 10.0.30.63 is a different network — use `.122` from LAN)
RAM	64 GiB guest (~60 GiB available when idle)
GPU	RTX 4080 16 GiB (PCI passthrough `hostpci0`)
Workload	Ollama already running (~3.6 GiB VRAM in sample)

Getting the most from RAM + GPU

Right-size models to VRAM — On a 16 GiB 4080, prefer quantised models that fit entirely in VRAM (e.g. 7B–14B Q4/Q5, or 32B Q2/Q3 if you accept quality trade-offs). If a model spills to CPU RAM, throughput drops sharply.
One heavy model at a time — Ollama loads models on demand; set OLLAMA_MAX_LOADED_MODELS=1 (or keep only one client) so you do not fragment 64 GiB RAM + 16 GiB VRAM across several large weights.
Parallel requests — OLLAMA_NUM_PARALLEL defaults are conservative; raise only if VRAM headroom exists (watch nvidia-smi while under load).
Keep guest RAM for KV cache — With 64 GiB you can run larger context windows; set OLLAMA_CONTEXT_LENGTH / model num_ctx to what you need, not maximum “just because”.
CPU offload only when needed — num_gpu layers = all layers for speed; partial offload is for models that do not fit in VRAM, not for tuning.
Disk — Store models on fast local disk (not NFS); ollama pull once, prune old tags periodically (ollama list / remove unused).
Proxmox — Do not balloon GPU VM RAM; GPU passthrough already reserves most of the 64 GiB. Freeing pve201 meant lowering this VM from 72→64 GiB, not overcommitting other guests on 201.
Optional — Open WebUI on localhost + Caddy TLS; bind Ollama to 127.0.0.1:11434 only (LAN via VPN).

Not in Ansible yet: add devGPU / 10.0.10.122 to inventory when you want playbooks (cron, hardening) on this box.

fail2ban vs password SSH

What fail2ban does: After too many failed SSH logins from an IP, it adds a temporary firewall ban for that IP (typically 10–60 minutes). It does not disable password authentication globally.

Can passwords stay on if fail2ban is on? Technically yes — fail2ban only rate-limits brute force; passwords are still weaker than keys. Best practice on servers: keys + PasswordAuthentication no + fail2ban (defence in depth).

Your Proxmox console fallback: If you lock yourself out of SSH on a guest, you can still use Proxmox → VM → Console or pct enter / qm guest exec from pve201/pve10. That is a good break-glass path, but it is not a substitute for keys on hosts you manage daily — console is slow and easy to misconfigure under pressure.

Recommendation: Enable fail2ban via make security with ignoreip including 10.0.10.0/24 and your UniFi VPN client subnet. Then disable password SSH once keys work everywhere you care about.

Risk explanations (2026-05-23)

Password SSH (`PasswordAuthentication yes`)

How bad: High on internet-facing IPs; medium on 10.0.10.0/24 only. Anyone who can reach :22 can try passwords indefinitely (no fail2ban).

Will fixing break things? No, if you (1) confirm key login works, (2) set PasswordAuthentication no, (3) keep a second SSH session open, (4) reload sshd. Breakage happens only if keys are missing/wrong.

Root login (`PermitRootLogin yes` on hypervisors)

How bad: High — root + password on PVE is full cluster compromise.

Will fixing break things? Use prohibit-password (keys only), not no, unless you have another admin user with sudo. Ansible playbooks expect root on PVE today.

fail2ban off

How bad: Medium — relies on LAN trust; SSH noise from scanners still fills logs.

Will fixing break things? Rarely. Tune ignoreip to your admin IP/subnet so your own typos don't ban you.

UFW off

How bad: Medium on segmented LAN; high if any host has a public IP.

Will fixing break things? Yes, if misconfigured — default deny without allowing 22 from admin IP, 80/443 from Caddy, or Docker-published ports you still need. Use Ansible roles/ssh (UFW after SSH rules) and test.

unattended-upgrades off

How bad: Medium — security patches lag until manual maintenance.

Will fixing break things? Usually no. Kernel updates may require reboot; use Unattended-Upgrade::Automatic-Reboot "false" until you want reboot windows.

Proxmox UI :8006 exposed

How bad: Critical on untrusted networks — API gives VM/storage control.

Will fixing break things? Restricting to 10.0.10.0/24 does not break normal LAN admin access.

HTTP services on all interfaces (8080, 3000, …)

How bad: High without TLS/auth at the edge; medium behind Caddy + LAN only.

Will fixing break things? Yes if you bind to 127.0.0.1 before Caddy reverse_proxy is updated. Order: Caddy route → test → then bind Docker to localhost.

Remote access (Tailscale deferred)

Decision: Tailscale off; use UniFi site-to-site / VPN into 10.0.10.0/24 for admin and Ollama/GPU access.

Security: Ensure VPN is required for SSH and Proxmox :8006 from outside; do not port-forward :22/:8006 on the router without IP allowlists.

pve201 RAM (was 97% used)

How bad: Critical — OOM kills guests, swap thrashing.

Mitigation done: VM 104 reduced 73728→65536 MiB (~8 GiB freed on hypervisor). Still tight; consider moving git-ci-01 or other workloads to pve10.

2026-05-20 — Original audit

Scope: Proxmox nodes pve201 (10.0.10.201) and pve10 (10.0.10.10), all LXCs via pct exec, SSH deep-dive on hypervisors.

Executive summary

Area	Critical	High	Medium
Hypervisors (201, 10)	2	4	2
LXCs on 201 (10 running)	0	10	8
LXCs on 10 (3 running)	0	3	3

Top priorities

Harden SSH on both Proxmox hosts (root + passwords currently allowed).
Restrict Proxmox API/UI port 8006 to admin IPs.
Disable password SSH on all LXCs; deploy keys + make copy-ssh-keys for inventory IPs.
Patch hosts with 40–105 pending apt upgrades (hypervisors worst).
Put HTTP services (8080, 8000, qBit, etc.) behind reverse proxy + TLS or bind to internal IPs.

Proxmox hypervisors

pve201 — 10.0.10.201 (`pve`)

Resource	Status
OS	Debian 12, PVE 8.4.16, kernel 6.8.12-18-pve
RAM free	~2.5 GB / 126 GB (critical)
Pending apt	105
UFW / fail2ban / unattended-upgrades	None

SSH audit (dedicated)

Setting	Current	Target
`permitrootlogin`	yes	`prohibit-password`
`passwordauthentication`	yes	`no`
`pubkeyauthentication`	yes	yes
`maxauthtries`	6	3–4
`x11forwarding`	yes	no (on servers)
Root keys	3 keys in `authorized_keys`	audit/remove unused

Exposed services

Port	Service	Risk
22	SSH	Brute-force (no fail2ban)
8006	Proxmox API/UI	Critical — full cluster control
3128	spiceproxy	Medium
111	rpcbind	Low — reduce exposure

Fixes (pve201)

# 1) SSH — prefer Ansible after limiting to your IP
make copy-ssh-key HOST=pve201   # if needed
# Manual quick fix on host:
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd

# 2) Proxmox firewall — Datacenter → Firewall → restrict 8006 to 10.0.10.0/24 or admin IP
# Or iptables on host for port 8006

# 3) fail2ban
apt install fail2ban -y
systemctl enable --now fail2ban

# 4) Auto security updates
apt install unattended-upgrades apt-listchanges -y
dpkg-reconfigure -plow unattended-upgrades

# 5) Patch
apt update && apt upgrade -y

Ansible (when ready): add pve201 / pve10 to a proxmox group play with roles/ssh + roles/monitoring_server (fail2ban). Do not lock yourself out — test with second session first.

pve10 — 10.0.10.10 (`PVENAS`)

Resource	Status
OS	Debian 13 (trixie), PVE, kernel 6.17.13-3-pve
Load	~30 on 24 CPUs (overloaded)
Pending apt	92
UFW / fail2ban / unattended-upgrades	None
ZFS `NAS.SP00`	inactive (I/O suspended)
PBS `PVEBUVD00` → 10.0.10.200:8007	unreachable

SSH audit (dedicated)

Same as pve201: permitrootlogin yes, passwordauthentication yes, 3 root authorized_keys.

Exposed services

Port	Service	Risk
22	SSH	High
8006	Proxmox API/UI	Critical
2049, mountd, statd	NFS/RPC	High on LAN
3128	spiceproxy	Medium

Fixes (pve10)

Same SSH / fail2ban / unattended-upgrades / patch steps as pve201.

Additional:

# Investigate ZFS pool
zpool status NAS.SP00
# Fix PBS connectivity or remove stale datastore from Proxmox UI

LXCs on pve201 (via `pct exec`)

VMID	Name	IP	Status	SSH root	Password auth	UFW	fail2ban	Upgrades	Public services
301	vikunja-debian	10.0.10.159	running	without-password	yes	no	no	0	3456, 22
302	qbit-debian	10.0.10.91	running	without-password	yes	no	no	0	8080 (qBit), 22
303	searchXNG-debian	10.0.10.70	running	without-password	yes	no	no	83	8080, 22
304	wireguard-debian	10.0.10.192	running	without-password	yes	no	no	0	22
305	kuma-debian	10.0.10.197	stopped	—	—	—	—	—	replaced by LXC 218
306	portfolio	—	destroyed	—	—	—	—	—	migrated → pve10 LXC 219 @ `10.0.10.106` (purged 2026-05-22)
307	jobber-delian	10.0.10.178	running	without-password	yes	no	no	83	3005, 22
308	stirling-pdf	10.0.10.43	running	without-password	yes	no	no	0	8080, 22
9001	pote-dev	10.0.10.114	stopped	—	—	—	—	—	—
9101	punimTagFE-dev	10.0.10.121	running	without-password	yes	active	no	89	8000, 111, 22
9401	mirrormatch-dev	10.0.10.141	stopped	—	—	—	—	—	—

Inventory mapping: vikunja → 159 (LXC 301), qBittorrent → 91, punimTag app → 121.

Common LXC issues (pve201)

Issue	Severity	Fix
`passwordauthentication yes` on all LXCs	High	Set `PasswordAuthentication no` in `/etc/ssh/sshd_config`, reload sshd
No fail2ban	High	Install fail2ban or rely on Proxmox FW + LAN segmentation
Apps on `0.0.0.0:8080` / 8000 / 3456	High	Bind to localhost + Caddy, or restrict via Proxmox guest firewall (`firewall=1` on net0 — enable rules)
79–89 pending upgrades on several CTs	Medium	`pct exec <id> -- apt update && apt upgrade -y`
Stopped dev CTs (9001, 9401)	Low	Start when needed or keep stopped to reduce attack surface

Per-LXC fixes (pve201)

# Example: harden + patch vikunja (301) from Proxmox host
pct exec 301 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 301 -- systemctl reload ssh

# Patch container
pct exec 303 -- bash -c 'apt update && apt upgrade -y'

# Copy your SSH key (from Mac, once password/key works)
make copy-ssh-key HOST=vikunja   # 10.0.10.159
make copy-ssh-key HOST=qBittorrent # 10.0.10.91

punimTagFE-dev (9101): Only LXC with UFW active — extend rules to deny inbound except 22 from admin subnet; still disable password auth.

LXCs on pve10 (via `pct exec`)

VMID	Name	IP	Status	SSH root	Password auth	UFW	fail2ban	Upgrades	Public services
210	cal	10.0.10.228	running	without-password	yes	no	no	0	3000, 22
215	caseware	10.0.10.105	running	without-password	yes	no	no	40	80 (nginx), 22
216	auto	10.0.10.59	running	without-password	yes	no	no	40	80 (nginx), 22

Inventory mapping: caseware → 105, auto → 59.

Fixes (pve10 LXCs)

# SSH harden caseware (215)
pct exec 215 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 215 -- systemctl reload sshd

# Patch
pct exec 215 -- apt update && apt upgrade -y
pct exec 216 -- apt update && apt upgrade -y

# Deploy keys from Mac
make copy-ssh-key HOST=caseware
make copy-ssh-key HOST=auto

HTTP port 80 on caseware/auto: Ensure TLS termination on Caddy (inventory host caddy 10.0.10.50) and no plain HTTP from WAN if exposed.

SSH hardening checklist (all Linux targets)

Use this order to avoid lockout:

Confirm your key works: ssh -o BatchMode=yes root@<ip> true
Set PasswordAuthentication no
Set PermitRootLogin prohibit-password (LXCs already without-password — equivalent for keys-only)
sshd -t && systemctl reload sshd
Open second terminal and test before closing first
Optional: change SSH port, MaxAuthTries 4, disable X11Forwarding

Ansible alignment:

# After keys on host
make dev HOST=<hostname> --tags security
# or role ssh via playbooks that include roles/ssh

Re-run audits

# Hypervisor full audit
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-remote.sh
ssh root@10.0.10.10  'bash -s' < scripts/security-audit-remote.sh

# Hypervisor SSH-only
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh

# All LXCs on a node
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh
ssh root@10.0.10.10  'bash -s' < scripts/security-audit-lxc-via-pve.sh

Tracking

Item	Owner	Status
SSH keys caseware, auto, cal, vikunja, mailcow, listmonk	2026-05-23	☑
Fleet `apt upgrade` (no reboot)	2026-05-23	☑ all previously failed hosts fixed
Tier 1 cron (journal + apt)	2026-05-23	☑ PVE + most hosts via Ansible
Tier 2 cron (docker prune)	2026-05-23	☑ identity, monitoring, vikunja, git-ci-01
VM 104 RAM 72→64 GiB	2026-05-23	☑
Inventory `vikunja` rename	2026-05-23	☑
Fix `host_vars` ansibleVM / listmonk merge	2026-05-23	☑ plain YAML (review `*.vault-bak`)
SSH harden pve201		☐
SSH harden pve10		☐
Restrict 8006 on both nodes		☐
fail2ban on hypervisors		☐
`make security` on production groups		☐
Disable password SSH on all LXCs		☐
`copy-ssh-keys` remaining inventory		☐ partial
TLS / localhost bind for :8080 services		☐
unattended-upgrades all production		☐
Tailscale re-auth		⏸ deferred (UniFi VPN)
Fix ZFS NAS.SP00 on pve10		☐
caddy Ansible as root	2026-05-23	☑
vaultwardenVM / ansibleVM become in vault	2026-05-23	☑
Add GPU-Dev `10.0.10.122` to inventory		☐
Ollama bind localhost + optional Open WebUI		☐

Next steps (priority)

make security on one site host (e.g. caseware) with a second SSH session open — disable password SSH, enable UFW + fail2ban (ignoreip = LAN + VPN pool).
Restrict Proxmox :8006 to 10.0.10.0/24 + VPN subnet on pve201 and pve10.
Bind internal Docker ports on identity / monitoring / vaultwarden to 127.0.0.1 after confirming Caddy routes.
GPU-Dev: point clients at http://10.0.10.122:11434 over VPN; tune Ollama env vars; add host to inventory when automating.
unattended-upgrades on production LXCs (reboot policy manual).
Review host_vars/*.vault-bak and merge any secrets still needed into vault + plain host_vars.

References

Security remediation plan — phased fixes (critical → low) and login model
Security hardening guide
SECURITY_HARDENING_PLAN.md
Role defaults: roles/ssh/defaults/main.yml

19 KiB Raw Blame History Unescape Escape

Security Audit Report

2026-05-23 — Actions completed

Post-maintenance SSH reachability

Maintenance playbook recap (skip_reboot=true)

Open security gaps (unchanged until make security)

GPU-Dev (pve201 VM 104) — Ollama / LLMs

Getting the most from RAM + GPU

fail2ban vs password SSH

Risk explanations (2026-05-23)

Password SSH (PasswordAuthentication yes)

Root login (PermitRootLogin yes on hypervisors)

fail2ban off

UFW off

unattended-upgrades off

Proxmox UI :8006 exposed

HTTP services on all interfaces (8080, 3000, …)

Remote access (Tailscale deferred)

pve201 RAM (was 97% used)

2026-05-20 — Original audit

Executive summary

Proxmox hypervisors

pve201 — 10.0.10.201 (pve)

SSH audit (dedicated)

Exposed services

Fixes (pve201)

pve10 — 10.0.10.10 (PVENAS)

SSH audit (dedicated)

Exposed services

Fixes (pve10)

LXCs on pve201 (via pct exec)

Common LXC issues (pve201)

Per-LXC fixes (pve201)

LXCs on pve10 (via pct exec)

Fixes (pve10 LXCs)

SSH hardening checklist (all Linux targets)

Re-run audits

Tracking

Next steps (priority)

References

19 KiB

Raw Blame History

Maintenance playbook recap (`skip_reboot=true`)

Open security gaps (unchanged until `make security`)

Password SSH (`PasswordAuthentication yes`)

Root login (`PermitRootLogin yes` on hypervisors)

pve201 — 10.0.10.201 (`pve`)

pve10 — 10.0.10.10 (`PVENAS`)

LXCs on pve201 (via `pct exec`)

LXCs on pve10 (via `pct exec`)