ansible/docs/guides/security-audit-report.md
ilia f17a1a3bcc
Some checks failed
CI / skip-ci-check (pull_request) Successful in 7s
CI / lint-and-test (pull_request) Failing after 10s
CI / secret-scanning (pull_request) Successful in 7s
CI / dependency-scan (pull_request) Successful in 16s
CI / sast-scan (pull_request) Successful in 29s
CI / ansible-validation (pull_request) Failing after 54s
CI / license-check (pull_request) Successful in 14s
CI / vault-check (pull_request) Successful in 12s
CI / container-scan (pull_request) Successful in 7s
CI / sonar-analysis (pull_request) Successful in 7s
CI / playbook-test (pull_request) Successful in 25s
CI / workflow-summary (pull_request) Successful in 5s
Add homelab SSO, maintenance cron, and inventory cleanup.
Cal Authentik OIDC playbook/role (deferred until license), Vikunja OIDC
docs and vault secrets, SSO matrix, mailcow LAN proxy fix, extended
security audit docs, maintenance_cron role with group_vars split, and
inventory updates (vikunja rename, identity/monitoring/cal host_vars).

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-23 20:23:10 -04:00

442 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Security Audit Report
**Last audit:** 2026-05-23 (re-run after SSH keys + `make maintenance`)
**Previous audit:** 2026-05-20
**Auditor:** `scripts/security-audit-*.sh`, Ansible `maintenance` + `maintenance_cron` roles
**Repo baseline** (`roles/ssh/defaults/main.yml`): `PermitRootLogin prohibit-password`, `PasswordAuthentication no`, UFW enabled.
---
## 2026-05-23 — Actions completed
| Action | Status |
|--------|--------|
| SSH keys → caseware, auto, cal, vikunja, mailcow, listmonk | ✅ All six reachable as `root` |
| SSH keys → mailcow/listmonk VMs | ✅ Via brief VM shutdown + disk inject on pve201 (no guest agent) |
| Inventory rename `vikanjans``vikunja` | ✅ `hosts` + `proxmox_vmid=301` |
| `apt upgrade` fleet (skip reboot) | ✅ 14 hosts via Ansible; auto via `pct exec` on pve10 |
| Tier 1 cron (journal + apt) | ✅ `roles/maintenance_cron` on PVE, sites, comms, ansible, hermes, etc. |
| Tier 2 cron (docker prune) | ✅ identity, monitoring, vikunja; git-ci-01 keeps `docker-prune-ci` |
| VM 104 (GPU-Dev) RAM 72→64 GiB | ✅ pve201; host free RAM ~1.7→10 GiB |
| Fix broken `host_vars` (ansibleVM, listmonk) | ✅ Plain YAML; old blobs → `*.vault-bak` |
| Vault `vault_*_become_password` + maintenance vaultwardenVM | ✅ 2026-05-23 |
| caddy root SSH + maintenance | ✅ `bootstrap-root-ssh-caddy`; inventory `ansible_user=root` |
| ansibleVM maintenance | ✅ become password in vault |
### Post-maintenance SSH reachability
| Host | SSH | Notes |
|------|-----|-------|
| caseware | ✅ | |
| auto | ✅ | Was slow from laptop earlier; OK after upgrade |
| cal | ✅ | |
| vikunja | ✅ | LXC 301 @ 10.0.10.159 |
| mailcow | ✅ | ~1 min downtime for key inject |
| listmonk | ✅ | ~1 min downtime for key inject |
### Maintenance playbook recap (`skip_reboot=true`)
| Host | Result |
|------|--------|
| pve201, pve10, caseware, cal, vikunja, mailcow, listmonk, identity, monitoring, hermes, levkin, portfolio, git-ci-01, sonarqube-01 | ✅ upgraded |
| caddy | ✅ (as `root`; no `sudo` package on host) |
| ansibleVM | ✅ (`vault_ansiblevm_become_password`) |
| vaultwardenVM | ✅ (`vault_vaultwarden_become_password`) |
### Open security gaps (unchanged until `make security`)
| Control | Fleet status | Risk if fixed wrong |
|---------|--------------|---------------------|
| `PasswordAuthentication yes` | Most LXCs + both PVE | **Low break risk** if SSH keys tested first in a second session |
| `PermitRootLogin yes` | pve201, pve10, sonarqube-01 | Same — use `prohibit-password`, not `no`, if you need root+key |
| fail2ban | Off everywhere | Enabling is safe; may lock you out only if you brute-force yourself |
| UFW | Off (except one dev LXC) | **Medium risk** — wrong rules drop SSH/80/443; apply via Ansible `roles/ssh` after allowlist |
| unattended-upgrades | hermes, ansibleVM only | Safe; schedule reboots separately |
| Proxmox :8006 | Open on LAN | Restrict in PVE firewall — **won't break VMs** |
| Docker on `0.0.0.0` | identity, monitoring, vaultwarden, qBit | Bind to `127.0.0.1`**can break access** if Caddy route missing; test URL after |
| Tailscale | **Deferred** | Off by choice; remote access via **UniFi VPN** to LAN |
See [Risk explanations (2026-05-23)](#risk-explanations-2026-05-23) and [fail2ban vs password SSH](#fail2ban-vs-password-ssh) below.
---
## GPU-Dev (pve201 VM 104) — Ollama / LLMs
| Resource | Current |
|----------|---------|
| Host | pve201, VMID **104**, `GPU-Dev-Debian` |
| LAN IP | **10.0.10.122** (inventory `devGPU` @ 10.0.30.63 is a different network — use `.122` from LAN) |
| RAM | **64 GiB** guest (~60 GiB available when idle) |
| GPU | **RTX 4080 16 GiB** (PCI passthrough `hostpci0`) |
| Workload | **Ollama** already running (~3.6 GiB VRAM in sample) |
### Getting the most from RAM + GPU
1. **Right-size models to VRAM** — On a 16 GiB 4080, prefer quantised models that fit entirely in VRAM (e.g. 7B14B Q4/Q5, or 32B Q2/Q3 if you accept quality trade-offs). If a model spills to CPU RAM, throughput drops sharply.
2. **One heavy model at a time** — Ollama loads models on demand; set `OLLAMA_MAX_LOADED_MODELS=1` (or keep only one client) so you do not fragment 64 GiB RAM + 16 GiB VRAM across several large weights.
3. **Parallel requests**`OLLAMA_NUM_PARALLEL` defaults are conservative; raise only if VRAM headroom exists (watch `nvidia-smi` while under load).
4. **Keep guest RAM for KV cache** — With 64 GiB you can run larger context windows; set `OLLAMA_CONTEXT_LENGTH` / model `num_ctx` to what you need, not maximum “just because”.
5. **CPU offload only when needed**`num_gpu` layers = all layers for speed; partial offload is for models that do not fit in VRAM, not for tuning.
6. **Disk** — Store models on fast local disk (not NFS); `ollama pull` once, prune old tags periodically (`ollama list` / remove unused).
7. **Proxmox** — Do not balloon GPU VM RAM; GPU passthrough already reserves most of the 64 GiB. Freeing pve201 meant lowering this VM from 72→64 GiB, not overcommitting other guests on 201.
8. **Optional** — [Open WebUI](https://github.com/open-webui/open-webui) on localhost + Caddy TLS; bind Ollama to `127.0.0.1:11434` only (LAN via VPN).
**Not in Ansible yet:** add `devGPU` / `10.0.10.122` to inventory when you want playbooks (cron, hardening) on this box.
---
## fail2ban vs password SSH
**What fail2ban does:** After too many failed SSH logins from an IP, it adds a **temporary firewall ban** for that IP (typically 1060 minutes). It does **not** disable password authentication globally.
**Can passwords stay on if fail2ban is on?** Technically yes — fail2ban only rate-limits brute force; passwords are still weaker than keys. Best practice on servers: **keys + `PasswordAuthentication no` + fail2ban** (defence in depth).
**Your Proxmox console fallback:** If you lock yourself out of SSH on a guest, you can still use **Proxmox → VM → Console** or `pct enter` / `qm guest exec` from pve201/pve10. That is a good break-glass path, but it is **not** a substitute for keys on hosts you manage daily — console is slow and easy to misconfigure under pressure.
**Recommendation:** Enable fail2ban via `make security` with `ignoreip` including `10.0.10.0/24` and your UniFi VPN client subnet. Then disable password SSH once keys work everywhere you care about.
---
## Risk explanations (2026-05-23)
### Password SSH (`PasswordAuthentication yes`)
**How bad:** High on internet-facing IPs; medium on `10.0.10.0/24` only. Anyone who can reach :22 can try passwords indefinitely (no fail2ban).
**Will fixing break things?** No, if you (1) confirm key login works, (2) set `PasswordAuthentication no`, (3) keep a second SSH session open, (4) reload sshd. Breakage happens only if keys are missing/wrong.
### Root login (`PermitRootLogin yes` on hypervisors)
**How bad:** High — root + password on PVE is full cluster compromise.
**Will fixing break things?** Use `prohibit-password` (keys only), not `no`, unless you have another admin user with sudo. Ansible playbooks expect root on PVE today.
### fail2ban off
**How bad:** Medium — relies on LAN trust; SSH noise from scanners still fills logs.
**Will fixing break things?** Rarely. Tune `ignoreip` to your admin IP/subnet so your own typos don't ban you.
### UFW off
**How bad:** Medium on segmented LAN; high if any host has a public IP.
**Will fixing break things?** **Yes, if misconfigured** — default deny without allowing 22 from admin IP, 80/443 from Caddy, or Docker-published ports you still need. Use Ansible `roles/ssh` (UFW after SSH rules) and test.
### unattended-upgrades off
**How bad:** Medium — security patches lag until manual maintenance.
**Will fixing break things?** Usually no. Kernel updates may require reboot; use `Unattended-Upgrade::Automatic-Reboot "false"` until you want reboot windows.
### Proxmox UI :8006 exposed
**How bad:** **Critical** on untrusted networks — API gives VM/storage control.
**Will fixing break things?** Restricting to `10.0.10.0/24` does not break normal LAN admin access.
### HTTP services on all interfaces (8080, 3000, …)
**How bad:** High without TLS/auth at the edge; medium behind Caddy + LAN only.
**Will fixing break things?** **Yes** if you bind to `127.0.0.1` before Caddy `reverse_proxy` is updated. Order: Caddy route → test → then bind Docker to localhost.
### Remote access (Tailscale deferred)
**Decision:** Tailscale off; use **UniFi site-to-site / VPN** into `10.0.10.0/24` for admin and Ollama/GPU access.
**Security:** Ensure VPN is required for SSH and Proxmox :8006 from outside; do not port-forward :22/:8006 on the router without IP allowlists.
### pve201 RAM (was 97% used)
**How bad:** **Critical** — OOM kills guests, swap thrashing.
**Mitigation done:** VM 104 reduced 73728→65536 MiB (~8 GiB freed on hypervisor). Still tight; consider moving git-ci-01 or other workloads to pve10.
---
## 2026-05-20 — Original audit
**Scope:** Proxmox nodes `pve201` (10.0.10.201) and `pve10` (10.0.10.10), all LXCs via `pct exec`, SSH deep-dive on hypervisors.
---
## Executive summary
| Area | Critical | High | Medium |
|------|----------|------|--------|
| Hypervisors (201, 10) | 2 | 4 | 2 |
| LXCs on 201 (10 running) | 0 | 10 | 8 |
| LXCs on 10 (3 running) | 0 | 3 | 3 |
**Top priorities**
1. Harden **SSH on both Proxmox hosts** (root + passwords currently allowed).
2. Restrict **Proxmox API/UI port 8006** to admin IPs.
3. Disable **password SSH on all LXCs**; deploy keys + `make copy-ssh-keys` for inventory IPs.
4. Patch hosts with **40105** pending apt upgrades (hypervisors worst).
5. Put **HTTP services** (8080, 8000, qBit, etc.) behind reverse proxy + TLS or bind to internal IPs.
---
## Proxmox hypervisors
### pve201 — 10.0.10.201 (`pve`)
| Resource | Status |
|----------|--------|
| OS | Debian 12, PVE 8.4.16, kernel 6.8.12-18-pve |
| RAM free | ~2.5 GB / 126 GB (**critical**) |
| Pending apt | **105** |
| UFW / fail2ban / unattended-upgrades | **None** |
#### SSH audit (dedicated)
| Setting | Current | Target |
|---------|---------|--------|
| `permitrootlogin` | **yes** | `prohibit-password` |
| `passwordauthentication` | **yes** | `no` |
| `pubkeyauthentication` | yes | yes |
| `maxauthtries` | 6 | 34 |
| `x11forwarding` | yes | no (on servers) |
| Root keys | 3 keys in `authorized_keys` | audit/remove unused |
#### Exposed services
| Port | Service | Risk |
|------|---------|------|
| 22 | SSH | Brute-force (no fail2ban) |
| 8006 | Proxmox API/UI | **Critical** — full cluster control |
| 3128 | spiceproxy | Medium |
| 111 | rpcbind | Low — reduce exposure |
#### Fixes (pve201)
```bash
# 1) SSH — prefer Ansible after limiting to your IP
make copy-ssh-key HOST=pve201 # if needed
# Manual quick fix on host:
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd
# 2) Proxmox firewall — Datacenter → Firewall → restrict 8006 to 10.0.10.0/24 or admin IP
# Or iptables on host for port 8006
# 3) fail2ban
apt install fail2ban -y
systemctl enable --now fail2ban
# 4) Auto security updates
apt install unattended-upgrades apt-listchanges -y
dpkg-reconfigure -plow unattended-upgrades
# 5) Patch
apt update && apt upgrade -y
```
**Ansible (when ready):** add `pve201` / `pve10` to a `proxmox` group play with `roles/ssh` + `roles/monitoring_server` (fail2ban).
Do **not** lock yourself out — test with second session first.
---
### pve10 — 10.0.10.10 (`PVENAS`)
| Resource | Status |
|----------|--------|
| OS | Debian 13 (trixie), PVE, kernel 6.17.13-3-pve |
| Load | **~30** on 24 CPUs (overloaded) |
| Pending apt | **92** |
| UFW / fail2ban / unattended-upgrades | **None** |
| ZFS `NAS.SP00` | **inactive** (I/O suspended) |
| PBS `PVEBUVD00` → 10.0.10.200:8007 | **unreachable** |
#### SSH audit (dedicated)
Same as pve201: `permitrootlogin yes`, `passwordauthentication yes`, 3 root authorized_keys.
#### Exposed services
| Port | Service | Risk |
|------|---------|------|
| 22 | SSH | High |
| 8006 | Proxmox API/UI | **Critical** |
| 2049, mountd, statd | NFS/RPC | High on LAN |
| 3128 | spiceproxy | Medium |
#### Fixes (pve10)
Same SSH / fail2ban / unattended-upgrades / patch steps as pve201.
Additional:
```bash
# Investigate ZFS pool
zpool status NAS.SP00
# Fix PBS connectivity or remove stale datastore from Proxmox UI
```
---
## LXCs on pve201 (via `pct exec`)
| VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services |
|------|------|-----|--------|----------|---------------|-----|----------|----------|-----------------|
| 301 | vikunja-debian | 10.0.10.159 | running | without-password | **yes** | no | no | 0 | **3456**, 22 |
| 302 | qbit-debian | 10.0.10.91 | running | without-password | **yes** | no | no | 0 | **8080** (qBit), 22 |
| 303 | searchXNG-debian | 10.0.10.70 | running | without-password | **yes** | no | no | **83** | **8080**, 22 |
| 304 | wireguard-debian | 10.0.10.192 | running | without-password | **yes** | no | no | 0 | 22 |
| 305 | kuma-debian | 10.0.10.197 | **stopped** | — | — | — | — | — | replaced by LXC 218 |
| 306 | portfolio | — | **destroyed** | — | — | — | — | — | migrated → pve10 LXC **219** @ `10.0.10.106` (purged 2026-05-22) |
| 307 | jobber-delian | 10.0.10.178 | running | without-password | **yes** | no | no | **83** | **3005**, 22 |
| 308 | stirling-pdf | 10.0.10.43 | running | without-password | **yes** | no | no | 0 | **8080**, 22 |
| 9001 | pote-dev | 10.0.10.114 | **stopped** | — | — | — | — | — | — |
| 9101 | punimTagFE-dev | 10.0.10.121 | running | without-password | **yes** | **active** | no | **89** | **8000**, 111, 22 |
| 9401 | mirrormatch-dev | 10.0.10.141 | **stopped** | — | — | — | — | — | — |
**Inventory mapping:** `vikunja` → 159 (LXC 301), `qBittorrent` → 91, `punimTag` app → 121.
### Common LXC issues (pve201)
| Issue | Severity | Fix |
|-------|----------|-----|
| `passwordauthentication yes` on all LXCs | High | Set `PasswordAuthentication no` in `/etc/ssh/sshd_config`, reload sshd |
| No fail2ban | High | Install fail2ban or rely on Proxmox FW + LAN segmentation |
| Apps on `0.0.0.0:8080` / 8000 / 3456 | High | Bind to localhost + Caddy, or restrict via Proxmox guest firewall (`firewall=1` on net0 — enable rules) |
| 7989 pending upgrades on several CTs | Medium | `pct exec <id> -- apt update && apt upgrade -y` |
| Stopped dev CTs (9001, 9401) | Low | Start when needed or keep stopped to reduce attack surface |
### Per-LXC fixes (pve201)
```bash
# Example: harden + patch vikunja (301) from Proxmox host
pct exec 301 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 301 -- systemctl reload ssh
# Patch container
pct exec 303 -- bash -c 'apt update && apt upgrade -y'
# Copy your SSH key (from Mac, once password/key works)
make copy-ssh-key HOST=vikunja # 10.0.10.159
make copy-ssh-key HOST=qBittorrent # 10.0.10.91
```
**punimTagFE-dev (9101):** Only LXC with **UFW active** — extend rules to deny inbound except 22 from admin subnet; still disable password auth.
---
## LXCs on pve10 (via `pct exec`)
| VMID | Name | IP | Status | SSH root | Password auth | UFW | fail2ban | Upgrades | Public services |
|------|------|-----|--------|----------|---------------|-----|----------|----------|-----------------|
| 210 | cal | 10.0.10.228 | running | without-password | **yes** | no | no | 0 | **3000**, 22 |
| 215 | caseware | 10.0.10.105 | running | without-password | **yes** | no | no | **40** | **80** (nginx), 22 |
| 216 | auto | 10.0.10.59 | running | without-password | **yes** | no | no | **40** | **80** (nginx), 22 |
**Inventory mapping:** `caseware` → 105, `auto` → 59.
### Fixes (pve10 LXCs)
```bash
# SSH harden caseware (215)
pct exec 215 -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec 215 -- systemctl reload sshd
# Patch
pct exec 215 -- apt update && apt upgrade -y
pct exec 216 -- apt update && apt upgrade -y
# Deploy keys from Mac
make copy-ssh-key HOST=caseware
make copy-ssh-key HOST=auto
```
**HTTP port 80 on caseware/auto:** Ensure TLS termination on Caddy (inventory host `caddy` 10.0.10.50) and no plain HTTP from WAN if exposed.
---
## SSH hardening checklist (all Linux targets)
Use this order to avoid lockout:
1. Confirm your key works: `ssh -o BatchMode=yes root@<ip> true`
2. Set `PasswordAuthentication no`
3. Set `PermitRootLogin prohibit-password` (LXCs already `without-password` — equivalent for keys-only)
4. `sshd -t && systemctl reload sshd`
5. Open **second terminal** and test before closing first
6. Optional: change SSH port, `MaxAuthTries 4`, disable `X11Forwarding`
**Ansible alignment:**
```bash
# After keys on host
make dev HOST=<hostname> --tags security
# or role ssh via playbooks that include roles/ssh
```
---
## Re-run audits
```bash
# Hypervisor full audit
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-remote.sh
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-remote.sh
# Hypervisor SSH-only
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh
# All LXCs on a node
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-lxc-via-pve.sh
```
---
## Tracking
| Item | Owner | Status |
|------|-------|--------|
| SSH keys caseware, auto, cal, vikunja, mailcow, listmonk | 2026-05-23 | ☑ |
| Fleet `apt upgrade` (no reboot) | 2026-05-23 | ☑ all previously failed hosts fixed |
| Tier 1 cron (journal + apt) | 2026-05-23 | ☑ PVE + most hosts via Ansible |
| Tier 2 cron (docker prune) | 2026-05-23 | ☑ identity, monitoring, vikunja, git-ci-01 |
| VM 104 RAM 72→64 GiB | 2026-05-23 | ☑ |
| Inventory `vikunja` rename | 2026-05-23 | ☑ |
| Fix `host_vars` ansibleVM / listmonk merge | 2026-05-23 | ☑ plain YAML (review `*.vault-bak`) |
| SSH harden pve201 | | ☐ |
| SSH harden pve10 | | ☐ |
| Restrict 8006 on both nodes | | ☐ |
| fail2ban on hypervisors | | ☐ |
| `make security` on production groups | | ☐ |
| Disable password SSH on all LXCs | | ☐ |
| `copy-ssh-keys` remaining inventory | | ☐ partial |
| TLS / localhost bind for :8080 services | | ☐ |
| unattended-upgrades all production | | ☐ |
| Tailscale re-auth | | ⏸ deferred (UniFi VPN) |
| Fix ZFS NAS.SP00 on pve10 | | ☐ |
| caddy Ansible as root | 2026-05-23 | ☑ |
| vaultwardenVM / ansibleVM become in vault | 2026-05-23 | ☑ |
| Add GPU-Dev `10.0.10.122` to inventory | | ☐ |
| Ollama bind localhost + optional Open WebUI | | ☐ |
---
## Next steps (priority)
1. **`make security`** on one site host (e.g. caseware) with a second SSH session open — disable password SSH, enable UFW + fail2ban (`ignoreip` = LAN + VPN pool).
2. **Restrict Proxmox :8006** to `10.0.10.0/24` + VPN subnet on pve201 and pve10.
3. **Bind internal Docker ports** on identity / monitoring / vaultwarden to `127.0.0.1` after confirming Caddy routes.
4. **GPU-Dev:** point clients at `http://10.0.10.122:11434` over VPN; tune Ollama env vars; add host to inventory when automating.
5. **unattended-upgrades** on production LXCs (reboot policy manual).
6. Review `host_vars/*.vault-bak` and merge any secrets still needed into vault + plain host_vars.
---
## References
- **[Security remediation plan](security-remediation-plan.md)** — phased fixes (critical → low) and login model
- [Security hardening guide](security.md)
- [SECURITY_HARDENING_PLAN.md](../SECURITY_HARDENING_PLAN.md)
- Role defaults: `roles/ssh/defaults/main.yml`