ansible/docs/guides/security-remediation-plan.md
ilia 8a507eddee
Some checks failed
CI / skip-ci-check (pull_request) Successful in 7s
CI / lint-and-test (pull_request) Successful in 12s
CI / ansible-validation (pull_request) Failing after 5s
CI / secret-scanning (pull_request) Successful in 6s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 6s
CI / license-check (pull_request) Successful in 10s
CI / vault-check (pull_request) Failing after 5s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 3s
CI / workflow-summary (pull_request) Successful in 5s
Fix CI: ansible-lint playbook schema and markdownlint for new guides.
Use ansible.builtin.su, spaces in caddy blockinfile, relax MD060/MD036
and line length for homelab documentation tables.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 17:10:33 -04:00

15 KiB
Raw Blame History

Security Remediation Plan

Based on: security-audit-report.md (2026-05-20)
Goal: Align hosts with roles/ssh (keys only, no password SSH) without locking yourself out.


How you should log in (not “ladmin → root” everywhere)

Your inventory uses different users on purpose. After hardening, the pattern is:

Host type Inventory user How you work Root access
Proxmox (pve201, pve10) root ssh root@10.0.10.201 with your SSH key Direct root (keys only, no password)
Dev / QA (dev01, git-ci-01, …) ladmin (or beast, master) ssh ladmin@host with key sudo for admin tasks; Ansible become: true
Services (caddy, jellyfin, …) often root ssh root@host with key Direct root (keys only)
Optional bootstrap make bootstrap-root-ssh HOST=x One-time: key on ladminsu to install root key → then harden SSH

You do not need “SSH ladmin then su root” on Proxmox if you keep managing them as root in inventory — you need root + SSH key + passwords disabled.

You do use ladmin → sudo on dev/qa boxes where ansible_user=ladmin. That is normal: unprivileged (or sudo) login + elevation, not password guessing on root.

PermitRootLogin prohibit-password means: root may log in only with a key, never with a password. It does not mean “ban root; use ladmin only.”

PasswordAuthentication no means: nobody (root, ladmin, etc.) can SSH with a password — keys only.


Phases overview

Phase When Focus
0 — Backup + prep Before any change Snapshots, sshd copies, git commit, keys, second SSH session
1 — Critical Week 1 Proxmox SSH + 8006, keys everywhere, RAM on 201
2 — High Week 12 LXCs SSH, fail2ban, patching, app ports
3 — Medium Week 24 unattended-upgrades, Ansible make security, TLS
4 — Low Ongoing rpcbind, naming, stopped CTs, Mac, docs

Phase 0 — Backup (before any hardening)

Yes — back up first. SSH and firewall mistakes can lock you out; patches can break services. Use the right backup type per layer.

What to back up (by layer)

Layer What Method Rollback if SSH breaks
Your Mac Ansible repo + ~/.ansible-vault-pass (secure copy) + SSH keys Time Machine / git commit / copy ~/.ssh N/A
Proxmox hosts /etc/ssh/sshd_config, /etc/pve/, firewall rules Copy files + Proxmox snapshot optional Console in web UI (pct enter / VM console)
Each LXC/VM Full guest state Proxmox snapshot or vzdump Restore snapshot or rollback CT
Dev workstations OS + home (if Timeshift installed) make timeshift-snapshot HOST=dev02 make timeshift-restore
Central PBS Not reliable today10.0.10.200 unreachable Fix PBS later; dont depend on it for this work

0A — Mac / repo (5 minutes)

cd ~/Documents/code/ansible
git status
git add -A && git commit -m "Pre-security-hardening baseline"   # if you want a restore point

# Store vault passphrase somewhere safe (password manager), NOT only on disk
# Optional: encrypted copy of ~/.ansible-vault-pass offline

0B — Proxmox: config files (both nodes)

for pve in 10.0.10.201 10.0.10.10; do
  ssh root@$pve "mkdir -p /root/pre-hardening-$(date +%Y%m%d) && \
    cp -a /etc/ssh/sshd_config /root/pre-hardening-$(date +%Y%m%d)/ && \
    cp -a /etc/pve /root/pre-hardening-$(date +%Y%m%d)/pve-etc 2>/dev/null; \
    ls -la /root/pre-hardening-$(date +%Y%m%d)/"
done

Running LXCs on pve201 (from audit): 301308, 9101 — snapshot each before pct exec SSH changes.

Running LXCs on pve10: 210, 215, 216.

# On pve201 — snapshot (fast, local-lvm; needs free space)
ssh root@10.0.10.201 'for id in 301 302 303 304 305 306 307 308 9101; do
  name=$(pct list | awk -v i=$id "$1==i {print \$4}")
  echo "Snapshot vmid=$id ($name)"
  pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
done'

# On pve10
ssh root@10.0.10.10 'for id in 210 215 216; do
  pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
done'

Optional full backup (slower, larger) — important CTs only if snapshots fail (low disk on 201):

vzdump <vmid> --storage local --mode snapshot --compress zstd

Check space on pve201 first (~2.5 GB RAM + disk — snapshot needs free space on local-lvm):

ssh root@10.0.10.201 'pvesm status; free -h'

If snapshots fail for lack of space: do 0B only on PVE, then harden SSH using Proxmox console as safety net (no snapshot).

0D — Inventory VMs with Timeshift (dev group)

Only where Timeshift is already installed (e.g. dev02):

make timeshift-snapshot HOST=dev02
make timeshift-list HOST=dev02

Not used on Proxmox or most LXCs by default.

0E — Export current SSH settings (audit trail)

mkdir -p ~/security-hardening-backup-$(date +%Y%m%d)
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-ssh.txt
ssh root@10.0.10.10  'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve10-ssh.txt
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-lxc.txt

Backup exit criteria (do not skip)

  • Git commit (or branch) for ansible repo
  • sshd_config (+ optional /etc/pve) copied on both PVE nodes
  • Proxmox snapshots or documented reason skipped (disk/RAM)
  • Second SSH session tested to pve201 / pve10
  • You know how to open Proxmox → VM/CT → Console if SSH fails

Rollback quick reference

Problem Rollback
Bad sshd_config on PVE Console → restore /root/pre-hardening-*/sshd_configsystemctl reload sshd
Bad LXC SSH pct rollback <vmid> pre-ssh-hardening-YYYYMMDD
Bad patch on CT Same snapshot rollback
Locked out of LAN on 8006 Console → disable/datacenter firewall rule

Phase 0 — Prep (after backups)

# Task Command / notes
0.1 Confirm vault password file ~/.ansible-vault-pass
0.2 Bootstrap control node make bootstrap
0.3 Verify key on Proxmox ssh -o BatchMode=yes root@10.0.10.201 true
0.4 Copy keys to inventory make copy-ssh-keys (or per group)
0.5 Document admin IP e.g. 10.0.10.127 for firewall rules
0.6 Open second terminal before changing sshd Test login before closing first session

Exit criteria: Backups done (above) + key login works to pve201, pve10, and hosts you will harden next.


Phase 1 — Critical

1.1 Proxmox SSH (pve201 + pve10)

Issue: PermitRootLogin yes + PasswordAuthentication yes — password brute force on root.

Fix (per host, after 0.3):

# On pve201 OR pve10 — keep existing session open!
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd

Verify (new terminal): ssh -o BatchMode=yes root@10.0.10.201 true

Ansible (later): dedicated play for [proxmox] with roles/ssh (today make security only targets dev playbook).

Host Priority
pve201 P0
pve10 P0

1.2 Restrict Proxmox UI/API (port 8006)

Issue: Anyone on LAN can hit full cluster API.

Fix (choose one):

  • A — Proxmox firewall (recommended): Datacenter → Firewall → add rule: accept 8006 from 10.0.10.0/24 and/or your Mac IP; drop others.
  • B — SSH tunnel only: no LAN exposure; ssh -L 8006:127.0.0.1:8006 root@10.0.10.201 → browser https://127.0.0.1:8006.

Do not block 8006 globally without A or B in place.


1.3 RAM on pve201 (~2.5 GB free)

Issue: New guests or updates risk OOM.

Fix:

ssh root@10.0.10.201 'free -h; pct list'
# Stop non-essential CTs/VMs or migrate workload to pve10

Review running guests from make proxmox-info ALL=true; stop labs you do not need.


1.4 Deploy SSH keys to unreachable inventory hosts

Issue: Cannot audit or Ansible-manage hosts without keys.

Order:

  1. make copy-ssh-key HOST=caddy (and each [services] host)
  2. make bootstrap-root-ssh HOST=listmonk where root password still works but key does not
  3. make copy-ssh-keys GROUP=qa for ladmin hosts

Exit criteria: make ping succeeds for each group you will harden in phase 2.


Phase 2 — High

2.1 LXC SSH — disable password auth (all running CTs)

Issue: passwordauthentication yes on every audited LXC.

Fix from Proxmox host (no Mac SSH to CT required):

# pve201 — example for each running VMID
for id in 301 302 303 304 305 306 307 308 9101; do
  pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
  pct exec $id -- bash -c 'sshd -t && systemctl reload sshd' || pct exec $id -- systemctl reload ssh
done

# pve10
for id in 210 215 216; do
  pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
  pct exec $id -- systemctl reload sshd
done

Before disable: install your key on CTs you need (make copy-ssh-key HOST=vikanjans, etc.).

Note: CTs already have permitrootlogin without-password — keep that; only turn off passwords.


2.2 fail2ban on hypervisors

Issue: No brute-force protection on SSH (and eventually 8006 if proxied).

ssh root@10.0.10.201 'apt install -y fail2ban && systemctl enable --now fail2ban'
ssh root@10.0.10.10  'apt install -y fail2ban && systemctl enable --now fail2ban'

Optional: extend to high-value LXCs via roles/monitoring_server or manual install.


2.3 Patch backlog

Target Pending Action
pve201 ~105 apt update && apt upgrade -y (maintenance window)
pve10 ~92 same
LXCs 303, 306, 307, 9101 7989 pct exec <id> -- apt update && apt upgrade -y
caseware, auto (pve10) ~40 same

Order: hypervisors first (after snapshot), then LXCs one by one.


2.4 Application ports on 0.0.0.0

Issue: HTTP services exposed on LAN without TLS/auth.

LXC / host Port Fix
qbit (91) 8080 Prefer VPN; or Caddy + auth; bind to internal IP
searchXNG (70) 8080 Same
punimTagFE (121) 8000 Behind Caddy; firewall allow only 10.0.10.0/24
vaultwarden (142) 8080 Already in inventory — reverse proxy + TLS
portfolio 106:80 (pve10 LXC 219, nginx) Migrated 2026-05-22; pve201 LXC 306 destroyed
vikunja (159) 3456 Proxy via Caddy (todo.levkin.ca)

Pattern: App listens 127.0.0.1 only; Caddy (10.0.10.50) terminates TLS for public URLs in inventory.


2.5 pve10 infrastructure

Issue Fix
ZFS NAS.SP00 suspended zpool status; import/clear errors
PBS 10.0.10.200 unreachable Fix network/service or remove stale datastore
Load ~30 Identify heavy VMs; migrate or stop

Phase 3 — Medium

3.1 unattended-upgrades

Hypervisors + important LXCs:

apt install -y unattended-upgrades apt-listchanges
dpkg-reconfigure -plow unattended-upgrades

3.2 Ansible security roles (by group)

Today make security runs playbooks/development.yml on dev only.

Expand with new/changed playbooks:

Group Playbook idea Roles
[proxmox] playbooks/infrastructure/proxmox-hardening.yml ssh, monitoring_server
[services] extend playbooks/servers.yml ssh, base, fail2ban
[qa] tag run on qa hosts ssh
LXCs optional pct + Ansible over SSH after keys ssh

Workflow:

make check HOST=pve201          # after proxmox play exists
make dev HOST=dev01 --tags security

3.3 UFW on LXCs

Only punimTagFE-dev has UFW today. Template for others:

  • Allow 22 from 10.0.10.0/24
  • Allow app port only if needed on LAN
  • Default deny incoming

Use roles/ssh UFW tasks or Proxmox guest firewall (firewall=1 on net0).

3.4 Align names / inventory

Proxmox name Ansible Action
punimTagFE-dev punimTag-dev Rename CT or update app_projects name
vikunja-debian vikanjans OK (IP 159)
qbit-debian qBittorrent OK (IP 91)

3.5 Mac (control machine)

Issue Fix
Firewall off System Settings → Firewall → On
FileVault off Enable FileVault
Docker on *:3000 Bind to 127.0.0.1 unless LAN needed

Phase 4 — Low

Item Fix
rpcbind (111) on pve201 / 9101 Disable if unused: systemctl disable rpcbind
X11Forwarding on Proxmox Set no in sshd
Stopped CTs 9001, 9401 Leave stopped or destroy if unused
make security-audit target Add Makefile → runs audit scripts, appends to report
Quarterly re-audit Re-run scripts/security-audit-lxc-via-pve.sh

Suggested calendar

Week Critical High Medium
1 0.x prep, 1.1 SSH both PVE, 1.2 firewall 8006, 1.4 keys 2.1 LXC passwords off (after keys), 2.2 fail2ban
2 1.3 RAM 201 2.3 patch PVE + LXCs, 2.4 Caddy for 8080 services 3.1 unattended-upgrades
3 2.5 pve10 ZFS/PBS/load 3.2 Ansible plays for proxmox + services
4 3.3 UFW, 3.4 naming, 3.5 Mac

Rollback (if locked out of SSH)

  • Proxmox: use console in web UI (or physical/IPMI) → edit /etc/ssh/sshd_configPasswordAuthentication yes temporarily → reload sshd.
  • LXC: pct enter <vmid> from PVE host.

Tracking checklist

Copy into your issue tracker or tick in security-audit-report.md:

Backup (Phase 0 — before everything)

  • Git commit / branch for ansible repo
  • PVE sshd_config backup on 201 + 10
  • Proxmox CT snapshots (or vzdump) on critical LXCs
  • Audit outputs saved locally (security-hardening-backup-*)
  • Console access tested in Proxmox UI

Critical

  • pve201 SSH: prohibit-password + no passwords
  • pve10 SSH: same
  • 8006 restricted to admin subnet/IP
  • SSH keys on all inventory hosts
  • pve201 RAM relieved

High

  • All running LXCs: PasswordAuthentication no
  • fail2ban on pve201 + pve10
  • Patch pve201, pve10, LXCs with 40+ upgrades
  • qBit / searchXNG / punimTag / vaultwarden port exposure reduced
  • pve10 ZFS + PBS investigated

Medium

  • unattended-upgrades on PVE + key LXCs
  • make security (or new plays) for proxmox, services, qa
  • UFW on critical LXCs
  • Mac firewall + FileVault

Low

  • rpcbind, X11, audit Makefile, naming cleanup

Quick reference: your login after plan

# Proxmox
ssh root@10.0.10.201          # key only

# Dev / QA
ssh ladmin@10.0.10.223        # key only → sudo -i when you need root

# Services (inventory root)
ssh root@10.0.10.50           # key only

# Proxmox UI (if 8006 restricted)
ssh -L 8006:127.0.0.1:8006 root@10.0.10.201
# → https://127.0.0.1:8006