ansible/docs/guides/security-remediation-plan.md
ilia 8a507eddee
Some checks failed
CI / skip-ci-check (pull_request) Successful in 7s
CI / lint-and-test (pull_request) Successful in 12s
CI / ansible-validation (pull_request) Failing after 5s
CI / secret-scanning (pull_request) Successful in 6s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 6s
CI / license-check (pull_request) Successful in 10s
CI / vault-check (pull_request) Failing after 5s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 3s
CI / workflow-summary (pull_request) Successful in 5s
Fix CI: ansible-lint playbook schema and markdownlint for new guides.
Use ansible.builtin.su, spaces in caddy blockinfile, relax MD060/MD036
and line length for homelab documentation tables.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 17:10:33 -04:00

460 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Security Remediation Plan
**Based on:** [security-audit-report.md](security-audit-report.md) (2026-05-20)
**Goal:** Align hosts with `roles/ssh` (keys only, no password SSH) without locking yourself out.
---
## How you should log in (not “ladmin → root” everywhere)
Your inventory uses **different users on purpose**. After hardening, the pattern is:
| Host type | Inventory user | How you work | Root access |
|-----------|----------------|--------------|-------------|
| **Proxmox** (`pve201`, `pve10`) | `root` | `ssh root@10.0.10.201` with **your SSH key** | Direct root (keys only, no password) |
| **Dev / QA** (`dev01`, `git-ci-01`, …) | `ladmin` (or `beast`, `master`) | `ssh ladmin@host` with **key** | `sudo` for admin tasks; Ansible `become: true` |
| **Services** (caddy, jellyfin, …) | often `root` | `ssh root@host` with **key** | Direct root (keys only) |
| **Optional bootstrap** | — | `make bootstrap-root-ssh HOST=x` | One-time: key on `ladmin``su` to install **root** key → then harden SSH |
**You do not need** “SSH ladmin then su root” on Proxmox if you keep managing them as `root` in inventory — you need **root + SSH key + passwords disabled**.
**You do** use ladmin → sudo on dev/qa boxes where `ansible_user=ladmin`. That is normal: unprivileged (or sudo) login + elevation, not password guessing on root.
**`PermitRootLogin prohibit-password`** means: root may log in **only with a key**, never with a password. It does **not** mean “ban root; use ladmin only.”
**`PasswordAuthentication no`** means: **nobody** (root, ladmin, etc.) can SSH with a password — keys only.
---
## Phases overview
| Phase | When | Focus |
|-------|------|--------|
| **0 — Backup + prep** | Before any change | Snapshots, `sshd` copies, git commit, keys, second SSH session |
| **1 — Critical** | Week 1 | Proxmox SSH + 8006, keys everywhere, RAM on 201 |
| **2 — High** | Week 12 | LXCs SSH, fail2ban, patching, app ports |
| **3 — Medium** | Week 24 | unattended-upgrades, Ansible `make security`, TLS |
| **4 — Low** | Ongoing | rpcbind, naming, stopped CTs, Mac, docs |
---
## Phase 0 — Backup (before any hardening)
**Yes — back up first.** SSH and firewall mistakes can lock you out; patches can break services. Use the right backup type per layer.
### What to back up (by layer)
| Layer | What | Method | Rollback if SSH breaks |
|-------|------|--------|-------------------------|
| **Your Mac** | Ansible repo + `~/.ansible-vault-pass` (secure copy) + SSH keys | Time Machine / git commit / copy `~/.ssh` | N/A |
| **Proxmox hosts** | `/etc/ssh/sshd_config`, `/etc/pve/`, firewall rules | Copy files + **Proxmox snapshot** optional | **Console** in web UI (`pct enter` / VM console) |
| **Each LXC/VM** | Full guest state | **Proxmox snapshot** or `vzdump` | Restore snapshot or rollback CT |
| **Dev workstations** | OS + home (if Timeshift installed) | `make timeshift-snapshot HOST=dev02` | `make timeshift-restore` |
| **Central PBS** | — | **Not reliable today**`10.0.10.200` unreachable | Fix PBS later; dont depend on it for this work |
### 0A — Mac / repo (5 minutes)
```bash
cd ~/Documents/code/ansible
git status
git add -A && git commit -m "Pre-security-hardening baseline" # if you want a restore point
# Store vault passphrase somewhere safe (password manager), NOT only on disk
# Optional: encrypted copy of ~/.ansible-vault-pass offline
```
### 0B — Proxmox: config files (both nodes)
```bash
for pve in 10.0.10.201 10.0.10.10; do
ssh root@$pve "mkdir -p /root/pre-hardening-$(date +%Y%m%d) && \
cp -a /etc/ssh/sshd_config /root/pre-hardening-$(date +%Y%m%d)/ && \
cp -a /etc/pve /root/pre-hardening-$(date +%Y%m%d)/pve-etc 2>/dev/null; \
ls -la /root/pre-hardening-$(date +%Y%m%d)/"
done
```
### 0C — Proxmox: snapshots (recommended before SSH/firewall on PVE)
**Running LXCs on pve201** (from audit): 301308, 9101 — snapshot each before `pct exec` SSH changes.
**Running LXCs on pve10:** 210, 215, 216.
```bash
# On pve201 — snapshot (fast, local-lvm; needs free space)
ssh root@10.0.10.201 'for id in 301 302 303 304 305 306 307 308 9101; do
name=$(pct list | awk -v i=$id "$1==i {print \$4}")
echo "Snapshot vmid=$id ($name)"
pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
done'
# On pve10
ssh root@10.0.10.10 'for id in 210 215 216; do
pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
done'
```
**Optional full backup** (slower, larger) — important CTs only if snapshots fail (low disk on 201):
```bash
vzdump <vmid> --storage local --mode snapshot --compress zstd
```
**Check space on pve201 first** (~2.5 GB RAM + disk — snapshot needs free space on `local-lvm`):
```bash
ssh root@10.0.10.201 'pvesm status; free -h'
```
If snapshots fail for lack of space: do **0B only** on PVE, then harden SSH using **Proxmox console** as safety net (no snapshot).
### 0D — Inventory VMs with Timeshift (`dev` group)
Only where Timeshift is already installed (e.g. `dev02`):
```bash
make timeshift-snapshot HOST=dev02
make timeshift-list HOST=dev02
```
Not used on Proxmox or most LXCs by default.
### 0E — Export current SSH settings (audit trail)
```bash
mkdir -p ~/security-hardening-backup-$(date +%Y%m%d)
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-ssh.txt
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve10-ssh.txt
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-lxc.txt
```
### Backup exit criteria (do not skip)
- [ ] Git commit (or branch) for ansible repo
- [ ] `sshd_config` (+ optional `/etc/pve`) copied on **both** PVE nodes
- [ ] Proxmox snapshots **or** documented reason skipped (disk/RAM)
- [ ] Second SSH session tested to `pve201` / `pve10`
- [ ] You know how to open **Proxmox → VM/CT → Console** if SSH fails
### Rollback quick reference
| Problem | Rollback |
|---------|----------|
| Bad `sshd_config` on PVE | Console → restore `/root/pre-hardening-*/sshd_config``systemctl reload sshd` |
| Bad LXC SSH | `pct rollback <vmid> pre-ssh-hardening-YYYYMMDD` |
| Bad patch on CT | Same snapshot rollback |
| Locked out of LAN on 8006 | Console → disable/datacenter firewall rule |
---
## Phase 0 — Prep (after backups)
| # | Task | Command / notes |
|---|------|----------------|
| 0.1 | Confirm vault password file | `~/.ansible-vault-pass` |
| 0.2 | Bootstrap control node | `make bootstrap` |
| 0.3 | Verify key on Proxmox | `ssh -o BatchMode=yes root@10.0.10.201 true` |
| 0.4 | Copy keys to inventory | `make copy-ssh-keys` (or per group) |
| 0.5 | Document admin IP | e.g. `10.0.10.127` for firewall rules |
| 0.6 | Open **second terminal** before changing `sshd` | Test login before closing first session |
**Exit criteria:** Backups done (above) + key login works to `pve201`, `pve10`, and hosts you will harden next.
---
## Phase 1 — Critical
### 1.1 Proxmox SSH (pve201 + pve10)
**Issue:** `PermitRootLogin yes` + `PasswordAuthentication yes` — password brute force on root.
**Fix (per host, after 0.3):**
```bash
# On pve201 OR pve10 — keep existing session open!
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
sshd -t && systemctl reload sshd
```
**Verify (new terminal):** `ssh -o BatchMode=yes root@10.0.10.201 true`
**Ansible (later):** dedicated play for `[proxmox]` with `roles/ssh` (today `make security` only targets `dev` playbook).
| Host | Priority |
|------|----------|
| pve201 | P0 |
| pve10 | P0 |
---
### 1.2 Restrict Proxmox UI/API (port 8006)
**Issue:** Anyone on LAN can hit full cluster API.
**Fix (choose one):**
- **A — Proxmox firewall (recommended):** Datacenter → Firewall → add rule: accept `8006` from `10.0.10.0/24` and/or your Mac IP; drop others.
- **B — SSH tunnel only:** no LAN exposure; `ssh -L 8006:127.0.0.1:8006 root@10.0.10.201` → browser `https://127.0.0.1:8006`.
**Do not** block 8006 globally without A or B in place.
---
### 1.3 RAM on pve201 (~2.5 GB free)
**Issue:** New guests or updates risk OOM.
**Fix:**
```bash
ssh root@10.0.10.201 'free -h; pct list'
# Stop non-essential CTs/VMs or migrate workload to pve10
```
Review running guests from `make proxmox-info ALL=true`; stop labs you do not need.
---
### 1.4 Deploy SSH keys to unreachable inventory hosts
**Issue:** Cannot audit or Ansible-manage hosts without keys.
**Order:**
1. `make copy-ssh-key HOST=caddy` (and each `[services]` host)
2. `make bootstrap-root-ssh HOST=listmonk` where root password still works but key does not
3. `make copy-ssh-keys GROUP=qa` for `ladmin` hosts
**Exit criteria:** `make ping` succeeds for each group you will harden in phase 2.
---
## Phase 2 — High
### 2.1 LXC SSH — disable password auth (all running CTs)
**Issue:** `passwordauthentication yes` on every audited LXC.
**Fix from Proxmox host (no Mac SSH to CT required):**
```bash
# pve201 — example for each running VMID
for id in 301 302 303 304 305 306 307 308 9101; do
pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec $id -- bash -c 'sshd -t && systemctl reload sshd' || pct exec $id -- systemctl reload ssh
done
# pve10
for id in 210 215 216; do
pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
pct exec $id -- systemctl reload sshd
done
```
**Before disable:** install your key on CTs you need (`make copy-ssh-key HOST=vikanjans`, etc.).
**Note:** CTs already have `permitrootlogin without-password` — keep that; only turn off passwords.
---
### 2.2 fail2ban on hypervisors
**Issue:** No brute-force protection on SSH (and eventually 8006 if proxied).
```bash
ssh root@10.0.10.201 'apt install -y fail2ban && systemctl enable --now fail2ban'
ssh root@10.0.10.10 'apt install -y fail2ban && systemctl enable --now fail2ban'
```
Optional: extend to high-value LXCs via `roles/monitoring_server` or manual install.
---
### 2.3 Patch backlog
| Target | Pending | Action |
|--------|---------|--------|
| pve201 | ~105 | `apt update && apt upgrade -y` (maintenance window) |
| pve10 | ~92 | same |
| LXCs 303, 306, 307, 9101 | 7989 | `pct exec <id> -- apt update && apt upgrade -y` |
| caseware, auto (pve10) | ~40 | same |
**Order:** hypervisors first (after snapshot), then LXCs one by one.
---
### 2.4 Application ports on `0.0.0.0`
**Issue:** HTTP services exposed on LAN without TLS/auth.
| LXC / host | Port | Fix |
|------------|------|-----|
| qbit (91) | 8080 | Prefer VPN; or Caddy + auth; bind to internal IP |
| searchXNG (70) | 8080 | Same |
| punimTagFE (121) | 8000 | Behind Caddy; firewall allow only 10.0.10.0/24 |
| vaultwarden (142) | 8080 | Already in inventory — reverse proxy + TLS |
| portfolio | **106:80** (pve10 LXC 219, nginx) | Migrated 2026-05-22; pve201 LXC **306 destroyed** |
| vikunja (159) | 3456 | Proxy via Caddy (`todo.levkin.ca`) |
**Pattern:** App listens `127.0.0.1` only; **Caddy** (`10.0.10.50`) terminates TLS for public URLs in inventory.
---
### 2.5 pve10 infrastructure
| Issue | Fix |
|-------|-----|
| ZFS `NAS.SP00` suspended | `zpool status`; import/clear errors |
| PBS 10.0.10.200 unreachable | Fix network/service or remove stale datastore |
| Load ~30 | Identify heavy VMs; migrate or stop |
---
## Phase 3 — Medium
### 3.1 unattended-upgrades
Hypervisors + important LXCs:
```bash
apt install -y unattended-upgrades apt-listchanges
dpkg-reconfigure -plow unattended-upgrades
```
### 3.2 Ansible security roles (by group)
Today `make security` runs `playbooks/development.yml` on **`dev` only**.
**Expand with new/changed playbooks:**
| Group | Playbook idea | Roles |
|-------|---------------|-------|
| `[proxmox]` | `playbooks/infrastructure/proxmox-hardening.yml` | `ssh`, monitoring_server |
| `[services]` | extend `playbooks/servers.yml` | `ssh`, `base`, fail2ban |
| `[qa]` | tag run on qa hosts | `ssh` |
| LXCs | optional `pct` + Ansible over SSH after keys | `ssh` |
**Workflow:**
```bash
make check HOST=pve201 # after proxmox play exists
make dev HOST=dev01 --tags security
```
### 3.3 UFW on LXCs
Only **punimTagFE-dev** has UFW today. Template for others:
- Allow 22 from `10.0.10.0/24`
- Allow app port only if needed on LAN
- Default deny incoming
Use `roles/ssh` UFW tasks or Proxmox guest firewall (`firewall=1` on `net0`).
### 3.4 Align names / inventory
| Proxmox name | Ansible | Action |
|--------------|---------|--------|
| punimTagFE-dev | punimTag-dev | Rename CT or update `app_projects` name |
| vikunja-debian | vikanjans | OK (IP 159) |
| qbit-debian | qBittorrent | OK (IP 91) |
### 3.5 Mac (control machine)
| Issue | Fix |
|-------|-----|
| Firewall off | System Settings → Firewall → On |
| FileVault off | Enable FileVault |
| Docker on `*:3000` | Bind to `127.0.0.1` unless LAN needed |
---
## Phase 4 — Low
| Item | Fix |
|------|-----|
| rpcbind (111) on pve201 / 9101 | Disable if unused: `systemctl disable rpcbind` |
| X11Forwarding on Proxmox | Set `no` in sshd |
| Stopped CTs 9001, 9401 | Leave stopped or destroy if unused |
| `make security-audit` target | Add Makefile → runs audit scripts, appends to report |
| Quarterly re-audit | Re-run `scripts/security-audit-lxc-via-pve.sh` |
---
## Suggested calendar
| Week | Critical | High | Medium |
|------|----------|------|--------|
| **1** | 0.x prep, 1.1 SSH both PVE, 1.2 firewall 8006, 1.4 keys | 2.1 LXC passwords off (after keys), 2.2 fail2ban | — |
| **2** | 1.3 RAM 201 | 2.3 patch PVE + LXCs, 2.4 Caddy for 8080 services | 3.1 unattended-upgrades |
| **3** | — | 2.5 pve10 ZFS/PBS/load | 3.2 Ansible plays for proxmox + services |
| **4** | — | — | 3.3 UFW, 3.4 naming, 3.5 Mac |
---
## Rollback (if locked out of SSH)
- Proxmox: use **console** in web UI (or physical/IPMI) → edit `/etc/ssh/sshd_config``PasswordAuthentication yes` temporarily → reload sshd.
- LXC: `pct enter <vmid>` from PVE host.
---
## Tracking checklist
Copy into your issue tracker or tick in [security-audit-report.md](security-audit-report.md):
**Backup (Phase 0 — before everything)**
- [ ] Git commit / branch for ansible repo
- [ ] PVE `sshd_config` backup on 201 + 10
- [ ] Proxmox CT snapshots (or vzdump) on critical LXCs
- [ ] Audit outputs saved locally (`security-hardening-backup-*`)
- [ ] Console access tested in Proxmox UI
### Critical
- [ ] pve201 SSH: prohibit-password + no passwords
- [ ] pve10 SSH: same
- [ ] 8006 restricted to admin subnet/IP
- [ ] SSH keys on all inventory hosts
- [ ] pve201 RAM relieved
### High
- [ ] All running LXCs: PasswordAuthentication no
- [ ] fail2ban on pve201 + pve10
- [ ] Patch pve201, pve10, LXCs with 40+ upgrades
- [ ] qBit / searchXNG / punimTag / vaultwarden port exposure reduced
- [ ] pve10 ZFS + PBS investigated
### Medium
- [ ] unattended-upgrades on PVE + key LXCs
- [ ] `make security` (or new plays) for proxmox, services, qa
- [ ] UFW on critical LXCs
- [ ] Mac firewall + FileVault
### Low
- [ ] rpcbind, X11, audit Makefile, naming cleanup
---
## Quick reference: your login after plan
```bash
# Proxmox
ssh root@10.0.10.201 # key only
# Dev / QA
ssh ladmin@10.0.10.223 # key only → sudo -i when you need root
# Services (inventory root)
ssh root@10.0.10.50 # key only
# Proxmox UI (if 8006 restricted)
ssh -L 8006:127.0.0.1:8006 root@10.0.10.201
# → https://127.0.0.1:8006
```