Some checks failed
CI / skip-ci-check (pull_request) Successful in 6s
CI / lint-and-test (pull_request) Failing after 9s
CI / ansible-validation (pull_request) Failing after 6s
CI / secret-scanning (pull_request) Successful in 5s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 5s
CI / license-check (pull_request) Successful in 11s
CI / vault-check (pull_request) Failing after 6s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 2s
CI / workflow-summary (pull_request) Successful in 4s
Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio to inventory; Mailcow mailbox automation; vault import/export scripts; security audit guides and UniFi DHCP reference. Co-authored-by: Cursor <cursoragent@cursor.com>
460 lines
15 KiB
Markdown
460 lines
15 KiB
Markdown
# Security Remediation Plan
|
||
|
||
**Based on:** [security-audit-report.md](security-audit-report.md) (2026-05-20)
|
||
**Goal:** Align hosts with `roles/ssh` (keys only, no password SSH) without locking yourself out.
|
||
|
||
---
|
||
|
||
## How you should log in (not “ladmin → root” everywhere)
|
||
|
||
Your inventory uses **different users on purpose**. After hardening, the pattern is:
|
||
|
||
| Host type | Inventory user | How you work | Root access |
|
||
|-----------|----------------|--------------|-------------|
|
||
| **Proxmox** (`pve201`, `pve10`) | `root` | `ssh root@10.0.10.201` with **your SSH key** | Direct root (keys only, no password) |
|
||
| **Dev / QA** (`dev01`, `git-ci-01`, …) | `ladmin` (or `beast`, `master`) | `ssh ladmin@host` with **key** | `sudo` for admin tasks; Ansible `become: true` |
|
||
| **Services** (caddy, jellyfin, …) | often `root` | `ssh root@host` with **key** | Direct root (keys only) |
|
||
| **Optional bootstrap** | — | `make bootstrap-root-ssh HOST=x` | One-time: key on `ladmin` → `su` to install **root** key → then harden SSH |
|
||
|
||
**You do not need** “SSH ladmin then su root” on Proxmox if you keep managing them as `root` in inventory — you need **root + SSH key + passwords disabled**.
|
||
|
||
**You do** use ladmin → sudo on dev/qa boxes where `ansible_user=ladmin`. That is normal: unprivileged (or sudo) login + elevation, not password guessing on root.
|
||
|
||
**`PermitRootLogin prohibit-password`** means: root may log in **only with a key**, never with a password. It does **not** mean “ban root; use ladmin only.”
|
||
|
||
**`PasswordAuthentication no`** means: **nobody** (root, ladmin, etc.) can SSH with a password — keys only.
|
||
|
||
---
|
||
|
||
## Phases overview
|
||
|
||
| Phase | When | Focus |
|
||
|-------|------|--------|
|
||
| **0 — Backup + prep** | Before any change | Snapshots, `sshd` copies, git commit, keys, second SSH session |
|
||
| **1 — Critical** | Week 1 | Proxmox SSH + 8006, keys everywhere, RAM on 201 |
|
||
| **2 — High** | Week 1–2 | LXCs SSH, fail2ban, patching, app ports |
|
||
| **3 — Medium** | Week 2–4 | unattended-upgrades, Ansible `make security`, TLS |
|
||
| **4 — Low** | Ongoing | rpcbind, naming, stopped CTs, Mac, docs |
|
||
|
||
---
|
||
|
||
## Phase 0 — Backup (before any hardening)
|
||
|
||
**Yes — back up first.** SSH and firewall mistakes can lock you out; patches can break services. Use the right backup type per layer.
|
||
|
||
### What to back up (by layer)
|
||
|
||
| Layer | What | Method | Rollback if SSH breaks |
|
||
|-------|------|--------|-------------------------|
|
||
| **Your Mac** | Ansible repo + `~/.ansible-vault-pass` (secure copy) + SSH keys | Time Machine / git commit / copy `~/.ssh` | N/A |
|
||
| **Proxmox hosts** | `/etc/ssh/sshd_config`, `/etc/pve/`, firewall rules | Copy files + **Proxmox snapshot** optional | **Console** in web UI (`pct enter` / VM console) |
|
||
| **Each LXC/VM** | Full guest state | **Proxmox snapshot** or `vzdump` | Restore snapshot or rollback CT |
|
||
| **Dev workstations** | OS + home (if Timeshift installed) | `make timeshift-snapshot HOST=dev02` | `make timeshift-restore` |
|
||
| **Central PBS** | — | **Not reliable today** — `10.0.10.200` unreachable | Fix PBS later; don’t depend on it for this work |
|
||
|
||
### 0A — Mac / repo (5 minutes)
|
||
|
||
```bash
|
||
cd ~/Documents/code/ansible
|
||
git status
|
||
git add -A && git commit -m "Pre-security-hardening baseline" # if you want a restore point
|
||
|
||
# Store vault passphrase somewhere safe (password manager), NOT only on disk
|
||
# Optional: encrypted copy of ~/.ansible-vault-pass offline
|
||
```
|
||
|
||
### 0B — Proxmox: config files (both nodes)
|
||
|
||
```bash
|
||
for pve in 10.0.10.201 10.0.10.10; do
|
||
ssh root@$pve "mkdir -p /root/pre-hardening-$(date +%Y%m%d) && \
|
||
cp -a /etc/ssh/sshd_config /root/pre-hardening-$(date +%Y%m%d)/ && \
|
||
cp -a /etc/pve /root/pre-hardening-$(date +%Y%m%d)/pve-etc 2>/dev/null; \
|
||
ls -la /root/pre-hardening-$(date +%Y%m%d)/"
|
||
done
|
||
```
|
||
|
||
### 0C — Proxmox: snapshots (recommended before SSH/firewall on PVE)
|
||
|
||
**Running LXCs on pve201** (from audit): 301–308, 9101 — snapshot each before `pct exec` SSH changes.
|
||
|
||
**Running LXCs on pve10:** 210, 215, 216.
|
||
|
||
```bash
|
||
# On pve201 — snapshot (fast, local-lvm; needs free space)
|
||
ssh root@10.0.10.201 'for id in 301 302 303 304 305 306 307 308 9101; do
|
||
name=$(pct list | awk -v i=$id "$1==i {print \$4}")
|
||
echo "Snapshot vmid=$id ($name)"
|
||
pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
|
||
done'
|
||
|
||
# On pve10
|
||
ssh root@10.0.10.10 'for id in 210 215 216; do
|
||
pct snapshot $id pre-ssh-hardening-$(date +%Y%m%d) || echo "FAILED $id"
|
||
done'
|
||
```
|
||
|
||
**Optional full backup** (slower, larger) — important CTs only if snapshots fail (low disk on 201):
|
||
|
||
```bash
|
||
vzdump <vmid> --storage local --mode snapshot --compress zstd
|
||
```
|
||
|
||
**Check space on pve201 first** (~2.5 GB RAM + disk — snapshot needs free space on `local-lvm`):
|
||
|
||
```bash
|
||
ssh root@10.0.10.201 'pvesm status; free -h'
|
||
```
|
||
|
||
If snapshots fail for lack of space: do **0B only** on PVE, then harden SSH using **Proxmox console** as safety net (no snapshot).
|
||
|
||
### 0D — Inventory VMs with Timeshift (`dev` group)
|
||
|
||
Only where Timeshift is already installed (e.g. `dev02`):
|
||
|
||
```bash
|
||
make timeshift-snapshot HOST=dev02
|
||
make timeshift-list HOST=dev02
|
||
```
|
||
|
||
Not used on Proxmox or most LXCs by default.
|
||
|
||
### 0E — Export current SSH settings (audit trail)
|
||
|
||
```bash
|
||
mkdir -p ~/security-hardening-backup-$(date +%Y%m%d)
|
||
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-ssh.txt
|
||
ssh root@10.0.10.10 'bash -s' < scripts/security-audit-ssh.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve10-ssh.txt
|
||
ssh root@10.0.10.201 'bash -s' < scripts/security-audit-lxc-via-pve.sh > ~/security-hardening-backup-$(date +%Y%m%d)/pve201-lxc.txt
|
||
```
|
||
|
||
### Backup exit criteria (do not skip)
|
||
|
||
- [ ] Git commit (or branch) for ansible repo
|
||
- [ ] `sshd_config` (+ optional `/etc/pve`) copied on **both** PVE nodes
|
||
- [ ] Proxmox snapshots **or** documented reason skipped (disk/RAM)
|
||
- [ ] Second SSH session tested to `pve201` / `pve10`
|
||
- [ ] You know how to open **Proxmox → VM/CT → Console** if SSH fails
|
||
|
||
### Rollback quick reference
|
||
|
||
| Problem | Rollback |
|
||
|---------|----------|
|
||
| Bad `sshd_config` on PVE | Console → restore `/root/pre-hardening-*/sshd_config` → `systemctl reload sshd` |
|
||
| Bad LXC SSH | `pct rollback <vmid> pre-ssh-hardening-YYYYMMDD` |
|
||
| Bad patch on CT | Same snapshot rollback |
|
||
| Locked out of LAN on 8006 | Console → disable/datacenter firewall rule |
|
||
|
||
---
|
||
|
||
## Phase 0 — Prep (after backups)
|
||
|
||
| # | Task | Command / notes |
|
||
|---|------|----------------|
|
||
| 0.1 | Confirm vault password file | `~/.ansible-vault-pass` |
|
||
| 0.2 | Bootstrap control node | `make bootstrap` |
|
||
| 0.3 | Verify key on Proxmox | `ssh -o BatchMode=yes root@10.0.10.201 true` |
|
||
| 0.4 | Copy keys to inventory | `make copy-ssh-keys` (or per group) |
|
||
| 0.5 | Document admin IP | e.g. `10.0.10.127` for firewall rules |
|
||
| 0.6 | Open **second terminal** before changing `sshd` | Test login before closing first session |
|
||
|
||
**Exit criteria:** Backups done (above) + key login works to `pve201`, `pve10`, and hosts you will harden next.
|
||
|
||
---
|
||
|
||
## Phase 1 — Critical
|
||
|
||
### 1.1 Proxmox SSH (pve201 + pve10)
|
||
|
||
**Issue:** `PermitRootLogin yes` + `PasswordAuthentication yes` — password brute force on root.
|
||
|
||
**Fix (per host, after 0.3):**
|
||
|
||
```bash
|
||
# On pve201 OR pve10 — keep existing session open!
|
||
sed -i 's/^#*PermitRootLogin.*/PermitRootLogin prohibit-password/' /etc/ssh/sshd_config
|
||
sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||
sshd -t && systemctl reload sshd
|
||
```
|
||
|
||
**Verify (new terminal):** `ssh -o BatchMode=yes root@10.0.10.201 true`
|
||
|
||
**Ansible (later):** dedicated play for `[proxmox]` with `roles/ssh` (today `make security` only targets `dev` playbook).
|
||
|
||
| Host | Priority |
|
||
|------|----------|
|
||
| pve201 | P0 |
|
||
| pve10 | P0 |
|
||
|
||
---
|
||
|
||
### 1.2 Restrict Proxmox UI/API (port 8006)
|
||
|
||
**Issue:** Anyone on LAN can hit full cluster API.
|
||
|
||
**Fix (choose one):**
|
||
|
||
- **A — Proxmox firewall (recommended):** Datacenter → Firewall → add rule: accept `8006` from `10.0.10.0/24` and/or your Mac IP; drop others.
|
||
- **B — SSH tunnel only:** no LAN exposure; `ssh -L 8006:127.0.0.1:8006 root@10.0.10.201` → browser `https://127.0.0.1:8006`.
|
||
|
||
**Do not** block 8006 globally without A or B in place.
|
||
|
||
---
|
||
|
||
### 1.3 RAM on pve201 (~2.5 GB free)
|
||
|
||
**Issue:** New guests or updates risk OOM.
|
||
|
||
**Fix:**
|
||
|
||
```bash
|
||
ssh root@10.0.10.201 'free -h; pct list'
|
||
# Stop non-essential CTs/VMs or migrate workload to pve10
|
||
```
|
||
|
||
Review running guests from `make proxmox-info ALL=true`; stop labs you do not need.
|
||
|
||
---
|
||
|
||
### 1.4 Deploy SSH keys to unreachable inventory hosts
|
||
|
||
**Issue:** Cannot audit or Ansible-manage hosts without keys.
|
||
|
||
**Order:**
|
||
|
||
1. `make copy-ssh-key HOST=caddy` (and each `[services]` host)
|
||
2. `make bootstrap-root-ssh HOST=listmonk` where root password still works but key does not
|
||
3. `make copy-ssh-keys GROUP=qa` for `ladmin` hosts
|
||
|
||
**Exit criteria:** `make ping` succeeds for each group you will harden in phase 2.
|
||
|
||
---
|
||
|
||
## Phase 2 — High
|
||
|
||
### 2.1 LXC SSH — disable password auth (all running CTs)
|
||
|
||
**Issue:** `passwordauthentication yes` on every audited LXC.
|
||
|
||
**Fix from Proxmox host (no Mac SSH to CT required):**
|
||
|
||
```bash
|
||
# pve201 — example for each running VMID
|
||
for id in 301 302 303 304 305 306 307 308 9101; do
|
||
pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||
pct exec $id -- bash -c 'sshd -t && systemctl reload sshd' || pct exec $id -- systemctl reload ssh
|
||
done
|
||
|
||
# pve10
|
||
for id in 210 215 216; do
|
||
pct exec $id -- sed -i 's/^#*PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
|
||
pct exec $id -- systemctl reload sshd
|
||
done
|
||
```
|
||
|
||
**Before disable:** install your key on CTs you need (`make copy-ssh-key HOST=vikanjans`, etc.).
|
||
|
||
**Note:** CTs already have `permitrootlogin without-password` — keep that; only turn off passwords.
|
||
|
||
---
|
||
|
||
### 2.2 fail2ban on hypervisors
|
||
|
||
**Issue:** No brute-force protection on SSH (and eventually 8006 if proxied).
|
||
|
||
```bash
|
||
ssh root@10.0.10.201 'apt install -y fail2ban && systemctl enable --now fail2ban'
|
||
ssh root@10.0.10.10 'apt install -y fail2ban && systemctl enable --now fail2ban'
|
||
```
|
||
|
||
Optional: extend to high-value LXCs via `roles/monitoring_server` or manual install.
|
||
|
||
---
|
||
|
||
### 2.3 Patch backlog
|
||
|
||
| Target | Pending | Action |
|
||
|--------|---------|--------|
|
||
| pve201 | ~105 | `apt update && apt upgrade -y` (maintenance window) |
|
||
| pve10 | ~92 | same |
|
||
| LXCs 303, 306, 307, 9101 | 79–89 | `pct exec <id> -- apt update && apt upgrade -y` |
|
||
| caseware, auto (pve10) | ~40 | same |
|
||
|
||
**Order:** hypervisors first (after snapshot), then LXCs one by one.
|
||
|
||
---
|
||
|
||
### 2.4 Application ports on `0.0.0.0`
|
||
|
||
**Issue:** HTTP services exposed on LAN without TLS/auth.
|
||
|
||
| LXC / host | Port | Fix |
|
||
|------------|------|-----|
|
||
| qbit (91) | 8080 | Prefer VPN; or Caddy + auth; bind to internal IP |
|
||
| searchXNG (70) | 8080 | Same |
|
||
| punimTagFE (121) | 8000 | Behind Caddy; firewall allow only 10.0.10.0/24 |
|
||
| vaultwarden (142) | 8080 | Already in inventory — reverse proxy + TLS |
|
||
| portfolio | **106:80** (pve10 LXC 219, nginx) | Migrated 2026-05-22; pve201 LXC **306 destroyed** |
|
||
| vikunja (159) | 3456 | Proxy via Caddy (`todo.levkin.ca`) |
|
||
|
||
**Pattern:** App listens `127.0.0.1` only; **Caddy** (`10.0.10.50`) terminates TLS for public URLs in inventory.
|
||
|
||
---
|
||
|
||
### 2.5 pve10 infrastructure
|
||
|
||
| Issue | Fix |
|
||
|-------|-----|
|
||
| ZFS `NAS.SP00` suspended | `zpool status`; import/clear errors |
|
||
| PBS 10.0.10.200 unreachable | Fix network/service or remove stale datastore |
|
||
| Load ~30 | Identify heavy VMs; migrate or stop |
|
||
|
||
---
|
||
|
||
## Phase 3 — Medium
|
||
|
||
### 3.1 unattended-upgrades
|
||
|
||
Hypervisors + important LXCs:
|
||
|
||
```bash
|
||
apt install -y unattended-upgrades apt-listchanges
|
||
dpkg-reconfigure -plow unattended-upgrades
|
||
```
|
||
|
||
### 3.2 Ansible security roles (by group)
|
||
|
||
Today `make security` runs `playbooks/development.yml` on **`dev` only**.
|
||
|
||
**Expand with new/changed playbooks:**
|
||
|
||
| Group | Playbook idea | Roles |
|
||
|-------|---------------|-------|
|
||
| `[proxmox]` | `playbooks/infrastructure/proxmox-hardening.yml` | `ssh`, monitoring_server |
|
||
| `[services]` | extend `playbooks/servers.yml` | `ssh`, `base`, fail2ban |
|
||
| `[qa]` | tag run on qa hosts | `ssh` |
|
||
| LXCs | optional `pct` + Ansible over SSH after keys | `ssh` |
|
||
|
||
**Workflow:**
|
||
|
||
```bash
|
||
make check HOST=pve201 # after proxmox play exists
|
||
make dev HOST=dev01 --tags security
|
||
```
|
||
|
||
### 3.3 UFW on LXCs
|
||
|
||
Only **punimTagFE-dev** has UFW today. Template for others:
|
||
|
||
- Allow 22 from `10.0.10.0/24`
|
||
- Allow app port only if needed on LAN
|
||
- Default deny incoming
|
||
|
||
Use `roles/ssh` UFW tasks or Proxmox guest firewall (`firewall=1` on `net0`).
|
||
|
||
### 3.4 Align names / inventory
|
||
|
||
| Proxmox name | Ansible | Action |
|
||
|--------------|---------|--------|
|
||
| punimTagFE-dev | punimTag-dev | Rename CT or update `app_projects` name |
|
||
| vikunja-debian | vikanjans | OK (IP 159) |
|
||
| qbit-debian | qBittorrent | OK (IP 91) |
|
||
|
||
### 3.5 Mac (control machine)
|
||
|
||
| Issue | Fix |
|
||
|-------|-----|
|
||
| Firewall off | System Settings → Firewall → On |
|
||
| FileVault off | Enable FileVault |
|
||
| Docker on `*:3000` | Bind to `127.0.0.1` unless LAN needed |
|
||
|
||
---
|
||
|
||
## Phase 4 — Low
|
||
|
||
| Item | Fix |
|
||
|------|-----|
|
||
| rpcbind (111) on pve201 / 9101 | Disable if unused: `systemctl disable rpcbind` |
|
||
| X11Forwarding on Proxmox | Set `no` in sshd |
|
||
| Stopped CTs 9001, 9401 | Leave stopped or destroy if unused |
|
||
| `make security-audit` target | Add Makefile → runs audit scripts, appends to report |
|
||
| Quarterly re-audit | Re-run `scripts/security-audit-lxc-via-pve.sh` |
|
||
|
||
---
|
||
|
||
## Suggested calendar
|
||
|
||
| Week | Critical | High | Medium |
|
||
|------|----------|------|--------|
|
||
| **1** | 0.x prep, 1.1 SSH both PVE, 1.2 firewall 8006, 1.4 keys | 2.1 LXC passwords off (after keys), 2.2 fail2ban | — |
|
||
| **2** | 1.3 RAM 201 | 2.3 patch PVE + LXCs, 2.4 Caddy for 8080 services | 3.1 unattended-upgrades |
|
||
| **3** | — | 2.5 pve10 ZFS/PBS/load | 3.2 Ansible plays for proxmox + services |
|
||
| **4** | — | — | 3.3 UFW, 3.4 naming, 3.5 Mac |
|
||
|
||
---
|
||
|
||
## Rollback (if locked out of SSH)
|
||
|
||
- Proxmox: use **console** in web UI (or physical/IPMI) → edit `/etc/ssh/sshd_config` → `PasswordAuthentication yes` temporarily → reload sshd.
|
||
- LXC: `pct enter <vmid>` from PVE host.
|
||
|
||
---
|
||
|
||
## Tracking checklist
|
||
|
||
Copy into your issue tracker or tick in [security-audit-report.md](security-audit-report.md):
|
||
|
||
**Backup (Phase 0 — before everything)**
|
||
|
||
- [ ] Git commit / branch for ansible repo
|
||
- [ ] PVE `sshd_config` backup on 201 + 10
|
||
- [ ] Proxmox CT snapshots (or vzdump) on critical LXCs
|
||
- [ ] Audit outputs saved locally (`security-hardening-backup-*`)
|
||
- [ ] Console access tested in Proxmox UI
|
||
|
||
**Critical**
|
||
|
||
- [ ] pve201 SSH: prohibit-password + no passwords
|
||
- [ ] pve10 SSH: same
|
||
- [ ] 8006 restricted to admin subnet/IP
|
||
- [ ] SSH keys on all inventory hosts
|
||
- [ ] pve201 RAM relieved
|
||
|
||
**High**
|
||
|
||
- [ ] All running LXCs: PasswordAuthentication no
|
||
- [ ] fail2ban on pve201 + pve10
|
||
- [ ] Patch pve201, pve10, LXCs with 40+ upgrades
|
||
- [ ] qBit / searchXNG / punimTag / vaultwarden port exposure reduced
|
||
- [ ] pve10 ZFS + PBS investigated
|
||
|
||
**Medium**
|
||
|
||
- [ ] unattended-upgrades on PVE + key LXCs
|
||
- [ ] `make security` (or new plays) for proxmox, services, qa
|
||
- [ ] UFW on critical LXCs
|
||
- [ ] Mac firewall + FileVault
|
||
|
||
**Low**
|
||
|
||
- [ ] rpcbind, X11, audit Makefile, naming cleanup
|
||
|
||
---
|
||
|
||
## Quick reference: your login after plan
|
||
|
||
```bash
|
||
# Proxmox
|
||
ssh root@10.0.10.201 # key only
|
||
|
||
# Dev / QA
|
||
ssh ladmin@10.0.10.223 # key only → sudo -i when you need root
|
||
|
||
# Services (inventory root)
|
||
ssh root@10.0.10.50 # key only
|
||
|
||
# Proxmox UI (if 8006 restricted)
|
||
ssh -L 8006:127.0.0.1:8006 root@10.0.10.201
|
||
# → https://127.0.0.1:8006
|
||
```
|