ansible/docs/guides/levkin-selfhost-plan-2.md
ilia 0f34c51fc8
All checks were successful
CI / skip-ci-check (pull_request) Successful in 8s
CI / lint-and-test (pull_request) Successful in 17s
CI / secret-scanning (pull_request) Successful in 8s
CI / dependency-scan (pull_request) Successful in 18s
CI / ansible-validation (pull_request) Successful in 54s
CI / sast-scan (pull_request) Successful in 29s
CI / license-check (pull_request) Successful in 14s
CI / vault-check (pull_request) Successful in 13s
CI / container-scan (pull_request) Successful in 8s
CI / sonar-analysis (pull_request) Successful in 8s
CI / playbook-test (pull_request) Successful in 27s
CI / workflow-summary (pull_request) Successful in 6s
Complete homelab post-sprint: SSO docs, monitoring scripts, phase 0/1 closure.
Consolidate sprint status into handoff docs, add Listmonk/Mattermost/Mailcow
and Vikunja SSO guides, Beszel alerts script, mattermost inventory, and
mark phases 0–1 complete with phase 2 backlog for edge Caddy and security.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-24 12:13:55 -04:00

23 KiB
Raw Blame History

Levkin self-hosted stack — plan & decisions

Reference doc for the Proxmox homelab. Lives alongside the Cursor project that has the Proxmox info.

Conventions:

  • All groups run inside an LXC unless marked VM.
  • Inside each LXC: one docker-compose.yml, managed by Dockge where applicable.
  • Caddy on the edge LXC is the only thing exposed to the internet.
  • Authentik on the identity LXC is the source of truth for who you are.
  • Vaultwarden stays standalone (it's the break-glass path if Authentik dies).

Progress summary (updated 2026-05-24)

Area Status
Phase 0 Foundation Done — pve10 LXCs static; UniFi VM DHCP reservations; auth + apex DNS; Caddy on VM 106 @ .50 (edge LXC = Phase 1.5)
Phase 1 Identity (Authentik) LXC 217 @ 10.0.10.21 — admin + TOTP
Phase 2 Monitoring LXC 218 — Kuma (17 monitors), Dockge, Umami, Beszel (16 agents), SMTP
Phase 3 Cal.com LXC 210 — booking + auto consult button; OIDC deferred (no enterprise license)
Phase 4 SSO Vikunja, Listmonk, Mattermost, Mailcow — browser smoke tests remaining
Phase 58 Immich, Crater, Outline, automation depth — after P0 backlog
Comms health Mailcow + Listmonk restored 2026-05-23 — mailcow-lan-proxy-fix.md
Site consolidation Partial — git LXCs + levkin.ca LXC 220; optional later: static on Caddy VM
dev-apps punimTag 9101 on pve201 until testing done
Nextcloud retire VM 201 stopped, onboot 0, Caddy removed (~8 GiB RAM freed)
Portainer retire VM 109 destroyed 2026-05-23 (~16 GiB on pve10)
Security pass 🟡 Partial — SSH keys + apt + cron 2026-05-23 — security-remediation-plan.md

Capacity headroom (live check 2026-05-24)

Use this before adding LXCs/VMs. Re-check with pvesm status and free -h on each node.

pve10 (PVENAS) — primary place for new homelab services

Resource Total Used Available Notes
local-lvm (thin) ~1.67 TiB ~22% ~1.30 TiB New guests on local-lvm only (NAS SP00 degraded)
RAM (host) 62 GiB ~40 GiB ~22 GiB Portainer 109 + Nextcloud 201 freed

Running: LXCs 210, 215221; VMs 102108, 117, 150, 200. Stopped: 101 Jellyfin, 201 Nextcloud.

Headroom: ~20+ GiB RAM for Immich, Crater, or dev-apps LXC.

Still available to free:

Stop / retire Frees (maxmem)
Portainer VM 109 16 GiB freed
Nextcloud VM 201 8 GiB freed
Hermes VM 117 (if not needed) 16 GiB
Site LXCs 215/216 → Caddy static (optional) ~1 GiB

pve201 (pve) — do not add new homelab services

Resource Total Used Available Notes
local-lvm ~1.67 TiB ~46% ~922 GiB Disk OK
RAM 125 GiB ~105 GiB ~19 GiB GPU 104 (64 GB), DebianDesktop 100 (24 GB rebooted), punim 9101 (16 GB)

Verdict: New stacks on pve10 only. pve201: stop/migrate punim after testing.


Current state (May 2026)

Already running:

  • Caddy reverse proxy — currently on a VM (should migrate to LXC, see "Caddy migration" section)
  • Mailcow — VM, mail domain is levkine.ca (with e)
  • Vaultwarden, Vikunja, n8n, Listmonk, Mattermost — across various LXCs/VMs
  • Cal.com — LXC id 210, cal.levkin.ca, Postgres included, admin user ilia, 15-min consult event live at cal.levkin.ca/ilia/consult with Jitsi link
  • Caddy entries live for: levkin.ca, caseware.levkin.ca, auto.levkin.ca, iliadobkin.com, cal.levkin.ca, listmonk.levkin.ca, pdf.levkin.ca, search.levkin.ca, auth.levkin.ca, stats.levkin.ca, status.levkin.ca
  • Authentik — LXC 217 @ 10.0.10.21, https://auth.levkin.ca, admin + TOTP enrolled
  • Monitoring — LXC 218 @ 10.0.10.22: Uptime Kuma :3001, Dockge :5001, Umami :3000 (LAN-only) — monitoring-stack.md
  • Umami + Authentik admin/TOTP/backup codes — done
  • Uptime Kuma — monitors live; email alerts via Mailcow — see monitoring-stack.md
  • Dockge on 218 — manages local /opt/monitoring stack
  • Snapshots backup-20260522 on LXCs 217, 218
  • Jellyfin (VM 101) — stopped
  • LXC 210, 215221 — static via pct set; Caddy VM 106 — static in-guest .50
  • Nextcloud VM 201 — retired (stopped, onboot 0, Caddy removed)
  • Portainer VM 109removed 2026-05-23 (~16 GiB RAM freed on pve10)
  • Marketing sites — LXC 220 (levkin.ca), 215/216/219 (git deploy), not yet on Caddy VM static roots
  • punimTag dev — pve201 LXC 9101 @ 10.0.10.121 (16 GB) — leave until testing done; then dev-apps on pve10

Decisions locked in:

  • Container manager: Dockge (not Portainer, not Coolify/Dokploy/CapRover)
  • Chat: Mattermost only — no Matrix/Synapse
  • Knowledge tool: Outline for client-facing, SiYuan if/when PhD work picks up (don't run Affine + Trilium too)
  • Bookmark manager: Linkwarden (full-page archive is the killer feature)
  • Authentik is the SSO target; Vaultwarden stays standalone

LXC / VM grouping table

Group What's inside Why grouped LXC or VM
edge Caddy reverse proxy, Crowdsec/Fail2ban The front door — small, stable, restart rarely LXC, 1 vCPU, 1GB RAM
identity Authentik (+ Postgres + Redis), Vaultwarden Auth-critical — touch rarely, back up religiously LXC, 2 vCPU, 2GB RAM
comms Mailcow Mailcow's compose is huge (15+ containers) and self-contained — wants its own host VM, 4GB RAM
automation n8n, Windmill (later), Huginn (later) Active workloads, frequent updates, you'll touch these a lot LXC, 24 vCPU, 4GB RAM
productivity Vikunja, Listmonk, Outline, Mealie, Linkwarden Personal/team productivity, low-resource LXC, 2 vCPU, 4GB RAM
media Immich, Nextcloud, Paperless-ngx Large storage, GPU passthrough useful for Immich ML VM if GPU passthrough, else LXC. Lots of disk.
business Cal.com , Crater Client-facing, financial — back up often LXC, 2 vCPU, 2GB RAM
monitoring Uptime Kuma , Dockge , Umami , Beszel (later) Ops stack on LXC 218 LXC, 2 vCPU, 2GB RAM
labs Anything experimental — Flowise, Trigger.dev Things you're trying out, can be wiped LXC, scratch space

Why this grouping (cheat sheet)

  • One service goes bad → only its group restarts.
  • Need a kernel upgrade for one stack → snapshot the LXC, upgrade, roll back if broken.
  • Mailcow's huge surface area is isolated in its own VM.
  • Edge LXC is tiny and stable → perfect for the layer everything depends on.
  • Backup cadence per group (see Backups section).
  • Resource limits per LXC mean a runaway container can't eat n8n's RAM.

Subdomains

Only expose what actually needs to be public. Internal services use Tailscale/Wireguard for remote access.

Expose publicly

Subdomain Service Group Why public Status
levkin.ca Company site (spec + /folders) edge Main brand LXC 220 — DNS must point to home IP (was parked elsewhere)
caseware.levkin.ca Static site edge Marketing live
auto.levkin.ca Static site edge Marketing live
iliadobkin.com Portfolio (SDET) edge Personal site live (pve10 LXC 219)
cal.levkin.ca Cal.com business Clients book on it live
listmonk.levkin.ca Listmonk productivity Unsubscribe URLs must resolve live
mail.levkine.ca Mailcow comms Mail server live
auth.levkin.ca Authentik identity OIDC redirect URLs need external resolution live
bill.levkin.ca Crater business Clients view invoices Phase 6
cloud.levkin.ca Nextcloud media Retiring — decommission VM 201 after cutover 🗑️
photos.levkin.ca Immich media Mobile apps need public hostname Phase 5
vault.levkin.ca Vaultwarden identity Mobile clients need public hostname
notes.levkin.ca Outline productivity Sharing docs with clients
chat.levkin.ca Mattermost comms Only if inviting outside users optional

Keep internal only (no public DNS, no Caddy block)

Reachable only via local network or Tailscale/Wireguard:

Service Reason
Umami admin UI Only you need the dashboard. Tracking endpoint can be public, dashboard isn't.
Uptime Kuma Status dashboard is for you. Don't advertise infrastructure.
Beszel Metrics are admin-only.
Dockge Admin UI — local only.
n8n editor UI shouldn't be exposed. Webhooks go on hooks.levkin.ca if needed.
Huginn / Windmill / Flowise Admin tools.
Vikunja Personal task manager.
Mealie Family recipes.
Trigger.dev Internal automation.
Paperless-ngx Personal documents. Never expose.
SiYuan Personal knowledge.
Linkwarden Personal bookmarks.

Borderline (decide per service)

Subdomain Service Notes
stats.levkin.ca Umami Public tracker script; admin UI prefer LAN :3000
status.levkin.ca Uptime Kuma Public status page only (not admin UI)
(none) Beszel LAN/Tailscale 10.0.10.22:8090 — host metrics, no public DNS

Phased rollout

Phase 0 — Foundation

  1. Caddy running (on VM — migrate to LXC in Phase 1.5)
  2. Static IP audit — all pve10 LXCs pinned via pct set; Caddy VM static .50; homelab VMs pinned via UniFi DHCP — see host-list.md
  3. DNS for auth.levkin.ca + levkin.ca apex → home IP
  4. identity LXC 217 @ 10.0.10.21 (2 vCPU, 2GB RAM, 20GB local-lvm, Debian 12 + Docker Compose)

Phase 1 — Identity

  1. Deploy Authentik in identity LXC (Authentik + Postgres + Redis, official compose at /opt/authentik)
  2. Caddy: auth.levkin.ca10.0.10.21:9000 (simple passthrough, no forward-auth)
  3. Admin user (admin), TOTP enrolled
  4. authentik Admins group (skip custom users group until more accounts)
  5. Static backup codes; don't OIDC other apps until Cal.com test

Phase 2 — Next infra (was Phase 1.5) — Caddy migration to LXC

Deferred until after sprint merge. Authentik + SSO are stable; edge migration is the next structural change.

Why Caddy belongs in an LXC, not a VM:

  • ~50MB OS overhead vs ~512MB for a VM
  • Boot/restart in 2-5s vs 20-40s (matters when reloading config)
  • Snapshot/backup is faster
  • Caddy is a Go binary doing reverse-proxy work — no need for kernel isolation
  • Near-native network performance

Steps:

  1. Create edge LXC: Debian 12, 1 vCPU, 512MB RAM, 8GB disk, static IP from host list
  2. Install Caddy via official Debian repo:
    apt install -y debian-keyring debian-archive-keyring apt-transport-https
    curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
    curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list
    apt update && apt install caddy
    
  3. Copy Caddyfile + custom snippets ((security-headers) etc.) from the VM
  4. Add a test subdomain (e.g. test.levkin.ca) pointing at the new LXC — verify TLS issues and routing works
  5. Cut over: update router port-forward (80/443) to the new LXC IP. DNS A records don't need to change if they point to your home IP.
  6. Watch Mailcow, Cal.com, Listmonk, the marketing sites for ~24h
  7. Keep the old VM snapshot for a week, then delete

Phase 2 — Quick wins

  1. Umami — tracking on levkin.ca, caseware, auto, and iliadobkin.com (portfolio)
  2. Uptime Kuma — monitors in UI
  3. Dockge — logged in; register /opt/monitoring stack (see monitoring-stack.md)
  4. Kuma email alerts — SMTP via Mailcow — monitoring-stack.md

Phase 3 — Cal.com (mostly done)

  1. Cal.com deployed in business LXC (id 210, Postgres included)
  2. cal.levkin.ca proxied via Caddy
  3. Booking link live at cal.levkin.ca/ilia/consult with Jitsi location
  4. Email working via cal@levkine.ca SMTP through Mailcow
  5. Cal.com OIDCdeferred (cal-authentik-oidc.md) — needs enterprise CALCOM_LICENSE_KEY
  6. auto.levkin.ca consult button → cal.levkin.ca/ilia/consult

Phase 4 — SSO migration

  1. Vikunjavikunja-authentik-oidc.md
  2. Nextcloud — skipped (VM 201 retired)
  3. Listmonklistmonk-authentik-oidc.md (v6.1.0)
  4. Mattermostmattermost-authentik-gitlab-oauth.md
  5. Mailcowmailcow-authentik-oidc.md

Remaining: browser smoke tests as ilia; rotate OIDC secrets when done.

For each: keep a local admin password as a break-glass account.

Phase 5 — Family / personal wins (~1 evening)

  1. Immich in media VM — install mobile apps for you and family, enable auto-upload. Face recognition runs in background; "my kids 2024" works within a couple days.
  2. Skip PhotoPrism — Immich covers it.

Phase 6 — Business / consulting (~12 evenings)

  1. Crater in business LXC — tax rates, company info, Stripe integration if you want online payment
  2. Beszel hub in monitoring LXC + agents on each LXC — one dashboard for resource usage

Phase 7 — Automation depth (ongoing)

Only when you have a real use case:

  1. Huginn in automation — first agent: competitor pages, kosher product availability, grant deadlines
  2. Windmill in automation — first script: rewrite an n8n flow with too many code nodes
  3. Flowise in labs — first flow: chat-with-docs against your consulting notes

Phase 8 — Knowledge / research

  1. Outline in productivity LXC — client-facing wiki + your notes
  2. Linkwarden in productivity LXC — bookmarks with full-page archive
  3. Paperless-ngx in media — scan and OCR the paper that's accumulating
  4. SiYuan — only if/when PhD or long-form research becomes relevant

Static IP audit

Maintain a host-list.md file (in this Cursor project, alongside this plan) with every LXC/VM, its current IP, its target static IP, and DHCP/static status. Cursor will use this as the source of truth when scripting changes.

Suggested format:

LXC/VM ID Name Role Current IP Target static IP DHCP/Static Notes
210 cal Cal.com 10.0.10.228/24 (DHCP) 10.0.10.228/24 static Convert ASAP
... ... ... ... ... ... ...

Use /24 subnets within 10.0.10.0/24 (or whatever your LAN is) with role-based ranges so it's scannable:

Range Reserved for
.1 - .9 Network gear (router, switches, APs)
.10 - .19 Proxmox host(s) + PBS
.20 - .39 Edge / identity / comms (critical infra)
.40 - .79 Application LXCs (productivity, automation, business, monitoring)
.80 - .99 Media VM(s)
.100 - .199 DHCP pool (clients, phones, laptops)
.200 - .249 Labs / experimental
.250 - .254 Reserved

How to set static on a Proxmox LXC

Two methods — pick one and stick with it:

Method A — Proxmox CLI (recommended, survives reboots cleanly):

pct set <ID> -net0 name=eth0,bridge=vmbr0,ip=10.0.10.X/24,gw=10.0.10.1
pct reboot <ID>

Method B — Router DHCP reservation:

  • Reserve the IP in your router's DHCP table by MAC address. LXC stays "DHCP" technically, but always gets the same IP.
  • Easier if you have many hosts and one router.
  • Risk: if the LXC's MAC changes (rebuild from snapshot to new ID), reservation breaks.

Recommendation: Method A (pct set) for everything critical (edge, identity, comms, business). Method B is fine for labs/experimental LXCs.

Audit checklist

  1. List every LXC: pct list
  2. List every VM: qm list
  3. For each, run pct exec <ID> -- ip a (or qm guest exec <ID> -- ip a for VMs) and check whether the IP came from DHCP
  4. Fill in host-list.md
  5. Pick target IPs from the range plan above
  6. Convert one at a time, lowest-risk first (labs → productivity → business → comms → identity → edge)
  7. After each conversion, verify the Caddy reverse-proxy entry still works (curl from outside)
  8. Update host-list.md status column

Hosts known to need conversion right now

  • LXC 210 (cal) — static at 10.0.10.228
  • Site LXCs 220, 215/216/219 — static; served via Caddy → nginx on each LXC (git deploy). Optional future: static files on Caddy VM only.

Backlog (priority order)

P0 — status (2026-05-24)

# Item Status
1 Umami / Kuma / Dockge
2 Portainer VM 109 removed
3 Nextcloud VM 201 retired
4 Listmonk → LXC 221 + SMTP + VM 113 destroyed
5 Beszel agents 16 systems
6 Kuma monitors + email 17 monitors, all alert-linked
7 DNS levkin.ca apex
8 Vikunja OIDC infra live — browser test as ilia still manual
9 UniFi DHCP listmonk MAC manual @ UniFi
10 NAS / Jellyfin / DebianDesktop deferred
11 Cal OIDC deferred (no license)

P1 — next

See handoff-next-steps.md — SSO smoke tests, secret rotation.

Phase 2 backlog (was P1 infra)

  1. Caddy → edge LXC @ 10.0.10.20
  2. Security remediationsecurity-remediation-plan.md
  3. NAS / Jellyfin — disk W4J0L3PY

P1 — when ready

  • Outline — wiki for client docs
  • Linkwarden — bookmarks with full-page archive
  • Plane — Jira-lite project management (pair with Mattermost)

P2 — when you have a real need

  • Crater — invoicing (Phase 6)
  • Immich — photos (Phase 5)
  • Paperless-ngx — document scanning (Phase 8)
  • Huginn — first when you have a monitoring use case
  • Windmill — when n8n hits limits
  • Trigger.dev — durable background jobs in code (better fit than Windmill for QA work)
  • PrivateBin — encrypted paste for sharing secrets with contractors
  • Addy.io — email aliases
  • SiYuan — if PhD work picks up
  • Flowise — labs only, when LLM workflow use case appears

Skip / declined

  • PhotoPrism — Immich covers it
  • Activepieces — you already have n8n
  • Affine / Trilium — picked Outline + SiYuan instead
  • Matrix/Synapse + Element — staying on Mattermost
  • Coolify / Dokploy / CapRover — Dockge is enough; revisit only if writing many custom apps

Backup strategy

  • Proxmox Backup Server (PBS) or vzdump to a NAS — snapshot each LXC/VM nightly
  • Critical groups (identity, comms, business): 7 daily + 4 weekly + 12 monthly
  • Productivity/automation: 7 daily + 4 weekly
  • Labs: 3 daily, no long retention
  • Off-site copy of identity and business LXCs — these contain auth and billing data. Encrypted copy to Wasabi or Backblaze B2.

The whole LXC gets snapshotted — much simpler than file-level container backup.

Done on pve10 (2026-05-22): pct snapshot backup-20260522 on LXCs 217 (identity) and 218 (monitoring).


Next steps (priority order)

See handoff-2026-05-24.md for sprint status checklist.

# Task Status Effort Frees / unlocks
1 Kuma SMTP done
2 Cal.com → Authentik OIDC deferred Needs CALCOM_LICENSE_KEY; infra ready — sso-selfhosted-matrix.md
3 auto.levkin.ca → Cal booking link Consult button live
4 Stop Portainer VM 109 Removed 2026-05-23; ~16 GiB RAM on pve10
5 Retire Nextcloud VM 201 ~8 GiB RAM freed
6 Vikunja → Authentik OIDC 🟡 infra OK 15 min Browser login as ilia
7 UniFi DHCP reservations 20 min unifi-static-dhcp.md
8 DNS levkin.ca apex 142.180.237.136
9 Beszel + Kuma 16 Beszel agents; 17 Kuma monitors
10 Listmonk SMTP UI + vault
10 NAS.SP00 disk → Jellyfin hardware VM 101
11 DebianDesktop reboot VM 100 rebooted; 24 GB active on pve201
12 Caddy → edge LXC .20 defer ~30 min Phase 1.5
13 dev-apps LXC defer half day After punim testing
14 Static sites → Caddy VM optional 1 h Defer

Defer: Immich, Crater, Outline; Listmonk/Mattermost/Mailcow SSO after Vikunja; Cal OIDC until license.

Adding a new service — quick rule

Want to add… Node RAM budget Prerequisite
Small app (Mealie, Linkwarden) pve10 2 GB LXC ~22 GiB free on pve10
Medium (Outline, Crater) pve10 4 GB LXC Portainer + Nextcloud already freed
Heavy (Immich + ML) pve10 or pve201 GPU 48 GB+ NAS healthy; pve201 only after GPU/punim sized down
Dev sandbox pve10 dev-apps 68 GB punim 9101 migration only after testing

Nextcloud decommission (VM 201)

  1. Confirm export in exports/nextcloud-2026-05-21/ is complete
  2. Delete Nextcloud monitor in Kuma
  3. Remove nextcloud.levkin.ca from Caddy VM
  4. Stop VM 201; update host-list.md
  5. After NAS healthy: optional vzdump archive then delete disk

Important rules

  1. Never put Authentik behind itself. auth.levkin.ca is a simple Caddy passthrough — no forward-auth, no fancy dependencies. If Authentik goes down, you'd lose access to Authentik.
  2. Vaultwarden stays standalone. It's your break-glass path if Authentik dies. Don't OIDC it.
  3. Keep a local admin password on every SSO-wired app. OIDC integrations break during upgrades — you need to log in to fix them.
  4. Local admin to Proxmox host. Independent of Authentik and Vaultwarden. Written down somewhere physical.
  5. Don't expose admin UIs publicly. Dockge, Beszel, Uptime Kuma admin, n8n editor — use Tailscale or Wireguard for remote access.
  6. Static IPs for every LXC. DHCP will eventually move them and Caddy will break. Set via pct set <id> -net0 ...ip=10.0.10.X/24,gw=... or a router reservation.
  7. Cal.com LXC (210) — static at .228 .
  8. Maintain host-list.md as the single source of truth for IPs. Update it whenever a new LXC/VM is created or migrated.