ansible/docs/guides/levkin-selfhost-plan-2.md

# Levkin self-hosted stack — plan & decisions

Reference doc for the Proxmox homelab. Lives alongside the Cursor project that has the Proxmox info.

**Conventions:**
- All groups run inside an LXC unless marked **VM**.
- Inside each LXC: one `docker-compose.yml`, managed by **Dockge** where applicable.
- Caddy on the `edge` LXC is the only thing exposed to the internet.
- Authentik on the `identity` LXC is the source of truth for who you are.
- Vaultwarden stays standalone (it's the break-glass path if Authentik dies).

---

## Current state (May 2026)

**Already running:**
- Caddy reverse proxy — currently on a **VM** (should migrate to LXC, see "Caddy migration" section)
- Mailcow — VM, mail domain is `levkine.ca` (with e)
- Vaultwarden, Vikunja, n8n, Listmonk, Mattermost, Nextcloud — across various LXCs
- **Cal.com** — LXC id `210`, `cal.levkin.ca`, Postgres included, admin user `ilia`, 15-min consult event live at `cal.levkin.ca/ilia/consult` with Jitsi link
- Caddy entries live for: `caseware.levkin.ca`, `auto.levkin.ca`, `iliadobkin.com`, `cal.levkin.ca`, `listmonk.levkin.ca`, `pdf.levkin.ca`, `search.levkin.ca`, `auth.levkin.ca`
- **Authentik** — LXC **217** @ `10.0.10.21`, `https://auth.levkin.ca`, admin + TOTP enrolled
- **Monitoring** — LXC **218** @ `10.0.10.22`: Uptime Kuma `:3001`, Dockge `:5001`, Umami `:3000` (LAN-only) — [monitoring-stack.md](monitoring-stack.md)
- **Umami** + **Authentik** admin/TOTP/backup codes — done
- **Uptime Kuma** — monitors live; email alerts via Mailcow — see [monitoring-stack.md](monitoring-stack.md)
- **Dockge** on 218 — manages local `/opt/monitoring` stack
- **Snapshots** `backup-20260522` on LXCs **217**, **218**
- **Jellyfin** (VM 101) — stopped
- LXC **210, 215–218, 219** — static via `pct set`; **Caddy VM 106** — static in-guest `.50`
- **Nextcloud VM 201** — export done; **retire soon** (no SSO, remove Kuma monitor + Caddy block when off)

**Decisions locked in:**
- Container manager: **Dockge** (not Portainer, not Coolify/Dokploy/CapRover)
- Chat: **Mattermost only** — no Matrix/Synapse
- Knowledge tool: **Outline** for client-facing, **SiYuan** if/when PhD work picks up (don't run Affine + Trilium too)
- Bookmark manager: **Linkwarden** (full-page archive is the killer feature)
- Authentik is the SSO target; Vaultwarden stays standalone

---

## LXC / VM grouping table

| Group | What's inside | Why grouped | LXC or VM |
|---|---|---|---|
| **edge** | Caddy reverse proxy, Crowdsec/Fail2ban | The front door — small, stable, restart rarely | LXC, 1 vCPU, 1GB RAM |
| **identity** | Authentik (+ Postgres + Redis), Vaultwarden | Auth-critical — touch rarely, back up religiously | LXC, 2 vCPU, 2GB RAM |
| **comms** | Mailcow | Mailcow's compose is huge (15+ containers) and self-contained — wants its own host | **VM**, 4GB RAM |
| **automation** | n8n, Windmill (later), Huginn (later) | Active workloads, frequent updates, you'll touch these a lot | LXC, 2–4 vCPU, 4GB RAM |
| **productivity** | Vikunja, Listmonk, Outline, Mealie, Linkwarden | Personal/team productivity, low-resource | LXC, 2 vCPU, 4GB RAM |
| **media** | Immich, Nextcloud, Paperless-ngx | Large storage, GPU passthrough useful for Immich ML | **VM** if GPU passthrough, else LXC. Lots of disk. |
| **business** | Cal.com ✅, Crater | Client-facing, financial — back up often | LXC, 2 vCPU, 2GB RAM |
| **monitoring** | Uptime Kuma ✅, Dockge ✅, Umami ✅, Beszel (later) | Ops stack on LXC **218** | LXC, 2 vCPU, 2GB RAM |
| **labs** | Anything experimental — Flowise, Trigger.dev | Things you're trying out, can be wiped | LXC, scratch space |

### Why this grouping (cheat sheet)

- One service goes bad → only its group restarts.
- Need a kernel upgrade for one stack → snapshot the LXC, upgrade, roll back if broken.
- Mailcow's huge surface area is isolated in its own VM.
- Edge LXC is tiny and stable → perfect for the layer everything depends on.
- Backup cadence per group (see Backups section).
- Resource limits per LXC mean a runaway container can't eat n8n's RAM.

---

## Subdomains

Only expose what actually needs to be public. Internal services use Tailscale/Wireguard for remote access.

### Expose publicly

| Subdomain | Service | Group | Why public | Status |
|---|---|---|---|---|
| `caseware.levkin.ca` | Static site | edge | Marketing | ✅ live |
| `auto.levkin.ca` | Static site | edge | Marketing | ✅ live |
| `iliadobkin.com` | Portfolio (SDET) | edge | Personal site | ✅ live (pve10 LXC 219) |
| `cal.levkin.ca` | Cal.com | business | Clients book on it | ✅ live |
| `listmonk.levkin.ca` | Listmonk | productivity | Unsubscribe URLs must resolve | ✅ live |
| `mail.levkine.ca` | Mailcow | comms | Mail server | ✅ live |
| `auth.levkin.ca` | Authentik | identity | OIDC redirect URLs need external resolution | ✅ live |
| `bill.levkin.ca` | Crater | business | Clients view invoices | ⏳ Phase 6 |
| `cloud.levkin.ca` | Nextcloud | media | **Retiring** — decommission VM 201 after cutover | 🗑️ |
| `photos.levkin.ca` | Immich | media | Mobile apps need public hostname | ⏳ Phase 5 |
| `vault.levkin.ca` | Vaultwarden | identity | Mobile clients need public hostname | ⏳ |
| `notes.levkin.ca` | Outline | productivity | Sharing docs with clients | ⏳ |
| `chat.levkin.ca` | Mattermost | comms | Only if inviting outside users | ⏳ optional |

### Keep internal only (no public DNS, no Caddy block)

Reachable only via local network or Tailscale/Wireguard:

| Service | Reason |
|---|---|
| Umami admin UI | Only you need the dashboard. Tracking endpoint can be public, dashboard isn't. |
| Uptime Kuma | Status dashboard is for you. Don't advertise infrastructure. |
| Beszel | Metrics are admin-only. |
| Dockge | Admin UI — local only. |
| n8n editor | UI shouldn't be exposed. Webhooks go on `hooks.levkin.ca` if needed. |
| Huginn / Windmill / Flowise | Admin tools. |
| Vikunja | Personal task manager. |
| Mealie | Family recipes. |
| Trigger.dev | Internal automation. |
| Paperless-ngx | Personal documents. Never expose. |
| SiYuan | Personal knowledge. |
| Linkwarden | Personal bookmarks. |

### Borderline (decide per service)

| Subdomain | Service | Notes |
|---|---|---|
| `stats.levkin.ca` | Umami collector | Only the tracking script endpoint needs to be public; admin UI stays internal |
| `status.levkin.ca` | Uptime Kuma | Kuma supports a separate public status page URL — that one can be public |

---

## Phased rollout

### Phase 0 — Foundation
1. ✅ Caddy running (on VM — migrate to LXC in Phase 1.5)
2. ✅ **Static IP audit (partial)** — all LXCs on pve10 pinned; Caddy VM static `.50`; remaining VMs on stable DHCP — see [host-list.md](host-list.md)
3. ✅ DNS for `auth.levkin.ca` → home IP (verified 2026-05-22)
4. ✅ `identity` LXC **217** @ `10.0.10.21` (2 vCPU, 2GB RAM, 20GB `local-lvm`, Debian 12 + Docker Compose)

### Phase 1 — Identity ✅
1. ✅ Deploy Authentik in `identity` LXC (Authentik + Postgres + Redis, official compose at `/opt/authentik`)
2. ✅ Caddy: `auth.levkin.ca` → `10.0.10.21:9000` (simple passthrough, no forward-auth)
3. ✅ Admin user (`admin`), TOTP enrolled
4. ✅ `authentik Admins` group (skip custom `users` group until more accounts)
5. ✅ Static backup codes; **don't OIDC other apps until Cal.com test**

### Phase 1.5 — Caddy migration to LXC (~30 min)

Why now (after Phase 1, before bulk SSO work in Phase 4): Authentik is stable enough to absorb a small change, but you haven't yet built the dependency web of OIDC integrations that would make a Caddy reload risky.

Why Caddy belongs in an LXC, not a VM:
- ~50MB OS overhead vs ~512MB for a VM
- Boot/restart in 2-5s vs 20-40s (matters when reloading config)
- Snapshot/backup is faster
- Caddy is a Go binary doing reverse-proxy work — no need for kernel isolation
- Near-native network performance

Steps:
1. Create `edge` LXC: Debian 12, 1 vCPU, 512MB RAM, 8GB disk, **static IP from host list**
2. Install Caddy via official Debian repo:
   ```bash
   apt install -y debian-keyring debian-archive-keyring apt-transport-https
   curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/gpg.key' | gpg --dearmor -o /usr/share/keyrings/caddy-stable-archive-keyring.gpg
   curl -1sLf 'https://dl.cloudsmith.io/public/caddy/stable/debian.deb.txt' | tee /etc/apt/sources.list.d/caddy-stable.list
   apt update && apt install caddy
   ```
3. Copy `Caddyfile` + custom snippets (`(security-headers)` etc.) from the VM
4. Add a **test subdomain** (e.g. `test.levkin.ca`) pointing at the new LXC — verify TLS issues and routing works
5. Cut over: update router port-forward (80/443) to the new LXC IP. DNS A records don't need to change if they point to your home IP.
6. Watch Mailcow, Cal.com, Listmonk, the marketing sites for ~24h
7. Keep the old VM snapshot for a week, then delete

### Phase 2 — Quick wins ✅
1. ✅ **Umami** — tracking on caseware, auto, and iliadobkin.com (portfolio)
2. ✅ **Uptime Kuma** — monitors in UI
3. ✅ **Dockge** — logged in; register `/opt/monitoring` stack (see [monitoring-stack.md](monitoring-stack.md))
4. ⏳ **Kuma email alerts** — SMTP via Mailcow `alerts@levkine.ca` → your inbox (steps in monitoring-stack.md)

### Phase 3 — Cal.com (mostly done) ✅
1. ✅ Cal.com deployed in `business` LXC (id 210, Postgres included)
2. ✅ `cal.levkin.ca` proxied via Caddy
3. ✅ Booking link live at `cal.levkin.ca/ilia/consult` with Jitsi location
4. ✅ Email working via `cal@levkine.ca` SMTP through Mailcow
5. ⏳ **Wire Cal.com to Authentik via OIDC** (first real SSO connection — do this after Phase 1)
6. ⏳ Update `auto.levkin.ca` button → `cal.levkin.ca/ilia/consult` (currently points to placeholder)

### Phase 4 — SSO migration (~half a day, staged)
Wire each to Authentik, least-risky first:
1. **Vikunja** (OIDC native) — easy, single-user impact
2. ~~**Nextcloud**~~ — **skipped** (VM 201 retiring)
3. **Listmonk** (OIDC native, admin only) — easy
4. **Mattermost** (SAML or OIDC native) — moderate
5. **Mailcow** (OIDC) — last, because mail-critical

For each: keep a local admin password as a break-glass account.

### Phase 5 — Family / personal wins (~1 evening)
1. **Immich** in `media` VM — install mobile apps for you and family, enable auto-upload. Face recognition runs in background; "my kids 2024" works within a couple days.
2. Skip PhotoPrism — Immich covers it.

### Phase 6 — Business / consulting (~1–2 evenings)
1. **Crater** in `business` LXC — tax rates, company info, Stripe integration if you want online payment
2. **Beszel** hub in `monitoring` LXC + agents on each LXC — one dashboard for resource usage

### Phase 7 — Automation depth (ongoing)
Only when you have a real use case:
1. **Huginn** in `automation` — first agent: competitor pages, kosher product availability, grant deadlines
2. **Windmill** in `automation` — first script: rewrite an n8n flow with too many code nodes
3. **Flowise** in `labs` — first flow: chat-with-docs against your consulting notes

### Phase 8 — Knowledge / research
1. **Outline** in `productivity` LXC — client-facing wiki + your notes
2. **Linkwarden** in `productivity` LXC — bookmarks with full-page archive
3. **Paperless-ngx** in `media` — scan and OCR the paper that's accumulating
4. **SiYuan** — only if/when PhD or long-form research becomes relevant

---

## Static IP audit

**Maintain a `host-list.md` file** (in this Cursor project, alongside this plan) with every LXC/VM, its current IP, its target static IP, and DHCP/static status. Cursor will use this as the source of truth when scripting changes.

Suggested format:

| LXC/VM ID | Name | Role | Current IP | Target static IP | DHCP/Static | Notes |
|---|---|---|---|---|---|---|
| 210 | cal | Cal.com | 10.0.10.228/24 (DHCP) | 10.0.10.228/24 | ⏳ static | Convert ASAP |
| ... | ... | ... | ... | ... | ... | ... |

### Recommended IP plan

Use `/24` subnets within `10.0.10.0/24` (or whatever your LAN is) with role-based ranges so it's scannable:

| Range | Reserved for |
|---|---|
| `.1 - .9` | Network gear (router, switches, APs) |
| `.10 - .19` | Proxmox host(s) + PBS |
| `.20 - .39` | Edge / identity / comms (critical infra) |
| `.40 - .79` | Application LXCs (productivity, automation, business, monitoring) |
| `.80 - .99` | Media VM(s) |
| `.100 - .199` | DHCP pool (clients, phones, laptops) |
| `.200 - .249` | Labs / experimental |
| `.250 - .254` | Reserved |

### How to set static on a Proxmox LXC

Two methods — pick one and stick with it:

**Method A — Proxmox CLI (recommended, survives reboots cleanly):**
```bash
pct set <ID> -net0 name=eth0,bridge=vmbr0,ip=10.0.10.X/24,gw=10.0.10.1
pct reboot <ID>
```

**Method B — Router DHCP reservation:**
- Reserve the IP in your router's DHCP table by MAC address. LXC stays "DHCP" technically, but always gets the same IP.
- Easier if you have many hosts and one router.
- Risk: if the LXC's MAC changes (rebuild from snapshot to new ID), reservation breaks.

**Recommendation:** Method A (`pct set`) for everything critical (edge, identity, comms, business). Method B is fine for labs/experimental LXCs.

### Audit checklist

1. List every LXC: `pct list`
2. List every VM: `qm list`
3. For each, run `pct exec <ID> -- ip a` (or `qm guest exec <ID> -- ip a` for VMs) and check whether the IP came from DHCP
4. Fill in `host-list.md`
5. Pick target IPs from the range plan above
6. Convert one at a time, lowest-risk first (labs → productivity → business → comms → identity → edge)
7. **After each conversion**, verify the Caddy reverse-proxy entry still works (curl from outside)
8. Update `host-list.md` status column

### Hosts known to need conversion right now

- **LXC 210 (cal)** — currently DHCP `10.0.10.228/24`, must be static before Caddy migration

---

## Backlog (priority order)

### P0 — next batch after Phase 1 admin bootstrap
1. **Umami** — analytics on landing pages, 10 min to deploy, immediate signal
2. **Uptime Kuma** — monitor what you already have
3. **Dockge** — UI over existing compose
4. **Beszel** — homelab resource visibility
5. **Mealie** — family recipes, simple win

### P1 — when ready
- **Outline** — wiki for client docs
- **Linkwarden** — bookmarks with full-page archive
- **Plane** — Jira-lite project management (pair with Mattermost)

### P2 — when you have a real need
- **Crater** — invoicing (Phase 6)
- **Immich** — photos (Phase 5)
- **Paperless-ngx** — document scanning (Phase 8)
- **Huginn** — first when you have a monitoring use case
- **Windmill** — when n8n hits limits
- **Trigger.dev** — durable background jobs in code (better fit than Windmill for QA work)
- **PrivateBin** — encrypted paste for sharing secrets with contractors
- **Addy.io** — email aliases
- **SiYuan** — if PhD work picks up
- **Flowise** — labs only, when LLM workflow use case appears

### Skip / declined
- ~~PhotoPrism~~ — Immich covers it
- ~~Activepieces~~ — you already have n8n
- ~~Affine / Trilium~~ — picked Outline + SiYuan instead
- ~~Matrix/Synapse + Element~~ — staying on Mattermost
- ~~Coolify / Dokploy / CapRover~~ — Dockge is enough; revisit only if writing many custom apps

---

## Backup strategy

- **Proxmox Backup Server (PBS)** or `vzdump` to a NAS — snapshot each LXC/VM nightly
- **Critical groups** (`identity`, `comms`, `business`): 7 daily + 4 weekly + 12 monthly
- **Productivity/automation**: 7 daily + 4 weekly
- **Labs**: 3 daily, no long retention
- **Off-site copy** of `identity` and `business` LXCs — these contain auth and billing data. Encrypted copy to Wasabi or Backblaze B2.

The whole LXC gets snapshotted — much simpler than file-level container backup.

**Done on pve10 (2026-05-22):** `pct snapshot` **`backup-20260522`** on LXCs **217** (identity) and **218** (monitoring).

---

## Next steps (priority order)

See **[homelab-status-2026-05-22.md](homelab-status-2026-05-22.md)** for done vs todo.

| # | Task | Effort | Doc |
|---|------|--------|-----|
| 1 | **Kuma SMTP** test in UI | 5 min | [monitoring-stack.md](monitoring-stack.md) |
| 2 | **UniFi DHCP reservations** | 20 min | [unifi-static-dhcp.md](unifi-static-dhcp.md) |
| 3 | **Cal.com → Authentik OIDC** | 1–2 h | Phase 3 below |
| 4 | **Retire Nextcloud VM 201** | 30 min | [nextcloud-export-2026-05-21.md](nextcloud-export-2026-05-21.md) |
| 5 | **NAS.SP00** disk replace → Jellyfin | hardware | [nas-sp00-drive-failure-report.md](nas-sp00-drive-failure-report.md) |
| 6 | **Caddy → edge LXC `.20`** | ~30 min | Phase 1.5 |

**Defer:** Nextcloud SSO, Immich, Crater, Beszel until above are done.

### Nextcloud decommission (VM 201)

1. Confirm export in `exports/nextcloud-2026-05-21/` is complete
2. Delete **Nextcloud** monitor in Kuma
3. Remove `nextcloud.levkin.ca` from Caddy VM
4. Stop VM 201; update [host-list.md](host-list.md)
5. After NAS healthy: optional `vzdump` archive then delete disk

---

## Important rules

1. **Never put Authentik behind itself.** `auth.levkin.ca` is a simple Caddy passthrough — no forward-auth, no fancy dependencies. If Authentik goes down, you'd lose access to Authentik.
2. **Vaultwarden stays standalone.** It's your break-glass path if Authentik dies. Don't OIDC it.
3. **Keep a local admin password on every SSO-wired app.** OIDC integrations break during upgrades — you need to log in to fix them.
4. **Local admin to Proxmox host.** Independent of Authentik and Vaultwarden. Written down somewhere physical.
5. **Don't expose admin UIs publicly.** Dockge, Beszel, Uptime Kuma admin, n8n editor — use Tailscale or Wireguard for remote access.
6. **Static IPs for every LXC.** DHCP will eventually move them and Caddy will break. Set via `pct set <id> -net0 ...ip=10.0.10.X/24,gw=...` or a router reservation.
7. **Cal.com LXC (210)** — static at `.228` ✅.
8. **Maintain `host-list.md`** as the single source of truth for IPs. Update it whenever a new LXC/VM is created or migrated.