ilia/ansible

Fork 0

ilia de49b34cdc

CI / skip-ci-check (pull_request) Successful in 6s

Details

CI / lint-and-test (pull_request) Failing after 9s

Details

CI / ansible-validation (pull_request) Failing after 6s

Details

CI / secret-scanning (pull_request) Successful in 5s

Details

CI / dependency-scan (pull_request) Successful in 8s

Details

CI / sast-scan (pull_request) Failing after 5s

Details

CI / license-check (pull_request) Successful in 11s

Details

CI / vault-check (pull_request) Failing after 6s

Details

CI / playbook-test (pull_request) Failing after 6s

Details

CI / container-scan (pull_request) Failing after 6s

Details

CI / sonar-analysis (pull_request) Failing after 2s

Details

CI / workflow-summary (pull_request) Successful in 4s

Details

Add homelab monitoring, portfolio site, and vault tooling.

Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio
to inventory; Mailcow mailbox automation; vault import/export scripts;
security audit guides and UniFi DHCP reference.

Co-authored-by: Cursor <cursoragent@cursor.com>

2026-05-22 16:25:07 -04:00

7.7 KiB

Raw Blame History

NAS.SP00 drive failure — IT report

Date: 2026-05-21
Host: PVENAS (Proxmox VE) — 10.0.10.10
Pool: ZFS NAS.SP00 (~9 TB, ~862 GB used)
Prepared for: IT / hardware replacement
SMART audit: nas-sp00-smart-audit-2026-05-21.md

Executive summary

One disk in a four-drive ZFS mirror pair has failed at the hardware level. The pool is DEGRADED but online with no known data errors at this time. The failed drive must be physically replaced and the pool resilvered. Until then, mirror-0 has no redundancy — a second failure on the remaining disk in that mirror (W4J0L0BA) could cause data loss.

This issue also caused a host-wide I/O wedge (pool SUSPENDED → stuck sync()), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on local-lvm). That was cleared by a forced node reboot; replacing the drive remains required.

Pool layout

Vdev	Role	Disk A	Disk B	Status
mirror-0	RAID1 pair	`W4J0L0BA` (sda, 5 TB)	`W4J0L3PY` (sdb)	DEGRADED — sdb UNAVAIL
mirror-1	RAID1 pair	`W4J0LKCD` (sdd, 5 TB)	`W4J0K9V7` (sdc, 5 TB)	ONLINE

Model family (healthy drives): Seagate ST5000DM000-1FK178 (5 TB, 7200 RPM).

Failed drive identification

Field	Expected	Observed
Serial	W4J0L3PY	W4J0L3PY
Model	ST5000DM000-1FK178	ST5000DM000 (truncated reporting)
WWN	—	`5000c50082cc8bbb`
Firmware	—	CC48
Capacity	~5,000,981,078,016 bytes (5.00 TB)	137,438,952,960 bytes (~137 GB)
Linux device	`/dev/sdb`	`/dev/sdb`
ZFS state	ONLINE	UNAVAIL — label missing/invalid

ZFS last known path:
/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1

Symptoms and evidence

1. Capacity collapse (primary indicator)

The drive is detected as ~137 GB instead of 5 TB. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:

Failed HDD (media/controller failure)
Bad SATA cable, backplane port, or HBA port
USB/SATA bridge failure (if applicable)
Severe firmware/HPA corruption (less common)

2. SMART / SCSI errors

smartctl against /dev/sdb:

Read SMART Data failed: scsi error aborted command
Overall health: UNKNOWN (attributes unreadable)
Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)

Healthy sibling in same mirror (/dev/sda, W4J0L0BA): SMART PASSED, full 5 TB capacity.

3. Kernel log (`dmesg` at boot, 2026-05-21 ~21:27)

Repeated on sdb:

Buffer I/O error on dev sdb
Sense Key: Medium Error
Add. Sense: Unrecovered read error
critical medium error, dev sdb, sector N op 0x0:(READ)

Indicates the block device cannot reliably read media — hardware or link layer, not a ZFS configuration issue.

4. ZFS pool history

Pool previously entered SUSPENDED state (I/O failures on faulted devices).
After node reboot: pool DEGRADED, short resilver completed with 0 errors (healing scan on remaining devices).
Current: No known data errors in zpool status.

Impact

Storage / services on `NAS.SP00`

Proxmox guests with disks on this pool (non-exhaustive):

VMID	Name	NAS-backed storage
101	Jellyfin	1 TB zvol
105	TrueNAS	1 TB zvol
108	actual-debian	10 GB
200	PVE.BU.SVR	1 TB
201	NextcloudAIO-debian	8 TB

Risk: With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have no redundancy until the failed drive is replaced and resilver completes.

Unrelated workloads

Guests on local-lvm (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are not stored on NAS.SP00 but were affected when the pool suspended and blocked system-wide sync().

Backup target

Proxmox datastore PVEBUVD00 (PBS @ 10.0.10.200:8007) reports unreachable from this node — separate issue; verify PBS host/network.

Diagnosis

Question	Answer
Is this a ZFS misconfiguration?	No — config is consistent; three drives show correct 5 TB labels.
Is the pool lost?	No — degraded but importable; no known data errors currently.
Which disk to replace?	Seagate W4J0L3PY (`/dev/sdb`, mirror-0 failed leg).
Can we fix it in software?	Unlikely — capacity and SMART failures point to hardware.
Safe to reseat first?	Optional trial — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, replace disk.

Recommended actions

Immediate (IT / on-site)

Identify physical slot for serial W4J0L3PY (compare to inventory/asset tags).
Reseat SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
If capacity is still wrong or SMART still fails → replace with new 5 TB+ enterprise/NAS-class HDD (match class of ST5000DM000 or better).
Do not remove the UNAVAIL device from the pool until replacement is in place.

After new disk is installed

On PVENAS as root (adjust /dev/disk/by-id/... to the new drive’s partition 1):

# Verify new disk shows ~5 TB
lsblk /dev/sdX
smartctl -H /dev/sdX

# Replace failed vdev (use ID from: zpool status NAS.SP00)
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1

# Monitor until resilver completes
zpool status -v NAS.SP00

Post-resilver

Run zpool scrub NAS.SP00 during a maintenance window.
Confirm PVEBUVD00 / PBS connectivity if backups depend on it.
Review whether Nextcloud VM 201 (8 TB on degraded pool) should remain running until healthy.

Not recommended

Ignoring degraded state for extended periods.
Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
zpool clear without addressing hardware — does not fix a dead disk.

Reference — healthy disks (for spare matching)

Serial	Device	Capacity	SMART
W4J0L0BA	sda	5.00 TB	PASSED
W4J0K9V7	sdc	5.00 TB	PASSED
W4J0LKCD	sdd	5.00 TB	PASSED

Timeline (brief)

When	Event
Prior to 2026-05-21	`W4J0L3PY` accumulated read/write errors; pool faulted
2026-05-21	Pool SUSPENDED; host `sync()` wedged; Cal LXC start failed
2026-05-21 ~21:28	Forced node reboot; pool DEGRADED, resilver finished, 0 errors
2026-05-21	`sdb` still reports ~137 GB, UNAVAIL — replacement still required

Contact / handoff notes

Node: Proxmox VE 8.x on PVENAS (10.0.10.10)
Pool name in Proxmox: NAS.SP00 (zfspool, active, degraded)
Failed serial: W4J0L3PY
Replacement type: 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178

For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see levkin-selfhost-plan-2.md.

TL;DR

Pool NAS.SP00 on PVENAS (10.0.10.10) had a disk failure (W4J0L3PY)
Pool went SUSPENDED; required forced reboot and is now DEGRADED
Immediate action: Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
Use zpool replace command with correct device paths (see main procedure)
Monitor resilver to completion; run zpool scrub after
Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
Reach out if unsure about pool status or downstream service risk

7.7 KiB Raw Blame History Unescape Escape

NAS.SP00 drive failure — IT report

Executive summary

Pool layout

Failed drive identification

Symptoms and evidence

1. Capacity collapse (primary indicator)

2. SMART / SCSI errors

3. Kernel log (dmesg at boot, 2026-05-21 ~21:27)

4. ZFS pool history

Impact

Storage / services on NAS.SP00

Unrelated workloads

Backup target

Diagnosis

Recommended actions

Immediate (IT / on-site)

After new disk is installed

Post-resilver

Not recommended

Reference — healthy disks (for spare matching)

Timeline (brief)

Contact / handoff notes

TL;DR

7.7 KiB

Raw Blame History

3. Kernel log (`dmesg` at boot, 2026-05-21 ~21:27)

Storage / services on `NAS.SP00`