Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio to inventory; Mailcow mailbox automation; vault import/export scripts; security audit guides and UniFi DHCP reference. Co-authored-by: Cursor <cursoragent@cursor.com>
7.7 KiB
NAS.SP00 drive failure — IT report
Date: 2026-05-21
Host: PVENAS (Proxmox VE) — 10.0.10.10
Pool: ZFS NAS.SP00 (~9 TB, ~862 GB used)
Prepared for: IT / hardware replacement
SMART audit: nas-sp00-smart-audit-2026-05-21.md
Executive summary
One disk in a four-drive ZFS mirror pair has failed at the hardware level. The pool is DEGRADED but online with no known data errors at this time. The failed drive must be physically replaced and the pool resilvered. Until then, mirror-0 has no redundancy — a second failure on the remaining disk in that mirror (W4J0L0BA) could cause data loss.
This issue also caused a host-wide I/O wedge (pool SUSPENDED → stuck sync()), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on local-lvm). That was cleared by a forced node reboot; replacing the drive remains required.
Pool layout
| Vdev | Role | Disk A | Disk B | Status |
|---|---|---|---|---|
| mirror-0 | RAID1 pair | W4J0L0BA (sda, 5 TB) |
W4J0L3PY (sdb) |
DEGRADED — sdb UNAVAIL |
| mirror-1 | RAID1 pair | W4J0LKCD (sdd, 5 TB) |
W4J0K9V7 (sdc, 5 TB) |
ONLINE |
Model family (healthy drives): Seagate ST5000DM000-1FK178 (5 TB, 7200 RPM).
Failed drive identification
| Field | Expected | Observed |
|---|---|---|
| Serial | W4J0L3PY | W4J0L3PY |
| Model | ST5000DM000-1FK178 | ST5000DM000 (truncated reporting) |
| WWN | — | 5000c50082cc8bbb |
| Firmware | — | CC48 |
| Capacity | ~5,000,981,078,016 bytes (5.00 TB) | 137,438,952,960 bytes (~137 GB) |
| Linux device | /dev/sdb |
/dev/sdb |
| ZFS state | ONLINE | UNAVAIL — label missing/invalid |
ZFS last known path:
/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1
Symptoms and evidence
1. Capacity collapse (primary indicator)
The drive is detected as ~137 GB instead of 5 TB. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:
- Failed HDD (media/controller failure)
- Bad SATA cable, backplane port, or HBA port
- USB/SATA bridge failure (if applicable)
- Severe firmware/HPA corruption (less common)
2. SMART / SCSI errors
smartctl against /dev/sdb:
- Read SMART Data failed: scsi error aborted command
- Overall health: UNKNOWN (attributes unreadable)
- Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)
Healthy sibling in same mirror (/dev/sda, W4J0L0BA): SMART PASSED, full 5 TB capacity.
3. Kernel log (dmesg at boot, 2026-05-21 ~21:27)
Repeated on sdb:
Buffer I/O error on dev sdb
Sense Key: Medium Error
Add. Sense: Unrecovered read error
critical medium error, dev sdb, sector N op 0x0:(READ)
Indicates the block device cannot reliably read media — hardware or link layer, not a ZFS configuration issue.
4. ZFS pool history
- Pool previously entered SUSPENDED state (I/O failures on faulted devices).
- After node reboot: pool DEGRADED, short resilver completed with 0 errors (healing scan on remaining devices).
- Current: No known data errors in
zpool status.
Impact
Storage / services on NAS.SP00
Proxmox guests with disks on this pool (non-exhaustive):
| VMID | Name | NAS-backed storage |
|---|---|---|
| 101 | Jellyfin | 1 TB zvol |
| 105 | TrueNAS | 1 TB zvol |
| 108 | actual-debian | 10 GB |
| 200 | PVE.BU.SVR | 1 TB |
| 201 | NextcloudAIO-debian | 8 TB |
Risk: With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have no redundancy until the failed drive is replaced and resilver completes.
Unrelated workloads
Guests on local-lvm (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are not stored on NAS.SP00 but were affected when the pool suspended and blocked system-wide sync().
Backup target
Proxmox datastore PVEBUVD00 (PBS @ 10.0.10.200:8007) reports unreachable from this node — separate issue; verify PBS host/network.
Diagnosis
| Question | Answer |
|---|---|
| Is this a ZFS misconfiguration? | No — config is consistent; three drives show correct 5 TB labels. |
| Is the pool lost? | No — degraded but importable; no known data errors currently. |
| Which disk to replace? | Seagate W4J0L3PY (/dev/sdb, mirror-0 failed leg). |
| Can we fix it in software? | Unlikely — capacity and SMART failures point to hardware. |
| Safe to reseat first? | Optional trial — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, replace disk. |
Recommended actions
Immediate (IT / on-site)
- Identify physical slot for serial W4J0L3PY (compare to inventory/asset tags).
- Reseat SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
- If capacity is still wrong or SMART still fails → replace with new 5 TB+ enterprise/NAS-class HDD (match class of ST5000DM000 or better).
- Do not remove the UNAVAIL device from the pool until replacement is in place.
After new disk is installed
On PVENAS as root (adjust /dev/disk/by-id/... to the new drive’s partition 1):
# Verify new disk shows ~5 TB
lsblk /dev/sdX
smartctl -H /dev/sdX
# Replace failed vdev (use ID from: zpool status NAS.SP00)
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1
# Monitor until resilver completes
zpool status -v NAS.SP00
Post-resilver
- Run
zpool scrub NAS.SP00during a maintenance window. - Confirm PVEBUVD00 / PBS connectivity if backups depend on it.
- Review whether Nextcloud VM 201 (8 TB on degraded pool) should remain running until healthy.
Not recommended
- Ignoring degraded state for extended periods.
- Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
zpool clearwithout addressing hardware — does not fix a dead disk.
Reference — healthy disks (for spare matching)
| Serial | Device | Capacity | SMART |
|---|---|---|---|
| W4J0L0BA | sda | 5.00 TB | PASSED |
| W4J0K9V7 | sdc | 5.00 TB | PASSED |
| W4J0LKCD | sdd | 5.00 TB | PASSED |
Timeline (brief)
| When | Event |
|---|---|
| Prior to 2026-05-21 | W4J0L3PY accumulated read/write errors; pool faulted |
| 2026-05-21 | Pool SUSPENDED; host sync() wedged; Cal LXC start failed |
| 2026-05-21 ~21:28 | Forced node reboot; pool DEGRADED, resilver finished, 0 errors |
| 2026-05-21 | sdb still reports ~137 GB, UNAVAIL — replacement still required |
Contact / handoff notes
- Node: Proxmox VE 8.x on PVENAS (
10.0.10.10) - Pool name in Proxmox:
NAS.SP00(zfspool, active, degraded) - Failed serial: W4J0L3PY
- Replacement type: 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178
For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see levkin-selfhost-plan-2.md.
TL;DR
- Pool
NAS.SP00onPVENAS(10.0.10.10) had a disk failure (W4J0L3PY) - Pool went SUSPENDED; required forced reboot and is now DEGRADED
- Immediate action: Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
- Use
zpool replacecommand with correct device paths (see main procedure) - Monitor resilver to completion; run
zpool scrubafter - Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
- Reach out if unsure about pool status or downstream service risk