ansible/docs/guides/nas-sp00-drive-failure-report.md
ilia de49b34cdc
Some checks failed
CI / skip-ci-check (pull_request) Successful in 6s
CI / lint-and-test (pull_request) Failing after 9s
CI / ansible-validation (pull_request) Failing after 6s
CI / secret-scanning (pull_request) Successful in 5s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 5s
CI / license-check (pull_request) Successful in 11s
CI / vault-check (pull_request) Failing after 6s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 2s
CI / workflow-summary (pull_request) Successful in 4s
Add homelab monitoring, portfolio site, and vault tooling.
Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio
to inventory; Mailcow mailbox automation; vault import/export scripts;
security audit guides and UniFi DHCP reference.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 16:25:07 -04:00

7.7 KiB
Raw Blame History

NAS.SP00 drive failure — IT report

Date: 2026-05-21
Host: PVENAS (Proxmox VE) — 10.0.10.10
Pool: ZFS NAS.SP00 (~9 TB, ~862 GB used)
Prepared for: IT / hardware replacement
SMART audit: nas-sp00-smart-audit-2026-05-21.md


Executive summary

One disk in a four-drive ZFS mirror pair has failed at the hardware level. The pool is DEGRADED but online with no known data errors at this time. The failed drive must be physically replaced and the pool resilvered. Until then, mirror-0 has no redundancy — a second failure on the remaining disk in that mirror (W4J0L0BA) could cause data loss.

This issue also caused a host-wide I/O wedge (pool SUSPENDED → stuck sync()), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on local-lvm). That was cleared by a forced node reboot; replacing the drive remains required.


Pool layout

Vdev Role Disk A Disk B Status
mirror-0 RAID1 pair W4J0L0BA (sda, 5 TB) W4J0L3PY (sdb) DEGRADED — sdb UNAVAIL
mirror-1 RAID1 pair W4J0LKCD (sdd, 5 TB) W4J0K9V7 (sdc, 5 TB) ONLINE

Model family (healthy drives): Seagate ST5000DM000-1FK178 (5 TB, 7200 RPM).


Failed drive identification

Field Expected Observed
Serial W4J0L3PY W4J0L3PY
Model ST5000DM000-1FK178 ST5000DM000 (truncated reporting)
WWN 5000c50082cc8bbb
Firmware CC48
Capacity ~5,000,981,078,016 bytes (5.00 TB) 137,438,952,960 bytes (~137 GB)
Linux device /dev/sdb /dev/sdb
ZFS state ONLINE UNAVAIL — label missing/invalid

ZFS last known path:
/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1


Symptoms and evidence

1. Capacity collapse (primary indicator)

The drive is detected as ~137 GB instead of 5 TB. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:

  • Failed HDD (media/controller failure)
  • Bad SATA cable, backplane port, or HBA port
  • USB/SATA bridge failure (if applicable)
  • Severe firmware/HPA corruption (less common)

2. SMART / SCSI errors

smartctl against /dev/sdb:

  • Read SMART Data failed: scsi error aborted command
  • Overall health: UNKNOWN (attributes unreadable)
  • Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)

Healthy sibling in same mirror (/dev/sda, W4J0L0BA): SMART PASSED, full 5 TB capacity.

3. Kernel log (dmesg at boot, 2026-05-21 ~21:27)

Repeated on sdb:

Buffer I/O error on dev sdb
Sense Key: Medium Error
Add. Sense: Unrecovered read error
critical medium error, dev sdb, sector N op 0x0:(READ)

Indicates the block device cannot reliably read media — hardware or link layer, not a ZFS configuration issue.

4. ZFS pool history

  • Pool previously entered SUSPENDED state (I/O failures on faulted devices).
  • After node reboot: pool DEGRADED, short resilver completed with 0 errors (healing scan on remaining devices).
  • Current: No known data errors in zpool status.

Impact

Storage / services on NAS.SP00

Proxmox guests with disks on this pool (non-exhaustive):

VMID Name NAS-backed storage
101 Jellyfin 1 TB zvol
105 TrueNAS 1 TB zvol
108 actual-debian 10 GB
200 PVE.BU.SVR 1 TB
201 NextcloudAIO-debian 8 TB

Risk: With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have no redundancy until the failed drive is replaced and resilver completes.

Unrelated workloads

Guests on local-lvm (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are not stored on NAS.SP00 but were affected when the pool suspended and blocked system-wide sync().

Backup target

Proxmox datastore PVEBUVD00 (PBS @ 10.0.10.200:8007) reports unreachable from this node — separate issue; verify PBS host/network.


Diagnosis

Question Answer
Is this a ZFS misconfiguration? No — config is consistent; three drives show correct 5 TB labels.
Is the pool lost? No — degraded but importable; no known data errors currently.
Which disk to replace? Seagate W4J0L3PY (/dev/sdb, mirror-0 failed leg).
Can we fix it in software? Unlikely — capacity and SMART failures point to hardware.
Safe to reseat first? Optional trial — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, replace disk.

Immediate (IT / on-site)

  1. Identify physical slot for serial W4J0L3PY (compare to inventory/asset tags).
  2. Reseat SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
  3. If capacity is still wrong or SMART still fails → replace with new 5 TB+ enterprise/NAS-class HDD (match class of ST5000DM000 or better).
  4. Do not remove the UNAVAIL device from the pool until replacement is in place.

After new disk is installed

On PVENAS as root (adjust /dev/disk/by-id/... to the new drives partition 1):

# Verify new disk shows ~5 TB
lsblk /dev/sdX
smartctl -H /dev/sdX

# Replace failed vdev (use ID from: zpool status NAS.SP00)
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1

# Monitor until resilver completes
zpool status -v NAS.SP00

Post-resilver

  • Run zpool scrub NAS.SP00 during a maintenance window.
  • Confirm PVEBUVD00 / PBS connectivity if backups depend on it.
  • Review whether Nextcloud VM 201 (8 TB on degraded pool) should remain running until healthy.
  • Ignoring degraded state for extended periods.
  • Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
  • zpool clear without addressing hardware — does not fix a dead disk.

Reference — healthy disks (for spare matching)

Serial Device Capacity SMART
W4J0L0BA sda 5.00 TB PASSED
W4J0K9V7 sdc 5.00 TB PASSED
W4J0LKCD sdd 5.00 TB PASSED

Timeline (brief)

When Event
Prior to 2026-05-21 W4J0L3PY accumulated read/write errors; pool faulted
2026-05-21 Pool SUSPENDED; host sync() wedged; Cal LXC start failed
2026-05-21 ~21:28 Forced node reboot; pool DEGRADED, resilver finished, 0 errors
2026-05-21 sdb still reports ~137 GB, UNAVAIL — replacement still required

Contact / handoff notes

  • Node: Proxmox VE 8.x on PVENAS (10.0.10.10)
  • Pool name in Proxmox: NAS.SP00 (zfspool, active, degraded)
  • Failed serial: W4J0L3PY
  • Replacement type: 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178

For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see levkin-selfhost-plan-2.md.

TL;DR

  • Pool NAS.SP00 on PVENAS (10.0.10.10) had a disk failure (W4J0L3PY)
  • Pool went SUSPENDED; required forced reboot and is now DEGRADED
  • Immediate action: Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
  • Use zpool replace command with correct device paths (see main procedure)
  • Monitor resilver to completion; run zpool scrub after
  • Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
  • Reach out if unsure about pool status or downstream service risk