ansible/docs/guides/nas-sp00-drive-failure-report.md
ilia de49b34cdc
Some checks failed
CI / skip-ci-check (pull_request) Successful in 6s
CI / lint-and-test (pull_request) Failing after 9s
CI / ansible-validation (pull_request) Failing after 6s
CI / secret-scanning (pull_request) Successful in 5s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 5s
CI / license-check (pull_request) Successful in 11s
CI / vault-check (pull_request) Failing after 6s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 2s
CI / workflow-summary (pull_request) Successful in 4s
Add homelab monitoring, portfolio site, and vault tooling.
Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio
to inventory; Mailcow mailbox automation; vault import/export scripts;
security audit guides and UniFi DHCP reference.

Co-authored-by: Cursor <cursoragent@cursor.com>
2026-05-22 16:25:07 -04:00

203 lines
7.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NAS.SP00 drive failure — IT report
**Date:** 2026-05-21
**Host:** PVENAS (Proxmox VE) — `10.0.10.10`
**Pool:** ZFS `NAS.SP00` (~9 TB, ~862 GB used)
**Prepared for:** IT / hardware replacement
**SMART audit:** [nas-sp00-smart-audit-2026-05-21.md](nas-sp00-smart-audit-2026-05-21.md)
---
## Executive summary
One disk in a four-drive ZFS mirror pair has **failed at the hardware level**. The pool is **DEGRADED** but **online** with **no known data errors** at this time. The failed drive must be **physically replaced** and the pool **resilvered**. Until then, **mirror-0 has no redundancy** — a second failure on the remaining disk in that mirror (`W4J0L0BA`) could cause data loss.
This issue also caused a **host-wide I/O wedge** (pool SUSPENDED → stuck `sync()`), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on `local-lvm`). That was cleared by a forced node reboot; **replacing the drive remains required**.
---
## Pool layout
| Vdev | Role | Disk A | Disk B | Status |
|------|------|--------|--------|--------|
| mirror-0 | RAID1 pair | `W4J0L0BA` (sda, 5 TB) | `W4J0L3PY` (sdb) | **DEGRADED** — sdb UNAVAIL |
| mirror-1 | RAID1 pair | `W4J0LKCD` (sdd, 5 TB) | `W4J0K9V7` (sdc, 5 TB) | **ONLINE** |
Model family (healthy drives): Seagate **ST5000DM000-1FK178** (5 TB, 7200 RPM).
---
## Failed drive identification
| Field | Expected | Observed |
|-------|----------|----------|
| **Serial** | W4J0L3PY | W4J0L3PY |
| **Model** | ST5000DM000-1FK178 | ST5000DM000 (truncated reporting) |
| **WWN** | — | `5000c50082cc8bbb` |
| **Firmware** | — | CC48 |
| **Capacity** | ~5,000,981,078,016 bytes (**5.00 TB**) | **137,438,952,960 bytes (~137 GB)** |
| **Linux device** | `/dev/sdb` | `/dev/sdb` |
| **ZFS state** | ONLINE | **UNAVAIL** — label missing/invalid |
ZFS last known path:
`/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1`
---
## Symptoms and evidence
### 1. Capacity collapse (primary indicator)
The drive is detected as **~137 GB** instead of **5 TB**. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:
- **Failed HDD** (media/controller failure)
- **Bad SATA cable, backplane port, or HBA port**
- **USB/SATA bridge failure** (if applicable)
- **Severe firmware/HPA corruption** (less common)
### 2. SMART / SCSI errors
`smartctl` against `/dev/sdb`:
- **Read SMART Data failed:** scsi error aborted command
- **Overall health:** UNKNOWN (attributes unreadable)
- Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)
Healthy sibling in same mirror (`/dev/sda`, W4J0L0BA): **SMART PASSED**, full 5 TB capacity.
### 3. Kernel log (`dmesg` at boot, 2026-05-21 ~21:27)
Repeated on **`sdb`**:
```
Buffer I/O error on dev sdb
Sense Key: Medium Error
Add. Sense: Unrecovered read error
critical medium error, dev sdb, sector N op 0x0:(READ)
```
Indicates the block device cannot reliably read media — **hardware or link layer**, not a ZFS configuration issue.
### 4. ZFS pool history
- Pool previously entered **SUSPENDED** state (I/O failures on faulted devices).
- After node reboot: pool **DEGRADED**, short **resilver** completed with **0 errors** (healing scan on remaining devices).
- Current: **No known data errors** in `zpool status`.
---
## Impact
### Storage / services on `NAS.SP00`
Proxmox guests with disks on this pool (non-exhaustive):
| VMID | Name | NAS-backed storage |
|------|------|-------------------|
| 101 | Jellyfin | 1 TB zvol |
| 105 | TrueNAS | 1 TB zvol |
| 108 | actual-debian | 10 GB |
| 200 | PVE.BU.SVR | 1 TB |
| 201 | NextcloudAIO-debian | 8 TB |
**Risk:** With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have **no redundancy** until the failed drive is replaced and resilver completes.
### Unrelated workloads
Guests on **`local-lvm`** (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are **not stored on NAS.SP00** but were affected when the pool suspended and blocked system-wide `sync()`.
### Backup target
Proxmox datastore **PVEBUVD00** (PBS @ `10.0.10.200:8007`) reports **unreachable** from this node — separate issue; verify PBS host/network.
---
## Diagnosis
| Question | Answer |
|----------|--------|
| Is this a ZFS misconfiguration? | **No** — config is consistent; three drives show correct 5 TB labels. |
| Is the pool lost? | **No** — degraded but importable; no known data errors currently. |
| Which disk to replace? | **Seagate W4J0L3PY** (`/dev/sdb`, mirror-0 failed leg). |
| Can we fix it in software? | **Unlikely** — capacity and SMART failures point to hardware. |
| Safe to reseat first? | **Optional trial** — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, **replace disk**. |
---
## Recommended actions
### Immediate (IT / on-site)
1. **Identify physical slot** for serial **W4J0L3PY** (compare to inventory/asset tags).
2. **Reseat** SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
3. If capacity is still wrong or SMART still fails → **replace with new 5 TB+ enterprise/NAS-class HDD** (match class of ST5000DM000 or better).
4. Do **not** remove the UNAVAIL device from the pool until replacement is in place.
### After new disk is installed
On **PVENAS** as root (adjust `/dev/disk/by-id/...` to the **new** drives partition 1):
```bash
# Verify new disk shows ~5 TB
lsblk /dev/sdX
smartctl -H /dev/sdX
# Replace failed vdev (use ID from: zpool status NAS.SP00)
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1
# Monitor until resilver completes
zpool status -v NAS.SP00
```
### Post-resilver
- Run **`zpool scrub NAS.SP00`** during a maintenance window.
- Confirm **PVEBUVD00** / PBS connectivity if backups depend on it.
- Review whether **Nextcloud VM 201** (8 TB on degraded pool) should remain running until healthy.
### Not recommended
- Ignoring degraded state for extended periods.
- Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
- `zpool clear` without addressing hardware — does not fix a dead disk.
---
## Reference — healthy disks (for spare matching)
| Serial | Device | Capacity | SMART |
|--------|--------|----------|-------|
| W4J0L0BA | sda | 5.00 TB | PASSED |
| W4J0K9V7 | sdc | 5.00 TB | PASSED |
| W4J0LKCD | sdd | 5.00 TB | PASSED |
---
## Timeline (brief)
| When | Event |
|------|--------|
| Prior to 2026-05-21 | `W4J0L3PY` accumulated read/write errors; pool faulted |
| 2026-05-21 | Pool **SUSPENDED**; host `sync()` wedged; Cal LXC start failed |
| 2026-05-21 ~21:28 | Forced node reboot; pool **DEGRADED**, resilver finished, 0 errors |
| 2026-05-21 | `sdb` still reports **~137 GB**, UNAVAIL — **replacement still required** |
---
## Contact / handoff notes
- **Node:** Proxmox VE 8.x on **PVENAS** (`10.0.10.10`)
- **Pool name in Proxmox:** `NAS.SP00` (zfspool, active, degraded)
- **Failed serial:** **W4J0L3PY**
- **Replacement type:** 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178
For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see [`levkin-selfhost-plan-2.md`](levkin-selfhost-plan-2.md).
## TL;DR
- Pool `NAS.SP00` on `PVENAS` (10.0.10.10) had a disk failure (`W4J0L3PY`)
- Pool went **SUSPENDED**; required forced reboot and is now **DEGRADED**
- **Immediate action:** Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
- Use `zpool replace` command with correct device paths (see main procedure)
- Monitor resilver to completion; run `zpool scrub` after
- Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
- Reach out if unsure about pool status or downstream service risk