ansible/docs/guides/nas-sp00-drive-failure-report.md

# NAS.SP00 drive failure — IT report

**Date:** 2026-05-21
**Host:** PVENAS (Proxmox VE) — `10.0.10.10`
**Pool:** ZFS `NAS.SP00` (~9 TB, ~862 GB used)
**Prepared for:** IT / hardware replacement
**SMART audit:** [nas-sp00-smart-audit-2026-05-21.md](nas-sp00-smart-audit-2026-05-21.md)

---

## Executive summary

One disk in a four-drive ZFS mirror pair has **failed at the hardware level**. The pool is **DEGRADED** but **online** with **no known data errors** at this time. The failed drive must be **physically replaced** and the pool **resilvered**. Until then, **mirror-0 has no redundancy** — a second failure on the remaining disk in that mirror (`W4J0L0BA`) could cause data loss.

This issue also caused a **host-wide I/O wedge** (pool SUSPENDED → stuck `sync()`), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on `local-lvm`). That was cleared by a forced node reboot; **replacing the drive remains required**.

---

## Pool layout

| Vdev | Role | Disk A | Disk B | Status |
|------|------|--------|--------|--------|
| mirror-0 | RAID1 pair | `W4J0L0BA` (sda, 5 TB) | `W4J0L3PY` (sdb) | **DEGRADED** — sdb UNAVAIL |
| mirror-1 | RAID1 pair | `W4J0LKCD` (sdd, 5 TB) | `W4J0K9V7` (sdc, 5 TB) | **ONLINE** |

Model family (healthy drives): Seagate **ST5000DM000-1FK178** (5 TB, 7200 RPM).

---

## Failed drive identification

| Field | Expected | Observed |
|-------|----------|----------|
| **Serial** | W4J0L3PY | W4J0L3PY |
| **Model** | ST5000DM000-1FK178 | ST5000DM000 (truncated reporting) |
| **WWN** | — | `5000c50082cc8bbb` |
| **Firmware** | — | CC48 |
| **Capacity** | ~5,000,981,078,016 bytes (**5.00 TB**) | **137,438,952,960 bytes (~137 GB)** |
| **Linux device** | `/dev/sdb` | `/dev/sdb` |
| **ZFS state** | ONLINE | **UNAVAIL** — label missing/invalid |

ZFS last known path:
`/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1`

---

## Symptoms and evidence

### 1. Capacity collapse (primary indicator)

The drive is detected as **~137 GB** instead of **5 TB**. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:

- **Failed HDD** (media/controller failure)
- **Bad SATA cable, backplane port, or HBA port**
- **USB/SATA bridge failure** (if applicable)
- **Severe firmware/HPA corruption** (less common)

### 2. SMART / SCSI errors

`smartctl` against `/dev/sdb`:

- **Read SMART Data failed:** scsi error aborted command
- **Overall health:** UNKNOWN (attributes unreadable)
- Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)

Healthy sibling in same mirror (`/dev/sda`, W4J0L0BA): **SMART PASSED**, full 5 TB capacity.

### 3. Kernel log (`dmesg` at boot, 2026-05-21 ~21:27)

Repeated on **`sdb`**:

```
Buffer I/O error on dev sdb
Sense Key: Medium Error
Add. Sense: Unrecovered read error
critical medium error, dev sdb, sector N op 0x0:(READ)
```

Indicates the block device cannot reliably read media — **hardware or link layer**, not a ZFS configuration issue.

### 4. ZFS pool history

- Pool previously entered **SUSPENDED** state (I/O failures on faulted devices).
- After node reboot: pool **DEGRADED**, short **resilver** completed with **0 errors** (healing scan on remaining devices).
- Current: **No known data errors** in `zpool status`.

---

## Impact

### Storage / services on `NAS.SP00`

Proxmox guests with disks on this pool (non-exhaustive):

| VMID | Name | NAS-backed storage |
|------|------|-------------------|
| 101 | Jellyfin | 1 TB zvol |
| 105 | TrueNAS | 1 TB zvol |
| 108 | actual-debian | 10 GB |
| 200 | PVE.BU.SVR | 1 TB |
| 201 | NextcloudAIO-debian | 8 TB |

**Risk:** With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have **no redundancy** until the failed drive is replaced and resilver completes.

### Unrelated workloads

Guests on **`local-lvm`** (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are **not stored on NAS.SP00** but were affected when the pool suspended and blocked system-wide `sync()`.

### Backup target

Proxmox datastore **PVEBUVD00** (PBS @ `10.0.10.200:8007`) reports **unreachable** from this node — separate issue; verify PBS host/network.

---

## Diagnosis

| Question | Answer |
|----------|--------|
| Is this a ZFS misconfiguration? | **No** — config is consistent; three drives show correct 5 TB labels. |
| Is the pool lost? | **No** — degraded but importable; no known data errors currently. |
| Which disk to replace? | **Seagate W4J0L3PY** (`/dev/sdb`, mirror-0 failed leg). |
| Can we fix it in software? | **Unlikely** — capacity and SMART failures point to hardware. |
| Safe to reseat first? | **Optional trial** — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, **replace disk**. |

---

## Recommended actions

### Immediate (IT / on-site)

1. **Identify physical slot** for serial **W4J0L3PY** (compare to inventory/asset tags).
2. **Reseat** SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
3. If capacity is still wrong or SMART still fails → **replace with new 5 TB+ enterprise/NAS-class HDD** (match class of ST5000DM000 or better).
4. Do **not** remove the UNAVAIL device from the pool until replacement is in place.

### After new disk is installed

On **PVENAS** as root (adjust `/dev/disk/by-id/...` to the **new** drive’s partition 1):

```bash
# Verify new disk shows ~5 TB
lsblk /dev/sdX
smartctl -H /dev/sdX

# Replace failed vdev (use ID from: zpool status NAS.SP00)
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1

# Monitor until resilver completes
zpool status -v NAS.SP00
```

### Post-resilver

- Run **`zpool scrub NAS.SP00`** during a maintenance window.
- Confirm **PVEBUVD00** / PBS connectivity if backups depend on it.
- Review whether **Nextcloud VM 201** (8 TB on degraded pool) should remain running until healthy.

### Not recommended

- Ignoring degraded state for extended periods.
- Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
- `zpool clear` without addressing hardware — does not fix a dead disk.

---

## Reference — healthy disks (for spare matching)

| Serial | Device | Capacity | SMART |
|--------|--------|----------|-------|
| W4J0L0BA | sda | 5.00 TB | PASSED |
| W4J0K9V7 | sdc | 5.00 TB | PASSED |
| W4J0LKCD | sdd | 5.00 TB | PASSED |

---

## Timeline (brief)

| When | Event |
|------|--------|
| Prior to 2026-05-21 | `W4J0L3PY` accumulated read/write errors; pool faulted |
| 2026-05-21 | Pool **SUSPENDED**; host `sync()` wedged; Cal LXC start failed |
| 2026-05-21 ~21:28 | Forced node reboot; pool **DEGRADED**, resilver finished, 0 errors |
| 2026-05-21 | `sdb` still reports **~137 GB**, UNAVAIL — **replacement still required** |

---

## Contact / handoff notes

- **Node:** Proxmox VE 8.x on **PVENAS** (`10.0.10.10`)
- **Pool name in Proxmox:** `NAS.SP00` (zfspool, active, degraded)
- **Failed serial:** **W4J0L3PY**
- **Replacement type:** 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178

For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see [`levkin-selfhost-plan-2.md`](levkin-selfhost-plan-2.md).
## TL;DR

- Pool `NAS.SP00` on `PVENAS` (10.0.10.10) had a disk failure (`W4J0L3PY`)
- Pool went **SUSPENDED**; required forced reboot and is now **DEGRADED**
- **Immediate action:** Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
- Use `zpool replace` command with correct device paths (see main procedure)
- Monitor resilver to completion; run `zpool scrub` after
- Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
- Reach out if unsure about pool status or downstream service risk