Some checks failed
CI / skip-ci-check (pull_request) Successful in 6s
CI / lint-and-test (pull_request) Failing after 9s
CI / ansible-validation (pull_request) Failing after 6s
CI / secret-scanning (pull_request) Successful in 5s
CI / dependency-scan (pull_request) Successful in 8s
CI / sast-scan (pull_request) Failing after 5s
CI / license-check (pull_request) Successful in 11s
CI / vault-check (pull_request) Failing after 6s
CI / playbook-test (pull_request) Failing after 6s
CI / container-scan (pull_request) Failing after 6s
CI / sonar-analysis (pull_request) Failing after 2s
CI / workflow-summary (pull_request) Successful in 4s
Document pve10 static IPs, monitoring stack, and site LXCs; add portfolio to inventory; Mailcow mailbox automation; vault import/export scripts; security audit guides and UniFi DHCP reference. Co-authored-by: Cursor <cursoragent@cursor.com>
203 lines
7.7 KiB
Markdown
203 lines
7.7 KiB
Markdown
# NAS.SP00 drive failure — IT report
|
||
|
||
**Date:** 2026-05-21
|
||
**Host:** PVENAS (Proxmox VE) — `10.0.10.10`
|
||
**Pool:** ZFS `NAS.SP00` (~9 TB, ~862 GB used)
|
||
**Prepared for:** IT / hardware replacement
|
||
**SMART audit:** [nas-sp00-smart-audit-2026-05-21.md](nas-sp00-smart-audit-2026-05-21.md)
|
||
|
||
---
|
||
|
||
## Executive summary
|
||
|
||
One disk in a four-drive ZFS mirror pair has **failed at the hardware level**. The pool is **DEGRADED** but **online** with **no known data errors** at this time. The failed drive must be **physically replaced** and the pool **resilvered**. Until then, **mirror-0 has no redundancy** — a second failure on the remaining disk in that mirror (`W4J0L0BA`) could cause data loss.
|
||
|
||
This issue also caused a **host-wide I/O wedge** (pool SUSPENDED → stuck `sync()`), which blocked LXC/VM operations unrelated to the pool (e.g. Cal.com on `local-lvm`). That was cleared by a forced node reboot; **replacing the drive remains required**.
|
||
|
||
---
|
||
|
||
## Pool layout
|
||
|
||
| Vdev | Role | Disk A | Disk B | Status |
|
||
|------|------|--------|--------|--------|
|
||
| mirror-0 | RAID1 pair | `W4J0L0BA` (sda, 5 TB) | `W4J0L3PY` (sdb) | **DEGRADED** — sdb UNAVAIL |
|
||
| mirror-1 | RAID1 pair | `W4J0LKCD` (sdd, 5 TB) | `W4J0K9V7` (sdc, 5 TB) | **ONLINE** |
|
||
|
||
Model family (healthy drives): Seagate **ST5000DM000-1FK178** (5 TB, 7200 RPM).
|
||
|
||
---
|
||
|
||
## Failed drive identification
|
||
|
||
| Field | Expected | Observed |
|
||
|-------|----------|----------|
|
||
| **Serial** | W4J0L3PY | W4J0L3PY |
|
||
| **Model** | ST5000DM000-1FK178 | ST5000DM000 (truncated reporting) |
|
||
| **WWN** | — | `5000c50082cc8bbb` |
|
||
| **Firmware** | — | CC48 |
|
||
| **Capacity** | ~5,000,981,078,016 bytes (**5.00 TB**) | **137,438,952,960 bytes (~137 GB)** |
|
||
| **Linux device** | `/dev/sdb` | `/dev/sdb` |
|
||
| **ZFS state** | ONLINE | **UNAVAIL** — label missing/invalid |
|
||
|
||
ZFS last known path:
|
||
`/dev/disk/by-id/ata-ST5000DM000-1FK178_W4J0L3PY-part1`
|
||
|
||
---
|
||
|
||
## Symptoms and evidence
|
||
|
||
### 1. Capacity collapse (primary indicator)
|
||
|
||
The drive is detected as **~137 GB** instead of **5 TB**. ZFS cannot use a partition label created for a 5 TB disk on a device that exposes only a tiny fraction of capacity. This pattern is typical of:
|
||
|
||
- **Failed HDD** (media/controller failure)
|
||
- **Bad SATA cable, backplane port, or HBA port**
|
||
- **USB/SATA bridge failure** (if applicable)
|
||
- **Severe firmware/HPA corruption** (less common)
|
||
|
||
### 2. SMART / SCSI errors
|
||
|
||
`smartctl` against `/dev/sdb`:
|
||
|
||
- **Read SMART Data failed:** scsi error aborted command
|
||
- **Overall health:** UNKNOWN (attributes unreadable)
|
||
- Multiple log read commands fail (Error Log, Self-test Log, GP Log, etc.)
|
||
|
||
Healthy sibling in same mirror (`/dev/sda`, W4J0L0BA): **SMART PASSED**, full 5 TB capacity.
|
||
|
||
### 3. Kernel log (`dmesg` at boot, 2026-05-21 ~21:27)
|
||
|
||
Repeated on **`sdb`**:
|
||
|
||
```
|
||
Buffer I/O error on dev sdb
|
||
Sense Key: Medium Error
|
||
Add. Sense: Unrecovered read error
|
||
critical medium error, dev sdb, sector N op 0x0:(READ)
|
||
```
|
||
|
||
Indicates the block device cannot reliably read media — **hardware or link layer**, not a ZFS configuration issue.
|
||
|
||
### 4. ZFS pool history
|
||
|
||
- Pool previously entered **SUSPENDED** state (I/O failures on faulted devices).
|
||
- After node reboot: pool **DEGRADED**, short **resilver** completed with **0 errors** (healing scan on remaining devices).
|
||
- Current: **No known data errors** in `zpool status`.
|
||
|
||
---
|
||
|
||
## Impact
|
||
|
||
### Storage / services on `NAS.SP00`
|
||
|
||
Proxmox guests with disks on this pool (non-exhaustive):
|
||
|
||
| VMID | Name | NAS-backed storage |
|
||
|------|------|-------------------|
|
||
| 101 | Jellyfin | 1 TB zvol |
|
||
| 105 | TrueNAS | 1 TB zvol |
|
||
| 108 | actual-debian | 10 GB |
|
||
| 200 | PVE.BU.SVR | 1 TB |
|
||
| 201 | NextcloudAIO-debian | 8 TB |
|
||
|
||
**Risk:** With mirror-0 degraded, blocks stored only on the surviving mirror-0 disk have **no redundancy** until the failed drive is replaced and resilver completes.
|
||
|
||
### Unrelated workloads
|
||
|
||
Guests on **`local-lvm`** (NVMe, e.g. Cal.com LXC 210, Caddy VM 106) are **not stored on NAS.SP00** but were affected when the pool suspended and blocked system-wide `sync()`.
|
||
|
||
### Backup target
|
||
|
||
Proxmox datastore **PVEBUVD00** (PBS @ `10.0.10.200:8007`) reports **unreachable** from this node — separate issue; verify PBS host/network.
|
||
|
||
---
|
||
|
||
## Diagnosis
|
||
|
||
| Question | Answer |
|
||
|----------|--------|
|
||
| Is this a ZFS misconfiguration? | **No** — config is consistent; three drives show correct 5 TB labels. |
|
||
| Is the pool lost? | **No** — degraded but importable; no known data errors currently. |
|
||
| Which disk to replace? | **Seagate W4J0L3PY** (`/dev/sdb`, mirror-0 failed leg). |
|
||
| Can we fix it in software? | **Unlikely** — capacity and SMART failures point to hardware. |
|
||
| Safe to reseat first? | **Optional trial** — power down or hot-swap per chassis policy; if capacity still reads ~137 GB, **replace disk**. |
|
||
|
||
---
|
||
|
||
## Recommended actions
|
||
|
||
### Immediate (IT / on-site)
|
||
|
||
1. **Identify physical slot** for serial **W4J0L3PY** (compare to inventory/asset tags).
|
||
2. **Reseat** SATA/SAS cable and backplane connection once (if hot-swap policy allows). Reboot or rescan SCSI bus.
|
||
3. If capacity is still wrong or SMART still fails → **replace with new 5 TB+ enterprise/NAS-class HDD** (match class of ST5000DM000 or better).
|
||
4. Do **not** remove the UNAVAIL device from the pool until replacement is in place.
|
||
|
||
### After new disk is installed
|
||
|
||
On **PVENAS** as root (adjust `/dev/disk/by-id/...` to the **new** drive’s partition 1):
|
||
|
||
```bash
|
||
# Verify new disk shows ~5 TB
|
||
lsblk /dev/sdX
|
||
smartctl -H /dev/sdX
|
||
|
||
# Replace failed vdev (use ID from: zpool status NAS.SP00)
|
||
zpool replace NAS.SP00 ata-ST5000DM000-1FK178_W4J0L3PY-part1 /dev/disk/by-id/ata-NEW_SERIAL-part1
|
||
|
||
# Monitor until resilver completes
|
||
zpool status -v NAS.SP00
|
||
```
|
||
|
||
### Post-resilver
|
||
|
||
- Run **`zpool scrub NAS.SP00`** during a maintenance window.
|
||
- Confirm **PVEBUVD00** / PBS connectivity if backups depend on it.
|
||
- Review whether **Nextcloud VM 201** (8 TB on degraded pool) should remain running until healthy.
|
||
|
||
### Not recommended
|
||
|
||
- Ignoring degraded state for extended periods.
|
||
- Running heavy I/O on large VMs (e.g. 8 TB Nextcloud) during extended degraded operation.
|
||
- `zpool clear` without addressing hardware — does not fix a dead disk.
|
||
|
||
---
|
||
|
||
## Reference — healthy disks (for spare matching)
|
||
|
||
| Serial | Device | Capacity | SMART |
|
||
|--------|--------|----------|-------|
|
||
| W4J0L0BA | sda | 5.00 TB | PASSED |
|
||
| W4J0K9V7 | sdc | 5.00 TB | PASSED |
|
||
| W4J0LKCD | sdd | 5.00 TB | PASSED |
|
||
|
||
---
|
||
|
||
## Timeline (brief)
|
||
|
||
| When | Event |
|
||
|------|--------|
|
||
| Prior to 2026-05-21 | `W4J0L3PY` accumulated read/write errors; pool faulted |
|
||
| 2026-05-21 | Pool **SUSPENDED**; host `sync()` wedged; Cal LXC start failed |
|
||
| 2026-05-21 ~21:28 | Forced node reboot; pool **DEGRADED**, resilver finished, 0 errors |
|
||
| 2026-05-21 | `sdb` still reports **~137 GB**, UNAVAIL — **replacement still required** |
|
||
|
||
---
|
||
|
||
## Contact / handoff notes
|
||
|
||
- **Node:** Proxmox VE 8.x on **PVENAS** (`10.0.10.10`)
|
||
- **Pool name in Proxmox:** `NAS.SP00` (zfspool, active, degraded)
|
||
- **Failed serial:** **W4J0L3PY**
|
||
- **Replacement type:** 5 TB+ HDD, same or better class as Seagate ST5000DM000-1FK178
|
||
|
||
For questions about homelab service impact (Cal, Caddy, Phase 0 rollout), see [`levkin-selfhost-plan-2.md`](levkin-selfhost-plan-2.md).
|
||
## TL;DR
|
||
|
||
- Pool `NAS.SP00` on `PVENAS` (10.0.10.10) had a disk failure (`W4J0L3PY`)
|
||
- Pool went **SUSPENDED**; required forced reboot and is now **DEGRADED**
|
||
- **Immediate action:** Replace the failed drive with a spare (same or larger size; see healthy serials in table below)
|
||
- Use `zpool replace` command with correct device paths (see main procedure)
|
||
- Monitor resilver to completion; run `zpool scrub` after
|
||
- Backup services and large VMs (e.g. Nextcloud 8TB) depend on pool health—keep degraded time short
|
||
- Reach out if unsure about pool status or downstream service risk |