Case study: recovery of a corrupted 12 TB multi-device pool

(github.com)

67 points | by salt4034 8 hours ago

24 comments

yjftsjthsd-h 5 hours ago
> This is not a bug report. [...] The goal is constructive, not a complaint.
Er, I appreciate trying to be constructive, but in what possible situation is it not a bug that a power cycle can lose the pool? And if it's not technically a "bug" because BTRFS officially specifies that it can fail like that, why is that not in big bold text at the start of any docs on it? 'Cuz that's kind of a big deal for users to know.
EDIT: From the longer write-up:
> Initial damage. A hard power cycle interrupted a commit at generation 18958 to 18959. Both DUP copies of several metadata blocks were written with inconsistent parent and child generations.
Did the author disable safety mechanisms for that to happen? I'm coming from being more familiar with ZFS, but I would have expected BTRFS to also use a CoW model where it wasn't possible to have multiple inconsistent metadata blocks in a way that didn't just revert you to the last fully-good commit. If it does that by default but there's a way to disable that protection in the name of improving performance, that would significantly change my view of this whole thing.
[-]
- rincebrain 4 hours ago
  As far as I can see, no, the author disabled nothing of the sort that he documented.
  I suspect that the author's intent is less "I do not view this as a bug" and more "I do not think it's useful to get into angry debates over whether something is a bug". I do not know whether this is a common thing on btrfs discussions, but I have certainly seen debates to that effect elsewhere.
  (My personal favorite remains "it's not a data loss bug if someone could technically theoretically write something to recover the data". Perhaps, technically, that's true, but if nobody is writing such a tool, nobody is going to care about the semantics there.)
  [-]
  - yjftsjthsd-h 3 hours ago
    > I suspect that the author's intent is less "I do not view this as a bug" and more "I do not think it's useful to get into angry debates over whether something is a bug".
    Agreed, and I appreciate the attempt to channel things into a productive conversation.
- rcxdude 2 hours ago
  btrfs's reputation is not great in this regard.
- Retr0id 3 hours ago
  Unless I missed it the writeup never identifies a causal bug, only things that made recovery harder.
c-c-c-c-c 10 minutes ago
Added to my list of reasons to never use btrfs in production.
harshreality 25 minutes ago
Using DUP as the metadata profile sounds insane.
Changing the metadata profile to at least raid1 is a good idea, especially for anyone, against recommendations, using raid5 or raid6 for a btrfs array (raid1c3 for raid6). That would make it very difficult for metadata to get corrupted, which is the lion's share of the higher-impact problems with raid5/6 btrfs.
check:
```
    btrfs fi df <mountpoint>
```
convert metadata:
```
    btrfs balance start -mconvert=raid1c3,soft <mountpoint>
```
(make sure it's -mconvert — m is for metadata — not -dconvert which would switch profiles for data, messing up your array)
Retr0id 4 hours ago
This is obviously LLM output, but perhaps LLM output that corresponds to a real scenario. It's plausible that Claude was able to autonomously recover a corrupted fs, but I would not trust its "insights" by default. I'd love to see a btrfs dev's take on this!
[-]
- number6 3 hours ago
  This is also my first impulse. The second was, if this happened to me, I would not be able to recover it. All the custom c tool talk... If you ask Claude Code it will code something up.
  Well that he recovered the disks is amazing in itself. I would have given up and just pulled a backup.
  However, I would like to see a Dev saying: why didn't you use the --<flag> which we created for this Usecase
- yjftsjthsd-h 3 hours ago
  I was assuming real scenario with heavy LLM help to recover. Would be nice for the author to clarify. And, separately, for BTRFS devs to weigh in, though I'd somewhat prefer to get some indication that it's real before spending their time.
- nslsm 3 hours ago
  An LLM wouldn't make a mistake like "One paragraph summary"
stinkbeetle 5 hours ago
> Case study: recovery of a severely corrupted 12 TB multi-device pool, plus constructive gap analysis and reference tool set #1107
Please don't be btrfs please don't be btrfs please don't be btrfs...
[-]
- toaste_ 3 hours ago
  I mean, the only other option was bcachefs, which might have been funny if this LLM-generated blogpost were written by the OpenClaw instance the developer has decided is sentient:
  https://www.reddit.com/r/bcachefs/comments/1rblll1/the_blog_...
  But no. It was btrfs.
  As a side note, it's somewhat impressive that an LLM agent was able to produce a suite of custom tools that were apparently successfully used to recover some data from a corrupted btrfs array, even ad-hoc.
  [-]
  - yjftsjthsd-h 3 hours ago
    It could be ZFS. I'd be much more surprised, but it can still have bugs.
    [-]
    - praseodym 3 hours ago
      ZFS on Linux has had many bugs over the years, notably with ZFS-native encryption and especially sending/receiving encrypted volumes. Another issue is that using swap on ZFS is still guaranteed to hang the kernel in low memory scenarios, because ZFS needs to allocate memory to write to swap.
      [-]
      - badgersnake 2 hours ago
        The zero copy that zero copied unencrypted blocks onto encrypted file systems was genius. It’s almost like they don’t test.
phoronixrly 5 hours ago
To theal author: did you continue using btrfs after this ordeal? An FS that will not eat (all) your data upon a hard powercycle only at the cost of 14 custom C tools is a hard pass from me no matter how many distros try to push it down my throat as 'production-ready'...
Also, impressive work!
[-]
- fpoling 2 hours ago
  What are the alternatives to btrfs? At 12 TB data checksums are a must unless the data tolerate bit-rot. And if one wants to stick with the official kernel without out-of-tree modules, btrfs is the only choice.
  [-]
  - aktau an hour ago
    I tried btrfs on three different occasions. Three times it managed to corrupt itself. I'll admit I was too enthousiastic the first time, trying it less than a year after it appeared in major distros. But the latter two are unforgiveable (I had to reinstall my mom's laptop).
    I've been using ZFS for my NAS-like thing since then. It's been rock solid ().
    (): I know about the block cloning bug, and the encryption bug. Luckily I avoided those (I don't tend to enable new features like block cloning, and I didn't have an encrypted dataset at the time). Still, all in all it's been really good in comparison to btrfs.
  - egorfine an hour ago
    > if one wants to stick with the official kernel without out-of-tree modules
    I wonder how could a requirement like that possibly arise. Especially with an obvious exception for zfs.
    [-]
    - ThatPlayer an hour ago
      Bcachefs also fulfills the requirement of checksums (and multi device support).
      Also out of tree.
      [-]
      - phoronixrly an hour ago
        Does it not also eat data though?
  - phoronixrly an hour ago
    lvm offers lvmraid, integrity, and snapshots as one example. It's old unsexy tech, but losing data is not to my taste lately...
  - Joel_Mckay an hour ago
    Could try ZFS or CephFS... even if several host roles are in VM containers (45Drives has a product setup that way.)
    The btrfs solution has a mixed history, and had a lot of the same issues DRBD could get. They are great until some hardware/kernel-mod eventually goes sideways, and then the auto-heal cluster filesystems start to make a lot more sense. Note, with cluster based complete-file copy/repair object features the damage is localized to single files at worst, and folks don't have to wait 3 days to bring up the cluster on a crash.
    Best of luck, =3