How Libcorrect Corrects Errors, Part I

(quiet.github.io)

56 points | by brian-armstrong 2041 days ago

1 comments

  • FooBarWidget 2038 days ago
    So there's the phenomenon of bit rot. Filesystems like ZFS add data checksums, and they recommend you to scrub the data once in a while, and then when errors are detected you are supposed to restore from a replica/backup (whether restoration happens automatically or manually is besides the point; the point is that a replica/backup is required in order to restore).

    Why can't error correction codes be used instead of replicas/backups, as the primary means to recover from bitrot? (NOTE: I am talking about the primary means; obviously a backup is still necessary for more serious forms of disaster recovery. I am not advocating abolishing backups altogether) That makes it a lot easier for single-disk use cases, e.g. laptops or consumer devices. It also saves space, which is also important because laptop SSDs and phone SD cards aren't that big. A replica doubles the space requirements, while an error correction code is smaller than that.

    • xenadu02 2038 days ago
      That's what RAID-5 (and variants) are. Btrfs (and maybe ZFS?) support making certain files or volumes use parity stripes so corruption is recoverable in-place.

      Per Shannon, these things are all related. You can think of error-correcting codes as a form of replica compression: get the reliability of multiple replicas at a fraction of the size.

      Like compression, there are CPU-time tradeoffs as well. Writing two replicas can happen completely in parallel (they're independent) and it imposes basically no CPU cost. Writing an error-correcting code imposes a CPU penalty and for files that get mutated they impose write overhead (though SSDs have that anyway making it less of a concern).

      I think a lot of people assume bitrot happens over time and you need backups anyway, so why take the hit on the primary for some small protection? The reasonableness of that depends on your specific use case and risk tolerance.

      • sp332 2038 days ago
        Note that btrfs support for RAID 5/6 is still in development and not to be trusted. https://btrfs.wiki.kernel.org/index.php/Status It does store checksums though, so if you run dup or "raid 1" (which is not really RAID 1), it can figure out which copy is correct and recover. Not as efficient though since you need two complete copies of the data.
      • FooBarWidget 2038 days ago
        > That's what RAID-5 (and variants) are.

        How feasible is it to setup RAID-5 on a laptop or a phone?

        > I think a lot of people assume bitrot happens over time and you need backups anyway, so why take the hit on the primary for some small protection?

        As I said: laptop and consumer device use cases. SSD storage on a Macbook is crazy expensive.

    • guitarbill 2038 days ago
      > and then when errors are detected you are supposed to restore from a replica/backup.

      ZFS will recover/self-heal from errors if it can definitively figure out which data is correct.

      many storage technologies do protect against errors, CD/DVD/HDD all have ECC in the physical layer. but without control/knowledge of the physical medium (so at filesystem level), how do you distribute the ECC sensibly? you can't.

      another issue is if the other hardware doesn't do ECC, for example with non-ECC RAM. RAM is arguably more susceptible to bitflips due to it's dynamic nature. can you recover from errors if the information in your RAM can't be trusted? it's a hard problem, and ZFS pretty much requires ECC RAM for any data integrity guarantees to work.

      it also doesn't necessarily matter. a single bit flip in e.g. a jpg or mp4 file doesn't necessarily render it unusable, so people don't care.

      finally, ECC is a bit useless if your whole drive or device fails, which is a much more common failure mode.

      storage is cheap nowadays, and even double or triple redundancy is cheaper and more straight-forward than trying to be clever.

      • toast0 2038 days ago
        > finally, ECC is a bit useless if your whole drive or device fails, which is a much more common failure mode.

        I'm not sure that this is true. Almost all the platter based drives I've had fail gave some amount of warning signs before they became completely inaccessible. (The exception that I can remember was the one I damaged the circuit board of with a wrong type screw). My personal experience is just anecdotes, but we had a couple thousand disks at work that I managed, and we ended up swapping about one disk a week. SMART defect counts were very predictive of future disk failures, although even then the disks would usually be partially readable.

        SSDs on the other hand, the failure rate was much lower, but the failure mode was always disk disappears from the bus. We could never figure out a way to predict that (our write volumes aren't very high). Occasionally we'd see a large increase in defect count and slowdown in access for a while as a drive would reallocate a large block, but if we waited for it to settle, everything would be fine after.

      • FooBarWidget 2038 days ago
        > how do you distribute the ECC sensibly? you can't.

        I am not an expert on ECC. Are you saying that you can't store the ECC just anywhere? Just storing it alongside the data is not good enough? Why does ECC need special treatment?

        > storage is cheap nowadays, and even double or triple redundancy is cheaper and more straight-forward than trying to be clever.

        Tell that to Apple to who charges $500+ for a 1 TB SSD upgrade in a Macbook Pro. :-( "Cheap" is relative. I am worried about bitrot on my laptop but I also don't want to half my disk space in order protect against that.

        • pwg 2038 days ago
          > > how do you distribute the ECC sensibly? you can't.

          > I am not an expert on ECC. Are you saying that you can't store the ECC just anywhere?

          Well, you can put it "just anywhere", but /where/ you put it determines /what failure types/ you can recover from.

          > Just storing it alongside the data is not good enough? Why does ECC need special treatment?

          If you want to recover from bitrot, then putting the ECC data for a sector alongside the data in the same sector is sufficient (you'll have less bytes stored per sector, but if a bit flips, you can recover the original data).

          But, storing the ECC in the same sector with the data it protects will not protect against losing the entire sector (drive can't read whole sector error). In this instance both the data and the ECC is lost simultaneously, so the ECC can not help here if it is also lost at the same time. So if you want to protect against loss of a sector you need your ECC stored somewhere else (i.e., on a different sector that is unlikely to be correlated to the lost one in a failure situation) so that you still have the ECC available when the sector you are protecting goes away.

          But, if you are protecting against loss of an entire physical drive, then the ECC for the drive needs to be on another physical drive (same reasons apply as for a "sector", just at the level of a whole physical disk).

          It is all tradeoffs. You /can/ put it anywhere, but where you choose to store it determines which failure types you can recover from.

          • blattimwind 2038 days ago
            Disk drives use sector-level ECC, silent sector corruption should be (and IME is) more rare than unrecoverable sectors.
        • TheAceOfHearts 2038 days ago
          Buy an external HDD and sync important files when you're at home. It's unlikely you'd suffer from bitrot in files created while you're away from home.
      • blattimwind 2038 days ago
        > many storage technologies do protect against errors, CD/DVD/HDD all have ECC in the physical layer. but without control/knowledge of the physical medium (so at filesystem level), how do you distribute the ECC sensibly? you can't.

        SSDs and other flash storage heavily rely on ECC as well.

    • BurnGpuBurn 2038 days ago
      To do error correction you need more bits than when you only need to detect errors.

      With hard disks, error correction is done by adding a CRC code [0] (or something similarly short) to each block, so that when the device reads a block and the CRC check fails it can be fairly certain that that block isn't performing well anymore and stop using it. SSD's behave differently as far as I know but there's a similar mechanism in use.

      To be able to not only detect errors but also correct the data you need more bits of error correction information. Offering up bits to store error correction information isn't worth the penalty of not having those bits available to store data in a lot of cases.

      [0] https://en.wikipedia.org/wiki/Cyclic_redundancy_check

      • blattimwind 2038 days ago
        > With hard disks, error correction is done by adding a CRC code [0] (or something similarly short) to each block, so that when the device reads a block and the CRC check fails it can be fairly certain that that block isn't performing well anymore and stop using it. SSD's behave differently as far as I know but there's a similar mechanism in use.

        > To be able to not only detect errors but also correct the data you need more bits of error correction information. Offering up bits to store error correction information isn't worth the penalty of not having those bits available to store data in a lot of cases.

        Hard disks use forward error correction to make more effective use of the medium; the "medium" being a noisy channel means that using FEC to inflate the data but be able to recover from errors means you can perform much MUCH better on the BER/bandwidth front [increasing "bandwidth" of the "channel" means more data stored] than naively storing data. That's why everyone is doing it. Same goes for SSDs.

        The 520 byte sector approach is/was still used, but it is not the physical layer of hard disks and hasn't been for ... an unsigned long long time.

        • BurnGpuBurn 2024 days ago
          I didn't know that, I thought those techniques were only used in data transmission over wired or wireless network connections. Thanks for the heads up.
    • brian-armstrong 2038 days ago
      Your comment brought to mind the Backblaze project that does just that, though I think it operates on a larger scale than what you're suggesting

      https://www.backblaze.com/blog/reed-solomon/

      • FooBarWidget 2038 days ago
        I am hoping for something built into the filesystem so that it works transparently. :(
        • lathiat 2038 days ago
          ZFS does have the option of storing 2 copies even on the same disk (zfs set copies=2), unfortunately it doesn't have an ECC-style option for the same disk you need to use multiple disks in which case it will transparently fix things in a RAID-Z if it can.
          • toast0 2038 days ago
            You can set the copies setting differently for different parts of the filesystem which could potentially help -- if you have /home/ set to two copies at least you're not paying double for programs.

            Alternatively, you could set up multiple partitions and run raidz across them, but performance is going to be awful.