Backblaze Durability Is Eleven 9s – And Why It Doesn’t Matter

(backblaze.com)

319 points | by ingve 2081 days ago

28 comments

  • londons_explore 2081 days ago
    This analysis is simplistic.

    Correlated failures are common in drives. That could be a power surge taking out a whole rack, a firmware bug in the drives making them stop working in the year 2038, an errant software engineer reformatting the wrong thing, etc.

    When calculating your chance of failure, you have to include that, or your result is bogus.

    Eg. Model A of drive has a failure rate of 1% per year, but when failed the symptom is failure of the drive to spin up from cold, however if already spinning it will keep working as normal.

    3 years later, the datacenter goes down due to a grid power outage and a dispute with diesel suppliers so the generators go down. It's a controlled shutdown, so you believe no data is lost.

    2 days later when grid power is back on, you boot everything back up, only to find out that 3% of drives have failed.

    Not a problem. Our 17 out of 20 redundancy can recover up to 15% failure!

    However, each customers data is split into files around 8MB, which are in turn split into the 20 redundancy chunks. Each customer stores say 1TB with you. That means each customer has ~100k files.

    The chances that you only have 16 good drives for a file is about (0.97^16 * 0.03^4)2019*18 = 0.3%

    Yet your customer has 100k files! The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(

    • Johnny555 2081 days ago
      That could be a power surge taking out a whole rack

      This failure mode, at least, is already accounted for by sharding data across cabinets:

      Each file is stored as 20 shards: 17 data shards and 3 parity shards. Because those shards are distributed across 20 storage pods in 20 cabinets, the Vault is resilient to the failure of a storage pod, or even a power loss to an entire cabinet.

      https://www.backblaze.com/blog/vault-cloud-storage-architect...

      However, they don't seem to offer multi-datacenter (or multi-region) redundancy so are still susceptible to a datacenter fire/failure.

      In comparison, AWS S3 distributes data across 3 AZ's (datacenters), and you can further replicate across regions if you choose. Though you pay for that added redundancy in 3 - 4X higher cost.

      • ChuckMcM 2081 days ago
        A better example would be the ceramic bearing fiasco that NetApp experienced with Seagate. Seagate had switched to a floating ceramic bearing on one family of their fiber channel drives. In those drives one or more of the bearings would shatter and start spreading ceramic dust across the disk surface. This happened between 3 and 4 years of run time and the disk would rapidly fail after that happened. People that bought a filer with several hundred drives started seeing large numbers fail suddenly. Up to the point they failed, not any indication. But it was run time hours related and thus correlated to all drives that started running at the same time. If you had put those drives in different data centers on different racks on different servers it wouldn't have mattered if you started 20 at the same time to be your 'tome', when they started failing it was possible to lose them all in a week.
        • tlb 2081 days ago
          There was also a failure mode in Seagate drives where the bearing increased in stiction. As long as it was spinning, there was no problem. But if you spun it down, it might not spin up again. If you had a group of disks powered up for a long time, many could fail together at the next power cycle.

          A chaos monkey that randomly powers down disks one at a time can prevent this.

          • Dylan16807 2080 days ago
            Though that sounds like a fixable failure mode, so it's more annoyance than tragedy.
        • tinus_hn 2080 days ago
          You can tell from the reports they put out that Backblaze doesn’t put all their eggs in one basket, they use different brands and drive types.
      • gm-conspiracy 2081 days ago
        I thought AZs were in the same physical location, just separate networks, no?
        • Johnny555 2081 days ago
          They are separate datacenters, I don't think they make any promises about how far they are apart from each other, but at least in some regions they are 10+ miles apart.

          The AWS Cloud infrastructure is built around AWS Regions and Availability Zones. An AWS Region is a physical location in the world where we have multiple Availability Zones. Availability Zones consist of one or more discrete data centers, each with redundant power, networking, and connectivity, housed in separate facilities

          https://docs.aws.amazon.com/aws-technical-content/latest/aws...

          • gm-conspiracy 2080 days ago
            So, a single AZ could span across multiple physical locations, as well?

            Am I reading that correctly?

        • strong_silent_t 2081 days ago
          Others have answered, but I think the general principle for a Region is that between the AZs you have less than 1ms latency, physical separation between the data centers, but they may still be in the same floodplain, be able to be hit by the same hurricane, etc.

          More info here if you want to see how regions are structured at a high level of detail: https://youtu.be/AyOAjFNPAbA?t=15m21s

        • jsjohnst 2081 days ago
          Nope. They aren’t even in the same city. For example, each of the AZs in us-west-2 are separated by about 50-60 miles.

          This doesn’t list all the locations, but is a good map to get an idea:

          https://www.google.com/maps/d/u/0/viewer?ll=50.9584270000000...

          • fs111 2081 days ago
            That map is incorrect: us-west-2a, 2b and 2c are not static names for the AZs. Every user gets their own mapping of which physical location is a, which one is b and which one is c. My us-west-2a may be your us-west-2c. They are not the same.
            • gregsadetsky 2080 days ago
              Extremely interesting, thanks for this clarification. For those curious, it's documented here[0] -- search for "To ensure that resources are distributed across the Availability Zones for a region"...

              [0] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-re...

            • jsjohnst 2080 days ago
              Yes, I know they pseudo randomize the allocations. That is irrelevant to my core point, that each AZ is not just separate networks in the same building or even adjacent buildings, but rather they are truly isolated by a non-insignificant distance of somewhere around 50mi on average.
              • Johnny555 2080 days ago
                It's a cool map, but I can't find any reference for the source of the data center locations.

                AWS purposely doesn't publish that information, and while I can believe it's possible to crowdsource the data by doing a little sleuthing (or working for certain vendors), it's hard to trust the map without knowing the sources.

        • coderholic 2081 days ago
          No, they're different physical locations.

          From https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Conce...:

          > Each AWS Region has multiple, isolated locations known as Availability Zones.

          And from https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-re...:

          > Each region is completely independent. Each Availability Zone is isolated, but the Availability Zones in a region are connected through low-latency links.

      • catwell 2081 days ago
        Multi-region redundancy was on the roadmap for their B2 offering 10 months ago, not sure if it has shipped yet.

        https://news.ycombinator.com/item?id=15125643

        • Johnny555 2081 days ago
          I poked around their pricing page a bit before I posted to see if I could find multi-site redundancy options, but couldn't find anything.
          • atYevP 2080 days ago
            Yev from Backblaze here -> Not out yet, but we're working towards it!
    • jrockway 2081 days ago
      I'm not sure how anyone, in their disaster recovery plans, ever expect anything less than 100% failure of a data center. The scenario is: a tornado hits the data center.

      Your power outage causing 3% of drives to fail is just a subset of that.

      • stubish 2080 days ago
        Its just risk analysis. The cost involved in splitting things over multiple data centers vs. the chance of your single DC getting wiped out. Declaring bankruptcy if it happens will be the best business decision in many cases, or insurance if the risk is higher or owners can't afford the loss or have liability.
      • viraptor 2080 days ago
        If it's your data center, you can plan the location to prevent most of this. There are locations where the natural hazards can be completely managed. (No tornados, fires, tsunamis, earthquakes, ...) So the power outage is the most likely thing to happen.
        • jrockway 2080 days ago
          I guess. I feel like "a tornado can never happen" is a lot like those lines in the logs like "error: can't happen". It can't happen, but it does.
          • kijin 2080 days ago
            Theoretically, yes, it can happen.

            Realistically, the chance of a tornado taking out the Swedish datacenter built inside a former nuclear bunker under 100ft of granite bedrock is so small that it probably doesn't affect the number of 9's that you can claim.

            • not_kurt_godel 2080 days ago
              Sadly not even Swedish nuclear bunkers are safe from disaster: https://www.theguardian.com/environment/2017/may/19/arctic-s...

              > Arctic stronghold of world’s seeds flooded after permafrost melts > It was designed as an impregnable deep-freeze to protect the world’s most precious seeds from any global disaster and ensure humanity’s food supply forever. But the Global Seed Vault, buried in a mountain deep inside the Arctic circle, has been breached after global warming produced extraordinary temperatures over the winter, sending meltwater gushing into the entrance tunnel.

              There are always unforeseen and unforeseeable risks associated with any location. You can mitigate them but you can't claim X number of 9s for a single physical datacenter.

              • Dylan16807 2080 days ago
                Not a tornado, though.

                > There are always unforeseen and unforeseeable risks associated with any location. You can mitigate them but you can't claim X number of 9s for a single physical datacenter.

                What is X, here? I'm pretty sure I can claim 99% for a single datacenter.

                • not_kurt_godel 2080 days ago
                  What I meant is you can't necessarily amortize loss in the event of a localized catastrophe. Failure modes in a single location are by definition not always statistically independent. You could have 99.99999% durability for 20 years, but if something happens to the datacenter that causes total loss, you're SOL. Geographical redundancy vastly reduces the risk of freak occurrences that you can't predict.
                  • kijin 2080 days ago
                    If a datacenter boasts flawless durability for 19 years and loses everything in the 20th year, then they have an infinite number of 9's for the first 19 years and zero for the 20th year. It's all about probability.

                    Nobody can promise 100%, but that doesn't mean that all those 9's are meaningless. They mean a lot for budgeting, and even more for insurance purposes -- which is exactly what we as a civilization have come up with as a way to amortize loss in the event of a local catastrophe. Your premiums are going to be much higher if you don't have enough 9's in a critical part of your money-making infrastructure.

                    No one here is saying that you don't need geographical redundancy. First we need to figure out how many 9's we can realistically expect in order to determine how much redundancy makes financial sense.

                    • not_kurt_godel 2080 days ago
                      > No one here is saying that you don't need geographical redundancy

                      I mean, that's kind of what Backblaze is saying in the article, isn't it? They don't have geographical redundancy, yet there's not a single mention of that fact or the importance thereof in an entire article dedicated to teaching the unwashed masses about the limitations of mathematical theory in analyzing durability, even going so far as to say:

                      > somewhere around the 8th nine we start moving from practical to purely academic... it’s far more likely that...Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers [emphasis my own]

                      Seems like a pretty serious omission given their claimed authority as "the bottom line for data durability" and being "like all the other serious cloud providers" who do have geo redundancy, don't ya think?

                      • kijin 2080 days ago
                        As mentioned elsewhere in this thread, Backblaze is working on adding another datacenter.

                        Personally, I don't care whether a single provider has multiple datacenters or not, because I prefer to have redundancy across providers. But that's not the kind of recommendation that we're likely to see on the blog of one of those providers.

                      • mijamo 2080 days ago
                        I don't think geo redundancy helps much. Your data is more likely to be corrupted by some software of the provider than some random storm, or by some common hardware used by the same provider,

                        If you need to be safe about your data, you NEED several cloud providers in different places, with different softwares and different countries.

                        Especially for data it is pretty easy to just back it in 2 really different places at different providers. Relying on the geo redundancy of ONE provider and having to pay for it seems a bit useless for me.

                • Hello71 2080 days ago
                  hell, I can probably do a solid 9% just visiting the local coffee shop twice a week.
      • arminiusreturns 2080 days ago
        The first Datacenter I ever brought up to T2 was tornado proof (former military base).
    • ChuckMcM 2081 days ago
      I was going to post essentially the same thing, so here is an upvote :-)

      While I always find storage analysis interesting (I spent 5 years at NetApp where it was sort of a religion :-)) some of the assumptions that Brian was tossing out are not good ones to make. (like the lack of correlation, or that Drive Savers will exist as a company 10 years from now).

      Still it does help you to understand the they take data availability seriously which is the underlying message.

      • brianwski 2081 days ago
        Disclaimer: I'm the author of the blog post. :-)

        > some of the assumptions that Brian was tossing out are not good ones to make.

        We COMPLETELY welcome other analysis and listing other assumptions. Internally, we argued endlessly about why this or that wasn't totally accurate, and finally decided to publish the math WITH all of our assumptions exposed so you could be the judge. If Amazon wants to publish their assumptions for S3 for comparison, we're all ears.

        > that Drive Savers will exist as a company 10 years from now

        Absolutely true, this calculation is only good RIGHT NOW. For example, one of the things that came up internally was "well, when drives get more dense the rebuild time rises, so this calculation will no longer be accurate in two years". But at the same time, we have some additional tricks and optimizations to make which we have not done yet to cut the 6 day average drive rebuild time down to 3 days. Also, drives last us about 5 years, so your data will be migrated to totally new drives 5 years from now. Those drives will absolutely have a different drive failure rate (maybe higher, maybe lower) so the calculation will no longer have the same result 5 years from now.

        • ChuckMcM 2081 days ago
          Hey Brian! For the record I love that you are transparent about your assumptions, it is really helpful. I have been in very very similar debates, both at NetApp and at Google of all places. Peter Corbett, the guy who invented the dual parity scheme that NetApp uses, wrote a similar analysis as well for Fast '04 [1].

          As someone who likes to geek out on failure proof systems and perfectly secure systems, neither of which are attainable but can be asymptotically approached, I think you are seeing the "there is always one level deeper" kinds of discussions. Personally I think of them as endorsements because if the exceptions get too extreme (say 'what if an asteroid hits?') then you know you've got all the bases covered.

          Its only a problem if the person analyzing the analysis finds something that you really did not even consider. Then it opens up an opportunity to look at the problem a whole new way.

          [1] https://www.usenix.org/legacy/publications/library/proceedin...

          • brianwski 2081 days ago
            Good link! I've both forwarded it on, and will study it when I have some time later.
            • jorangreef 2080 days ago
              Hi Brian, thanks for open-sourcing Backblaze's JavaReedSolomon, which is really well-written. A few months ago I ran into an issue with Reed Solomon coding throughput not saturating the write throughput of 16 drives, and wrote a new Reed Solomon module based on Cauchy matrices: https://github.com/ronomon/reed-solomon

              The Cauchy matrices remove the need for a table lookup to do the Galois multiply, replacing it with pure XOR. Together with other optimizations, this gives nearly 3x-5x more coding throughput for the same (17,3) parameters, assuming you're still using your open-sourced JavaReedSolomon in production. I don't know if Reed Solomon coding throughput is a factor in your rebuild times?

              • kdkeyser 2080 days ago
                There is also Intel's ISA-L: https://github.com/01org/isa-l

                It contains optimized Galois Field multiplication, resulting in Reed Solomon (both Vandermonde as well as Cauchy) at multiple GB/s on a modern x86 CPU.

                Still, I doubt that the Reed Solomon coding speed is the limiting factor in their rebuild time. There is a mention of a 6-day duration, so even with a very slow Reed Solomon implementation ( ~ 100 MB/s) that should not be a bottleneck for a 10 TB drive rebuild (assuming a distributed rebuild approach, not a traditional RAID style rebuild).

        • voidmain 2080 days ago
          > If Amazon wants to publish their assumptions for S3 for comparison, we're all ears.

          The 2010 S3 calculation is obviously also wrong! I totally feel your pain in terms of wanting to have a directly comparable answer, but the reasons you give that "it doesn't matter" (and others, including correlated faults, software bugs, model risk, and security breaches) are actually reasons why the stated durability number is wrong. Honestly IMO a probability of 1-10^-11 is the wrong answer to pretty much any question; model risk is going to dominate that for any problem more complex than 1+1=2.

          That said, although neither your system nor Amazon's should be expected to have anywhere near "eleven nines" of durability in reality, if as I understand it S3 is split across availability zones in a region and your product has all splits in a monolithic DC, I would expect S3 to come out ahead in a more careful analysis. (But note that S3 is not really a seamless multi region product, though there is an option to set up cross region replication.)

          • pas 2079 days ago
            As far as I know, if they claim multi AZ in a city (region), that really means at least separate buildings, but likely separate locations withing the region too.

            Though I'd welcome hard data on this very much.

    • ChrisLomont 2081 days ago
      >The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(

      You just made the same mistake you're criticizing. You assumed the 100K files were uniformly and independently spread. They're also likely clustered, and perhaps not even at the same data center. Given the variety of drives Backblaze uses, the drives are also not likely to all be the same model, so your failure method is also unlikely.

      You also only looked at the case there are exactly 16 good drives. The proper failure estimate is 1 - (odds of 17 good + odds of 18 + odds of 19 + odds of 20). I'm not sure where you got the 20 x 19 x 18 part either. Did you mean 30 choose 16 or something like that? Using the proper 1-... method I get 0.00267, not 0.03.

      • londons_explore 2081 days ago
        >You just made the same mistake you're criticizing

        Yes. Even if you take it into account, the vast majority of customers will see data loss, assuming random (but not even) shard distribution.

        I rounded and approximated to avoid explaining too many probability rules... As you can see, it gives the same result at the end.

        • ChrisLomont 2081 days ago
          >it gives the same result at the end

          No, it doesn't. You picked 3% of all drives failing out of the blue. Your next estimate was an order of magnitude too high. Your last assumption of p^# drives is not reasonable.

          The proof is in reality. Backblaze has run over a decade, with all sorts of hardware failures, server configs, running many drive models through their lifetime, across manufacturers, across technologies, across multiple datacenters, and had not seen the level of failures you claim they will.

          So I suspect their method of estimating is more accurate than yours. So far it matches reality much better.

    • mmt 2081 days ago
      > Correlated failures are common in drives.

      This is why, when I was building DIY arrays for startups (around the same time Backblaze published their first pod design [1]), I went through the extra effort of sourcing disks from as many different vendors as possible.

      Although it was somewhat more time consuming and limited how good a price I could get and how fast the delivery could be, it meant that, for any given disk drive size, it meant I could build an array as large as 12 where no 2 drives were identical in model and manufacturing batch [2].

      Of course, it's still a vanishingly rare risk, and "nobody" cares about hardware any more. It does help to remember, at least once in a while, that, on some level, cloud computing really is "someone else's servers" and to hope that someone else still maintains this expertise.

      [1] though I used SuperMicro SAS expander backplane chasses for performance reasons

      [2] and firmware from the factory, although this is somewhat irrelevant, as one can explicitly load specific firmware versions, and, IIRC, the advantages of consistent firmware across drives, behind a hardware RAID card, outweighed the disadvantages

      • foobarian 2080 days ago
        I've been doing a poor man's version of this for home videos using 3 hard drives and rsync. It's easy to replace a drive and they are not likely to go out at the same time. But one thing that bugs me is that unless the drive fails hard (e.g. noticed by SMART or unable to read at all) how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background? Does that impact durability of the drives?
        • mmt 2080 days ago
          > how do I know the data on the drive is not corrupted without reading it? Are there best practices to continuously compare the replicas in the background?

          I assume you're talking about already-written sectors becoming unreadable or a similar failure. Unfortunately, I don't think you can. This is what I believe the "patrol read" feature of RAID cards is meant to address.

          Fortunately, however, I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros.

          > Does that impact durability of the drives?

          I haven't read the studies (from Google, mostly, IIRC) in a while, and I'm not sure if they've released anything lately for more modern drives [1]. However, I believe you'll find an occasional "patrol read" won't noticeably reduce drive life/durability.

          [1] Especially for something like SMR, whose tradeoffs would seem particularly attractive for something like this archival-like use case.

          • jorangreef 2080 days ago
            "I don't believe there's evidence that if the data is readable, it would ever be different from what had been written, so comparison isn't needed. The main exception to this is the case of firmware bugs that return sectors full of all-zeros."

            Comparison is needed to address misdirected writes and bit rot in the very least, see "An Analysis of Data Corruption in the Storage Stack" [1]. You can't count on your drive firmware or RAID firmware to get this right. You need bigger end-to-end checksums, and you need to scrub.

            [1] - http://www.cs.toronto.edu/~bianca/papers/fast08.pdf

            • mmt 2080 days ago
              Thanks! I had either missed that paper or had taken away more of the message that these errors are more likely to be from events like misdirected writes, cache flush problems (hence the high correlation with systems resets and not-ready-conditions), and firmware bugs (on-drive and further up the stack), rather than bit-rot.

              Still:

              > On average, each disk developed 0.26 checksum mismatches.

              > The maximum number of mismatches observed for any single drive is 33,000.

              Considering the latter can represent 132K on a modern, 4K-sectored drive, that's a remarkable amount of data loss, enough to warrant a checksumming higher up (such as in the filesystem).. in theory [1].

              However, the fact that this was NetApp using their custom hardware as the testbed makes me wonder if the data are skewed, and if the numbers would be nearly this bad from a more "commodity" setup, such as at Google. The paper alludes to this when referring to the extra hardware for the "nearline" disks, and I'm always suspicious of huge discrepancies in statistics between "enterprise" and other disks, even more so when there's a drastic difference in comparison methodology.

              It would be interesting to see if there are any numbers for more modern drives, especially as the distinction between "enterprise" and "consumer" drives is disappearing, if only because demand for the latter is disappearing.

              [1] In practice, an individual isn't aware of the 16.5K/132K loss risk, which is vanishingly small compared to other risks, anyway, and businesses don't tend to care and have survived OK anyway.

        • walterbell 2080 days ago
          Use ZFS, it can perform periodic integrity checks.
          • mmt 2079 days ago
            I've never done so outside of FreeNAS appliances, partly because I remain persuaded that offloading the RAID portion to a card is more cost-effective and higher-performance, especially on otherwise RAM- and/or IO-constrained servers, and partly because the ZFS support under Linux has, historically, been less than ideal.

            Higher-level checksum failures are, however, a situation, where I would appreciate an integration between filesystem and RAID, as I'd want a checksum error to mark a drive as bad, just like any other read error.

            Do you happen to know if ZFS does that?

            • walterbell 2079 days ago
              • mmt 2078 days ago
                Unfortunately, no, it doesn't really say how ZFS behaves when an error is encountered.

                This is super-disturbing and a dealbreaker, if it's still true:

                > The scrub operation will consume any and all I/O resources on the system (there are supposed to be throttles in place, but I’ve yet to see them work effectively), so you definitely want to run it when you’re system isn’t busy servicing your customers.

                I browsed a little of the oracle.com ZFS documentation but couldn't find much in the way of what triggers it to decide that a device is "faulted" other than being totally unreachable.

      • jquast 2080 days ago
        > I went through the extra effort of sourcing disks from as many different vendors as possible.

        This is very good advice!

        If you already built your array, consider advice: "replace a bad disk with a different brand, whenever possible".

        Over time, you naturally migrate away from the bad vendors/models/batches. After following this practice, it seems ridiculous to me now to keep replacing the same bad disks with the same vendor+model.

        • mmt 2080 days ago
          Although I wouldn't go so far as to insist on switching brands (especially since, as another commenter pointed out, there has been so much consolidation, there remain only 3), I agree that replacing with at least a different model, or, failing that, a different batch, is a best practice for an already-built homogenous array.

          Some of this can also be achieved ahead of time if one has multiple arrays with hot spares, by shuffling hot spares around, assuming there's some model diversity between the arrays but not within them.

          I doubt I'll ever again have the luxury of being able to perform this kind of engineering, however. Even a minor increase in cost or cognitive/procedure complexity or a decrease in convenience just serves to encourage a "let's move everything to the cloud" reaction, so I keep my mouth shut.

    • mrb 2080 days ago
      «The chances that you only have 16 good drives for a file is about (0.97^16 ∗ 0.03^4)∗20∗19∗18 = 0.3% Yet your customer has 100k files! The chance they can recover all their data is only (1-0.003)^100000... Which means every customer suffers data loss :-(»

      Your math is completely wrong. In reality 92% of customers suffer no data loss.

      The chance of a file having 4 failed drives (16 good drives) is: .03^4 = 0.00008100%

      The chance of a file being irrecoverable is the chance of having 4 or more failed drives: .03^4 + .03^5 + .03^6 + ... = 0.00008351%

      The chance of a file being recoverable is: 1 - .03^4 - .03^5 - .03^6 - ... = 99.99992151%

      The chance of a customer's 100k files all being recoverable is: (1 - .03^4 - .03^5 - .03^6 - ...)^100000 = 92.0%

      Therefore only 8% of customers encounter one or more 8MB file that is irrecoverable.

      • barbegal 2080 days ago
        The original maths is correct. You have neglected the fact that the 4 failed drives can be any of the 20. Your calculations are for 4 specific drives failing.
        • mrb 2080 days ago
          Oh you are right.

          The OP still has a slight error though. C(20,4) = 4845 possible combinations of failing drives (4 out of 20.)

          Therefore the chance a file is irrecoverable is: .97^16 × .03^4 × 4845 = 0.241% (not 0.3%)

          But the OP's conclusion is largely correct: every customer will have some irrecoverable files.

          • majewsky 2080 days ago
            I think you're not considering that failed drives will be replaced and the data on them reconstructed from the other shards. This failure mode requires 4 of 20 drives to fail in such a short amount of time that reconstruction cannot be completed.
            • mrb 2080 days ago
              Yes, but this was the OP's scenario: what happens when 3% of drives all fail at the same time when the DC is powered back on.

              Edit: actually the math is still wrong. The chance any 4 out of 20 drive is failing is: .03^4 × C(20,4) = .03^4 × 4845 = 0.392% — There is no need to multiply by .97^16 as the status of the other 16 drives is irrelevant.

              • mrb 2079 days ago
                Decidedly, statistics is hard. Everything above is wrong. Let's label the twenty drives D0 through D19. There are 2^20 possible scenarios, which can be represented as a string of 20 bits:

                • 00000000000000000000 = all 20 drives are working

                • 00000000000000000001 = D19-D1 working, D0 failing

                • 00000000000000000010 = D19-D2 working, D1 failing, D0 working

                • 00000000000000000011 = D19-D2 working, D1-D0 failing

                • etc

                The probability of each of these scenarios is:

                • 00000000000000000000: .97^20

                • 00000000000000000001: .97^19 × .03

                • 00000000000000000010: .97^19 × .03

                • 00000000000000000011: .97^18 × .03^2

                • etc

                There are C(20,4) = 4845 scenarios with exactly four failing drives (four "1" bits.) The probability of each scenario is .97^16 × .03^4. Therefore the probability of 4 failing drives (any drive) is the sum of the probability of each scenario: .97^16 × .03^4 × C(20,4) like I said 3 comments above.

                However the probability of a file being irrecoverable is P(4 failing drives) + P(5 failing drives) + ... + P(20 failing drives):

                    .97^16 × .03^4 × C(20,4)
                  + .97^15 × .03^5 × C(20,5)
                  + ...
                  + .97^0 × .03^20 × C(20,20)
                  = 0.267%
    • hinkley 2080 days ago
      You could also have a bad batch of drives, causing a bunch of failures to happen over the period of a couple of months. Sources of failure don't take a number. Like in your example, they can and will overlap, which is why when we try to design bullet proof systems we aren't satisfied until every bad even requires a number of separate things to go wrong all at once.

      But if you have faulty power or bad hardware chipping away at your equipment, your depth of resiliency is degraded until the issue is corrected.

    • httpz 2081 days ago
      Also, there are only 3 hard drive manufacturers left. If one of them have a bug that affects across their product line, that can take out 1/3 of all hard drives.
      • kolpa 2081 days ago
        It's vanishingly unlikely that a bug would affect all their drives (across all recent models, and only after burn-in) simultaneously, unless the drives are managed by a remote server with a SPOF.
        • hinkley 2080 days ago
          We're talking about 6+ 9's here. Putting an order of magnitude on your definition of 'vanishingly small' seems compulsory when we're already so far right of the decimal point.

          How many years between incidents are you talking, and when was the last time a manufacturer had a multigenerational bug? (and is quality improving, or decreasing? That is, are we more or less likely to see failures in the next 10 years than we did in the last 10?)

        • voidmain 2080 days ago
          The Crucial M4 SSD had a firmware bug [1] that caused it to fail repeatedly starting after 5,184 power on hours.

          Disk failures could also be triggered by datacenter environmental factors shared among many drives like temperature or noise.

          [1] https://www.storagereview.com/node/2676

        • sitkack 2080 days ago
          You never experienced Death Star.
    • jdcarter 2080 days ago
      I recently looked into the research on this topic, and it absolutely agrees with you--failures are not independent. Papers of note:

      https://www.usenix.org/legacy/event/fast08/tech/full_papers/...

      http://www.cs.toronto.edu/~bianca/papers/fast07.pdf

      Take note of section 5 in both papers for their statistical models and how they match with real-world data.

      • hinkley 2080 days ago
        There was a statistics class in college that kicked my ass. I never quite understood how to determine if two variables were independent or dependent. You get a vastly different answer if you get it wrong.

        I run into people all the time that seem to have the same problem, to the point that it makes me wary of any software developer putting forth numbers that seem fantastical.

        My gut reaction is that if Backblaze wants to keep reporting their disaster preparedness numbers that they need the assistance of an actuary to calculate them.

    • thefifthsetpin 2081 days ago
      I think that your point is what they were trying to address by saying that anything beyond eight nines is impractical. Their examples of correlated failures included earthquakes and floods which might have huge reach rather than just impacting a single rack, but I think it's the same general idea.
    • Confusion 2081 days ago

        Which means every customer suffers data loss 
      
      They suffer partial backup loss.

      The customer only suffers data loss if they lost their 'master' copy of the data as well during the outage. Iff they don't have a secondary backup solution.

      • mmt 2081 days ago
        Are they positioning their "B2" product solely as a backup solution, rather than something closer to an S3 competitor?

        If it's the latter, "data loss" seems an appropriate characterization.

        • Confusion 2080 days ago
          You are right; I hadn't realized BackBlaze was also offering cloud storage and thought this was about their backup product.
    • KindOne 2080 days ago
      Year 2038 seems a bit simple to test. Can you just set the BIOS date to January 19 2038 03:00:00 UTC and wait 15 minutes to see what happens?
    • cjhanks 2081 days ago
      Nice analysis.
  • garettmd 2081 days ago
    This was an interesting read, both the points made about durability, as well as the in-depth math. However, what stood out to me most was the line:

    Because at these probability levels, it’s far more likely that:

    - An armed conflict takes out data center(s). - Earthquakes / floods / pests / or other events known as “Acts of God” destroy multiple data centers. - There’s a prolonged billing problem and your account data is deleted.

    The point that once you get to a certain point of durability (at least as far as hardware/software is concerned) you're chasing diminishing returns for any improvement. But the risks that are still there (and have been big issues for people lately) are billing issues. I think it's an important point that the operational procedures (even in non-technical areas like billing and support) are critical factors in data "durability"

    • klodolph 2081 days ago
      I've posted the math here before but if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines.

      The point about billing is better, though.

      My other concern is that a software bug, operator error, or malicious operator deletes your data.

      • tjoff 2081 days ago
        That's why no sane entity would use one earth. Use two and you can quickly recover.
        • klodolph 2081 days ago
          Our goal is N+2, that way when one Earth is down for planned maintenance, you can endure unplanned loss of a second Earth.

          N, of course, is always 1.

        • atYevP 2080 days ago
          Why build one when you can have two at twice the price?
        • metalliqaz 2081 days ago
          Recovery isn't quick at all unless you have developed the Ansible tech tree.
        • minitoar 2080 days ago
          If you have two you have one, if you have one you have none.
      • thaumasiotes 2081 days ago
        > if we assume that an asteroid hits the earth every 65 million years and wipes out the dominant life forms, then this fact alone puts your yearly durability at a maximum of ~8 nines

        I don't think this is a useful definition of your yearly durability. If your data center is down for maintenance during a period in which it is guaranteed that nobody wants to access it, that doesn't reduce your availability at all -- if your only failure is an asteroid that kills all of your customers, it would be more accurate to say you have 100.000000% availability than 99.999999%.

      • toolslive 2081 days ago
        Isn't the expected life span of the company more limiting? Plenty of cloud storage companies go out of business (typically, they run out of money). You can apply Gott's law to this. It's pretty grim
      • GauntletWizard 2081 days ago
        The event that wiped out the Dinosaurs was 65 million years ago, but the Mesozoic era lasted for ~200 million years. Your point stands, though.
        • klodolph 2081 days ago
          The question of how many data points to use is a subtle one, though. I can say that I picked one data point because I was lazy and doing back of the envelope math, which is reasonable because we can be somewhat assured that I didn't choose a number of data points that was convenient for the hypothesis.

          But if you're choosing two data points, my question is... why two? If you are choosing whether or not to reply based on whether or not the second data point fits with the first, then you're introducing selection bias. The chance that the second data point disagrees with the first by at least as much as the 200 My interval disagrees with the 1/65 My rate is equal to 1-(exp(-65/200)-exp(200/65)) = 0.32, which is not especially high.

    • schoen 2081 days ago
      I wonder if there's a general term in engineering for the case where a particular risk has been reduced well below the likelihood of more serious but exotic risks. I've heard about this most in cryptography where we can sometimes say that the risk of, say, a system failing due to an unexpected hash collision is drastically less than the risk of the system failing due to cosmic radiation or various natural disasters. At that point it doesn't seem important or worthwhile to consider this risk, because it's dwarfed by the others.

      This seems like a form of the same argument, and I wonder where else this arises and how people describe it.

      • matthewmcg 2081 days ago
        I think in other contexts, you might say that these signals are "below the noise floor."
        • entropicdrifter 2081 days ago
          As an audio engineer (as well as a dev), I definitely agree with this use of that phrase.
      • pixl97 2081 days ago
      • matt4077 2081 days ago
        It’s not a perfect match, but Rumsfeldian “unknown unknowns” come to mind.

        Specifically: every X-nines durability design will be compromised by some failure mode you didn’t think of.

        • schoen 2080 days ago
          I don't think unknown unknowns are what I'm thinking of here. In this case the argument involves a very specific risk, and sometimes a very specific lower bound for its probability.

          For example, in the hash collision case the argument says that it's not worth worrying about the (known) probability of a software error due to an unexpected hash collision because it's dominated by the (known) probability of a comparable error due to cosmic radiation. (The former probability can be calculated using the birthday paradox formula, and the latter has been characterized experimentally in different kinds of semiconductor chips.)

          This kind of argument doesn't rely on the idea that there are other risks that we can't identify or quantify. It's about comparing two failure modes that we did think of, in order to argue that one of them is acceptable or at least not worth further attempts to mitigate.

        • fanpuns 2080 days ago
          Although Rumsfeld often gets credit for this statement, it has been around for a long time before him https://en.wikipedia.org/wiki/There_are_known_knowns
          • otakucode 2080 days ago
            I think Rumsfeld gets credit for it because he was using it in the most degenerate, disingeuous form possible. Rather than guarding against legitimate concerns and pursuing actual handling of potential issues, he was just trying to rationalize continuing policies that were demonstrably counterproductive. It's one thing to say there might be factors we don't know. It's another to day that simply because there might be such things, we should dedicate significant resources and lives to blindly flailing away under the assumption that it will help. A presumption of unknown unknowns puts you in a position of not acting, normally. There is no way to know that you aren't exacerbating and making a problem worse if you know that little.
        • macintux 2081 days ago
          I tend to assume pessimistically that the durability design will itself cause a problem. Redundant switches to survive a hardware failure, e.g., strikes me as inviting trouble.
          • jzwinck 2080 days ago
            Indeed. I once had critical systems routed via a large redundant Cisco switch which claimed to be 1+1. Turns out there was a single "supervisor" component which failed (after just a year or two) and made the pair of switches useless. Apparently the designer worked in a team where nobody does anything when the boss is out.
    • TheAceOfHearts 2081 days ago
      For a consumer the cheapest and easiest way to backup important documents or files is to encrypt it and store it across multiple storage providers, e.g. Dropbox and Google Drive.

      They usually give you a reasonable amount of free storage, and it's unlikely all accounts would be terminated or locked at the same time.

      And of course, you should always have your local backups as well.

      • e12e 2080 days ago
        Assume you want access to files over the next 20 years. What are the odds Google will have bought Dropbox in that time; and what are the odds an automated system monitor at Google ad words will have disabled your Google account in that time span?

        Replace with Amazon and/or crashplan as appropriate..

        • scarface74 2080 days ago
          Backblaze suggests a “3-2-1” backup strategy. You should always have at least one backup on site. If a remote backup becomes inaccessible, you could move over to another remote access provider.
  • Animats 2081 days ago
    Financial failure or service shutdown by the provider is the highest risk for long term storage. The backup services CrashPlan, Dell DataSafe, Symantec, Ubuntu One, and Nirvanix all shut down. Nirvanix only gave two weeks notice for users to save their data.[1]

    [1] https://www.computerweekly.com/opinion/Nirvanix-failure-a-bl...

    • brador 2081 days ago
      I would add a third risk of an account ban by some type of future automated copyright content ID. Especially if it is silent/without warning.
      • pbhjpbhj 2081 days ago
        How about USA government instruct the owners to cease access or take ownership like with Kim Dotcom's New Zealand (?) based storage company.

        I don't recall hearing about anyone recovering access to their data in that case?

        This makes every additional user an increase in risk, as even without a warrant it seems if USA TLAs consider someone a valid target then those servers are going down (or getting taken over by people with unknown service standards in order to run a sting, or ...).

        tl;dr you have to worry about accusations (of law breaking or copyright infringement) against others too as some jurisdictions have a strong overreach in such cases.

      • the8472 2081 days ago
        That's one of the reasons to encrypt locally and only store encrypted data on backup services.
        • e12e 2080 days ago
          If Google suspends your account for background music playing in a YouTube video, you might still lose access to your files in Google drive / cloud - even if the files are encrypted.
          • the8472 2080 days ago
            You should compartmentalize those.
            • kalleboo 2080 days ago
              That can be difficult to maintain though, as Google is pretty eager to link accounts.

              You have horror stories like https://www.reddit.com/r/tifu/comments/8kvias/tifu_by_gettin...

              > Eventually someone realized that their non-work accounts were banned as well. It wasn't until yesterday that someone made the connection. Anyone who had their accounts as a recovery option were also caught in the ban wave.

              • earenndil 2079 days ago
                I'm pretty sure that was found to fake, no?
            • e12e 2080 days ago
              I did. I had a YouTube account. And a Google account. Then Google bought YouTube.

              (this isn't quite true in my case, but Google did go to some lebghts to merge yt accounts into Google accounts recently).

      • stefan_ 2081 days ago
        Forget account ban, there is a nonzero risk a false positive of the automated kiddie porn search all the cloud storage providers do has your home searched and puts you in handcuffs.
        • e12e 2080 days ago
          While I'm sceptical of content filters, even with a home search, it seems unlikely you'd end up in cuffs unless a) the filter caught acymtual illegal content, or b) the search turned up something illegal.

          You might get killed in the course of the initial police raid though..

          • stefan_ 2080 days ago
            Well you take a picture of your kid in the bathtub, now who can tell the difference?
            • e12e 2079 days ago
              If pictures of naked kids is illegal in your jurisdiction, you've got bigger problems. I guess that's true for some locations, though. Still, nude people =! pornography.

              (kids below age of consent sexting each other is another, related, problem)

        • pbhjpbhj 2081 days ago
          Surely the far greater risk is one other person using the service does something the feds don't like (justified or otherwise) and they take every server that person's file fragments have touched.
    • chx 2081 days ago
      Just for fun, here's my report of Nirvanix in 2008. I feel no problems sharing it since both NowPublic and Nirvanix is long gone. Let's just say there were reasons they went under:

      1) Sideloading. I was unable to benchmark or even to get this to work. Images requested to be loaded from media-src.nowpublic.com to node4 @ Nirvanix never showed up. I showed my code to [nirvanixcontact] who said that the code looked OK but someone did a load testing on node4 without informing Nirvanix and that steps are made so that such a situation won't occur again. He told me that the requests are not lsot but they still have not landed. Again, my code can be at fault and I would be happy to run a sideload example code.

      2) Upload speeds. I have uploaded from d2 to Nirvanix 100 images each between 180-8584 kbytes totalling almost exactly 100 MB (101 844 931 bytes). The upload was a single HTTP request. The uploads took 18-19 minutes (I repeated the experiment). To give us a comparison I changed the URL in the script to a one line PHP script on another server (at hostignition.com) which just echo'd the number of uploaded files. This took 16.25 seconds and echo'd 100 so seemingly the files landed.

      2a) I tried to get another node via the LoginProxy method which we would need for uploads anyways. While LoginProxy itself did work, GetStorageNodeExtended https://services.nirvanix.com/ws/IMFS/GetStorageNodeExtended... always fails with ResponseCode 80006, ErrorMessage: Session not found for ip = 67.15.102.70, token = e7b00d25-fc35-431c-9437-9a4302767f46. Seemingly, does not pick up the consumerIP.

      3) The image conversion itself is blazing fast though. These images took a total of 101.58 seconds to convert and this includes 300 HTTP requests (200 sent to d2 from Nirvanix, 100 to Nirvanix).

    • toolslive 2081 days ago
      add Bitcasa to the list: https://en.wikipedia.org/wiki/Bitcasa
  • aikinai 2080 days ago
    I really want to like Backblaze and they seem to do a lot of good work, but whenever this comes up, I also feel responsible to let people know the dark side so they're informed at least.

    I've written in more detail before[0], but just to share the gotchas in case anyone here is thinking of switching to Backblaze:

    1. They backup almost no file metadata.

    2. The client is very slow (days or more) to add new files and there's no transparency (it claims everything is backed up when it's not).

    3. There are still bugs in the client that can put your backup into an invalid state where it gets deleted.

    4. Support is terrible, and won't be any help when you run into these bugs.

    [0] https://news.ycombinator.com/item?id=16301626

    • rendaw 2080 days ago
      As much as this is off topic, I'd like to continue this conversation. Do you have any details/reference for point 1?

      I've been using rclone (got the recommendation here) which has been reliable.

      Also, does anyone know if Backblaze has any plans to offer u2f? I've switched dns and email providers to get u2f.

      • aikinai 2080 days ago
        Yeah, it's a bit old, but here's an article about Backblaze not supporting metadata. [0] "It fails all but one of the Backup Bouncer tests, discarding file permissions, symlinks, Finder flags and locks, creation dates (despite claims), modification date (timezone-shifted), extended attributes (which include Finder tags and the “where from” URL), and Finder comments."

        And I don't know if it supports U2F, but it does support TOTP.

        [0] https://mjtsai.com/blog/2014/05/22/what-backblaze-doesnt-bac...

    • Dylan16807 2080 days ago
      I didn't like their main client very much (though it was a while ago) but I'm still planning to use B2 for some things. So it depends on what you're buying from them.
  • Pissompons 2081 days ago
    It's a really nice blog post but coming from Backblaze, it would have been nice if they wrote it _after_ bringing the Phoenix DC fully online. When Amazon or Google say 11 9s, I can believe it but Backblaze still only has a single datacenter for most data. All it takes is an earthquake.
    • sneak 2080 days ago
      Or an overzealous prosecutor.
  • ujjain 2081 days ago
    It still wouldn't upload my 1TB in back-ups in an entire month. Amazon Drive back-up completed in 3 days.

    Their pricing is amazing, but saving money on a back-up solution that doesn't seem as good as the other cloud storage providers is a dangerous game.

    • kalleboo 2080 days ago
      When I've gone a clean backup on Backblaze it's taken just over 24 hours to backup about 650 GB. And I'm not even in the US, so my data has to cross the pacific.

      Backblaze is actually faster than Apple Time Machine is on my LAN which slightly bothers me. It also has lower CPU usage.

      I originally chose Backblaze after benchmarking the other offers available at the time (Carbonite, Crashplan etc) and Backblaze was by far the fastest.

    • GuB-42 2081 days ago
      Not that dangerous if you do your own maths. I don't really trust cloud backup. As they said, the 11 9s doesn't matter. You are more likely to encounter a billing problem (as they said) but also get hacked, have a problem with your internet connection, many things can go wrong.

      That's why you need another solution if you are serious about your data, maybe a set of external hard drives (local backup). This way, you have redundancy and little correlation in failure, which greatly improves your general durability. That local storage may be paid with the money you save by getting an "inferior" cloud backup provider.

    • fencepost 2081 days ago
      Where are you geographically? Was your Amazon upload to a datacenter physically much nearer to you than California? 30MBit/s sustained for 3 days isn't unreasonable for a business connection, but seems high compared to most of what I see available at least in the US.
      • brianwski 2081 days ago
        Disclaimer: I work at Backblaze and live in California.

        > 30MBit/s sustained for 3 days isn't unreasonable for a business

        We (Backblaze) are seeing more and more consumer internet connections in the USA with 20 Mbit/sec upstreams, I thought they were available most everywhere if you were willing to upgrade your internet package "just a bit". 30 Mbits is a little unusual for "consumer", but not unheard of. Of course, there is a "selection bias" when you look at online backup users. :-)

        • fencepost 2080 days ago
          Yes regarding the available speed, but on a consumer connection with Comcast at least if you push 30MBit for 3 days straight you may get a call or at least may start seeing popups in your browser on any http (not https) traffic.

          From Comcast: The Terabyte Internet Data Usage Plan is a new data usage plan for XFINITY Internet service that provides you with a terabyte (1 TB or 1024 GB) of Internet data usage each month as part of your monthly service. If you choose to use more than 1 TB in a month, we will automatically add blocks of 50 GB to your account for an additional fee of $10 each. Your charges, however, will not exceed $200 each month, no matter how much you use. And, we're offering you two courtesy months, so you will not be billed the first two times you exceed a terabyte.

          Also, All customers in locations with an Internet Data Usage Plan receive a terabyte per month, regardless of their Internet tier of service. and The data usage plan does not currently apply to XFINITY Internet customers on our Gigabit Pro tier of service. The plan also does not apply to Business Internet customers, customers on Bulk Internet agreements, and customers with Prepaid Internet.

    • anderspitman 2081 days ago
      I've found the upload speeds to B2 to be fantastic. Especially if running multiple connections.
    • 3pt14159 2081 days ago
      I use B2 and I have no problems uploading large files quickly. Is this for the consumer option?
    • DenisM 2081 days ago
      How do you mean? Is upload too slow?
  • eloff 2081 days ago
    This was an interesting read from a technical point of view, but also well written and refreshingly transparent.

    I found the discussion about why it doesn't matter when you start talking about 11 nines of reliability to be hilariously true.

    At the end of the day we're still flawed humans living in a hostile universe, and no matter how foolproof we make the technology, there are some weaknesses that just can't be eliminated.

    • brianwski 2081 days ago
      Disclaimer: I'm the author of the blog post. :-)

      > well written and refreshingly transparent

      Thank you! In the interests of full transparency, the blog post was a collaborative affair and was proof read and edited for clarity by several people at Backblaze.

      > discussion about why it doesn't matter

      One of the philosophies Backblaze uses is to build a reliable component out of several inexpensive and unrelated components. So combine 20 cheap drives into a ultra reliable vault. We have two or three inexpensive network connections into each datacenter instead of buying one REALLY expensive connection for 8x the price. Etc.

      Personally, I recommend customers do the same. Instead of storing two copies of your data in two regions in Amazon for 2x the price, store your data in one region of Amazon and put one copy in Backblaze B2 for 1.25x the price. We believe this will result in higher availability and higher durability that two copies in Amazon because Amazon S3 and Backblaze B2 don't share datacenters in common (that we know of), don't share network links, don't share the software stack, etc. For bonus points, use different credit cards to pay for each, and have different IT people's credentials (alert email address) on each. That way if one IT person leaves your company and you don't get an alert that the credit card has expired, hopefully your other copy will be Ok.

      • k__ 2080 days ago
        BB and S3 both have eleven 9 durability, how much does using both increase this?
        • brianwski 2079 days ago
          > BB and S3 both have eleven 9 durability, how much does using both increase this?

          Putting your data in either Backblaze B2 or Amazon S3 suffer from other failure modes outside of the durability of the raw system. For example, let's say your IT person is poking around in their Amazon S3 account and accidentally clicks the "delete" button and all your data is gone? Or what if your credit card has a transaction declined, and your IT guy has left your company or the emails from Amazon are being put in the "Spam" folder of your email program. Or maybe a malicious Amazon employee writes a program to delete all the data in Amazon S3 from all customers? What if one of your employees is really disgruntled and logs into your Amazon S3 account and just to spite you deletes all your data?

          In every one of these situations, if you have a copy in Backblaze B2 and also another copy in Amazon S3, you can recover your data from the other vendor.

          I recommend using a separate credit card to pay for your Amazon S3 account and your Backblaze B2 account. They should expire a year apart. And don't give the logins to both systems to one disgruntled employee in your organization. Only give that disgruntled employee access to one or the other.

          Make sense?

        • bhelkey 2080 days ago
          Depends what you are modeling. The probabilities of random disk failures, are probably independent.

          However, there are risks that are not necessarily independent such as the US Government ordering these two services to delete your data, or, as the article mentions, an armed conflict destroying data centers.

          • e12e 2080 days ago
            Or your credit card being suspended and both services deny you access / delete your data.
    • kraftman 2081 days ago
      It's 'foolproof', like dog proof but for fools.
      • eloff 2081 days ago
        Hah, good catch, I know that but somehow made the error anyway. I corrected it.
  • londons_explore 2081 days ago
    I've very disappointed their recovery time is 6 days!

    Recovery workload should be spread across the whole cluster, so that the recovered data gets distributed evenly. In that case, assuming 10,000 drives, to recover one dead 12TB drive and a recovery rate of even 10 MB/secs per machine, recovery of one drive should be done in under a second. Maybe 10 seconds with some sluggish tail machines.

    Why do you need it done in under a second? While the data is down one replica, it is at dramatically higher risk. Also, drive failures can be dramatically accelerated, for example in the case of a bad software release erasing data - you need to be able to move data faster than bad software gets released. And releasing software at a rate of one machine per second still means a release takes 3 hours!

    • ChuckMcM 2081 days ago
      Be careful here, it isn't 6 days until data is recovered it is 6 days until it is fully protected again, there is a big difference. During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.

      In most systems we assume that "primary traffic" (read/write stuff) is prioritized over "rebuild traffic" which is recovering lost shards. So when you specify these things it is best to specify "how long to rebuild a shard while the array is providing storage services at its maximum specified rate." This assures the customer that if they have a 24/7/365 non-stop traffic pattern their data will still stay protected in the face of drive failures.

      • brianwski 2081 days ago
        Disclaimer: I'm the author of the blog post. :-)

        > During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.

        Correct. More specifically, the FIRST time the data is accessed in any 24 hour period it must ALWAYS be reconstructed from the Reed-Solomon encoded parts on 17 other drives on 17 other machines. Any 17 is fine, so it's totally fine if 1 or 2 drives are not available. Once reconstructed it is stored in a set of front end cache computers that have fast SSDs for this purpose.

        The second time the same file is accessed in a 24 hour period, it will be fetched out of the SSD cache layer so it won't even hit the spinning drives and won't care if all 20 drives are offline.

        > "primary traffic" (read/write stuff) is prioritized over "rebuild traffic"

        Yes. Backblaze balances between the two if only one drive has failed, but as a tome (20 drive group spread across 20 computers) becomes more badly degraded Backblaze begins favoring the rebuild. When two drives have failed out of 20, Backblaze stops allowing any writes to that tome because more writes will tend to fail yet another drive. Fewer writes offloads the tome. But we still allow reads. At Backblaze, we have never been 3 drives degraded out of 20 (knock on wood), but if this ever occurs the 20 drive tome is now running without parity -> so in that case we even stop allowing reads AT ALL until we are returned to at least 1 drive of fully redundant parity.

    • jsjohnst 2081 days ago
      > In that case, assuming 10,000 drives, to recover one dead 12TB drive and a recovery rate of even 10 MB/secs per machine, recovery of one drive should be done in under a second.

      I want to know where you can find a drive that can write 12TB/sec of data!

      (In other words, you clearly missed half the problem. To add a new replacement drive, you have to be able to write to it the data from an original drive. Also RS code calculation is fast these days, but it ain’t that fast)

      • riobard 2081 days ago
        Additionally, it implies that data needs to be spread across 10,000 drives, which is unrealistic anyway.
        • brianwski 2081 days ago
          Disclaimer: I wrote the blog post. :-)

          > spread across 10,000 drives, which is unrealistic

          I claim it is also undesirable. Backblaze specifically made the conscious decision that the parts of any one single "large file" (these can be up to 10 TBytes each) are all stored within the same "vault". A vault is 20 computers in 20 separate racks. This allows a single vault to check the consistency and integrity of a large file periodically without communicating to other vaults in the datacenter.

          The vaults have been a really good unit of scaling for Backblaze. If the vaults can maintain their performance, then we know we can just stamp out more vaults because there is almost no communication between vaults.

        • Dylan16807 2080 days ago
          As to the one second claim, I think that's a math error, because even 10k * 10MB is only 100GB.

          But spreading the data over 10k drives isn't unrealistic, it's a different architecture. Pick a different 20 drives for each file.

          Working it through: Assume 200 machines with 50 drives each. Each machine has to read 1TB, transmit it over the network, do a parity calculation, and write out ~50GB. With dual 10gbps ports the bottleneck is the network, and if we dedicate one on each machine to the rebuild we get a 15 minute clock.

          Not that having such a monolithic architecture is worth the complication and extra bugs.

      • londons_explore 2081 days ago
        It's assuming that the newly recovered data is spread across free space of the other 10,000 drives, a few megabytes on each.

        It will also be recovered by reading the recovery data, which should also be approximately evenly spread across the 10,000 drives.

    • TheDong 2081 days ago
      > you need to be able to move data faster than bad software gets released. And releasing software at a rate of one machine per second still means a release takes 3 hours!

      Since others have already demonstrated why the remainder of your comment is overly simplistic, I'll tackle this bit.

      Generally, software releases are not rolled out at a constant rate to all machines. A typical thing to do is to release it to staging, then to a "canary" subset of machines (e.g. to 1% or 5% of the machines).. Once all seems well there (e.g. metrics are clean and the canaries have handled X writes, reads, and simulated drive failures), it can be rolled out to a larger subset, and eventually to all machines.

      In that way, the release can take whatever total amount of time is desired while still catching any such bugs fairly reliably.

      Ideally, at backlblaze they could ensure that their canary instances are "data-redundancy aware" such that even if the 5% they roll to for the canary test all explode, data is still safe.

      Regardless, any talk of "recovering data faster than software releases" is completely silly and totally misses the reality of how releases are done, how recovery is done, and what sort of bugs might happen. The math based on faulty assumptions about rate is also pointless.

  • riku_iki 2081 days ago
    > if you store 1 million objects in B2 for 10 million years, you would expect to lose 1 file.

    Can this be reformulated: you store 10 trln objects (e.g. 100TB of 10 byte records), you lose 1 record each year.

    Also curious what are the stats from other providers.

    • jsjohnst 2081 days ago
      > Can this be reformulated

      Roughly, yes.

      > Also curious what are the stats from other providers.

      As to other providers, most are 6+ 9s that I’ve looked at, with many in the 8-9 range. Anything over 8 is (as they admitted) essentially marketing porn and not a useful metric (for reasons they mentioned as well as ones said by other comments here).

      • berns 2081 days ago
        Azure:

        locally redundant storage: 99.999999999 % (11 9's)

        zone redundant storage: 99.9999999999 % (12 9's)

        geographically redundant storage: 99.99999999999999 % (16 9's)

        https://azure.microsoft.com/en-us/pricing/details/storage/

      • riku_iki 2081 days ago
        And yet, no much complains about lost GMail letters in internet because of disc failures.. And GMail is likely stores much more than 10 trln objects and 100TB of data.
        • brianwski 2081 days ago
          A number of years ago a Gmail "insider" that I know admitted they had lost customer emails and their policy (at that time at least) was to simply ignore it and not tell anybody because most customers never notice if it is a small number.

          I think a much bigger scandal is that all major laptop Operating System vendors (Microsoft and Apple) absolutely know when your laptop drive loses files or even is starting to go bad in some cases, and they NEVER tell the customer. I think an excellent product offering would be a 3rd party piece of software and cloud service which was a "verification service". It wouldn't store your files offsite, it would store the name, size, and SHA1 offsite and periodically check that no bits have been flipped on your local drive unless you intended it. For example, a week after I take a photo, I absolutely never want the photo to change. Ever. Same with music I (legally) download.

        • sdmike1 2081 days ago
          I would be curious to see what the chance any given email is read and how much it drops off after the first reading. I would be willing to bet that out of 1000 people who receive a reasonable amount of mail you could probably find missing emails.
    • Dylan16807 2080 days ago
      The raw number would imply that, but I'm pretty sure the math breaks down when you're storing 10 byte records.

      Chunks of your data are going to be stored together, so it's a very small chance of losing a big block of 10 byte files. There's no failure mode that loses just one, and does so often.

      • riku_iki 2080 days ago
        > Chunks of your data are going to be stored together

        It probably depends on their infrastructure, e.g. if storage is something like cassandra, records would be evenly distributed by key hash.

        I agree that numbers likely are not like that though, just wanted to demonstrate that such calculation approach can bring unexpected conclusions.

  • duxup 2081 days ago
    I just wish Backblaze would not go "oh man a lot of your data changed, you should probabbly check for integrity and start your backup with us again later...."

    Oh gosh thanks Backblaze, I'll just dig through several TB of stuff....

  • toolslive 2081 days ago
    The best durability is probably achieved by Amplidata, but it does not matter.

    You need to do the same calculation for your meta data, which is probably not erasure coded. If you lose this, you don't lose your data, but you no longer know where you put it.

    So you probably add your meta data to your data as well in some kind of recoverable format. That's fine, it means that you can harvest the meta data again.

    But how long does this take ?

  • simonebrunozzi 2080 days ago
    The Backblaze blog post points to Amazon's CTO (Werner Vogel)'s blog post, in which he states that "...These techniques allow us to design our service for 99.999999999% durability."

    (side note: Werner is a great person)

    Unfortunately, there is a difference, a huge difference, between a system "designed" for 11 9s of durability, and a system "offering" 11 9s of durability.

    I wish Backblaze, or Amazon, or anybody else, would clarify durability using very honest terms.

    An example?

    "This system offers X 9s of durability over a period of one year, on average. This is a technical paper that describes how we tested that durability", followed by measurements and test specifics.

    Any other claim has much less value to me.

  • shub 2081 days ago
    Just one missing piece: the actual loss rate.
    • mikeryan 2081 days ago
      There’s another piece too. Practically they’re a backup service so knowing how often someone needs to recover their data and not had an opportunity to reset their backup based on their current state. I’ve used Backblaze for years and only needed it once. (I also back up to a time capsule as well since data recovery is more Practical/Easier from that)
  • Jabbles 2080 days ago
    to lose a file, we have to have four drives fail before we had a chance to rebuild the first one.

    I wonder how many times they've had 2 or 3 drives fail before they've rebuilt and if that matches their predictions.

  • HelloNurse 2080 days ago
    "Our nines go to eleven"

    Well written, but there are other significant risks like losing access credentials (e.g. a password stored only on one device that is destroyed in the same accident in which its only user, who remembered the password by heart, dies) or being hacked by someone who gains access to cloud storage and intentionally erases or corrupts data.

    Specialization is good, but if Backblaze is strictly in the business of storing data on hard disks, who's going to help with designing and maintaining the reliable complete system on top of their service that users actually need?

  • anderspitman 2081 days ago
    Off topic but I just want to say how happy I've been as a Backblaze customer. B2 is a fantastic product for my backup needs, and their hard drive stats have always been handy when selecting drives.
    • atYevP 2081 days ago
      Yev from Backblaze -> Thanks! Glad you're enjoying it :D
  • fencepost 2081 days ago
    One question since I know some of the Backblaze folks respond to these threads:

    In addition to calling (and possibly getting blocked/ignored), does your customer service staff send text messages? I suspect that a big percentage of the phone numbers you have are for cell phones these days, and I see a lot less SMS spam than I do telemarketing. SMS would also allow you to get a bit of info visible to recipients (e.g. "Backblaze CC Expired") with more detail once a message is opened.

    • atYevP 2080 days ago
      Yev from Backblaze here -> I believe we do send SMSs in the case of a Cap or Alert getting reached, so yes that could be possible - though I'm not sure if an SMS is part of our billing failure process - that's an interesting question!
      • u02sgb 2080 days ago
        Backblaze B2 customer here. My credit card stopped accepting your billing and no SMS for me. Took me a month or so to notice the emails and update my details. I've got SMS alerts active for Caps. Would be worth adding that as was a bit scary when I noticed the mail (think it was the third one you'd sent!).
        • atYevP 2079 days ago
          VERY good to know! Thank you!
  • cascom 2081 days ago
    One simple question - what is the math if one data center goes down?

    https://www.nytimes.com/interactive/2018/05/24/us/disasters-...

  • sneak 2080 days ago
    This article supposes that a meteor impact is more likely than disaster that renders northern California (their only data centers are in Oakland and Sacramento AFAIK) without power or civil order within ten million years.

    As someone else pointed out, it’s overly simplistic. They’re a great low-cost alternative to S3, sure. But keep a backup on another continent if you need your data 100 years from now.

  • juancn 2078 days ago
    My experience is that it's bullshit. I had a backup (damn I still have one) with Backblaze, when attempting to restore it, maybe 30% of the files survived restore. The rest are lost in smoke.

    They don't have any way to detect corruption in the data or if they have, the backup clients are oblivious to it.

    I lost about a 150GB of family photos and videos.

  • contravariant 2080 days ago
    >The sub result for 4 simultaneous drive failures in 156 hours = 1.89187284e-13. That means the probability of it NOT happening in 156 hours is (1 – 1.89187284e-13) which equals 0.999999999999810812715 (12 nines).

    Minor nitpick, this ignores the possibility of more than 4 failures, although this error only affects the fourth digit after the nines. Much more egregious is the following:

    >there are 56 “156 hour intervals” in a given year

    This is too simplistic, there are in fact infinitely many 156-hour intervals in a year, some of them just happen to overlap. This overlap can't simply be ignored because even if none of their 56 disjoint intervals contain 4 events this does not rule out the possibility of there being 4 events in some 156 hour interval they didn't take into account. In fact failing to take into account even one of the infinitely many intervals creates a blind spot (consider what happens if the drives happen to fail precisely at the start and end of a particular interval). You can still get a lower bound by e.g. ensuring none of the 56 intervals contain more than 1 failure, or by adding more intervals and ensuring none of them have more than 2 failures etc.

    Their binomial calculation contains the same mistake.

    A quick improved lower bound can be obtained by calculating the probability that any failure is followed by (at least) 3 other failures within 156 hours. For one failure this probability is given by the Poisson distribution and is

        Pc = 1 -\sum_{k<3} e^-λ λ^k / k! = 5.18413e-10. 
    
    Now we get into some trouble because the failures and the probability of a 'catastrophic' failure are dependent, however the probability that any particular failure turns catastrophic is constant, so the expected number of catastrophic failures can't be greater than the expected number of failures times that constant, this gives a lower bound of

        Pc (365·24·λ) = 6.63154e-9
    
    this is a lower bound, but that's still three fewer nines left than their claim.

    Anyway let's just hope their data centres are more reliable than their statistics.

    Edit: This last calculation can be justified by noting that the probability that 1 critical failure starts in a particular time interval is Pc times the probability of 1 failure in that interval plus some constant times the probability of more than one interval. Similarly the probability of more than one critical failure is at most the probability of more than one failure.

    Now the probability of more than one failure in a time interval is dominated by the length of the interval, therefore if you calculate the density those parts fall away and you're left with a density of Pc λ critical failures per hour.

    This seems to be an exact expression for the expected number of critical failures, and not just a lower bound. Although it is still a lower bound for the probability of a critical failure, albeit a fairly tight one.

  • hartator 2081 days ago
    > The math on calculating all this is extremely complex.

    Hum, I would be more reassured by past statistics than a probability evaluation. Did they happen to have loss data since their creation?

  • skybrian 2080 days ago
    The billing problem is mentioned, but just left dangling. I wonder if anyone has done any interesting work at fixing this?
  • rossdavidh 2080 days ago
    Fun read, but the whole time I was reading it I could hear Nassim Nicholas Taleb making exasperated noises of outrage.
  • zimbatm 2080 days ago
    answer: because all the files are in the same, single data center which has a lot less nines
  • egonschiele 2080 days ago
    why calculate poisson and binomial when both are ultimately gaussian?
    • RA_Fisher 2080 days ago
      They're not ultimately gaussian. :) The normal distribution is continuous and has unbounded support whilst neither the binomial or Poisson ate cts and unbounded.
  • rafaelgarrido 2081 days ago
    is the website down?
    • hartator 2081 days ago
      It was the 0.00000001%.
  • londons_explore 2081 days ago
    I've very disappointed their recovery time is 6 days!

    Recovery workload should be spread across the whole cluster, so that the recovered data gets distributed evenly. In that case, assuming 10,000 drives, to recover one dead 12TB drive and a recovery rate of even 10 MB/secs per machine, recovery of one drive should be done in under 2 minutes. Maybe 10 minutes with some sluggish tail machines.

    Why do you need it done so fast? Drive failures can be dramatically accelerated, for example in the case of a bad software release erasing data - you need to be able to move data faster than bad software gets released. And releasing software at a rate of one machine per second still means a release takes 3 hours!