High-availability durable filesystem is a difficult problem to solve. It usually starts with NFS, which is a big huge single point of failure. Depending on the nature of the application this might be good enough.
But if it's not, you'll typically want cross-datacenter replication so if one rack goes down you don't lose all your data. So then you're looking at something like Glusterfs/MooseFS/Ceph. But the latencies involved with synchronously replicating to multiple datacenters can really kill your performance. For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.
Other products try to do asynchronous replication, EdgeFS is one I was looking at recently. This follows the general industry trend, like it or not, of "eventually consistent is consistent enough". Not much better than a cron job + rsync, in my opinion, but for some workloads it's good enough. If there's a partition you'll lose data.
Eventually you just give up and realize that a perfectly synchronized geographically distributed POSIX filesystem is a pipe dream, you bite the bullet and re-write your app against S3 and call it a day.
What's a small write in your use case? How many KBs is the write op, and how many ops/sec?
We can store small files on NVMe disks in metadata, so that you can read them with a latency of just a couple of ms, and can scale to 10s of thousands of concurrent reads/writes per second with small files.
Depending on the use-case, Veritas InfoScale might be a better option. It can aggregate different tiers of storage, and present a multi-node shared-nothing file-system, with options for replication between AZs, regions, or different cloud providers.
> For example, try git cloning a large project onto a Glusterfs mount with >20ms ping between nodes. It's brutal.
That may be true but also is due to applications often having very sequential IO patterns even when they don't need to be.
I hope we'll get some convenience wrappers around io_uring that make batching of many small IO calls in a synchronous manner simple and easy for cases where you don't want to deal with async runtimes. E.g. bulk_statx() or fsync_many() would be prime candidates for batching.
... and then later you notice that you need a special SLA and Amazon Redshift to guarantee that a read from S3 will return the same value as the last write.
Even S3 is only eventually consistent and especially if a US user uploads images into your US bucket and then you serve the URLs to EU users, you might have loading problems. The correct solution, according to our support contact, is that we wait a second after uploading to S3 to make sure at least the metadata has replicated to the EU before we post the image url. That, or pay for Redshift to tell us how long to wait more precisely.
Because contrary to what everyone expects, S3 requests to US buckets might be served by delayed replicas in other countries if the request comes from outside the US.
> The correct solution, according to our support contact, is that we wait a second after uploading to S3
I'm shocked that you got this answer. This is definitely not how you are supposed to operate.
If you need to ensure the sequantiality of a write followed by a read on S3, the idiomatic way is to enable versioning on your bucket, issue a write, and provide the version ID to whoever need to read after that write.
Not only will that transparently ensure that you will not read deprecated data, but it will even ensure that you actually read the result of that particular write, and not any consecutive write that could have happened in between.
This pattern is very easy to implement for sequential operations in a single function, like:
version = s3.put_object(...)
data = s3.get_object(version_id=version, ...)
I agree that when you deal with various processes it can become messy.
In practice it never bothered me too much though. I prefer having explicit synchronization through versions, rather than having a blocking write waiting for all caches to be synchronized.
Also, this should only be necessary if you write to an already existing object. New keys will always trigger a cache invalidation from my understanding.
Interesting, do you have more information on how you did monitors these S3 metadata nodes? Never heard of that.
Why would an unversioned put be less costly than a versioned put?
To me it should be pretty much the same. I would almost suspect unversioned puts are hacked around versioned ones.
Out of curiosity what kind of workload did you perform? I have been moving terabytes of files (some very small 1kB files, some huge 150GB files) and never noticed a single blip of change in behavior from S3 (and certainly not anything going "off")
> S3 requests to US buckets might be served by delayed replicas in other countries if the request comes from outside the US
What? That makes no sense. Do you have a source for that? I thought the explicit choice of region when creating a bucket limits where data is located. Why would the give you geo-replication for free? Also: "Amazon S3 creates buckets in a Region you specify. To optimize latency, minimize costs, or address regulatory requirements, choose any AWS Region that is geographically close to you." - https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket....
> For new objects, you get read-after-write consistency and the example you gave contradicts that.
Mind the documented caveat for this case:
> The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
Indeed, although that seems irrelevant for the mentioned use case. After all, why would the client request an image before it's uploaded. It seems like a straight forward "server side upload->generate url and send it to client->client requests image" flow. It's perfectly covered by the consistency provided.
I'm surprised about the conclusion of the AWS support that this is intended behavior. I'm curious about the specifics of the behavior of S3 in this case. Do you have any more detailed information from the AWS support you could share by any chance? If you don't mind, I'd be also grateful if you could drop some answers to the questions below.
Do you use S3 replication (CRR/SRR)?
Before uploading, do you make any request to the object key? If yes, what kind of request?
For downloading, do you use pre-signed URLs or is the S3 bucket public?
What's the error the 0,01% of users encounter? Is it a NoSuchKey error?
They also say "A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list" which implies that you cannot read immediately after write.
You can minimize the effects of eventual consistency by distinguishing the distant replicas from the same-datacenter ones. Cross datacenter/region might be eventually consistent, but you will only see it rarely since the same-datacenter replicas are fully consistent with each other.
The main downside is that you have to pick one datacenter as the main datacenter at any given time. This can change if the old one goes down or due to business needs, but you can't have two of them as the main datacenter at once for a given dataset.
ObjectiveFS takes a different approach to high-availability durable file system by using S3 as the storage backend and building a log structured file system on top. This leverages the durability and scalability of S3 while allowing e.g. git cloning/diffing of a large project (even cross region) to remain fast. To compare we have some small file benchmark results at https://objectivefs.com/howto/performance-amazon-efs-vs-obje... .
A Unix file system is not only a map from paths to inodes to bytes. Unix file systems support a host of additional apis, some of which are hard to implement on top of something like s3 (eg atomically renaming a parent path).
Even if you drop the Unix (posix) part, most practically used file systems have features that are hard to guarantee in a distributed setting, and in either case simply don't exist in S3.
The point in my post was to point out that storage implemented using keys and values is fundamental to all filesystems. To say that the very model that is fundamental to all filesystem is the thing that somehow precludes S3 from being considered a filesystem is kind of bizarre.
Also Nowhere did I say or even imply that a key/value store is all a Unix file system is. Different filesystems have different features. Object storage is a type of filesystem with a very specific feature set.
It is an eventually consistent key-value store with some built-in optimization (indexes) for filtering/searching object timestamps and keys (e.g. listing keys by prefix), which allows for presenting them in a UI similar to how it is possible with files in a directory. That's about it.
Reading through your article, this solution is built on top of s3. So, moving and listing files is faster, presumably due to a new metadata system you've built for tracking files. The trade off here, is that writes must be strictly slower now than they were previously because you've added a network hop. All read and write data now flows through these workers. Which adds a point of failure, if you steam too much data through these workers, you could potentially OOM them. Reads are potentially faster, but that also depends on the hit rate of the block cache. Would be nice to see a more transparent post listing the pros and cons, rather than what reads as a technical advertisement.
I'm one of the co-authors. The numbers for writes are in the paper, so it is very unfair to call it an advertisement. And it is a global cache - if the block is cached somewhere, it will be used in reads.
The parent does make a good point about centralization of requests being a problem. S3 load balances under the hood, so different key prefixes within a bucket are usually serviced in isolation -- a DoS to one prefix will usually not affect other prefixes.
It seems like you'd be limiting yourself for concurrent access -- if everything is flowing through the MySQL cluster -- not a bad thing! Just perhaps warrants a caveat note. I'd expect S3 to smoke HopFS on concurrency.
Read the paper, it doesn't smoke HopsFS on concurrency. In the 2nd paper, we get 1.6m ops/sec on HopsFS HA over 3 availability zones (without S3 as a backing layer). Can you get that on S3 (maybe if you are Netflix, otherwise your starting quota is 1000s ops/sec)?
"Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by parallelizing reads. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second."
With 500 prefixes it doesn't matter if you're Netflix or a student working on a distributed computing project, you can get that 1.6m requests.
There is no limit to the number of prefixes in a bucket. You can scale to be arbitrarily large.
Funnily enough, i wasn't aware of ObjectiveFS - i guess it's because i can't find a research paper for it.
HopsFS on S3 is similar to ADLS on Azure (built on Azure block storage). Internally, ADLS and HopsFS are different - HopsFS has a scaleout consistent metadata layer with a CDC API, while ADLS doesn't. But ADLS-v2 is also very good.
"HopsFS is a new implementation of the Hadoop Filesystem (HDFS), that supports multiple stateless NameNodes, where the metadata is stored in MySQL Cluster, an in-memory distributed database. HopsFS enables more scalable clusters than Apache HDFS (up to ten times larger clusters), and enables NameNode metadata to be both customized and analyzed, because it can now be easily accessed via a SQL API."
Systems like Hadoop and Spark will run multiple versions of the same task writing out to a FS at once but to a temporary directory and when the first finishes the output data is just moved to the final place. It's not uncommon for a job to "complete" writing data to S3 and just sit there and hang as the FS move commands run copying the data and deleting the old version. It is just assumed a rename/move is a no-op in some systems.
Also, I presume that "move" means "update the index, held in another object or objects" (no slight! If I understand correctly, that's what most on-disk filesystems do as well). That's the only way I can imagine getting performance that much better.
In HopsFS, you can also tag objects with arbitrary metadata (the xattr API), and the JSON objects you tag with are automatically (eventually consistent) replicated to Elastic search. So, you can search the FS namespace with free-text search on elastic. However, the lag is up to 100ms from the MySQL Cluster metadata. See the epipe paper in the blog post for details.
From the title, I though that this is a technology which gives you 100x faster static sites compared to S3, which sounded really cool. Discovering that it was about move operations was really underwhelming. The title is on the edge of being misleading.
We see similar results using JuiceFS , are glad to see that another player has moved on the same direction with us, since JuiceFS was released to public in 2017, and is available on most of public cloud vendor.
The architect of JuiceFS is simper than HopsFS, it does not have worker nodes (the client access S3 and manage cache directly), and the metadata is stored in highly optimized server (similar to NN in HDFS, can be scaled out by adding more nodes).
JuiceFS provide POSIX client (using FUSE), and Hadoop SDK (in Java), and a S3 gateway (also a WebDAV gateway).
Dude, there’s literally no rename/move in the official S3 specification, you are going to need to do a PUT object operation which is going to be as fast as a PUT operation is. If you judge a fish for its ability to fly it will feel stupid.
I was gonna ask how it compares with Minio. Minio looks incredibly easy to "deploy" (single executable!) . Looks nice to start hosting files on a single machine if there's no immediate need for the industrial strength features of AWS S3, or even to run locally for development.
We haven't run mv/list experiments on 100,000 and 1,000,000 files for the blog. However, We expect the gap in latency between HopsFS/S3 and EMRFS would increase even further with larger directories. In our original HopsFS paper, we showed that the latency for mv/rename for a directory with 1 million files was around 5.8 seconds.
I have VM for my data scientists already in GCP, my datasets live in Google Cloud Storage. Can I take advantage of HopsFS for a shared file system across my VMs. Google Filestore Is ridículous expensive and at least they give u 1TB. Multi writer only supports 2VMs
If Filestore (our managed NFS product) is too large for you, I'd suggest having gcsfuse on each box (or just use the GCS Connector for Hadoop). You won't get the kind of NFS-style locking semantics that a real distributed filesystem would support, but it sounds like you mostly need your data scientists to be able to read and write from buckets as if they're a local filesystem (and you wouldn't expect them to actively be overwriting each other or something where caching gets in the way).
Edit: We used gcsfuse for our Supercomputing 2016 run with the folks at Fermilab. There wasn't time to rewrite anything (we went from idea => conference in a few weeks) and since we mostly just cared about throughput it worked great.
Thanks for replying, is there any performance numbers for gcsfuse. We use small files and large files. We write small files at high rate. (Jupyter Notebooks) and can read and write large files (models)
I also started looking into the multi writer and wondering if there are new developments (nodes >2)
More than anything, gcsfuse and similar projects that don’t put a separate metadata / data cache in between, reflect GCS’s latency and throughput (with a bit of an extra burden for being done through fuse).
GCS has pretty high latency. So even if you write a single byte, expect it to be like 50+ms. This is slower than a spinning disk in an old computer. If you’re just updating a single person’s notebook, they’ll feel it a bit on save (but obviously each person and file is independent).
But you can also do about 1 Gbps per network stream (and putting several of those together, 32+ Gbps per VM) such that even a 1 MB file is also probably done in about that much time. I think for streaming writes (save model) gcsfuse may still do a local copy until some threshold before writing out.
I’d probably put your models directly on GCS though. That’s what we do in colab and with TensorFlow (sadly, it seems from a quick search that PyTorch doesn’t have this out of the box).
Filestore and multi-writer PD will naturally improve over time. But I’m guessing you need something “today”. Feel free to send me a note (contact in profile) if you want to share specifics.
Reading/writing SMALL files are SUPER slow on things like S3, Google Drive, Backblaze. Also using a lot of threads does only help a little bit but it's nowhere near reading/writing speeds of e.g. a single 600MB file.
> it's nowhere near reading/writing speeds of e.g. a single 600MB file.
I have to disagree here. If you look at benchmarks on internet, yes, it will look like S3 is dead slow. But that is a client problem, not an S3 problem. For instance, boto3 (s3transfer) is an awful implementation that was so overengineered with a reimplementation of futures, task stealing, etc, that the download throughput is pathetic. Most often it will make you top below 200MB/s.
But S3 itself scales very well if you know how to use it, and skip boto3.
From my experience and benchmarks, each S3 connection will deliver up to 80MB/s, and with range requests you can easily have multiple parallel blocks downloaded.
I wrote a simple library that does this called s3pd (https://github.com/NewbiZ/s3pd). It's not perfect and is process based instead of coroutines, but that will give you an idea.
For reference, using s3pd I can saturate any EC2 network interface that I found (tested up to 10Gb/s interfaces, with download speed >1GB/s).
It's not surprising that it's slow. Handling files has 2 costs associated to it: One is a fixed setup cost, which includes looking up the file metadata, creating connections between all the services that make it possible to access the file, and starting the read/write. The other part is actually transferring the file content.
For small files the fixed cost will be the most important factor. The "transfer time" after this "first byte latency" might actually be 0, since all the data could be transferred within a single write call on each node.
You're right that S3 isn't for small files but for a lot of small files (think 500 bytes), I either plunge them to S3 through Kinesis Firehose or fit them gzipped into DynamoDB (depending on access patterns).
One could also consider using Amazon FSx which can IO to S3 natively.
> Reading/writing SMALL files are SUPER slow on things like S3
My experience was the opposite. Small files work acceptably quickly, the cost can't be beat, and the reliability probably beats out anything else I've seen. We saw latencies of ~50ms to write out small files. Slower than an SSD, yes, but not really that slow.
You do have to make sure you're not doing things like re-establishing the connection to get that, though. If you have to do a TLS handshake… it won't be fast. Also, in you're in Python, boto's API encourages, IMO, people to call `get_bucket`, which will do an existence check on the bucket (and thus, double the latency due to the extra round-trip); usually, you have out-of-band knowledge that the bucket will exist, and can skip the check. (If you're wrong, the subsequent object GET/PUT will error anyways.)
Yes, this paper did not compare small file performance in HopsFS with EMRFS/S3. But HopsFS on S3 also supports storing small files on NVMe disks in the metadata layer. Previously, we have shown that we can get up to 66X higher throughput for writes for small files versus HDFS (peer reviewed, of course) -
Does it matter? I often see 'our product is much faster than thing X at cloud Y!' and find myself asking why. Why would I want something less integrated for a performance change I don't need, a software change I'll have to write and an extra overhead for dealing with another source?
It's great that one individual thing is better than one other individual thing, but if you look at the bigger picture it generally isn't that individual thing by itself that you are using.
It does matter, because not every use case requires everything but the kitchen sink. If you're building things that only ever require the same mass produced components for every application, well that stifles the possibilities of what can be built.
So say you have a solution for your object storage and it is plenty. Then a different product pops up, solves the same problem in exactly the same way, and costs the same. Migrating isn't free and at the end you have two suppliers to manage. Does that make any sense at all? I think not.
There is a different case that makes sense: you have object storage and it's not sufficient, so you go look for object storage suppliers that deliver something different so it suits your need. Now it makes sense to look for a service that is relatively similar to what you are already using but is better in a factor that is significant for your application (i.e. speed of object key changes), now it does make sense.
Marketing just says: "Look at us, we are faster". I think that message is not going to matter unless that happens to be your exact problem in isolation, which isn't exactly common; systems don't tend to run in isolation.
Cool work! I love seeing people pushing distributed storage.
IIUC though, you make a similar choice as Avere and others. You're treating the object store as a distributed block store :
> In HopsFS-S3, we added configuration parameters to allow users to provide their Amazon S3 bucket to be used as the block data store. Similar to HopsFS, HopsFSS3 stores the small files, < 128 KB, associated with the file system’s metadata. For large files, > 128 KB, HopsFS-S3 will store the files in the user-provided bucket.
> HopsFSS3 implements variable-sized block storage to allow for any new appends to a file to be treated as new objects rather than overwriting existing objects
It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?
Using S3 / GCS / any object store as a block-store with a different frontend is a fine assumption for dedicated client or applications like HDFS-based ones. But it does mean you throw away interop with other services. For example, if your HDFS-speaking data pipeline produces a bunch of output and you want to read it via some tool that only speaks S3 (like something in Sagemaker or whatever), you're kind of trapped.
It sounds like you're already prepared to support variably-sized chunks / blocks, so I'd encourage you to have a "transparent mode". So many users love things like s3fs, gcsfuse and so on, because even if they're slow, they preserve interop. That's why we haven't gone the "blocks" route in the GCS Connector for Hadoop, interop is too valuable.
P.S. I'd love to see which things get easier for you if you are also able to use GCS directly (or at least know you're relying on our stronger semantics). A while back we finally ripped out all the consistency cache stuff in the Hadoop Connector once we'd rolled out the Megastore => Spanner migration . Being able to use Dual-Region buckets that are metadata consistent while actively running Hadoop workloads in two regions is kind of awesome.
>It's somewhat unclear to me, but I think the combination of these statements means "S3 is always treated as a block store, but sometimes the File == Variably-Sized-Block == Object. Is that right?
If the file is "small" (under a configure size, typically 128KB), it is stored in the metadata-layer, not on S3. Otherwise, if you just write the file once in one session (and it is under the 5TB object size limit in S3), there will be one object in S3 (variable size - blocks in HDFS are by default fixed size). However, if you append to the file, then we add a new object (as a block) for the append.
We have a new version under development (working prototype) where we can (in the background) rewrite all the blocks in a single file as a single object, and make the object readable by a S3 API. It will be released some time next year. The idea is that you can mark directories as "S3 compatible" and only pay for rebalancing those ones as needed. But you then have the choice of doing the rebalancing on-demand or as a background task, and prioritizing, and so on. You know the tradeoffs.
Yes, it would be easier to do this with GCS. But we did AWS and Azure first, as we feel GCS is more hostile to third-party vendors. The talks we have given at google (to the colossus team a couple of years ago and to Google Cloud/AI - https://www.meetup.com/SF-Big-Analytics/discussions/57666504... ) are like black holes of information transfer.
Your upcoming flexibility sounds awesome. I assume many people would just mark the entire bucket as “compatible” to support arbitrary renames/mv of directories, but being able to say “keep this directory in compat mode” for people who use a single mega bucket and split into teams / datasets at the top will be nice.
I’m sorry if you’ve tried to talk to us and we’ve been unhelpful. I’d be happy to put you in touch with some GCS people specifically — the Colossus folks are multiple layers below, while the AI folks are multiple layers above. They were probably mostly not sure what to say!
We worked quite openly and frankly with the Twitter folks on our GCS connector . I’d be happy to help support doing the same with you. My contact info is in my profile.
(Though I’d definitely agree that we’ve also been surprisingly reticent to talk about Colossus, until recently the only public talk was some slides at FAST).
There are different when they are read/write from multiple clients in same time. HopsFS can use the meta service to synchronize with each other under low latency (about ms), but ObjectiveFS have to use S3 to do the synchronisation, which has much higher latency (> 20ms).
We have chatted on these with the founders of ObjectiveFS before creating JuiceFS, they did NOT recommend to use ObjectiveFS in big data workload with Hadoop/Spark, that's why we started to build JuiceFS since 2016.
I'd say that the main difference with ObjectiveFS are metadata operations. From the documentation of ObjectiveFS:
`doesn’t fully support regular file system semantics or consistency guarantees (e.g. atomic rename of directories, mutual exclusion of open exclusive, append to file requires rewriting the whole file and no hard links).`
HopsFS does provide strongly consistent metadata operations like atomic directory rename, which is essential if you are running frameworks like Apache Spark.
That quote from the ObjectiveFS documentation  is out of context. It was describing limitations in s3fs, not ObjectiveFS. My understanding is that because ObjectiveFS is a log-structured filesystem that uses S3 as underlying storage, it doesn't have those limitations.
A quick read of ObjectiveFS, and it doesn't appear to be a distributed filesystem. It appears to be a single service (log -structured storage on a service) that is backed by S3. Am I wrong? (HopsFS is a distributed hierarchical FS).
If I understand the distinction you're making, then yes, you're wrong. The really beautiful thing about ObjectiveFS is that it's distributed, but the user doesn't really have to be concerned about that. A user can mount the same S3-backed ObjectiveFS filesystem on multiple machines, and they will somehow coordinate their reads and writes to that one S3 bucket, without having to communicate with any central component besides S3 itself and the ObjectiveFS licensing server.
That's not what I meant by a distributed file system. Is the metadata layer of ObjectiveFS distributed - scale-out. Can you add nodes to increase the metadata layer's capacity and throughput, so it can scale to handle millions of ops/sec? That's what I meant by a distributed file system (not just a client server, with a single metadata server).
There is no metadata server. The clients access S3 directly, and keep their own caches of both metadata and data. I don't know how the clients cootrdinate their writes and invalidate their caches, but somehow they do. From my perspective as a user, it just works.
It's conceptually similar to EMR in the way it works. You connect your AWS account and we'll deploy a cluster there. HopsFS will run on top a S3 bucket in your organization.
You get a fully featured Spark environment (With metrics and logging included - no need for cloudwatch). UI with Jupyter notebooks, the Hopsworks feature store and ML capabilities that EMR does not provide.
Yeah, https://hopsworks.ai is an alternative to EMR that includes HopsFS out-of-the-box. You get $4k free credits right now - it includes Spark, Flink, Python, Hive, Feature Store, GPUs (tensorflow), Jupyter, notebooks as jobs, AirFlow. And a UI for development and operations. So kind of like Databricks with more of a ML focus, but still supporting Spark.
You could but I would do thorough real-world benchmarks.
EMRFS has dozens of optimisations for Spark/Hadoop workloads e.g. S3 select, partitioning pruning, optimised committers etc and since EMR is a core product it is continually being improved. Using HopsFS would negate all of that.