Distributed Filesystems for Deep Learning

(logicalclocks.com)

91 points | by jamesblonde 2015 days ago

12 comments

  • manigandham 2015 days ago
    What does NVMe have do to with any of this? That's hardware storage technology, and it's being compared to Python versions of deep learning frameworks?

    The whole point of having a file system is that programs are abstracted from storage. What does native support even mean unless the file system somehow hooks into the running code?

    This makes no sense. It seems like a post trying to push HopsFS as more performant, but comes across as inaccurate and misleading.

    • aogl 2015 days ago
      I agree, it appears as though NVMe was simply added in there to look impressive and give another tick when in actual fact it's not relevant to this layer of tech. This should be abstracted beyond hardware storage and be based solely on the filesystem itself.
  • habitue 2015 days ago
    I am baffled why they have a matrix of what data science libraries they support. They support reading and writing CSVs and Parquet (and, I assume, arbitrary files as well). How is this specific HopsFS support for Pandas? Pandas is dumping csv format to a file object which happens to write to HopsFS.

    This is a pretty weak sales pitch.

    Edit: I'm not actually baffled. It's clearly SEO to catch people googling around for "pandas support" or the like.

    • jamesblonde 2015 days ago
      Author of the post here. Part of the point of the matrix was to show that not all distributed filesystems have native support for Python frameworks commonly used for Data Science - Pandas/PySpark/TensorFlow. The other part is support for NVMe hardware. Having been a professor for 17 years, i barely know what SEO is.
      • dekhn 2015 days ago
        Can you explain what it means to have native support? Are you saying you hook into Python's file io to implement remote IO? Or is the filesystem mounted via FUSE or another technology? I Don't think that Pandas has any HopFS specific code in it.
        • jamesblonde 2015 days ago
          "In Pandas, the only change we need to make to our code, compared to a local filesystem, is to replace open_file(..) with h.open_file(..), where h is a file handle to HDFS/HopsFS."

          HopsFS is a drop-in replacement for HDFS. Here's some more details on native HDFS connectors in Python by Wes McKinney -http://wesmckinney.com/blog/python-hdfs-interfaces/

          • dekhn 2015 days ago
            I think what you are saying is that you're wire-compatible with HDFS, and Pandas has access libraries (special file objects) that support HDFS.
          • daviesliu 2015 days ago
            In my opion, Cephfs and GlusterFS have much better support for Pandas and others, you don't need change any code between local development and train an distruted way, rather than a path.

            Second, the Python library for HDFS provides only subset of filesystem API, which limits what you can do with it.

      • rpedela 2015 days ago
        I get a 404 for the first link in your article. Thanks for reference to the Baidu study, it is very interesting.
    • mtanski 2015 days ago
      Sadly there is no way to downvote stories. This post is factually incorrect at best... And, the author shows up and just shills harder.
      • rpedela 2015 days ago
        You can flag.
  • xfactor973 2015 days ago
    Gluster/Ceph support NVMe. That matrix isn't right
  • regularfry 2015 days ago
    Not by that graph, you don't. Pick a number in the middle: 2^24x10^6 words. That's 130TB, give or take. Sounds a lot, right? Not when you remember 16TB drives and 15 bay RAID enclosures exist...
  • wskinner 2015 days ago
    Can anyone working on this project comment on how HopsFS compares to similar alternatives? The post and the paper compare it to HDFS, but that's not really fair - HDFS was not designed to handle huge quantities of small files, and if you use it for that, you will of course not get good performance. Competitive offerings would be things like Qumulo, Amazon EFS, Objective FS, perhaps BeeGFS & Lustre, and so on.
    • jamesblonde 2015 days ago
      HopsFS is a derivative work (a fork) of HDFS - it is still 100% wire-compatible with HDFS. It has distributed metadata (not metadata stored on the heap of a single JVM). It also supports putting small files on NVMe storage - https://www.logicalclocks.com/millions-and-millions-of-files... It's also open-source. So, it works in places HDFS works. HopsFS is, to the best of our knowledge, the only distributed filesystem with distributed metadata (apart from Google Collosus - the hidden jewel in their crown).

      Lustre, BeeGFS are HPC filesystems, i believe. No data locality. Amazon EFS - feels like a MapR clone, but they expose some tunings - higher latency for higher throughput and larger clusters. We don't make that trade-off. I don't know enough about Qumulo and Objective FS to comment on them.

      • notacoward 2015 days ago
        > HopsFS is, to the best of our knowledge, the only distributed filesystem with distributed metadata

        Then you haven't done your homework, since Gluster has had that architecture for a decade. And it supports real POSIX semantics, too - not just the HDFS subset.

        • jamesblonde 2014 days ago
          As I wrote below in a similar comment about Ceph, there is no global distributed metadata layer. If you want to rename a directory or a file, it is not an atomic metadata operation. It may involve copying across servers (bricks).

          Reference: http://www.admin-magazine.com/Articles/Comparing-Ceph-and-Gl...

        • gnufx 2015 days ago
          Indeed, and (as notacoward will know) the HPC filesystems OrangeFS/PVFS2, Lustre, BeeGFS, and GPFS all have -- or can have -- distributed metadata, and POSIX semantics (or POSIX modulo locking).
          • notacoward 2015 days ago
            Correct. I didn't mean to exclude those, but they weren't part of the original comparison. I didn't mention Ceph because somebody else already had, plus I'm not a maintainer for that as I am for Gluster. ;)

            It might also be useful to point out some semi-useful distinctions regarding levels of distribution. The PVFS2 lineage has had fully distributed metadata for even longer than Gluster. OTOH, both Lustre and Ceph have separate metadata servers, more than one but still fewer than there are data servers. That doesn't necessarily limit scalability, but it does increase operational complexity. I'm honestly not sure about BeeGFS or GPFS; I don't pay much attention to them since one is only fake open source and the other doesn't even try. If we want to include proprietary offerings we could also add EFS, ObjectiveFS, OneFS, PanFS, etc.

            BTW, for those here who know me, at $dayjob I've switched from working on Gluster to working on a more HDFS-like data store. (They chose not to call it a filesystem because it isn't one, which I consider a refreshing bit of honesty in a field full of pretenders.) It scales way beyond any of the things mentioned here, but it's proprietary so I can't compare notes as much as I was able to with Gluster. :(

            • jamesblonde 2014 days ago
              We have different interpretations of what we mean by "distributed system". I mean a group of servers that provide a single service. Or more precisely, metadata is a single distributed shared memory layer, where file systems semantics are same regardless on which partition/shard of the metadata a metadata operation is performed.

              What Ceph and many others do is support partitioned metadata. One volume == one metadata partition. But cross partition metadata operations do not have the same semantics as intra-partition metadata operations. So, moving a file/dir between volumes (partitions) is not atomic - they do not solve the hard consensus problem. We do. Gluster is also not solving the consensus problem. The end result is leaky filesystem abstractions. If you move a file between two directories, it may go fast. But between two different directories (across volumes), it can go very slow. Renaming a file in Gluster can cause it to move between hosts.

              • notacoward 2014 days ago
                > Renaming a file in Gluster can cause it to move between hosts.

                No. It does not. Within a volume, files will always be renamed in place on the replicas where they currently reside. There are other problems with rename in Gluster, having to do with how the locations of files are tracked when they're not where they're "supposed" to be according to the hashing scheme, but movement of data as part of rename is not one of those problems.

                "Across volumes" is not an interesting or relevant case, because the whole point of separate volumes in Gluster (and most other distributed filesystems) is to provide complete isolation. You seem to have this idea that users would want to have lots of volumes and still move data around between them. Is there something weird about your architecture that forces them to do that? Small volumes are operationally complex and stand resources. Generally, you'd have just one volume. Access control, quota, etc. could still be applied at the namespace or directory level, below volumes. It worked rather well at the world's largest Gluster installation, which I helped run for a year and a half, and I believe most other installations are similar.

                Please try to learn about other systems before you make wild claims about them, lest your audience perceive them as deliberate lies to benefit your business. I don't think your claims about Ceph are anywhere near accurate either, but I'll leave those for one of my Ceph friends to address.

                • jamesblonde 2014 days ago
                  My claim is exactly about what contributes a distributed namespace - i claim it is a namespace that spans multiple hosts and requires (distributed) consensus algorithms so that nodes that provide the metadata service (operations like rename) can execute operations only in metadata. Volumes are not just for isolation. They also partition the distributed namespace, and make it easier to scale - but at a cost. Operations that cross volumes do not solve the consensus problem, so they do not execute operations like move atomically. Whether that is a copy, then delete or something else is not the issue. They still do not provide a single distributed shared namespace abstraction with atomic operations.
                  • notacoward 2013 days ago
                    "i claim it is a namespace that spans multiple hosts and requires (distributed) consensus algorithms..."

                    That's a very unique definition of "distributed". To many in the field, "spans multiple hosts" alone is sufficient. Some might refine it to preclude single nodes with special roles or by adding requirements such as a single view of the data etc., but it's a joke to add atomicity or implementation details such as how a consensus algorithm is used. There are plenty of distributed data stores that don't provide atomicity for all operations, and users are happy with that. There are plenty that use leader election instead of consensus to deal with consistency issues, and users are happy with that too. Inventing a novel and highly specific definition to exclude them seems more than a bit disingenuous. If we want to go that route, I'll point out that HopsFS is misnamed because it's not a filesystem according to commonly used definitions. There are many hard problems that it wimps out on, that a real filesystem must address.

                    "Operations that cross volumes do not solve the consensus problem, so they do not execute operations like move atomically."

                    Again, you seem to be using "volume" in a very unique way. Your architecture might misuse "volume" to mean some sort of internal convenience that's practically invisible to users, but others don't. To most, a volume is a self-contained universe, much like a virtual machine. Users are well aware that moving data between either will not have the same semantics as moving data within.

                    • jamesblonde 2013 days ago
                      My definition of distributed metadata for a FS is the same as you would have for a distributed database. If I shard a database across many servers and do not define semantics for transactions that cross shards, that is what you are talking about. I am talking about a system that supports cross partition transactions.
                      • notacoward 2013 days ago
                        Let's recap. You've gone from claiming that Gluster doesn't have distributed metadata (untrue) to claiming that it copies data on rename (untrue) to claiming that renames aren't atomic across partitions (untrue because such partitions aren't even part of Gluster's architecture). Those are some very mobile goalposts. You've made similarly untrue claims about Ceph, which does have a sort of metadata partitioning but not the way you seem to think. Along the way, you've tried to redefine terms like "volume" which have well established domain specific meanings.

                        Such desperate attempts to disparage systems you obviously haven't even tried to understand are not helpful to users, and are insulting to people who have spent years solving hard problems to address those users' actual requirements. Believe it or not, systems exist which don't work like yours, and which prioritize different features or guarantees, and there's demonstrated demand for those differentiators. If you want to learn about those differences, to compare and contrast in ways that actually advance the state of the art, please begin. I say "begin" because I see no evidence that you've attempted such a thing. Does your academic employer know that you're misinforming students?

                        • jamesblonde 2013 days ago
                          I disagree. I tried to define a distributed namespace for you, and you won't accept my definition. And you don't accept that one of the reasons volumes exist is to partition namespaces to make them easier to scale. Luckily for at me, peer-reviewed conferences and journals do accept my definition.
            • gnufx 2013 days ago
              > I didn't mean to exclude those, but they weren't part of the original comparison.

              I didn't mean to imply otherwise, just to emphasize the hat in the way of the talking, assuming words mean what everyone knows. (I don't know if Gluster has moved too far from its roots to be an HPC filesystem too.) Good observations about distribution and misleading marketing.

              For anyone still following and interested, OrangeFS uses Giga+ [1] to distribute metadata. I don't know where to read about the design for the other free systems mentioned available for study, and I've heard people being rude about Lustre's design anyway.

              1. http://www.pdl.cmu.edu/PDL-FTP/PDSI/fast2011-gigaplus.pdf

      • gnufx 2015 days ago
        It's incorrect to imply that HPC filesystems generally can't specify data distribution or that it matters greatly on HPC network fabrics; there are publications demonstrating better performance on Hadoop benchamrks over them than HDFS. Also HPC filesystem installations routinely have millions and millions of files.
      • IvanVergiliev 2015 days ago
        > the only distributed filesystem with distributed metadata (apart from Google Collosus - the hidden jewel in their crown).

        Isn't that the case for S3 as well? I haven't seen AWS mention it explicitly, but the consistency guarantees and the "unlimited scalability" claims seem to point in that direction.

        • dekhn 2015 days ago
          S3 is a blobject store, not a filesystem.
      • bdeorus 2015 days ago
        MapR-FS also distributes metadata and NVMe support
      • nathanwh 2015 days ago
        Does Ceph not have distrubuted metadata as well?
        • jamesblonde 2015 days ago
          You can have multiple metadata servers, but then you have multiple volumes - like MapR. When you move a file between two different volumes, the move operation is no longer a simple metadata operation (which is atomic). Instead the move operation involves removing the file from one metadata server, then adding it to the other server - an agreement problem. Except, they don't solve the agreement problem properly. You will notice it more when you move a large directory, as then it will go really slow - and you hope you dont have a failure while the operation is executing.

          reference: http://docs.ceph.com/docs/master/cephfs/add-remove-mds/

          • teraflop 2015 days ago
            That reference does not contain any information to back up your claim.

            The Ceph team considers configurations with multiple active metadata servers to be stable and supported; where is your evidence that this is false?

  • jey 2015 days ago
    What's the relevance of NVMe? Is that I/O latency still relevant in the presence of networking and scheduling overheads?
    • jamesblonde 2015 days ago
      If your V-100 GPU can process 150 images/sec, can S3 keep up? What if you now have 10 GPUs or 100 GPUs? Now, HDFS can't keep up. But HopsFS can....
      • dekhn 2015 days ago
        Many people on these forums train directly from S3 and know how to get that kind of throughput. Like any blobject store, the key is to use multithreaded readers and high concurrency and set up your storage properly.
      • jey 2014 days ago
        Interesting. What's the bottleneck for HDFS, is it that the namenode has high latency under load?
  • sgt101 2015 days ago
    I'm wondering if Kudu + Impala + RAPIDS + YARN +TonY might do instead... maybe more support?
  • huntaub 2015 days ago
    What about Elastic File System?
    • jamesblonde 2015 days ago
      It's the same as s3 - no support for NVMe.
      • dekhn 2015 days ago
        That's a non sequitur- EFS is a way to mount filesystems via NFS. What EFS does internally is kind of irrelevant- Amazon could implement it with NVMe, or whatever technology that they build internally, and the client would see the performance.
      • manigandham 2015 days ago
        What exactly are you talking about? None of these things are comparable.

        S3 is object storage. Ceph is a networked POSIX file system. HDFS is in between. All of them can run on NVMe devices.

  • _zoltan_ 2015 days ago
    we're using Spectrum Scale with its HDFS connector. while we're talking about options, I thought I'd mention it as well.
    • gnufx 2013 days ago
      For what it's worth, you can do the same freely with at least OrangeFS, Gluster, and Lustre if you must do things a Hadoop-ish way; I assume also with Ceph.
  • thstart 2015 days ago
    I haven’t seen a practical use of DL or ML it is all over hyped. The biggest problem- nobody seems to understand what happens inside the Neural Network and why it works at all. All what NN are doing is tweaking weights till they match the desired outcome - all this very fast. But without deep understanding of how it works I doubt it would progress that way. Fact is after first evaluation at MIT it was decided it is feasible in 5 years to achieve desired outcome- now we are 50 years later at the same point with the only difference- fast hardware for matrix computations. The software is producing very fast a black box model which nobody understands.
  • xvilka 2015 days ago
    Problem of both HDFS and HopFS - they written in Java, which makes them slower and eating much more memory than native filesystems like Ceph.