Apache Arrow and MinIO

(blog.min.io)

71 points | by jtsymonds 1307 days ago

10 comments

hardwaresofton 1306 days ago
Am I the only one that just can't grok what Arrow is doing and how minio in particular made it faster? I read/skimmed this submission and I've worked out this much:
- Arrow is an efficient object format that helps skip serialization
- Spark works in memory
- Minio makes storage of data scalable and cheaper than hard disks
- (from the article) Spark jobs don't (didn't??) saturate the network
- (from the article) Spark jobs needed to write the same data to many buffers to get anywhere (exporting?)
- memory mapped IO is here to help (this seems like they're mixing memory mapped device IO
Is the gist that because arrow does not require much serialization/deserialization, memory mapped network devices can shoot it out to another server (w/ S3 as an intermediary) at network speed, skip the disk, and get used at the destination faster? I guess the other method was program-1 -> disk -> network -> disk -> program-2 and now it's program-1 -> s3 -> program-2 ?
[-]
- gas9S9zw3P9c 1306 days ago
  You're not the only one. I've worked quite a bit with Minio, Arrow, and Spark, and I don't really understand the point the article is trying to make or how it's related to Minio. It's either badly explained or just a fluff piece throwing together a bunch of technologies.
  To me, the article says "columnar formats that don't require deserialization are good, oh, and you can use Minio to store data!"
- mindhash 1306 days ago
  I was involved in Arrow project for a while. The goal of the project is make transfers easier and faster (zero-copy) between dev eco-systems (java->python, python->go) and across network (to database, s3, spark server etc).
  And yes it has nothing to do with minio any more than it has with s3 or any cloud FS. There are easy to use writers and readers that also handle streaming.
  Last time I checked,
  > The most common usage is parquet handling.
  > python to JAVA IPC was in place.
  > The communication across systems uses protobuf and that is optimised for web-level transfers and not large scale data transfers yet. So transfers still have work to be done
  There was also an in-memory object sharing module. I think its no longer part of the repo now.
  [-]
  - hardwaresofton 1305 days ago
    Yeah so I didn't know about Arrow and I'm glad I looked it up (while trying to figure out what exactly was going on), it's a really cool technology. A lot of server time is spent encoding/decoding JSON -- but wonder if it could help that use-case? I know how to compare thrift/protobuf/cbor/etc but I don't really know if Apache can work for the general use case -- it seems like it deals in tables and sets.
    Most of the time I don't necessarily want disparate languages in one system but it seems inevitable and knowing that Arrow exists seems like it will help me one day.
wesm 1306 days ago
(Wes here) I appreciate the Arrow shout-out but note that Apache Arrow has been a major open source community collaboration and not something I can take sole credit for.
vikrantrathore 1306 days ago
Just be careful when you use minIO, as it does not support all of the S3 API's. The maintainer of the project will close the tickets for errors when using S3 libraries and ask to write custom code specific for minIO which works well in amazon S3 [1].
So if your company is ready to invest in MinIO without S3 compatibility it's a nice software, and my kudos to the team who took the efforts to build it. It's just that it's not fully S3 compatible and MinIO buckets do not behave the same as S3 buckets.
[1] https://github.com/minio/minio/issues/10160
[-]
- kdkeyser 1306 days ago
  I have not yet seen a storage system that is actually 100% S3 compatible. Either functionality is missing, and/or there are corner cases where the behavior is just slightly different. It is also a complicated API, that often looks like it is a reflection of some internal engineering choices that were made (e.g. the object metadata structure AWS uses probably is the reason why the list call you mentioned behaves the way it does on AWS S3)
  However, in this case, MinIO makes the deliberate choice to deviate from AWS S3 behavior, on a very common call (listing objects). I do agree that this has the risk to break applications in non-obvious ways, e.g. the call succeeds, bu t your application behavior differs, compared to running against AWS S3, due to the missing directory entry.
  I also find their argumentation in the ticket a bit worrisome, calling the AWS S3 behavior a "blunder". I saw the same hubris when looking at their data path, where they choose to ignore battle-tested approaches like Paxos / Raft for distributed coordination, and instead build their own distributed lock algorithm and implementation: it does not seem to persist state, so resilience to power failure and server/process crashes might not be what you expect.
  [-]
  - cyberpunk 1306 days ago
    Check out leofs -- we're yet to hit any compatability issues with it, and it even has a nfs server which is quite nice.
- polskibus 1306 days ago
  That's really bad. Their marketing says upfront that they are designed for S3 API. If they don't strive for maximum compatibility, then their value prop goes out of the window. I was wondering if I should use minio as a compat layer to be able to use other tools that integrate with S3, what you pointed out probably saved me lots of problem investigation time on the future.
- stingraycharles 1306 days ago
  We’re using minio using Amazon’s own S3 SDK, and while they don’t support the latest S3 protocol version, it’s definitely not as bad as you say it is. We had more problems with Google Cloud Storage’s S3 API than with Minio’s (I believe there was something related to listing directories with their API as well, but it’s returning a flat out error instead).
  The key thing to note is that it’s probably not fully compatible, but it’s definitely compatible enough for many use cases.
  I can also see in the issue you reported that you changed the title of the issue from "Minio is not fully S3 compatible" to "Minio is not S3 compatible", which strikes me as you having some personal beef with them.
  [-]
  - vikrantrathore 1306 days ago
    There isn't any personal beef, I just take objections to the marekting line The defacto standard for Amazon S3 compatibility.
    In my view it's misleading as minIO is not a drop-in replacement for S3 API based applications. My team spend significant efforts to diagnose and find the error, as I believed in this line and asked them to retry different ways by changing code again and again, even though code was working fine with Amazon S3.
    I changed the title so that it can warn others, to not go through the same shooting the foot believing MinIO is a drop in replacement for Amazon S3 based applications. If you read the thread you can notice there isn't any plan to support posix like folder API in minIO which is supported in Amazon S3.
    [-]
    - stingraycharles 1306 days ago
      Of course, but marketing is marketing; if you blindly believe marketing headlines before doing your due diligence, I'm sorry to say, but it's a bit naive. If you look at pretty much any protocol out there that's implemented by multiple vendors, there are all small nuances and incompatibilities. Even extremely detailed specified protocols have incompatibilities; S3 not having a formal specification should make you expect even worse.
      If you look at S3 implementations from the various clouds and NAS appliances and whatnot, you'll see that none of them is a 1:1 compatibilite with each other.
      If you wanted to warn others that, instead of Minio not providing "full" S3 compatibility, but Minio does not provide S3 compatibility at all (which is what the title edit suggests), then I believe you are mistaken. Minio definitely provides S3 compatibility, all the main tools and SDKs can talk perfectly with Minio, it's just that there are different implementation details around the edges.
      [-]
      - vikrantrathore 1306 days ago
        > marketing is marketing
        I didn't believe it blindly and that's the reason there is a ticket, after thorough investigation of the error in code as well as self-hosted instance configuration and log file analysis.
        Also to make sure others do not need to go through the same long process of debugging, try to request the project maintainers to change this marketing claim as evident in the ticket. Please check the ticket and see for yourself. I did it to make sure others facing the same problem can see it before spending too much efforts to debug.
        I can see the amount of work minIO team has put in this project and again kudos to the team. Open source is hard and requires a lot of commitment.
        This is how open source works, you give back to community in the form of feedback, documentation, issues and if capable in code.
nicolast 1306 days ago
Not an Arrow expert at all, so I may be missing something, but I fail to understand the "The Science of Reading/Writing Data" section, or rather, its relevance to the article (and Arrow).
From what I could find, Arrow supports reading (writing?) data from (to?) memory-mapped files (i.e., memory regions created through mmap and friends). However, this has no relation to how the IO is being done, hence not related to access to IO devices using either IO ports, or memory mappings (DMA and such).
This section seems to be mixing up two fairly distinct concepts, i.e., talking about ways to access IO devices and transfer data to/from them (among which memory mapping is an option), where the memory mapping (mmap of files) as used by Arrow is something a little different.
[-]
- formerly_proven 1306 days ago
  The section is completely false. Memory-mapping files maps pages from the page cache, which lives in main memory. I suppose the author confused this with memory-mapped I/O and then confused port-based I/O with applications using syscalls. I see how you can arrive in that situation when you're only viewing the system through high-level abstractions in Java.
  > Thanks to Wes McKinney for this brilliant innovation, its not a surprise that such an idea came from him and team, as he is well known as the creator of Pandas in Python. He calls Arrow as the future of data transfer.
  I assume the confusion is with the author of the blog post and not Wes mcKinney, so this callout in that context is a real disservice.
  > The output which displays the time, shows the power of this approach. The performance boost is tremendous, almost 50%.
  Keep in mind that this is reading a 2 MB file with 100k entries, which somehow manages to consume half a second of CPU time. The author compares wall time and not CPU time; both runs consume somewhere between 600 ms and over a second of wall time (again, handling 2 MB of data). I wouldn't be surprised if the first call simply takes so long because it is lazily loading a bunch of code.
  Later on memory consumption is measured, and one of the file format readers manages to consume -1 MB.
  This article has a very bad smell.
stefanha 1306 days ago
> Usually applications use Port mapped I/O.
This is not true. On a Linux system run "cat /proc/ioports" to see the Port mapped I/O space. There are no network card or storage controller registers in Port mapped I/O space on my machine so network and disk I/O does not involve Port mapped I/O. Modern devices tend to use Memory Mapped I/O registers (see "cat /proc/iomem").
Is the author confusing MMIO vs PIO with mmap vs syscalls? They are two different things. Using mmap can be more efficient than using syscalls because the application has direct access to the page cache, no syscall overhead, and no memory copies.
georgewfraser 1306 days ago
I am truly baffled by the desire to build new data lakes. I understand why Hadoop data lakes came to be: Teradata charged a lot for storage. But now there are database systems that use object storage under-the-hood: Snowflake, BigQuery, Redshift RA3, Databricks Delta Lake. They give you the same low cost of object storage, but with a nice high-level relational interface. I am mystified by why someone would want to build their own data lake rather than just use one of these systems.
[-]
- jmngomes 1306 days ago
  All the tools you mention are great for their specific use cases, e.g. Snowflake BigQuery and RedShift are great for analytics over big data, the very common COUNT/SUM/AVG/PARTITION/GROUP BY analysis. But as with any other tool, they're not a good/perfect fit for diverse analysis methods of diverse data types, which I think is what a data lake aims at. Analyzing a large JSON dataset on Snowflake is possible, but either too slow or expensive when compared with more appropriate tools (e.g. Elasticsearch or a Python notebook running PySpark).
  Having a data lake - which I understand as a repository of raw data of diverse types, regardless of the tools - structured in a tool like S3 is very useful when you have multiple use cases over data of different kinds.
  For example, you could store audio files from customer calls and have them processed automatically by Spark jobs (e.g. for transcript and stats generation), structure and store call stats on a database for analytics, and do further analysis via notebooks on data science initiatives (e.g. sentiment analysis). This is akin to having a staging area for complex and diverse data types, and S3 is useful for this because of its speed, scale and management features.
  Teradata or Snowflake aren't a great fit for use cases like these, but they are great if the use case is to get answers to questions like "top 3 operators per team in volume of calls, by department and region, in last quarter" if the volume of calls is big.
  If I understood correctly, I think your comment was more focused on why use new tools when the existing are mature, but I think big data tools have had to become more specialized and targeted for specific use cases. But if the question is "why build more than one data lake", the only reason I can see is organizational: teams or different areas of an organization either need their own data lake because they have specific needs (which is rare) or won't/can't collaborate with others to have a shared asset.
  [-]
  - georgewfraser 1305 days ago
    I agree that no matter what you're going to store unstructured binary data, like audio files, in object storage. But that is perfectly compatible with storing structured data in a relational database.
    > they are great if the use case is to get answers to questions like "top 3 operators per team in volume of calls..."
    You are straw-manning Snowflake/BQ. Just because they are SQL database systems doesn't mean you have to do 100% of your analysis in SQL. You can use other systems, like Spark, PyTorch, Tensorflow, to work with data that you manage inside a RDBMS. There are some issues with bandwidth getting data between systems, but these issues are getting solved (by Arrow!) and in the meantime unload-to/load-from S3 is a good workaround.
    I've heard a lot of people make these same arguments and I've tentatively concluded it's mostly motivated reasoning. Engineers like to engineer things. They start by trying to make the obvious, boring system work, but when they run into an obstacle they immediately jump to "I need to build a new system using $TECHNOLOGY."
- MrPowers 1306 days ago
  Snowflake and BigQuery are quite expensive for big datasets.
  Databricks Delta Lake has some use cases, but there are some aspects that are rough around the edges. vacuuming is very slow, the design decision to store partitioned data on disk in folders has certain pros / cons, etc.
  There are a lot of great products in the data lake space, but lots more innovation is needed going forward.
  [-]
  - georgewfraser 1305 days ago
    > Snowflake and BigQuery are quite expensive for big datasets.
    This is just flatly untrue---they are nearly the same cost/TB as object storage, and they store everything in compressed columnarized format, so they're about as efficient as you can get.
    I have heard many people make the same claim. I can't figure it out. Is there something wrong with my calculator???
    [-]
    - bwc150 1305 days ago
      The storage costs are only a tiny fraction of the costs involved with accessing and analyzing the data using snowflake "credits"
      [-]
      - georgewfraser 1305 days ago
        For sure, but you're not going to fix that by making your own data lake using, for example, Parquet-on-S3. You're still going to pay the cost of compute when you analyze that data, and a well-optimized commercial database system is extremely hard to beat. Even if you look at Presto, and you exclude the people costs of managing it yourself, it still can't beat the commercial systems: https://fivetran.com/blog/warehouse-benchmark
      - hodgesrm 1305 days ago
        That's because Snowflake happens to charge relatively little for backing storage at the moment. As I recall it's about the same as object storage. Virtual data warehouses on the other hand are quite expensive, especially if they have a lot of compute.
- sgt101 1306 days ago
  Surely those are all tools for building a data lake, and the stack described in the article is just another tool for doing the same? It's a reasonable architectural choice...
nl 1306 days ago
Back a few years ago (2017 I think) another dev and I took a Spark based system that was taking 3+ days to do aggregate processing over HBase-stored data.
We switched to it to Parquet on HDFS and did some (ok, a lot...)performance tuning and exactly the same processing ended up at 6 seconds.
TL;DR: You really really need to understand how Spark does data procesisng.
[-]
- idunno246 1306 days ago
  I’ve had some bad luck with hbase for similar but not tried spark - it’s better optimized for writers than readers iirc. Pinterest[1] did an interesting thing where they worked directly on the hfiles. Everywhere I’ve worked I feel like I could pick any random data processing job that took more than a few minutes and get orders of magnitude improvement, there’s just so much inefficiency
  [1]https://medium.com/pinterest-engineering/open-sourcing-terra...
  [-]
  - anshumaniax 1306 days ago
    What do you mean inefficiency. We use the bulk upload feature and do billions of puts in an hour and our scans can go against 3 billion rows an hour. HBase scales linearly and we are already operating it on 5 times what we had designed it for
    [-]
    - idunno246 1306 days ago
      Ah I meant that as separate comment, not hbase specifically, but that data pipelines need updating over time. Generally since what’s worth optimizing for changes as size increase, and there’s always that years old pipeline that takes hours to run that with a few changes could be minutes
- anshumaniax 1306 days ago
  Really we went the exact other way for one of our use cases which required table scans and spark just lost. HBase just stores bytes in sorted order and once you know how to optimize for storage lots of wins can be achieved. I guess the use case here was aggregation so can definitely see some spark advantages.
  [-]
  - nl 1306 days ago
    Yes, HBase index scans are very fast.
    (Any index scan is extremely fast really. You can build your own indexes as separate Parquet files if you want to avoid HBase for some reason)
anderspitman 1306 days ago
Is Arrow data commonly written to disk? I was under the impression (as someone who doesn't use it currently) it's usually converted to Parquet for this purpose.
[-]
- andygrove 1306 days ago
  Arrow is a memory format and optimized for efficient vectorized processing in memory. Although it is possible to persist the Arrow format to disk, it is more common to use Parquet.
- thundergolfer 1306 days ago
  I'd be interested to know any benefits of converting to Parquet if you know your reader supports Arrow.
  [-]
  - lmeyerov 1306 days ago
    Compression. For our app, we do multilevel caches, where warm is arrow (in-memory, inflight, streaming, ...) and cold (persistent/scaleout) is parquet (ssd, s3, ...).
mekster 1306 days ago
What is the use case of minio? Typically server disks are much more expensive and turning it into s3 compatible storage doesn't seem like a good way of utilizing it.
[-]
- markonen 1306 days ago
  We’re using Minio storage on our own hardware to save on S3 egress transfer fees. It wouldn’t be very attractive based on the storage fee alone, but if you have any traffic at all, the traffic will dominate your calculations.
  Of course, you can also get these savings with an online service like Wasabi or Backblaze B2. I just feel that using S3 as your only storage is fine, but once you move to the budget options you need an external backup. So we are using Minio and backing up to Wasabi, and still saving a lot.
  [-]
  - mekster 1306 days ago
    I wonder but if you have access to the actual file system, why put the S3 layer unless the app is already built to use S3 in mind?
- stingraycharles 1306 days ago
  We use minio in our on-prem dev and CI environments, seems to be perfect for this use case.
- selfhoster11 1306 days ago
  Non-commercial workloads that don't need 11 nines of reliability. S3 is fairly expensive for those use cases, so Minio is a good option to replace it if you need an object-based data store (for backups, for instance).
  [-]
  - mekster 1306 days ago
    Why do you need S3 API layer for backup when you have the actual file system which is easy to provide sftp or rsync over openssh?
    [-]
    - selfhoster11 1305 days ago
      S3 is HTTPS-based, and all you need to implement it is access to HTTPS and HMAC hashing (for computing signatures). This allows it to be used with virtually any platform and programming language, because uploading a file is as simple as making a HTTP PUT request with a few extra headers set (assuming there isn't already an official or unoffical SDK that you can leverage instead).
      S3 is considerably simpler than SFTP or rsync over SSH on non-Unix-like platforms, which makes it a very interoperable API. Even Wordpress or a small Go service can implement S3 support without much effort or any extra dependencies. Being HTTPS-based, it can also be reverse proxied more easily because HTTP presents a Host header, unlike SSH.
- MogwaiAllOnYou 1306 days ago
  We use it as for on-prem deployments where cloud deployments use S3, also useful to spin up for local dev with Vagrant
- bwc150 1305 days ago
  Isn't MinIO just a better and distributed samba/nfs server?
gizmodo59 1306 days ago
Is there any specific optimizations done in minio code for arrow or it’s just that in general reading from arrow format gives high performance in analytics?