Qri: A global dataset version control system built on the distributed web

(github.com)

204 points | by anewhnaccount2 1854 days ago

4 comments

marknadal 1854 days ago
I really love the design and style qri! It is fun!
Can I ask why, for a git-style system, IPFS was chosen instead of GUN or SSB?
Certainly, images/files/etc. are better in IPFS than GUN or SSB.
But, you're gonna have a nightmare doing any git-style index/patch/object/etc. operations with it - both GUN & SSB's algorithms are meant to handle this type of stuff.
Did you guys do any analysis?
[-]
- b_fiive 1854 days ago
  hey, qri dev here. Delighted you like the design, we're hoping to make data a little more "approachable" :)
  We did look into SSB. I'll admit to not hearing about until only a few months ago, but the main reason we chose IPFS was for single-swarm behaviour, allowing for natural deduplication of content (a really nice property for dataset versioning).
  The majority of our work has been in the exact area you mentioned, building up a dataset document model that will version, branch, and convert to different formats. We've gone so far as to write our own structured data differ (https://github.com/qri-io/deepdiff). I'm very happy with the progress we've made on this frontier so far.
  I'm a huge fan of SSB, but don't think it's well suited for making datasets globally discoverable across the network. In the end the libp2p project tipped the scales for us, providing a nice set of primitives to build on.
  [-]
  - marknadal 1854 days ago
    Nice work!
DocSavage 1854 days ago
Interesting project, particularly with the choice of IPFS and DCAT -- something I'll have to look into. There have been other efforts to handle mostly file-based scientific data with versioning in both distributed (Dat https://blog.datproject.org/tag/science/) and centralized ways (DataHub https://datahub.csail.mit.edu/www/). Juan Benet visited our research center to give a talk about IPFS a few years ago. Really fantastic stuff.
I'm the creator of DVID (http://dvid.io), which has an entirely different approach to how we might handle distributed versioning of scientific data primarily at a larger scale (100 GB to petabytes). Like Qri and IPFS, DVID is written in Go. Our research group works in Connectomics. We start with massive 3D brain image volumes and apply automated and manual segmentation to mine the neurons and synapses of all that data. There's also a lot of associated data to manage the production of connectomes.
One of our requirements, though, is having low-latency reads and writes to the data. We decided to create a Science API that shields clients from how the data is actually represented, and for now, have used an ordered key-value stores for the backend. Pluggable "datatypes" provide the Science API and also translate requests into the underlying key-value pairs, which are the units for versioning. It's worked out pretty well for us and I'm now working on overhauling the store interface and improving the movement of versions between servers. At our scale, it's useful to be able to mail a hard drive to a collaborator to establish the base DAG data and then let them eventually do a "pull request" for their relatively small modifications.
We've published some of our data online (http://emdata.janelia.org) and visitors can actually browse through the 3d images using a Google-developed web app, Neuroglancer. It's running on a relatively small VM so I imagine any significant HN traffic might crush it :/ We are still figuring out the best way to handle the public-facing side.
I think a lot of people are coming up with their own ideas about how to version scientific data, so maybe we should establish a meeting or workshop to discuss how some of these systems might interoperate? The RDA (https://rd-alliance.org/) has been trying to establish working groups and standards, although they weren't really looking at distributed versioning a few years ago. We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.
[-]
- amirouche 1854 days ago
  > We need something like a Github for scientific data where papers can reference data at a particular commit and then offer improvements through pull requests.
  exactly my thought, do you know any working group that is working toward that goal?
  [-]
  - DocSavage 1854 days ago
    If by working group you mean a cross-company collection of people, I don't know of any or I would've joined them :) I've been working toward that goal for the last 5 years, but primarily with an eye to our kinds of data problems in the Connectomics field. I've been meaning to look at RDA again but reluctant to start a working group myself.
    [-]
    - b_fiive 1854 days ago
      Hey DocSavage! I'm one of these Qri folks, I'd love to see that working group exist. I have a friend or two at the RDA. maybe we should get an email going on the subject? Projects like these are bigger than any one company or tool :)
      [-]
      - amirouche 1853 days ago
        I started a awesome list dubbed "awesome data distribution" feedback welcome at https://github.com/amirouche/awesome-data-distribution
      - DocSavage 1854 days ago
        Agreed. Will follow up on email through your Qri contact page.
        [-]
        b_fiive 1854 days ago
        delightful. thanks!
    - benhamner 1854 days ago
      Any way we can help at Kaggle? Is https://www.kaggle.com/datasets helpful for your work in connectomics?
  - brynb 1854 days ago
    We’re building something along these lines at Axon (http://axon.science). Sign up for our beta if you’re interested in checking it out, and we should be able to get you set up in the next few days (we’re just starting to roll things out to the public this week).
    The basic idea is distributed version control, like git, but over p2p swarms rather than clusters around “central” repositories. We have special handling for large datasets (but still using git) to improve transfer efficiency and diffing.
    There’s a UI layer for collaboration (discussion, PRs, review) that supports deep linking to and embedding of files at specific commits, which sounds a bit like what you’re looking for.
    Feedback is very much appreciated!
    [-]
    - DocSavage 1854 days ago
      That looks very interesting, particularly the UI layer for collaboration. Your website says it supports “massive data sets” but I would spell out what you mean since data for different fields vary by several orders of magnitude. (Massive for me starts at TBs and goes to petabytes.)
      One of the issues for me is file-based versioning, which then requires the means to parse the format. A number of ventures and organizations (e.g., NeuroData without Borders) address versioning of the entire ecosystem necessary to correctly use the underlying data files, so not sure if that’s an explicit part of your ecosystem. Most importantly, is your stack going to be open source?
  - benhamner 1854 days ago
    We're working on that through Kaggle Datasets https://www.kaggle.com/datasets
    We support data versioning, interactive web previews, seamless loading into hosted Jupyter notebooks (Kaggle Kernels), seeing/sharing analytic results built on the data version, and adding direct collaborators right now.
    We don't support a data-oriented version of an "issue" or a "pull request" quite yet, but these needs are definitely on our radar.
  - mbreese 1854 days ago
    It's probably too late for this year, but ISMB is one of the traditional locations for such a working group in the biological sciences. It might be interesting for the meeting next year though. If anyone is interested in putting together a proposal, let me know. I'd be happy to help.
  - j88439h84 1854 days ago
    http://Dvc.org does this
    [-]
    - DocSavage 1854 days ago
      What are the differences between Dvc and Pachyderm.io, which I should have mentioned earlier?
      [-]
      - dmpetrov 1852 days ago
        "From a very high level perspective - Pachyderm is a data engineering tool designed with ML in mind, DVC.org is a tool to organize and version control an ML project. Probably, one way to think would be Spark/Hadoop or Airflow vs Git/Github." from https://news.ycombinator.com/item?id=19130499
  - smarx007 1854 days ago
    Zenodo?
- ktpsns 1854 days ago
  > scientific data primarily at a larger scale (100 GB to petabytes)
  Buying hard discs (100TB for a few 10kEUR a few years ago) is a real investion in our institute. As far as I understood, with distributed storages each participant volunteers to share his disc to store his (and other) data. Here's the devil's advocate: Why should I share my expensively bought disc space with you?
  [-]
  - DocSavage 1854 days ago
    Some institutions won't pay for others. In our space, big non-profit science institutions like Janelia and Allen Brain foot the bill for making the data available. Depending on the utility of the data, Amazon (https://aws.amazon.com/opendata/public-datasets/) or Google (https://cloud.google.com/public-datasets/) could also handle the cost of storing and distributing the data.
    With versioned data, you could leverage the largesse of the big institutions to provide the base data, and then only the deltas for the children versions need be handled by users making changes.
guywhocodes 1854 days ago
What are the benefits of using qri over ipfs? At a glance it seems very similar, just narrower.
[-]
- b_fiive 1854 days ago
  Imagine git were built on top of IPFS, and aimed specifically at datasets. Qri uses IPFS to store & move data, so all versions are just normal IPFS hashes. eg this: https://app.qri.io/b5/world_bank_population is just referencing this IPFS hash: https://ipfs.io/ipfs/QmXwh5kNGsNAysRx66jcMiw1grtFf9j7zLFGbK9...
  full disclosure: I work at Qri
  [-]
  - guywhocodes 1854 days ago
    Ah, that's excellent. Thanks for your time
  - sjapkee 1853 days ago
    >and aimed specifically at datasets
    What are the benefits of it? What git did not please?
- ekianjo 1854 days ago
  In IPFS you can't search from within the protocol as far as i understand. Qri focuses on datasets and provides a search layer directly form its tools.
- teawrecks 1854 days ago
  IPFS is listed as a dependency
mewwts 1854 days ago
I love how the distributed web is seemingly built more and more in golang these days.
- https://github.com/ethereum/go-ethereum
- https://github.com/ipfs/go-ipfs
- https://github.com/textileio/go-textile
- https://github.com/lightningnetwork/lnd
to name a few other projects.
[-]
- rolleiflex 1854 days ago
  Mine is also (Aether - https://getaether.net). I’ve also gotten comments reflecting on this same thing. I love Go. It is boring: it makes sure that I focus on doing interesting things, not on writing interesting code.
  [-]
  - b_fiive 1854 days ago
    aether is the coolest
- Protostome 1854 days ago
  Why do you love that its go in particular? (seriously asking, out of curiosity. why Go over all other languages, e.g. Rust and such)
  [-]
  - sheeshkebab 1854 days ago
    Go is simpler than most other high performance languages - easy to read and understand unfamiliar codebases. It helps that go compiles to native binaries for various platforms and runs with no or minimal dependencies.
  - yahyaheee 1854 days ago
    I would mostly attribute this to go’s compositional and prescriptive nature. Go sort of pushes you toward building highly reusable pieces that can be combined to create a system. It does that in a way that’s incredibly easy to grok, which allows developer communities to more easily grow around products.
    [-]
    - Ericson2314 1854 days ago
      Perscriptive? Maybe. Compositional? Absolutely not.
      I blame the things not being done sooner on Go.
      [-]
      - yahyaheee 1854 days ago
        Compositional meaning composed of core elements that are combined to create something else.
        [-]
        Ericson2314 1854 days ago
        I know what it means, I'm saying Go doesn't have it (relative to other languages).
        [-]
        yahyaheee 1854 days ago
        I would say that go is very compositional in a simple manner that makes it easy to grok and hence the tools end up being highly reusable. Not all languages push you toward decomposition, but I would argue its the most important trait of a language and its community. But you know how programming language discussions go =P
        gameswithgo 1854 days ago
        Well that’s everything
  - maccio92 1854 days ago
    Fanboyism
    [-]
    - stingraycharles 1854 days ago
      On a more serious note, I do think it’s probably related to group identity (or described as “tribes” in popular media) that explains it.
      A large project using their language of choice (Go in this instance) gives external validation that their tribe is growing, and thus having made the correct choice to join it.
    - res0nat0r 1854 days ago
      Cross compilation and static binaries.
- sjapkee 1853 days ago
  It only means that all this will soon die. Ruby of 2017-2019.