Launch HN: Datafold (YC S20) – Diff Tool for SQL Databases

Hi HN! My name is Gleb. I'm here with my co-founder Alex to tell you about our company Datafold (https://datafold.com). Datafold lets you diff large datasets for fast and powerful regression testing. We support databases such as PostgreSQL, Snowflake, BigQuery, and Redshift.

One of the biggest pain points in developing ETL pipelines – chains of jobs that move, clean, merge and aggregate analytical data – has been regression testing: verifying how a change in source code (mostly, SQL) affects the produced data.

Early in my career, as an on-call data engineer at Lyft, I accidentally introduced a breaking code change while attempting to ship a hotfix at 4AM to a SQL job that computed tables for core business analytics. A seemingly small change in filtering logic ended up corrupting data for all downstream pipelines and breaking dashboards for the entire company. Apart from being a silly mistake, this highlighted the lack of proper tooling for testing changes. If there had been a way to quickly compare the data computed by production code vs. the hotfix branch, I would have immediately spotted the alarming divergence and avoided merging the breaking change.

Without a diffing tool, the typical options for regression testing are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure values are within interval, etc.) – these are helpful, but costly investment. Frameworks such as dbt make it easier, but it’s often still prohibitively hard to verify all assumptions in a large table. (2) Write custom SQL queries to compare data produced by the prod and dev versions of the source code (e.g. compare counts, match primary keys). This can easily take up 100+ lines of SQL and hours of unsatisfying work, which no one really wants to do. (3) "Fuck It, Ship It" is always an option but too risky nowadays as analytical data not only powers dashboards but also production ML models.

As this problem is common in data engineering, some large organizations have built and open-sourced their solutions – for example, BigDiffy by Spotify. However, most of these tools are CLI-based and produce results in a plain-text format which is hard to comprehend when you are dealing with complex data.

To fit existing workflows of our users, we’ve built a web interface with interactive charts showing both diff summary statistics (e.g. % of different values by column) and value-level side-by-side comparison (git diff style). But since the mission of the tool is to save engineers as much time as possible, we also opened an API for automation through Airflow or other orchestrators, and built a Github workflow that runs diff on every pull request with changes to ETL code. Since billion-row-scale datasets are not uncommon nowadays, there is an optional sampling feature that helps keep compute costs low and get results within a few minutes no matter how large the dataset is.

We've found Datafold to be a good fit for the following workflows: (1) Developing data transformations – before an ETL job is shipped to production, it undergoes multiple iterations. Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless. (2) Code review & testing: large organizations have hundreds of people committing to ETL codebases. Understanding the impact of even a modest SQL diff is daunting. Datafold can produce a data diff for every commit in minutes so changes are well understood. (3) Data transfer validation: moving large volumes of data between databases is error-prone, especially if done via change data capture (CDC): a single lost event can affect the resulting dataset in a way that is tricky to debug. We allow comparing datasets across different databases, e.g. PostgreSQL & Snowflake.

We've set up a sandbox at https://app.datafold.com/hackernews so you can see how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with your own data.

We are passionate about improving tooling for data engineers and would love to hear about your experience with developing data pipelines and ensuring data quality. Also, if you think that dataset diffing can be helpful in other domains, we are very curious to learn from you!

189 points | by hichkaker 1356 days ago

23 comments

  • GordonS 1356 days ago
    I was curious about pricing, but I see it's "call me pricing" with buttons to schedule a demo, so at least I can see this is squarely aimed at the enterprise. If I'm being honest, I don't like seeing "call me pricing" on HN; there are no rules against it, but it just doesn't feel right on HN.

    Are you able to say anything about pricing here?

    • hichkaker 1356 days ago
      Definitely a fair point!

      The primary reason we haven't provided pricing is that we have just launched and wanted to collect more data points before setting making the pricing public.

      Our current offering is:

      1) Free for diffing datasets < 1M rows 2) $90 / mo / user for diffing datasets of unlimited size 3) Cross-database diff, on-prem (AWS/GCP/data center) deploy, Single-sign-on are in custom-priced enterprise bucket.

      We would love to hear your thoughts on this.

      • simonebrunozzi 1356 days ago
        I have a quick suggestion for you: two options you can mitigate this "issue".

        Option 1: make it free, up to a certain dataset size. You can harvest interested leads like the gentleman above.

        Option 2: (if you don't want to deal with a huge volume) offer it for $50 one-time fee, up to X size, for Y months (e.g. $50, up to 1 GB, valid for 3 months). Nice way to filter qualified leads.

        There are variations from the two options above, but I think you can easily get the general idea.

        Thoughts?

        • hichkaker 1356 days ago
          Thank you for the suggestion!

          We're leaning towards Option 1: free diffing for datasets < 1M rows. Option 2 seems a bit tricker since we are in a way creating a new tool category and it can be harder to convince someone to pay before they try and understand the value (unlike, say, a BI tool – everyone knows they need some kind).

      • GordonS 1356 days ago
        Does your solution diff database schema too, or is it purely for diffing data in identical schemas?

        I'm also really keen to hear why you built this tool - what use cases you expect. I've used free diffing tools a few times before in the past, but I think every time it was to make sure I hadn't messed up "manual" data migrations (which obviously aren't a good idea).

        • hichkaker 1356 days ago
          Yes, it does diff the schema too.

          The main use cases we've seen: 1) You made a change to some code that transforms data (SQL/Python/Spark) and want to make sure the changes in the data output are as expected. 2) Same as (1) but there is also some code review process. In addition to checking someone's source code diff, you can see the data diff. 3) You copy datasets between databases, e.g. PostgreSQL to Redshift and want to validate the correctness of the copy (either ad-hoc or on a regular basis).

          We have a signup-free sandbox where you can see the exact views we provide, including schema diff: https://app.datafold.com/hackernews

          • ironchef 1356 days ago
            Most folks I know are doing this (1 and 2) by testing against a replica (or in the case of snowflake just copying the DB or schema) ... then running data tests locally and downstream (great expectations, DBT tests, or some airflow driven tests).

            Is the value prop “you don’t need all they grunt work” as opposed to above direction?

            • hichkaker 1356 days ago
              You raised a great point.

              Data testing methods can perhaps be broken down to two main categories:

              1. "Unit testing" – validating assumptions about the data that you define explicitly and upfront (e.g. "x <= value < Y", "COUNT(*) = COUNT(DISTINCT X)" etc.) – what dbt and great_expectations helps you do. This is a great approach for testing data against your business expectations. However, it has several problems: (1) You need to define all tests upfront and maintain them going forward. This can be daunting if your table has 50-100+ columns and you likely have 50+ important tables. (2) This testing approach is only as good as the effort you put to define the tests, back to #1. (3) the more tests you have, the more test failures you'll be encountering, as the data is highly dynamic, and the value of such test suites diminishes with alert fatigue.

              2. Diff – identifies differences between datasets (e.g. prod vs. dev or source DB vs. destination DB). Specifically for code regression testing, a diff tool shows how the data has changed without requiring manual work from the user. A good diff tool also scales well: it doesn't matter how wide/long the table is – it'll highlight all differences. The downside of this approach is the lack of business context: e.g. is the difference in 0.6% of rows in column X acceptable or not? So it requires triaging.

              Ideally, you have both at your disposal: unit tests to check your most important assumptions about the data and use diff to detect anomalies and regressions during code changes.

              • sails 1353 days ago
                I think doing a deeper analysis into why this is a good tool in addition to dbt would be useful for me to understand. Locally Optimistic [] has a slack channel and do vendor demos, with a _very_ competent data analytics/engineering membership. I think you'd do well to join and do a demo!

                [] https://locallyoptimistic.com/community/

      • GordonS 1356 days ago
        $90/m/user is a lot, especially when you consider the several other SaaS services you could get together for that price. That's definitely enterprise pricing, which fits with the "call me pricing" I guess.

        It's a niche, and a small one at that, but I don't doubt you'll find some enterprises willing to pay what you ask. But outside of enterprise I just can't see anyone paying that, especially when free tools exist (albeit not nearly as polished and features as yours).

        I've got to wonder about YC backing for what seems like such a small niche though - very possible I'm not seeing something you have planned for further down the line.

        • hichkaker 1356 days ago
          Thank you for the feedback!

          Agree with you about the niche. Diff is our first tool that helps test changes in the ETL code, and the impact is correlated with the size and complexity of the codebase.

          Diff also provides us a wedge into the workflow and a technical foundation to build the next set of features to track and alert on changes in data: monitoring both metrics that you explicitly care about and finding anomalies in datasets. We've learned that this is something a much larger number of companies can benefit from.

          • sterlinm 1355 days ago
            I don't agree that $90/user/month is unreasonable in every context. Yeah it's probably too much for consumers but honestly the consumer need for this tool seems pretty niche to me. It's also probably too much for large enterprises where you'd have a lot of people who want the tool, but they're probably going to either build it themselves or pay $$$$ for big lame ETL tools.

            For mid-sized companies though that could be a bargain. I worked on a data migration project at a company with <100 people and <5 engineers where we had to hack together our own data-diff tools and this would have been a bargain.

      • lukevp 1356 days ago
        Could you use the free tier for regression testing a subset? Like up to 1M either first, last or random sample? Or do the datasets themselves have to be prefiltered down to 1M results?
        • hichkaker 1356 days ago
          Yes, you could. The limit is for the source dataset size but you can prefilter it (there is an option to pass in free-form SQL query instead of table name when creating a diff).

          For the majority of diffs we see with sampling applied, sample sizes are <1M rows (more is often impractical in terms of information gain for higher compute costs) especially if your goal is to assess the magnitude of the difference as opposed to get every single diverging row.

    • quickthrower2 1356 days ago
      > I don't like seeing "call me pricing" on HN

      Hmm. Why not?

      • GordonS 1356 days ago
        It feels dishonest - what have you got to hide? Are you going to suggest a different price depending on who I am? Why are you going to make me jump through hoops, wasting time listening to a spiel by a sales person?

        Aside from that, it just doesn't seem to fit the spirit of HN - this is just personal opinion, of course.

  • throwaway_pdp09 1356 days ago
    I see a lot of these things and I don't understand them. I've done too much ETL so I'm not naive. Now either 1) people are making a mountain out of a molehill (not saying that's happening here, but in other cases I think so) 2) there's something my experience of ETL hasn't taught me or 3) these tools are specialised for niches. This one talks about 'large datasets' but I don't know how large that is.

    Some questions then

    > Often it’s important to see how data changes between every iteration, and particularly useful if you have 1M+ rows and 100+ columns where “SELECT *” becomes useless.

    select is fine for diffing. You just do an either-way except , something like

      (
      select f1, f2, f3 ... f100
      from t1
      except
      select f1, f2, f3 ... f100
      from t2
      )
      union 
      (
      select f1, f2, f3 ... f100
      from t2
      except
      select f1, f2, f3 ... f100
      from t1
      )
    
    used this and it's fine on many rows (millions is fine but I do recommend and index and a DB with a halfway decent optimiser).

    > (2)

    Interesting. OK.

    > (3) Data transfer validation: moving large volumes of data between databases is error-prone

    Really? I never had a problem. What is 'large'? what problems have you seen? There are easy solutions with checksums, error correction (comes free with networks) or round-tripping, is that a problem?

    Edit, just done that with mssql tables, 8 cols, 38 bytes per row, ~776,000 rows (identical but for one row), diff as above takes 2 seconds without an index (with PK it takes 5 seconds. Sigh. Well done MS). The single row discrepancy shows up fine. Totally trivial to extend it to 100 columns (did that too in previous job).

    • forrestb 1356 days ago
      Data diffs are easy when the tables are already exactly the same. The union/except query you wrote returns no results.

      It's much harder when many things are wrong at once, and you have to debug: - Do schemas match? (same column names, column order, data types) - Are pk's unique? - Do pk's intersect? (pk's exclusive to one table?) - For intersecting pk's, how is data different? (string truncation, timestamp truncation, rounding, upper vs lower case, empty string vs null, etc)

      • throwaway_pdp09 1356 days ago
        > Data diffs are easy when the tables are already exactly the same. The union/except query you wrote returns no results.

        It won't if the tables are identical. And if the tables differ, it a) tells you and b) gives you a very good idea where to look.

        In all honesty, none of things are a problem.

        > It's much harder when many things are wrong at once

        Then you find and fix each in turn. BTDT.

        > Do schemas match? (same column names, column order, data types)

        Well, if they don't and you're responsible you've really messed up, but that does happen (cough). Just do a similar diff via information_schema to show you schema discrepancies.

        > Do pk's intersect?

        this is just a subset of full-table diffing. Use my except/union but just on the PK fields (consult infoschema for the columns which are the PKs. If it has to be done for many tables, build the union/except dynamically then execute the string to allow for different tables with different columns names - I've done this too. Annoying but when done once you can just reuse it).

        > For intersecting pk's, how is data different? (string truncation, timestamp truncation, rounding, upper vs lower case, empty string vs null, etc)

        The aforementioned comparison using infoschema views will tell you where the schemas differs. Comparing the contents show you where the contents differ.

        All this is annoying but of very little work compared to a whole ETL process, and - emphatically - is reusable on the next job.

        • forrestb 1356 days ago
          > Well, if they don't and you're responsible you've really messed up, but that does happen (cough). Just do a similar diff via information_schema to show you schema discrepancies.

          Analysts with 1-2 years experience in SQL are often writing ETL. Responsible Data Engineers can't be there every time they deploy to production.

          > All this is annoying but of very little work compared to a whole ETL process, and - emphatically - is reusable on the next job.

          Yes, you can write your own tools for SQL QA. Making a standard tool that the entire company can and does use is another story.

          • mcrad 1356 days ago
            > Analysts with 1-2 years experience in SQL are often writing ETL

            This is the essence of Big Data.

    • etesial 1356 days ago
      Hi, I'm Alex from Datafold.

      By large tables we mean 1B+ rows. You'll likely do not want to do joins on them directly.

      > You just do an either-way except

      If one column in one of the tables is completely different, it'll return 1M rows. It's not exactly helpfull, so let's add an order by and limit. Now you'll need to spot differences in 100 column table, since hiding matching values is even more hand-written / generated sql.

      It's pretty good to know about all problems upfront and not find and fix them one-by-one, but creating representative sample of differences is difficult to do with hand-written sql.

      Then what about cross-db comparison, sampling, integrating with CI / github PRs?

      > There are easy solutions with checksums, error correction (comes free with networks) or round-tripping, is that a problem

      Some companies are using complex pipelines that can and sometimes do lose events. Often they don't need their data to be 100% perfect. So they need to know if they lost anything, what exactly and if it's within acceptable range.

      As always, roll-your-own is a valid approach, as is using already available tools to save time. Many large companies like Uber and Spotify with data teams of 100+ have detailed roadmaps for diffing tools because they find their data engineers spending days doing regression testing, and they must be proficient with SQL.

      • throwaway_pdp09 1356 days ago
        Hi Alex,

        > You'll likely do not want to do joins on them directly.

        I can't see you have any alternative. Looking for any discrepancy in either table, you have to compare them both completely. Is there an alternative?

        (edit: actually there is, the checksumming I mentioned earlier. Never needed to do that).

        > so let's add an order by and limit. Now you'll need to spot differences in 100 column table, since hiding matching values is even more hand-written / generated sql.

        Yes, I added a limit for this too, and spotting two different columns out of a hundred, well it's not a big deal, but having done that I agree some GUI/other assistance would make it a lot nicer.

        > It's pretty good to know about all problems upfront and not find and fix them one-by-one...

        Well, you compare the schemas first and being small, you can do this very quickly so you can pick up errors there one after the other very easily. After that you compare the data, and I suppose you can run multiple queries one after the other in a batch file and check them when done. I guess I never had enough data to need to do that.

        > ...but creating representative sample of differences is difficult to do with hand-written sql.

        I'm not sure what you mean.

        > Then what about cross-db comparison,

        Straightforward. In fact if you weren't comparing data across DBs then you aren't doing ETL. (edit: ok I think ISWYM. Solutions are roundtripping if the amount of data is not overwhelming, checksumming otherwise).

        > sampling, integrating with CI / github PRs?

        I don't understand why any of these are needed, but I'm pretty certain that reflects on my lack of experience, so okay.

        > Some companies are using complex pipelines that can and sometimes do lose events

        Then you got a software quality problem. It's also not difficult (with joins against primary keys), to efficiently pick up the difference and to make things idempotent - just re-run the script. I've done that too.

        > Often they don't need their data to be 100% perfect. So they need to know if they lost anything, what exactly and if it's within acceptable range.

        Very good point.

        > As always, roll-your-own is a valid approach, as is using already available tools to save time

        this leads to very interesting question about upskilling of employees versus the cost of tools - good training (say, in SQL) is somewhat expensive but lack of knowledge is far more expensive. I don't think this is the right place to have such a discussion, and I don't want to detract from your product (I do seem to have dissed it which is not my intention).

        > Many large companies like Uber and Spotify

        Okay, if we're talking on the scale of companies like this then we talking something well out of my experience.

        I'll leave it here, I think a distracted enough from your launch, and I wish you well with it!

        • hichkaker 1356 days ago
          Thank you for asking deep questions and providing specific examples! The degree of scale and complexity varies significantly between companies, so we definitely can't claim that every ETL developer should use such a tool.

          Philosophically, I see our diff tool playing a similar role to autocomplete/autorefactor in an IDE. Can you type/change every word manually? Yes. Is it a good use of your time though?

        • zimpenfish 1356 days ago
          > you have to compare them both completely. Is there an alternative?

          When I was doing this stuff for a telco 10 years ago (comparing before/after for CDR mediation changes), I found it was much faster to dump the two x00M row tables as CSV, sort them, and then use a Perl script to compare on a row by row basis. The join approach taken by the Oracle expert took many hours; my dump-sort-scan took under an hour.

          • throwaway_pdp09 1356 days ago
            OMG. Something went very, very wrong if it's faster to do that outside the database than within.
            • zimpenfish 1356 days ago
              Yeah, I've no idea how it was that much slower but they were reasonably sized tables (x00M rows) with tens of columns and it was basically a full join on each column - probably would need an index on every column to make it sensible?

              [edit: And with CDR mediation, it's not as simple as "do all the columns match exactly?" because some of them are going to be different in certain allowed ways which needs to be considered a match and also you need to be able to identify which rows differ in one column to say "these look the same but XYZ is different?" Which is probably why the query was horrendous.]

        • etesial 1356 days ago
          > I can't see you have any alternative. Looking for any discrepancy in either table, you have to compare them both completely. Is there an alternative?

          The alternative is sampling. With analytical data often you have to accept that data isn't perfect, and that it doesn't need to be. So if it can be guaranteed that amount of "bad" data is, say, less than 0.001% than it may just be good enough. So cheap checks, like "how many PKs are null" can be done on full tables, anything that requires joins or mass-comaring all values can be done on sampled subsets.

          >> ...but creating representative sample of differences is difficult to do with hand-written sql. > I'm not sure what you mean.

          Let's say column "a" has completely different values in both tables. Column "b" has differences in 0.1% of rows. If we limit diff output to 100 rows in a straightforward way, most likely it won't capture differences in column "b" at all. So you'll need to fix column "a" first or remove it from diff, then you'll discover that "b" has problems too, then the issues with some column "c" may become visible. With right sampling you'll get 20 examples for each of "a", "b", "c" and whatever else is different, and can fix ETL in one go. Or maybe it'll help to uncover the root issue below those discrepancies faster. One very well may do without this feature, but it saves time and effort.

          Generally that's what the tool does, it saves time and effort. On writing one-off SQL, or building more generic scripts that do codegen, or building a set of scripts that handle edge cases, have a nice UX and can be used by teammates, or putting a GUI on top of that.

          > Solutions are roundtripping if the amount of data is not overwhelming, checksumming otherwise

          Agree, checksumming or sampling again. Since data is often expected to have small amounts of discrepancies, checksums would need to be done on blocks to see be able to drill down into failed blocks to see which rows caused failures.

          > Then you got a software quality problem. It's also not difficult (with joins against primary keys), to efficiently pick up the difference and to make things idempotent - just re-run the script. I've done that too.

          Agree, but software quality problems are bound to happen since requirements change, what was once a perfect architecture becomes a pain point. Attempts to fix some piece of the system often requires creating a new one from scratch and running both side-by-side during transitional period. Some companies begin to realize that they don't have enough visibility into their data migration processes, some understand the need for tooling, but can't afford to spend time or resources to build it in-house, especially if they'll need it only during migration. If we can help companies by providing ready to go solution, that's a great thing.

          > this leads to very interesting question about upskilling of employees versus the cost of tools

          Right, it's a complicated question involving multiple tradeoffs...

          • throwaway_pdp09 1356 days ago
            I'll print this out, it deserves careful reading which I'll do shortly, so I'll reply briefly on one point.

            I've worked with data where a small amount of data loss was tolerable and indeed inevitable (once in the system it was entirely reliable but incoming data was sometimes junk and had to be discarded, and that was ok if occasional) so I understand.

            But a third-party tool that can be configured to tolerate a small amount of data loss could put you at a legal disadvantage if things go wrong, even if it is within the spec given for your product. If you have a very small amount of lost data then checksumming and patching the holes when transferring data might be a very good idea, legally speaking, not data-speaking, and low overhead too.

            Also, you just might be assuming that lost data is uncorrelated ie. scattered randomly throughout a table. Depending on where it's lost during the transfer, say some network outage, it may be lost in coherent chunks. That might, or might not, matter.

  • ishcheklein 1356 days ago
    Hey! Looks great! Is there an example of the Github integration - how does it looks like?

    I'm one of the developers and maintainer of the DVC project and we recently released CML.dev- which integrates with Github and can be used to run some checks on data as well. But in our case it's about analyzing files more or less. I'm curious how does that integration look like in your case.

    • hichkaker 1356 days ago
      Thanks! We are going to add an example to our website and perhaps publish it as a Github App.

      The flow is: For every whitelisted (say, SQL) file changed by a pull request, our bot compares the production data (produced by the master branch) vs. the data output of the new branch and pastes the diff as a comment to the PR. Some YAML configuration is needed to, for example, adjust it to your repo file structure.

      • ishcheklein 1356 days ago
        Thanks! And how and where does setup happens which database to use to run the query for the specific SQL file? Also if it's part of some pipeline will it have to run the whole pipeline from the very beginning?
        • hichkaker 1356 days ago
          So there are two main workflows:

          1) Run diff as part of CI (on code change), that can be done via Github workflow (or other CI). The Github workflow takes in a YAML config file, and also is customizable so you can fit it according your needs. For example, one of our customers names SQL files after tables they populate, so when the bot sees a changed SQL file, it knows what table to compare it to.

          2) If you are looking to run a diff within the data pipeline itself (e.g. in Airflow), you can wrap Datafold API in an Airflow operator to integrate it as a task in your DAG.

          I hope that I covered your questions but please let me know if you are interested in specific use cases!

  • hodgesrm 1356 days ago
    It is always good to see new approaches to testing but I don't see how this one is going to work. I've worked at multiple database companies. Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.

    Diffs are relatively slow, when they fail you get a blizzard of errors, and the oracles (i.e. the "good" output) have to be updated constantly as the product changes. Plus I don't see how this helps with schema migration or performance issues, which are major problems in data management. And don't get me started on handling things like dates, which change constantly, hence break diffs.

    If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data. They can run blindingly fast and give you actionable data about regressions. They're also a pain to code but are most productive in the long run.

    • hichkaker 1355 days ago
      Thank you for sharing!

      I assume we are talking about analytical, not transactional data:

      > Diff'ing data is one of the weakest and most cumbersome ways to verify correctness.

      It depends on the use case: if the goal is to assess the impact of a change in source code on the resulting dataset produced (extremely common in ETL dev workflow in my experience), then isn't diff the natural solution? Of course, it depends on how the results are presented. A row-by-row output for a billion-row dataset is useless. That's why we provide diff stats across columns/rows and data distribution comparisons while allowing the user to see value-level diff if needed.

      > Diffs are relatively slow

      In general – yes, that's why we've implemented configurable sampling. In the majority of cases, developer is looking to assess the magnitude of difference and certain patterns, for which you don't need a large sample size. Our customers typically use ~1/10000 of the target table row count as a sample size.

      > when they fail you get a blizzard of errors We try to fail gracefully :)

      > I don't see how this helps with schema migration or performance issues.

      For schema migration, you can verify whether anything has changed in your dataset besides the intended schema changes (which certainly happened on my watch).

      > or performance issues

      We certainly don't claim to solve all DBA issues with diff, but here's an actual real example from our customer: they are optimizing their ETL jobs in BigQuery to lower GCP bill by reducing query runtime. After refactoring the code, they diff production vs. new code output to ensure that the data produced hasn't been affected.

      > If you really care about correctness it's better to use approaches like having focused test cases that check specific predicates on data.

      Possibly, but

      > They're also a pain to code

      ...which is often a prohibitive pain point if you have 300+ analytical datasets with 50+ columns each (a common layout for companies of 250+).

      And another problem: the more test cases, the more failures on every run, and unlike app code unit testing, you can't expect the cases to stay relevant since the data is changing constantly, so those "unit testing" test suites require constant maintenance, and as soon as you stop actualizing them, their value drops to 0.

      I think that diffing and "unit testing" are complimentary approaches and neither one is a panacea. So my recommendation has been to use both: 1) Specific test cases to validate the most important assumptions on data 2) Diff tool for regression testing.

      • hodgesrm 1354 days ago
        I'm unconvinced your approach works beyond a narrow range of use cases. The weakness is the "is this a problem" issue. You have a diff. Is it really significant? If it's significant, how did it arise? You can spend an inordinate amount of time answering those two questions, and you may have to do it again with every run. Diffs are cheap to implement but costly to use over time. That inversion of costs means users may end up bogged down maintaining the existing mechanism and unable to invest in other approaches.

        If I were going after the same problem I would try to do a couple of things.

        1. Reframe the QA problem to make it smaller. Reducing the number and size of pipelines is a good start. That has a bunch of knock-on benefits beyond correctness.

        2. Look at data cleaning technologies. QA on datasets is a variation on this problem. For example if you can develop predicates that check for common safety conditions on data like detecting bad addresses or SSANs you give users immediately usable quality information. There's a lot more you can do here.

        Assuming you are working on this project, I wish you good luck. You can contact me at rhodges at altinity dot com if you want to discuss further. I've been dealing with QA problems on data for a long time.

        • hodgesrm 1354 days ago
          p.s., to expand on #2 if you can "discover" useful safety conditions on data you change the economics of testing, much as #1 does.
  • stephane-klein 1356 days ago
    At the moment I use https://github.com/djrobstep/migra to make PostgreSQL diff.
    • random_savv 1356 days ago
      I also use migra - and love it. Migra only checks your schemas though, not your data.
      • simonebrunozzi 1356 days ago
        From what I recall you are correct. But I also vaguely remember that there was another little open source tool that integrated with Migra to check data as well? Or was it a custom-made script?

        Sorry, can't find it, I wish my comment could be more useful.

  • itpragmatik 1356 days ago
    1) Will it allow me to diff AWS RDS Aurora/MySQL serverA database schemaA against AWS RDS Aurora/MySQL serverB database schemaB ? 2) Are there APIs to initiate and view/parse these diffs that you generate or is it all through UI?
    • hichkaker 1356 days ago
      Yes, you can diff to physically distinct databases. MySQL support is in progress.

      There is a REST API that accepts the same parameters as in the UI and returns the diff results in JSON + optionally, materializes the tables in your DB with row-level differences.

  • igorludi 1356 days ago
    To whome it may concern, we have written a paper on the subject (I'm not affiliated with datafold): https://www.researchgate.net/publication/323563341_A_Time-co...

    The paper describes the original algorithm with examples.

  • MaxwellM 1356 days ago
    Much needed! Analysts struggle with this all the time - trying to explain why an analysis is off and having to manually debug every column in a new database
  • kveykva 1355 days ago
    In case you're unaware, your logo looks unfortunately a bit uncannily similar to that of https://www.sigmacomputing.com/ given you are both in a similar broader product category! I actually thought your logo throughout the site was actually a reference to integration with sigma at first.
    • hichkaker 1355 days ago
      Thank you for pointing it out, I've never noticed but now that you brought it up will do a few more iterations on our design to avoid the collision.
  • greg 1356 days ago
    I got the chance to play with Datafold and I would have loved to have had it when I was working for Facebook on data pipelines.
  • nicoburns 1356 days ago
    I've recently discovered, and highly recommend Daff [0]. It's an open source tool that can diff CSVs and SQLlite database tables. The tabular diff format is fantastic.

    [0]: https://github.com/paulfitz/daff

  • brainless 1356 days ago
    Hey Gleb, congrats on the launch. This is an interesting tool.

    Myself being a founder of a product in the space of tools for data-driven companies, I wanted to ask -

    Is you product aimed entirely at data engineers? The description seemed very technical and a problem that mostly very large companies would have. Did I understand correctly?

    • hichkaker 1356 days ago
      Thank you and that's a great question.

      It largely depends on how critical the data pipelines are to the business. We've seen big companies suffer the most from issues with their data quality. At the same time, we've had early-stage teams using our diff tool because their core product depends on the data they ingest from third-party vendors and performing regression testing of that data and the transformation code has been taking them a lot of time.

      The product in its current form could be useful to anyone developing data transformations (which is what data engineers typically do full time), and we are working on expanding it to help data consumers (analysts, PMs, etc.) have higher confidence in the quality of datasets and metrics they rely on.

  • chaps 1356 days ago
    Can this be used locally for datasets with non-transferable PII? Thinking about this for non-profit work.
    • hichkaker 1356 days ago
      Yes, we have on-prem AWS/GCP and K8S-based deployments available, would be very happy to support your work, just shoot me an email hn@datafold.com.
  • gregwebs 1356 days ago
    Great tool! I am only interested in running such a tool locally (on-prem). This avoids security/privacy issues and data transfer time/cost issues.

    A good model for me would be 30 day free license to get it integrated into our worklows.

    • hichkaker 1355 days ago
      Totally fair! We offer free trials for on-prem. Please ping us at hn@datafold.com to set it up.
  • jjirsa 1356 days ago
    For Cassandra: https://github.com/apache/cassandra-diff

    (Designed for correctness testing at petabyte scale)

  • sbr464 1356 days ago
    Can you import/work with .bcp files? (Microsoft SQL server bulk export files). For example, diffing 2 bcp files, even if you need to import and set them up as databases again.
    • hichkaker 1356 days ago
      If you can materialize those .bcp files as tables, then yes, absolutely.
  • lihan 1356 days ago
    How does it work behind the scene? Is it simply sample a portion of the data then do the diff? What if I need 100% accuracy?
    • hichkaker 1355 days ago
      If diffing datasets within the same physical database, generate SQL, execute in the database, analyze and render results.

      If diffing datasets across physically different databases, e.g. PostgreSQL <> Snowflake or 2 distinct MySQL servers, pull data in our engine from both sources, diff, and show results.

      Sampling is optional but helpful to keep compute costs low for large Mill/Bill/Trill-row datasets.

  • samblr 1356 days ago
    Congrats on the launch.

    How does it compare to enterprise ETL tools like Informatica, Talend - is it not possible to do these within them ?

    • forrestb 1356 days ago
      At best, Informatica could output the diff of two tables to another table. But the paradigm is very limiting. I'm not sure if you can dynamically change the mappings every time you feed it a new table (different primary keys, columns, data types), and you'd still have to browse the results in SQL.
      • samblr 1356 days ago
        Thank you, not to undermine the work done in Datafold. But visual schema-diff is a small improvement in giant of tools like informatica which can pretty much do anything in database space.

        It can be well argued that YC is taking in companies without realising what exists out there in enterprise world. There are quite a few no-code/api-stiching tools which launched recently as well. And those at best resemble poor imitation of powerful BI tools that are existing out there since almost a decade.

        • hichkaker 1356 days ago
          Those tools are definitely vastly powerful. Have you used either of them?

          TBH, I haven't, but judging from our current post-Informatica users and by reading questions on Informatica/Talend official user forums, I concluded that the diffing problem (to be specific – not only schema, but data diffing) is not directly addressed by them (the answers are in the realm of "there is no diff feature but you can write SQL..."

          In general, we see data stacks becoming increasingly modularized and tools more specialized. For example, there are at least 20x more teams using OSS like Airflow/Luigi/Dagster for managing their data pipelines (and 2-5 other tools for the rest of the workflow) than using end-to-end platforms that you mentioned. We see Datafold as a regression testing tool in a modular stack.

          • samblr 1356 days ago
            Thank you for your reply.

            I have used Talend in great detail 3 years ago but I didn't have the usecase of schema diff at the time. But for data diff you can easily define workflow. And have to admit these workflows are crazy powerful and even can help the data fix with any transformation required (nocode or code)

            However, Im seeing the usecase for a light weight tool with visual aspect. I like this. But will this problem be big enough for VC investment is the question ? I see schema diff can be just a plugin in one of the existing database tools. And if you are getting into data diff - you got to see what those tools do too.

            • hichkaker 1356 days ago
              > But will this problem be big enough for VC investment is the question?

              That's a great question. Thinking about where problems arise in data pipelines, there are fundamentally two moving pieces: 1) Your data – you're continuously getting new data without a real ability to enforce your assumptions on its schema or shape. 2) Your code for ingestion and transformation that needs to evolve with the business and to adapt to changes in other parts of the infra.

              Datafold's Diff tool currently mostly addresses #2. It can add value to any company that runs ETL pipelines but most impactful at large data engineering teams (similar story to CI or automated testing tools).

              Regarding #1, wouldn't it be useful if we tracked ALL your datasets across time and alerted you on anomalies in those datasets? And I am not talking about rigid "unit" tests e.g. X <= value < Y, but actual stats-based anomaly detection, akin to what Uber does: https://eng.uber.com/monitoring-data-quality-at-scale/

              So, with diff, we already compute and store detailed statistical profiles on every column in the table. Next, we are going to track those profiles across time.

              Diff is just the first tool we've built to get a wedge into the workflows of high-velocity data teams and start adding value, but it's just the beginning of a more comprehensive and, hopefully, valuable product we aspire to deliver.

              • samblr 1356 days ago
                Much appreciate your response
  • blunt69 1356 days ago
    Will there be integration with ORACLE Dbs?
    • _ZeD_ 1356 days ago
      This
      • hichkaker 1356 days ago
        Yes, we can add support for Oracle quite soon if you are interested.
  • chwolfe 1356 days ago
    Very cool, nice work! SQL Server / Azure SQL support available or on the roadmap?
    • hichkaker 1356 days ago
      Thank you!

      We are [almost] DB-agnostic (as long as there is SQL interface and JOINs are supported) and have been adding support for particular DBs based on the interest from our users. SQL Server & Azure SQL areon the roadmap, please let us know if you have a use case in mind to pilot Datafold on them.

    • atwebb 1356 days ago
      Have you tried RedGate SQL Compare?
  • FlashBlaze 1356 days ago
    It is interesting to see Ant Design used other than my personal projects.
  • husseiny 1356 days ago
    Very cool! Does it work with MS SQL in our own DC?
    • hichkaker 1355 days ago
      Support for MS SQL is in progress. We can deploy in your DC using with Docker or K8S.
  • greenie_beans 1356 days ago
    love it, fucked up and made a startup with a solution so you'll never make that mistake again.