A comparative analysis of SQL-on-Hadoop systems for interactive analytics

(arxiv.org)

39 points | by luu 2032 days ago

5 comments

maxxxxx 2030 days ago
Is this generally a good idea? We have data in multiple SQL databases and one of our departments wants to run analytics on the data. My plan was to export the data and then import it into a big consolidated SQL database. But our IT department talked them into using their Hadoop instance so now we export from SQL to JSON which then gets moved to Hadoop and queried with SQL and Hive. I still don't see the advantage of this over using a SQL database. Queries that should run in milliseconds take minutes in Hive. The data size is only a few terabytes so I wouldn't call it "big data".
Is there any advantage in using Hadoop for such a use case?
[-]
- wenc 2030 days ago
  Possibly not, as alluded to by other comments.
  Don't jump on the SQL-on-Hadoop bandwagon without understanding the trade-offs. Here are some:
  - If you use Parquet as your storage format, you lose indexes (ORC does support indexes though). If you use JSON, you lose a bunch more stuff including predicate pushdown, which are critical for WHERE clauses.
  - With schema-on-read, you lose a bunch of potential performance optimizations that come with schema-on-write.
  - Joins on a distributed database are much more difficult and usually much less performant than on a single node.
  - Depending on the SQL-on-Hadoop system you choose, you stand to lose a ton of ability to perform complex analytic queries with pivots, windowing functions, etc. which are commonplace in most traditional SQL databases.
  - You also lose the ability to rewrite data easily -- UPDATES and upserts are much more difficult (HDFS is an append-only file system, and is best suited for storing completely immutable data; its optimal mode is write-once read-many).
  - HDFS also doesn't like small files, so you either have to buffer your writes before writing nice and large 128MB blocks, or have a nightly job to do small file compaction. For enterprise data, small rewrites are often needed to incorporate corrections, backfills etc. due to data arriving late, or due to errors that need to be fixed, and Hadoop does not handle this case well at all.
  So as you can see, traditional SQL databases, if your data fits in them, present you with a tightly-coupled optimized environment that distributed databases are often not able to. They support also easy replication (real-time replication can be achieved via CDC).
  However, if the scale of your data is beyond what traditional SQL databases are able to handle (and this differs from database to database; some MPP SQL databases can handle large datasets just fine), and if you need redundancy, SQL-on-Hadoop solutions may make sense.
  One advantage of SQL-on-Hadoop (via Spark) offers is the ability to resume jobs in a long data transformation pipeline. If large SQL queries fail, they typically return nothing. With Spark, if you cache the intermediate results, you can carry on where you left off.
- psds2 2030 days ago
  It sounds like maybe they should be using sqoop to import directly in to HDFS. It also sounds like they may not be exploiting parallelism properly. Depending on the queries sometimes its easier to create multiple tables in hbase from the SQL import and have your hbase schemas correlate to the analytics queries that are most common.
- Triffids 2030 days ago
  Generally good idea, analytical queries will be faster than on rdbms. You problem in strange tools and format. Hive = map reduce, JSON = scan for all data. You need to replace map-reduce (Hive) to Impala or Spark 2.x and JSON to columnar format: Parquet or ORC.
- voycey 2030 days ago
  Check out Presto, it is a Facebook open source product and designed exactly for your use case, it doesn't use map-reduce but can read Hive catalogs directly and it is super fast. We use ORC as our file format but Parquet would work as well.
- kwillets 2030 days ago
  >Is there any advantage in using Hadoop for such a use case?
  Longer coffee breaks.
- glogla 2030 days ago
  Depends on the data size. It's definitely not worth it for gigabytes, and for petabytes, there's not much else. Terabytes are in the middle.
  [-]
  - maxxxxx 2030 days ago
    Aren't there petabyte SQL databases in existence? You can't use Hadoop for transactional data so I suppose that data still goes into SQL.
    [-]
    - Triffids 2030 days ago
      It depends. There are Cloudera kudu engine, it quite easy can handle millions atomic inserts per second. But it can't support true acid transaction. On other hand SQL also can't use true ACID in high load system. Serialazible transactions replaced by Read committed transactions, "banking day", queues + background process, etc
- js8 2030 days ago
  I agree, it would be nice if they compared e.g. PostgreSQL to the setup.
  [-]
  - mmt 2029 days ago
    It does seem like a glaring omission, especially at the modest database sizes they used.
    Even something like PG with the Citus extension to compare parallelism would have been relevant.
    EDIT: Although another comment suggests that the data for the paper could have been gathered as long ago as 2016, which could mean Citus was not yet open source.
DandyDev 2031 days ago
A pity that the paper doesn't include Hive LLAP in the comparison. With LLAP enabled, Hive becomes more like Impala and Drill, with persistent daemon processes providing much shorter response times to queries.
I'm curious how Hive LLAP does, compared to Impala and Drill.
[-]
- wenc 2030 days ago
  I found that a curious omission too -- it's hardly possible to have a meaningful study of SQL-on-Hadoop interactive analytics in 2017 (conservatively assuming there is a 1-year lag between the study and the paper) without including Hive LLAP and Presto.
  But having been part of the academic machine, I kind of understand. These things happen.
  EDIT: I notice that a short paper was submitted to IEEE in 2017, so the study was probably done in 2016, which may explain the omission.
- gorbachev 2030 days ago
  There're better comparisons put out by vendors who are building custom solutions on top of the open source Hadoop SQL engines, and from Hortonworks/Cloudera as well.
  Hortonworks has been pushing their latest Hive LLAP benchmarks pretty hard. They're doing a kind of a Hive3 "roadshow" at the moment as well.
billman 2031 days ago
I would have liked to see Presto in the analysis.
[-]
- disgruntledphd2 2030 days ago
  I am willing to bet that Presto would have done better than all of the others.
  I was also really surprised that neither Hive nor Presto were included. Clearly the author is biased against FB originated projects \s.
c789a123 2030 days ago
My comments ( I am experienced in using spark, so could be biased): 1. as the worker machine as 16 vcpu, and 122G ram, setting spark.executor.memory to 8GB seems too small, I would be interested to see how it works with 32GB setting per 8 cores. 2. CDH is not so updated with spark releases, new spark releases 2.3 can be used together with CDH hadoop for test. 3. in table 4, spark is the only system without fails, which confirms it is a very robust system. 5. It is a performance test, but is the result verified?
sgt101 2030 days ago
EC2 vs on prem is a big jump; does anyone have a view as to what this study means for on prem implementations?
[-]
- dajohnson89 2030 days ago
  What else could it be, besides a local machine(s)?
  [-]
  - sgt101 2030 days ago
    I expect that performance on prem will be rather different...