Ask HN: How to store billions of rows of time-series data?

I'm trying to build a system to store price data. According to my calculations, I'm going to be inserting about 20 million rows every day. This will quickly add up to billions of rows.

The queries will be very ad-hoc, mostly things like histograms, and grouping by certain attributes. It's also possible that some of the products will rarely/never be queried, not sure if it makes a difference. Read performance is still more important.

Is something like TimescaleDB a good choice?

I would also like to not spend a fortune on this (couple hundred a month at maximum). So is this even possible with the cost constraint?

17 points | by Kkoala 1481 days ago

10 comments

valyala 1479 days ago
Take a look at VictoriaMetrics. It supports OHLC function (rollup_candlestick) via MetticsQL [1], which is backwards compatible with PromQL - query language used in Prometheus. PromQL is optimized for typical queries over time series data [2].
VictoriaMetrics works out of the box without the need to set up third-party systems or wroting complex configs. It is also very fast for both data ingestion and querying [3], [4].
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Metr...
[2] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...
[3] https://medium.com/@valyala/measuring-vertical-scalability-f...
[4] https://medium.com/@valyala/billy-how-victoriametrics-deals-...
nknealk 1480 days ago
We use kinesis firehose and Athena as a SQL API for customer interactions analytics on our digital properties (among other things). Data sizes are upwards of 10M rows per day. Athena will blow through months of data pretty quickly at very reasonable cost.
Some gochas: AWS lock in. Especially since S3 is your storage back end. Kinesis firehose writes files by insert date and you may need ETLs to partition by some other time dimension inherent to the data. This gets tricky if stuff comes in really late and you may have to decide to drop records older than N days to make such an ETL work at reasonable cost.
Some pros: easy to POC. Pure variable cost architecture. Serverless (also a downside because if any of the services go down so do you) with little administrative work. It can get expensive if you don’t think through partitioning for analytics, but Kinesis firehose can be configured to write parquet to S3 natively allowing you to use existing spark/hive/whatever if your org has that as your SQL layer.
Edit: come to think of it, there is another division in my company using bigquery with whatever google’s streaming ingestion product is. They’re achieving similar results at similar cost. I haven’t personally done development in GCP though so I couldn’t comment
gwittel 1480 days ago
A timeseries db may work fine. 20 million per day isn’t bad, unless you expect huge spikes in a short duration.
What do your rows look like? What time ranges do you expect to query? Does your data arrive in a stream or large batches? This will help determine a rough IO and storage range needed (or at least worst case).
At this scale and going for low cost you have some trade offs that could help reduce costs and improve performance. I would try to keep it simple and test something like timescale with dummy data first. Chunks of Parquet or ORC data and Presto/Athena may also work depending on your workload.
That aside, other things may help make it even easier for you. Can you reduce the problem any (e.g. reduce number of rows you need)?
For example:
What is the minimum time period granularity you need for a single datapoint? Lower res = fewer rows.
Can you aggregate upon ingest?
For distributions can you use approximations (eg t-digest) or do you need exact values?
[-]
- Kkoala 1479 days ago
  Rows will be at least (id, other_id, price_1, price_2), prices should be 32 bit integers, id should be 64 bit or string integers and other_id can be 32 bit int. id and other_id should reference other things. Or if normalization is not possible then it would add up to bunch of names and other metadata, 6 strings more at minimum, but more like 10 strings or so.
  The data will be a stream basically.
  Granularity is quite important, but I think older data could be made less granular and turned into approximates.
seektable 1480 days ago
Yandex ClickHouse one more free/open source solution that can be considered for this purpose. Setup/maintenance may be a bit tricky - but you'll be surprised how fast CH executes your ad-hoc queries (SQL!) over billions of rows.
[-]
- Kkoala 1479 days ago
  Interesting, haven't heard of ClickHouse before. I will check it out, thanks.
mfreed 1477 days ago
(Timescale person here.)
Many of our users store many billions of rows + 10s of millions per day. Native compression will give you significant storage savings, and very flexible options for indexing to support various queries. And if you have some things which are commonly queried, easy to build automated continuous rollups (e.g., to continuous rollup to OHLCV candlesticks).
Can deploy either self-managed or Timescale Cloud (timescale.com/cloud), where you can find deployments within $100s per month.
Cheers!
GauntletWizard 1480 days ago
Yes, it sounds like what you want is a timeseries DB. Prometheus can handle tens of millions of rows and billions of datapoints with ease, and it's query language is designed for that kind of grouping via set arithmetic.
icsa 1481 days ago
Try the 32-bit evaluation version of kdb+ (https://kx.com/). Twenty million rows of data per day is easy with kdb+ - even on a laptop.
[-]
- anonu 1481 days ago
  That's not a great suggestion given the constraints. Kdb costs way more than "a couple hundred per month". Also a billion rows quickly falls outside the 32 bit addressable range given most of the calcs happen in memory.
  What's wrong with postgresql in this context? OP doesn't mention any speed, latency or bandwidth requirements...
  [-]
  - icsa 1481 days ago
    I have built many billion+ row databases on a laptop with 16 GB RAM, using the 32-bit version. kdb+ uses memory mapping and can support databases much larger than available RAM.
    PostgreSQL is a fine choice and I use it for many projects. However, kdb+ is a much better choice when time-series analysis is required.
    I'm happy to provide more details if you're interested.
sman393 1480 days ago
Depending on what your reporting needs are I would definitely recommend TimescaleDB. I love Prometheus for storing server resource metrics or basic app metrics (counts/timers). For more business-facing data, having the query functionality and visualization tooling of Postgres is a major win. I am working on a similar use-case to you, we considered Cassandra, Timescale, and Clickhouse. With our insert rates and reporting needs, We decided to start working with TimescaleDB.
runT1ME 1480 days ago
My current gig is using Scylla for storing time series data, we're partitioning by day and anywhere from 5-15 million rows per day.
We've mostly been using Apache Beam/Cloud Dataflow for working wtih the data en mass for occasional reports, and we're planning on trying Presto soon as well. I'm interested in what solution you come up with, happy to chat more about this kind of thing on twitter or email.
[-]
- PeterCorless 1470 days ago
  Thanks for the shout-out! Given that the OP was mentioning only having a budget in the hundreds per month, this seems an appropriate time to cite our most recent blog: https://www.scylladb.com/2020/04/16/ifood-relies-on-scylla-t...
jll29 1480 days ago
Google BigQuery is your friend.
[-]
- Kkoala 1479 days ago
  I have heard it's quite expensive. The storage itself seems quite cheap, but I think the queries also cost so that's where the costs come from