The queries will be very ad-hoc, mostly things like histograms, and grouping by certain attributes. It's also possible that some of the products will rarely/never be queried, not sure if it makes a difference. Read performance is still more important.
Is something like TimescaleDB a good choice?
I would also like to not spend a fortune on this (couple hundred a month at maximum). So is this even possible with the cost constraint?
VictoriaMetrics works out of the box without the need to set up third-party systems or wroting complex configs. It is also very fast for both data ingestion and querying [3], [4].
[1] https://github.com/VictoriaMetrics/VictoriaMetrics/wiki/Metr...
[2] https://medium.com/@valyala/promql-tutorial-for-beginners-9a...
[3] https://medium.com/@valyala/measuring-vertical-scalability-f...
[4] https://medium.com/@valyala/billy-how-victoriametrics-deals-...
Some gochas: AWS lock in. Especially since S3 is your storage back end. Kinesis firehose writes files by insert date and you may need ETLs to partition by some other time dimension inherent to the data. This gets tricky if stuff comes in really late and you may have to decide to drop records older than N days to make such an ETL work at reasonable cost.
Some pros: easy to POC. Pure variable cost architecture. Serverless (also a downside because if any of the services go down so do you) with little administrative work. It can get expensive if you don’t think through partitioning for analytics, but Kinesis firehose can be configured to write parquet to S3 natively allowing you to use existing spark/hive/whatever if your org has that as your SQL layer.
Edit: come to think of it, there is another division in my company using bigquery with whatever google’s streaming ingestion product is. They’re achieving similar results at similar cost. I haven’t personally done development in GCP though so I couldn’t comment
What do your rows look like? What time ranges do you expect to query? Does your data arrive in a stream or large batches? This will help determine a rough IO and storage range needed (or at least worst case).
At this scale and going for low cost you have some trade offs that could help reduce costs and improve performance. I would try to keep it simple and test something like timescale with dummy data first. Chunks of Parquet or ORC data and Presto/Athena may also work depending on your workload.
That aside, other things may help make it even easier for you. Can you reduce the problem any (e.g. reduce number of rows you need)?
For example:
What is the minimum time period granularity you need for a single datapoint? Lower res = fewer rows.
Can you aggregate upon ingest?
For distributions can you use approximations (eg t-digest) or do you need exact values?
The data will be a stream basically.
Granularity is quite important, but I think older data could be made less granular and turned into approximates.
Many of our users store many billions of rows + 10s of millions per day. Native compression will give you significant storage savings, and very flexible options for indexing to support various queries. And if you have some things which are commonly queried, easy to build automated continuous rollups (e.g., to continuous rollup to OHLCV candlesticks).
Can deploy either self-managed or Timescale Cloud (timescale.com/cloud), where you can find deployments within $100s per month.
Cheers!
What's wrong with postgresql in this context? OP doesn't mention any speed, latency or bandwidth requirements...
PostgreSQL is a fine choice and I use it for many projects. However, kdb+ is a much better choice when time-series analysis is required.
I'm happy to provide more details if you're interested.
We've mostly been using Apache Beam/Cloud Dataflow for working wtih the data en mass for occasional reports, and we're planning on trying Presto soon as well. I'm interested in what solution you come up with, happy to chat more about this kind of thing on twitter or email.