After four years of working together, we originally quit our jobs to set up a company focused on tech debt. We didn’t manage to solve that problem, but we learned how important product analytics were in finding users, getting them to try it out, and in understanding which features we needed to focus on to impact users.
However, when we installed product analytics, it bothered us how we needed to send our users’ data to 3rd parties. Exporting data from these tools costs $manyK a month, and it felt wrong from a privacy perspective. We designed PostHog to solve these problems.
We made PostHog to automatically capture every front-end click, removing the need to add track(‘event’) - it has a toolbar to label important events after they’re captured. That means you’re spending less time fixing your tracking. You can also push events too.
You can have API/SQL access to the underlying data, and it has analytics - funnels and event trends with segmentation based on event properties (like UTM tags). That means we’ve got the best parts of the 3rd party analytics providers but are more privacy and developer friendly.
We’re thinking of adding features around paths/retention/pushing events to other tools (ie slack/your CRM). We’d love to hear your feature requests.
We are platform and language agnostic, with a very simple setup. If you want Python/Ruby/Node, we give you a library. For anything else, there’s an API. The repo has instructions for Heroku (1 click!), Docker or deploy from source.
We’ve launched this repo under MIT license so any developer can use the tool. The goal is to not charge individual developers. We make money by charging a license fee for things like multiple users, user permissions, integrations with other databases, providing a hosted version and support.
Give it a spin: https://github.com/posthog/posthog. Let us know what you think!
I am glad someone is tackling this problem.
A feature request (or perhaps an architectural direction) would be if you could accommodate the backend behind graphql instead of Django+MySql, there's a potential for it go full Serverless (frontend and backend) with JAM-stack frameworks like redwood.js [1] (backed by apollo-graphql) or using Cloudflare Workers [2].
Edit: Another question I have is, is posthog at 80% feature parity with mixpanel / amplitude / heap already? If not, what do the timelines look like (asking since you're OSS, though, it is understandable if you can't reveal just yet). May be there needs to be a page on competitor-matrix on the website?
[0] https://github.com/PostHog/posthog/commits/master?after=9ae6...
[1] https://redwoodjs.com/
[2] https://blog.cloudflare.com/jamstack-at-the-edge-how-we-buil...
- I'd stress how important it was feeling inspired by the idea. Ian from Mattermost was really helpful, as were Dalton and the YC partners. Enjoying what we were working on probably tripled our speed.
- I'm meh technically, so we focussed on making sure Tim (CTO) could focus on exclusively the development. We split it up pretty clearly to create the right environment. I did the design, product, website (Elementor/WP) and docs, Aaron focussed on getting user feedback.
- We spent $1k on marketing, to speed up user engagement early on, so that helped get some bugs out.
Will do a blog post if there's more interest in the journey.
We've already had requests from people to store events into different databases, but I hadn't considered doing it with graphql/JAM. That could be a really nice way of having the storage abstracted from the database.
In terms of feature parity, our goal is basically 100% parity. Anything you can do analytics wise in those tools you should be able to do in PostHog. We're going to try to keep up the same pace we've had for the last 4 weeks.
I'm the co-founder of a company that had a similar value proposition back in 2017. We got invited to the interview at the YC office but couldn't convince people because of a number of reasons:
1. GDPR was not huge back in 2017 so the idea of creating an open-source alternative was not attractive enough.
2. We were targeting the companies that want to build their own data pipeline on cloud and the cloud providers such as AWS were claiming that their products (specifically Kinesis & Redshift) make it dead-easy to create such a data pipeline. At first, we thought that we were doing something complementary to cloud providers but soon we realized that we were competing with them. Our potential customers were trying to create such a data pipeline in AWS thinking that it will be simple and AWS actually made it easy to start in the beginning. However; data enrichments and the cost optimizations are really tricky when your data grows and our product was optimized for these workloads. AWS doesn't really need partners like us, we're saving costs from the customers but make AWS lose money in the long run because of these cost optimizations. The switching cost becomes more than just increasing the Redshift capacity by 2x as you store all the data in Redshift.
3. We're not native speakers so we probably couldn't express ourselves back then.
Time flies. We got into 500Startups Batch 21 that year but had to pivot last year since we couldn't make money to create a sustainable business.
Shameless plug: Right now, we provide the same feature-set (segmentation, funnel, retention, and SQL) for different CDPs such as Segment, Snowplow, Firebase, and in-house solutions. You can think of it Amplitude or Mixpanel but on top of your data warehouse. We generate SQL queries and run them on your data-warehouse just like a BI tool.
I would love to collaborate if you're open to partnerships since we're now complementary to each other. :) You can see how the product looks like from here: https://rakam.io/product
Also, thanks for making something really cool :)
They also loved the X-ray feature.
And here is a recently show-hn'd OSS metabase alternative: https://news.ycombinator.com/item?id=22347516
You might also like reading a recent news.yc discussion on BI tools: https://news.ycombinator.com/item?id=21513566
- (i) How to handle the request/response being essentially two events
- Should they be merged into one event?
- (ii) How to handle request/response bodies and http headers?
- Passing the bodies as event properties does not seems intuitive and it would be valuable to query them with XPath / JSONPath.
- Http headers could be passed as event properties
I think this can be a great business, we have funded startups following similar models like Gitlab, Mattermost, etc. Excited to keep funding more :)
Just because the code is open source doesn't mean you can't make money out of it.
https://www.google.com/search?q=post+hog
there's a bunch of great hosted tools out there, across ETL, workflows, dashboards, etc. Think Fivetran, Segment, Matillion, Periscope, etc. And then of course the warehouses like Snowflake, Redshift, etc.
But I think there are three issues with that stack, roughly like this (I've got to do some more thinking, would appreciate your input):
- Privacy: you have your customer data flying around in all these different tools. it's hard to impossible to track your compliance
- cost: all these vendors charge in some way by data volume, MAUs, etc. - you get taxed multiple times for the same data stream. It all adds up.
- control: your data is subject to pre-determined schemas, proprietary formats, black boxes, etc. - mismatch for the same metric across different tools, less flexibility to manipulate your data and pack up and go elsewhere.
I think there's a valid open source alternative for every layer of the stack:
Segment --> Rudder Labs, Snowplow
Matillion --> Airflow, dbt
Fivetran --> Stitch / Singer
Periscope, Looker, Tableau, etc. --> Metabase, Superset
Warehouses --> just yesterday I learned about materialize.io here on HN
And then add open source products like PostHog, that add additional value for very specific use cases (in this case product analytics).
Not arguing the value of the hosted products. They are amazing to use if you just get started. But there's a great open source "stack" available that long term likely will be more transparent, more flexible, and cheaper.
Would love your thoughts!
Shameless plug - we wrote a blog on setting up an Open Source Analytics Stack with one of our deployments highlighting these issues.
https://dev.to/rudderstack/how-to-build-an-open-source-analy...
PostHog looks awesome!! Congrats on your launch. Would love to collaborate and share notes.
I wouldn't take the self-deprecating language literally. It's probably a useful signal. It communicates "I'm a genuine community member sharing my work with the community" in a way that is hard for spammers to replicate.
They tend to have neither the resources nor knowledge that there's a world out there beyond the regular free to use but you pay with everybody's data and consolidate their power big corp (FB, Google, ...)
I think the way forward would be something akin to CloudFormation with presets on your own server (DO / Scaleway / AWS / ...), something managed over which you still have ownership and can plug off. My axis is particularly around marketing but I suppose you could expand.
PS : I would add Redash to the list of BI tools.
I suppose Singer might be the closest thing from your list above, but still you'd have to build out a large amount of auth & end user tooling to get it to work.
Every B2B SaaS developer these days has to build in a ton of integrations. Even an affordable hosting service for this would work well (embedded integration where you don't require your customer to sign up or pay for a second product).
It's quickly becoming clear (GitLab/Mattermost/Sentry) that there are some great ways to build enormous companies like this though. And, that's of course assuming you want to build a huge company :)
At a personal level, we found that this kind of business is just more fun to build... making cool stuff in the open and if we do a good job, getting inbound interest from bigger companies that have developers who need to use our tech at scale.
I don't think it'd work for everything - if you are a tool that developers don't interact with much, then I'd imagine it's tougher to build a real community.
on the other hand, SaaS products are optimized to sell to business groups who want the hard parts taken care of for them and they perceive value in saving time/money thanks to the SaaS product. I've seen both situations (monetizing open source versus monetizing a SaaS product) first hand, and it's clear that open source companies can be at a bit of a disadvantage. If they DO monetize their users, it's usually via their own SaaS offering to augment the open source tool.
What are some of the selling points compared to more mature OS solutions like Matomo? Also, isn't the enterprise version the opposite of your thesis? I.e. "it bothered us how we needed to send our users’ data to 3rd parties", but then provide a hosted version which would do the same thing? How do you think about that?
The things you can do with PostHog that you can’t easily do with Matomo, are things like pulling up identifiable user event histories, or plotting trends in product usage over time.
The enterprise version is just a private repo we'd give you access to that's still self hosted. We can also provide hosted deployments of any version, but that's really just for people that can't set it up themselves... hosting it isn't our core focus.
Can you talk more about your tech stack?
As an aside - it seems like most of the analytics companies I've heard of went through YC: MixPanel, Heap, Amplitude!
Or is this something you don't do?
Do you have a sample dataset to feed into our local environment or demo environment to test out the UI? Id love to poke around a bit before deploying to Heroku and setting it up on a site.
[0] https://127.0.0.1:8000/demo
Curious as to how deep you plan to go on the peripherals to product analytics - attaching additional attributes to users to group them (eg. Subscription level), getting a view into attribution channels for marketing strategy etc.
We don't aim to go "data science" deep with analytics, as we suspect you'd rather just integrate Metabase/Tableau/etc. We can see some cool ways to use it for attribution though - as you can host it we don't need to charge you enormous fees if your MTUs are very big... we see lots of B2C companies using product analytics on the product, but not the website, and struggling with tracking say UTM tags the whole way through.
There are two "out there" areas that we're really interested in right now...
1) We're thinking of focussing more on precisely what a developer (not product, not marketing) needs, as we think there is an underserved and enormous group here. Imagine when you're building something being able to run a command in your CLI, then being able to open a browser with a good understanding of which pages/features are being used as you work. The point being - give developers user data so they know how to build for impact.
2) We also want to explore integrations with other platforms to push stuff to them. I can't stop refreshing our own product, so I think pushing an Action to Slack, for example, would be helpful and would get it into everyday workflows a bit more easily. We don't want to do too much here and kind of hope the community spot these kinds of things and run with them :)
What's your reaction to the above? I'd love to know if you had a specific pain point in mind
The challenge was tough enough for Heap and PostHog is going to be at a huge disadvantage due to the lack of multi-tenancy. When you use Heap, your data is stored across Heap's entire cluster of machines. When you run a query, that query is ran simultaneously against every single machine in Heap's cluster. Even though your data may be taking up something like .1% of the total disk space, when you run a query, 100% of the disk throughput of Heap's cluster will go to processing your query. It's not an overstatement to say this alone results in a >50x improvement to query performance.
I honestly think Heap wouldn't be possible without multi-tenancy. It's hard enough as is to get queries that process multiple terabytes of data to return in seconds when you have a fleet of dozens of i3s available. I'm not sure how you would do that with a fleet a tiny fraction of that size. If you're curious about Heap's infrastructure, Heap's CTO, Dan Robinson, has given a number of talks on how it works[0][1].
That's not to say that PostHog won't work for anyone. I previously tried (and failed) to start a company based on optimizing people's Postgres instances. One of the big takeaways I had was that no matter how you use it, Postgres will work completely fine as long as you have <5GB of data. I think if you have a modest amount of data, something like PostHog would work perfectly fine for you. Since the Postgres optimization business didn't work out, I wound up pivoting to freshpaint.io which eliminates the need to setup event tracking for your analytics and marketing tools by automatically instrumenting your entire site. Since I started working on it, things have been going a lot better.
[0] https://www.youtube.com/watch?v=NVl9_6J1G60
[1] https://www.youtube.com/watch?v=iJLq3GV1Dyk
The nice thing about single-tenancy is that in reality lots of users have small enough datasets that scaling isn't a problem. Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user. Postgres also allows you to get started very quickly and do lots of queries yourself.
In our docs we explain our thinking more. Postgres is great for the vast majority of use-cases, and we're working hard to optimise those queries. Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.
The hard part wasn't scaling the system to handle all users combined. The hard part was designing the system such that when an individual user runs a query, they would get their results back in a reasonable amount of time.
Having every user in a single cluster made this easier because an individual customer could make use of the computer power of a cluster that was sized to fit the data for everyone in it. In other words, if Heap doubled the number of customers, Heap would get twice as fast for everyone. That's not true for PostHog.
> Heap et al have to scale to all of their users combined (as you said, terabytes), we just have to scale to the biggest user.
A decent sized Heap customer had multiple terabytes of data with the largest being well beyond that. You're going to have to figure out how to scale PostHog to that point without the benefits of multi-tenancy.
> Once users get beyond Postgres, we have integrations with databases that can scale well across many hosts, and we provide services around this to help people size their servers correctly.
I think a cluster of servers that could churn through terabytes of data in seconds would be prohibitively expensive for any individual customer to purchase.
Surely you meant "5TB", not "5GB"?
I meant what I said. You can literally just setup a PG instance and it will work perfectly fine up to a few GB. At that point, you will probably start to see certain slow queries due to bad query plans. All you need to know are the basics of EXPLAIN ANALYZE, create a few indexes. That will get you to ~100GB at which point you will start to have to deploy more serious optimizations like partitioning, denormalization, etc. Once you get to multi-TB postgres instances, you will have to look at ways to horizontally scale your DB. This can be done in the Postgres world with something like Citus, but you would probably also want to look at non-Postgres based alternatives.
Yes, if you are dealing with large databases, you need to learn about... dealing with the large databases. 5GB is something that a small laptop will do.
I've been using PostHog with my app for about a week now, and so far the results have been good. Pretty straightforward to integrate with a Swift iOS app too!
In any case, will be looking closer at this. Looks very interesting. Thanks.
[1] https://imgur.com/a/wYxbKj4
Any idea what a moderate size website (10k users per day,500k events) would cost to run on Heroku?
We’ve seen 500k events work with the hobby dev dyno and database which is $14/month. Depending on how much data history you want to keep you could upgrade to standard-0 on Heroku or spin up an RDS instance which is cheaper.
If you want to send events from your own backend to PostHog, there's instructions for Ruby/Python/Node/API in the docs[0]
[0] https://github.com/PostHog/posthog/wiki
actual link : https://github.com/PostHog/posthog/tree/master/posthog
Will have to see where I can fit this in to a project.
Edit: Nevermind, just read your last paragraph :-)
edit: whoops didn’t read.