Creator here! Thanks for submitting - I hope people find this interesting. I’m excited to explore more use cases beyond listing and reformatting git log output! There’s a similar project called gitbase, but I had difficulty just spinning that up for ad-hoc CLI querying so created this project in hope of a better UX for that type of use case. It uses SQLite virtual tables, which was something I had been meaning to play around with. This seemed like a cool use of them!
Cool. It's taken me years to find an efficient (and most importantly, easily debuggable) way to transfer Git's history to SQL. Since you are looking for use cases, I'll share with you some of the ways that I'm leveraging SQL to help drive code analytics for Git.
Once Git's history is in an easy to query state, you can do some interesting things. For example, if you aggregate all the contributors in a Git repo, you can produce something like this:
which makes it very easy to identify project investment/commitment. In the above example, you can see that Microsoft is heavily invested in vscode, as a lot of the developers that are contributing to it, have been doing so for more than 3 years. And if you aggregate contributions by file types, you can see how people are contributing to it as well. In the case of vscode, the contributions are mainly TypeScript contributions.
Here is another contributor example, which shows GitLab contributions:
What the above analytics shows, is GitLab has a lot of contributors and a lot of them are new contributors (6 months or less), which makes sense since they were hiring aggressively not too long ago. Not sure if this is still the case with Covid-19, but this can be easily confirmed 6 months from now, with the same chart.
Now for something more interesting in my opinion, which is code review analytics.
It has taken a lot of research and development to get to this point, but once you can easily query Git, you can surface very interesting things by cross referencing it with external systems, like GitHub's pull request system.
In the pull requests screen shot, I created a window that only considers open pull requests that were updated within the last 30 days. With this type of window, I can see what has changed across dozens, if not hundreds or thousands of pull requests. For example, I can easily identify file collisions, between pull requests. When was their last commit, and so forth.
I'm still working on refining my code review analytics, but the goal is to get it to an advanced state, where you can see exactly what is happening between pull requests and to derive insights from those requests.
So those are just some of the use cases that I've developed, which leverages being able to query a Git's history with SQL.
Do you have a schema somewhere? I've been working on a dataset that includes git repositories and I've been muddling through it slowly. It's built around a dataset of ~126k builds I collected some time back, plus ~5k Pivotal Tracker stories for the same time period. Covers about 2.5 years of 3 teams.
The hardest parts have been (1) dealing with actual lines, which I gave up on and (2) very busy robot repos with hundreds of thousands of commits.
My goal is to release the data as a single integrated set, but there's a ways to go. For one thing I need to find everyone in it to ask if they're OK with me doing so.
Sorry, I haven't published it yet. It honestly took a long time to develop the schemas, but I might publish it the future. The issue right now is, I'm a single founder so I really have to be smart with my time and publishing things will just add to my work load.
My goal is to make the indexed data easily accessible, so that you can easily cross reference Git's history with whatever external systems you may have. What I've created is really a search and analytics engine for Git, which is designed for querying via SQL or through a REST interface.
On my simple dev machine which has 32gb of RAM, 1 TB of NVME storage, and a 2700x CPU, the search engine can easily index hundreds of million changes.
> Do you store lines or full blobs at all? That's really where I came unglued on my first pass. I still want to reintroduce them somehow so that researchers can study changes more closely
No, since Git does a pretty good job of efficiently storing blobs. I would like to be able to execute
"select blob from blobs where sha=<some sha>"
but I can't justify the overhead of storing this in a database. This isn't to say I won't in the future, but if I do, I'll probably introduce a key/value DB for this, instead of using postgres. I do index blobs and diffs with lucene though. I also store the diffs in postgres.
Since Git does a very good job of storing blobs, I really can't justify using a DB just yet.
> What I'm doing looks to be a first for VMware, so we're moving cautiously
At the moment, not easily, but someone else just asked about that and I think it's worth looking into! It makes sense and could be a cool use case to run queries on all the key repos in an org or project
See also: Fossil, a version control system backed by a SQLite database. The author (of Fossil and SQLite) points out that a real database gives you more flexible efficient queries for your data. In git in comparison, it is for example much easier to find the parents of a commit than its children.
I used fossil professionally for about a year before company got bought out and switched to git. It was interesting. It’s been a few years since then, but here are my thoughts:
* the built-in web server was neat and useful
* when things got in a weird state (anyone learning Git knows what I mean)- it was extremely difficult to find a solution. Doing a web search was a waste of time, no one is sharing their useful Fossil SCM knowledge online.
* clear text passwords. To clarify this a bit, users are local to the clone, so not like it’s being shared... but then again clear text passwords. I just looked it up again, looks like it’s SHA1 now..... bcrypt would be nice.
* Fossil SCM does not believe in altering history. Your not going to find any squash commits or branch rebases. There is a ‘shun’ command for removing sensitive information, but it’s specifically designed to not work like git history edits and has weird behavior regarding clones.
Overall, I significantly prefer git. Being able to find solutions if something goes wrong is huge. Not having to train people on another tool when they join the team is good- you can put git as a skill requirement without limiting your options, you can’t do that with Fossil SCM. I learned Git first, so very likely I’m bias on the git-way vs. the fossil way. It was an interesting experience. To have a version control system have a built-in web server, it’s certainly unique!
Well, no, in huge repositories with lots of branches it's not. But this is true for Fossil as well. IIRC MSFT did a study with the Windows repository showing that a relational approach to VCS bloated the repository too much, and out of this grew the Bloom filter approach to speeding up with git blame/log.
Scale. Microsoft couldn't make a SQL VCS work. Maybe it's because they tried to use SQL Server instead of PostgreSQL, I dunno :) but IIRC the amounts of metadata involved essentially meant that they had no hope of doing git-like cloning: it was too much to clone.
You would need to implement a persistent datastructure, but without refcounts, you would have to do a gc. I guess you could do a refcount tree and only RC at the last common node.
Really atime is the devil. Funny thing is, is that all modern higher level systems have some form of data access logging, which is what you want anyway. So why not form some easily compressible event log?
haha you're totally right! if it's just a matter of renaming the binary and dropping it in a particular directory to get git to know about it, I can leave this part up to the user. I'll definitely have to make a note in the README.
gitqlite is a little clunky for a name, but I wanted to make sure to convey that it's sqlite doing a lot of the heavy lifting here!
I mainly use fossil these days, with git as a secondary option. One of the main reasons I went with fossil is due to it being built on top of SQLite and being able to query against the database, as well as extend it for customized project management.
Interesting stuff. When time permits, I'll have to check out how this implementation for git differs from fossil.
This post reminds me of a product idea I've been pondering for a while: A versioned database backed by git.
It would only be useful for data that made sense to be represented in individual files (perhaps json files with the file name as the id), with an API that allowed the data to be accessed and written to in a similar way to other databases. Perhaps similar to a nosql database, but because it's in git, everything is versioned and can be reverted.
Really was hoping this would be for finding projects on GitHub, not for searching within a repo. I often find it hard to do advanced searches using some special DSL GitHub has come up with (or really any company/website for that matter) and would love to use a SQL query to find specific git repos. I definitely feel the same way about YouTube channels and videos.