How short can Git abbreviate? (2013)

(blog.cuviper.com)

55 points | by pncnmnp 10 days ago

6 comments

jwilk 10 days ago

Today's histogram of commit abbreviation lengths:

  $ git rev-list --all --abbrev=0 --abbrev-commit | awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
  5 84
  6 692222
  7 527029
  8 43802
  9 2791
  10 181
  11 8

sgbeal 10 days ago
Just FYI: showing hash prefix collision counts is a built-in feature of the Fossil SCM. e.g. the collisions of one particularly well-known repo can be seen at:
<https://sqlite.org/src/hash-collisions>
and fossil's own can be seen at:
<https://fossil-scm.org/home/hash-collisions>
banish-m4 10 days ago
For grins, I recently wrote a brute force program to inject a nonce into a shell script that also was the crc32 hash of a resulting modified script containing itself. One script had 2 hash collisions that satisfied this property in the entire search domain.
[-]
- ponco 10 days ago
  oh my goodness, TIL nonce has a technical meaning. It has a very different meaning in Australia.
  [-]
  - banish-m4 10 days ago
    https://en.wikipedia.org/wiki/Cryptographic_nonce
    In general, it's a made up one time value. In this case, it satisfies an additional property.
  - bananapub 10 days ago
    beats only knowing the other meaning: https://www.joe.co.uk/comedy/nonce-finance-responds-to-ridic...
  - cqqxo4zV46cp 10 days ago
    I myself improve my crypto code with rock spiders all the time!
  - wittierusername 10 days ago
    Despite finding out about "nonce" as a technical term when I was reading around PHP like 3 years ago, I still can't not react to it every time lmao
pcthrowaway 10 days ago
I realize the author probably won't see this since the post is ancient, but it looks like there's a bug on the output: The "how many total objects are ambiguous at that abbreviation [length]" isn't quite right. It's actually "how many total possible words of that length can reference 2 or more existing objects, (and therefore are insufficient to disambiguate the matching objects of that repo)"
aronhegedus 10 days ago
Reminds me of this project: https://github.com/trichner/gitc0ffee which is used to take a commit, append some header to it to force the hash to collide with some prefix, such as `c0ffee`
>6 character prefix: less than a second
>8 character prefix: in the order of one or more minutes
No affiliation with this, but I tested it, and it was fast!
ramses0 10 days ago
I've actually used this as an interview question for several years (and has been my first/only "interview-GPT" question), as it's a problem I ran into at work, can be solved by someone new to programming, and has lots of headroom and alternates for efficiency, optimization, and alternative implementation discussions.
The use case I ran into was an initial + updated "dictionary of values" use case. Imagine: `{ "foo:bar:baz": 1.23, "bleep:bloop:blah": 4.56, ...etc.. }`.
I wanted to send the first batch as a "full dictionary", and then send updated batches as: `{ "aa": 1.23, "bb": 4.56 }` whereby the client should already be able to reverse: `aa` => `foo:bar:baz`, and `bb`: `bleep:bloop:blah` since they could md5 the original key, and then match that to the "compressed" unique key.
For the interview question, asking "what's the minimum unique prefix length", and exactly in the context of "how short can you abbreviate a git commit" was an excellent, low-friction, easily understandable problem.
I was shocked when the initial GPT-3 nailed the naive implementation, and in retrospect, not-surprised when its superficially-correct "optimized" solution was only superficially correct. ;-) Better than 50% of the interviewees I'd asked this question of.