How short can Git abbreviate? (2013)

(blog.cuviper.com)

55 points | by pncnmnp 10 days ago

6 comments

  • jwilk 10 days ago
    Today's histogram of commit abbreviation lengths:

      $ git rev-list --all --abbrev=0 --abbrev-commit | awk '{ a[length] += 1 } END { for (len in a) print len, a[len] }'
      5 84
      6 692222
      7 527029
      8 43802
      9 2791
      10 181
      11 8
  • sgbeal 10 days ago
    Just FYI: showing hash prefix collision counts is a built-in feature of the Fossil SCM. e.g. the collisions of one particularly well-known repo can be seen at:

    <https://sqlite.org/src/hash-collisions>

    and fossil's own can be seen at:

    <https://fossil-scm.org/home/hash-collisions>

  • banish-m4 10 days ago
    For grins, I recently wrote a brute force program to inject a nonce into a shell script that also was the crc32 hash of a resulting modified script containing itself. One script had 2 hash collisions that satisfied this property in the entire search domain.
  • pcthrowaway 10 days ago
    I realize the author probably won't see this since the post is ancient, but it looks like there's a bug on the output: The "how many total objects are ambiguous at that abbreviation [length]" isn't quite right. It's actually "how many total possible words of that length can reference 2 or more existing objects, (and therefore are insufficient to disambiguate the matching objects of that repo)"
  • aronhegedus 10 days ago
    Reminds me of this project: https://github.com/trichner/gitc0ffee which is used to take a commit, append some header to it to force the hash to collide with some prefix, such as `c0ffee`

    >6 character prefix: less than a second

    >8 character prefix: in the order of one or more minutes

    No affiliation with this, but I tested it, and it was fast!

  • ramses0 10 days ago
    I've actually used this as an interview question for several years (and has been my first/only "interview-GPT" question), as it's a problem I ran into at work, can be solved by someone new to programming, and has lots of headroom and alternates for efficiency, optimization, and alternative implementation discussions.

    The use case I ran into was an initial + updated "dictionary of values" use case. Imagine: `{ "foo:bar:baz": 1.23, "bleep:bloop:blah": 4.56, ...etc.. }`.

    I wanted to send the first batch as a "full dictionary", and then send updated batches as: `{ "aa": 1.23, "bb": 4.56 }` whereby the client should already be able to reverse: `aa` => `foo:bar:baz`, and `bb`: `bleep:bloop:blah` since they could md5 the original key, and then match that to the "compressed" unique key.

    For the interview question, asking "what's the minimum unique prefix length", and exactly in the context of "how short can you abbreviate a git commit" was an excellent, low-friction, easily understandable problem.

    I was shocked when the initial GPT-3 nailed the naive implementation, and in retrospect, not-surprised when its superficially-correct "optimized" solution was only superficially correct. ;-) Better than 50% of the interviewees I'd asked this question of.