Exploding Git Repositories

(kate.io)

447 points | by ingve 2387 days ago

14 comments

hathawsh 2387 days ago
I wonder what the author means by "a lot" of RAM and storage. I tried it for fun. The git process pegged one CPU core and swelled to 26 GB of RAM over 8 minutes, after which I had to kill it.
[-]
- wscott 2387 days ago
  Yeah I tried it too. Killed at 65G. Disappointed that Linux killed Chrome first.
```
    Oct 12 15:47:52 x99 kernel: [552390.074468] Out of memory: Kill process 7898 (git) score 956 or sacrifice child
    Oct 12 15:47:52 x99 kernel: [552390.074471] Killed process 7898 (git) total-vm:65304212kB, anon-rss:63789568kB, file-rss:1384kB, shmem-rss:0kB
```
  Edit:
  Interesting. Linux didn't kill Chrome, it died on its own.
```
    Oct 12 15:42:21 x99 kernel: [552060.423448] TaskSchedulerFo[8425]: segfault at 0 ip 000055618c430740 sp 00007f344cc093f0 error 6 in chrome[556188a1d000+55d1000]
    Oct 12 15:42:21 x99 kernel: [552060.439116] Core dump to |/usr/share/apport/apport 16093 11 0 16093 pipe failed
    Oct 12 15:42:21 x99 kernel: [552060.450561] traps: chrome[16409] trap invalid opcode ip:55af00f34b4c sp:7ffee985fb20 error:0
    Oct 12 15:42:21 x99 kernel: [552060.450564]  in chrome[55aeffb76000+55d1000]
    Oct 12 15:47:52 x99 kernel: [552390.074289] syncthing invoked oom-killer: gfp_mask=0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
```
  Seems Chrome faulted first, but it was probably capturing all signals and didn't handle OOM. Then next, syncthing faulted and it started the oom-killer which correctly selected 'git' to kill.
  [-]
  - Tharre 2386 days ago
    > [..] and didn't handle OOM.
    How would Chrome 'handle' an OOM anyway? As far as I'm aware, malloc doesn't return ENOMEM when the system runs out of memory, only when you hit RLIMIT_AS and alike.
    [-]
    - exikyut 2386 days ago
      Or when you hit 4G VIRT on 32-bit.
      Took me a good day's worth of debugging before some bright spark piped up and said "wait, you said you were on x86-32...?"
      ...yeah, I use really old computers.
      [-]
      - katastic 2386 days ago
        I'm setting up my last machine for my wife for gaming. Athlon X4 630, and 16 GB of RAM. I loaded windows up and said it had ~2 GB free and I was like "oh crap, the RAM sticks must be dead" (because the last motherboard that I just replaced broke some RAM slots).
        I fixed my old video card, a GTX 560, and wanted to see what it could run. I loaded steam and PUBG said "invalid platform error". It took me a moment. I hit alt-pausebreak, presto, Windows 32-bit. Whoops.
        Hadn't had that problem in a long time except at clients running ancient windows server versions complaining about why Exchange 2003 won't work with their iPhones anymore "it used to work and we didn't change anything!" (Yeah... but the iPhone DID change--including banning your insecure 2003 Exchange protocols.)
- gabesullice 2387 days ago
  Humblebrag ;)
  [-]
  - geezerjay 2386 days ago
    Nowadays 32GB of RAM go for as little as 170$. Some mid-tier graphics cards cost much more than that.
    [-]
    - xorfish 2386 days ago
      They went for around 100$ during summer 2016, now the cheapest DDR4 is around 240$:
      https://pcpartpicker.com/products/memory/#Z=32768002&sort=pr...
      [-]
      - hathawsh 2386 days ago
        Wow, I didn't notice just how much fluctuation there has been in RAM prices. My Newegg order history shows I paid $65 for 16 GB of DDR3/1600 at the end of 2015. Now the exact same product is sold by Newegg for $122. Crazy!
        https://www.newegg.com/Product/Product.aspx?Item=N82E1682023...
    - smcl 2386 days ago
      I sometimes forget that people use Desktops or systems with ability to add extra RAM.
timdorr 2387 days ago
I'm curious how this was uploaded to GitHub successfully. I guess they do less actual introspection on the repo's contents than I thought. Did it wreak havoc on any systems behind the scenes (similar to big repos like Homebrew's)?
[-]
- stolee 2387 days ago
  There isn't anything wrong with the objects. A 'fetch' succeeds but the 'checkout' is what blows up.
  [-]
  - yes_or_gnome 2387 days ago
    Good point. For those that are curious:
    Clone (--no-checkout):
```
    $ git clone --no-checkout https://github.com/Katee/git-bomb.git
    Cloning into 'git-bomb'...
    remote: Counting objects: 18, done.
    remote: Compressing objects: 100% (6/6), done.
    remote: Total 18 (delta 2), reused 0 (delta 0), pack-reused 12
    Unpacking objects: 100% (18/18), done.
```
    From there, you can do some operations like `git log` and `git cat-file -p HEAD` (I use the "dump" alias[1]; `git config --global alias.dump catfile -p`), but not others `git checkout` or `git status`.
    [1] Thanks to Jim Weirich and Git-Immersion, http://gitimmersion.com/lab_23.html. I never knew the guy, but, ~~8yrs~~ (corrected below) 3.5yrs after his passing, I still go back to his presentations on Git and Ruby often.
    Edit: And, to see the whole tree:
```
  NEXT_REF=HEAD
  while [ -n "$NEXT_REF" ]; do
    echo "$NEXT_REF"
    git dump "${NEXT_REF}"
    echo
    NEXT_REF=$(git dump "${NEXT_REF}"^{tree} 2>/dev/null | awk '{ if($4 == "d0" || $4 == "f0"){ print $3 } }')
  done
```
    [-]
    - matthewrudy 2387 days ago
      Sad one to nitpick, but Jim died in 2014. So ~3.5 years ago.
      Had the pleasure of meeting him in Singapore in 2013.
      Still so much great code of his we use all the time.
      [-]
      - yes_or_gnome 2387 days ago
        Thanks for the correction, he truly was a brilliant mind. One of my regrets was not being active and outgoing enough to go meet him myself. I was lived in the Cincinnati area from 2007-2012. I first got started with Ruby in 2009, and quickly became aware of who he was (Rake, Bundler, etc) and that he lived/worked close by. But, at the time, I wasn't interested in conferences, meetups, or simply emailing someone to say thanks.
- enzanki_ars 2387 days ago
  I too was curious about this.
  https://github.com/Katee/git-bomb/commit/45546f17e5801791d4b... shows:
  "Sorry, this diff is taking too long to generate. It may be too large to display on GitHub."
  ...so they must have some kind of backend limits that may have prevented this for becoming an issue.
  I wonder what would happen if it was hosted on a GitLab instance? Might have to try that sometime...
  [-]
  - ballenf 2387 days ago
    Since GitHub paid a bounty and Ok'd release, perhaps they've patched some aspects of it already. Might be impossible to recreate the issue now.
    My naive question is whether CLI "git" would need or could benefit from a patch. Part of me thinks it doesn't, since there are legitimate reasons for each individual aspect of creating the problematic repo. But I probably don't understand god deeply enough to know for sure.
    [-]
    - mnx 2387 days ago
      is this a git->god typo, or a statement about your feelings towards Linus?
      [-]
      - warent 2387 days ago
        Please don't let Linus read this
  - ethomson 2387 days ago
    Yes, hosting providers need rate limiting mitigations in place. GitHub's is called gitmon (at least unofficially), and you can learn more at https://m.youtube.com/watch?v=f7ecUqHxD7o
    Visual Studio Team Services has a fundamentally different architecture, but we do some similar mechanisms despite that. (I should do some talks about it - but it's always hard to know how much to say about your defenses lest it give attackers clever new ideas!)
    [-]
    - corobo 2386 days ago
      > how much to say about your defenses lest it give attackers clever new ideas
      attackers will try clever new ideas anyway if their less clever old ideas don't work :P
      [-]
      - Sean1708 2386 days ago
        How does the saying go? Something like "security through obscurity isn't security"?
        [-]
        ethomson 2384 days ago
        It's not security through obscurity. It's defense in depth.
  - deckar01 2387 days ago
    GitLab uses a custom Git client called Gitaly [0].
    > Project Goals
    > Make the git data storage tier of large GitLab instances, and GitLab.com in particular, fast.
    [0]: https://gitlab.com/gitlab-org/gitaly
    Edit: It looks like Gitaly still spawns git for low level operations. It is probably affected.
    [-]
    - jychang 2387 days ago
      Spawning git doesn't mean that it can't just check for a timeout and stop the task with an error.
      Someone will probably have to actually try an experiment with Gitlab.
  - lloeki 2386 days ago
    Tested locally on a GitLab instance: trying to push the repo results in a unicorn worker allocating ~3GB and pegging a core, then being killed on a timeout by the unicorn watchdog.
```
    Counting objects: 18, done.
    Delta compression using up to 4 threads.
    Compressing objects: 100% (17/17), done.
    Writing objects: 100% (18/18), 2.13 KiB | 0 bytes/s, done.
    Total 18 (delta 3), reused 0 (delta 0)
    remote: GitLab: Failed to authorize your Git request: internal API unreachable
    To gitlab.example.com: lloeki/git-bomb.git
     ! [remote rejected] master -> master (pre-receive hook declined)
    error: failed to push some refs to 'git@gitlab.example.com:lloeki/git-bomb.git'
```
    I had "Prevent committing secrets to Git" enable though. Disabling this makes the push work. The repo first then can be browsed at the first level only from the web UI, but clicking in any folder breaks the whole thing down with multiple git processes hanging onto git rev-list.
    EDIT: reported at https://gitlab.com/gitlab-org/gitlab-ce/issues/39093 (confidential).
- shade23 2387 days ago
  To save folks from searching : https://github.com/cocoapods/cocoapods/issues/4989
  [-]
  - styfle 2387 days ago
    Thanks. Here is the comment from a GitHub engineer addressing the root cause:
    https://github.com/cocoapods/cocoapods/issues/4989#issuecomm...
JoshMnem 2387 days ago
Because that page is AMP by default, it takes about 7 seconds to load the page on my laptop. AMP is really slow in some cases.
Edit: see my comment below before you downvote me.
[-]
- katee 2387 days ago
  Huh, I've tested on a bunch of devices/connections and haven't encountered that. Do you know what causes AMP to be that slow for you? I'll take a look at serving non-AMP pages by default. It will require tweaking how image inclusion works.
  [-]
  - JoshMnem 2387 days ago
    For people who use extensions or browsers that block third party JS, AMP pages will take many seconds to load in non-mobile Web browsers.
    Here is information about some of the other problems with AMP:
    https://www.theregister.co.uk/2017/05/19/open_source_insider...
    https://danielmiessler.com/blog/google-amp-not-good-thing/
    https://ethanmarcotte.com/wrote/ampersand/
    https://css-tricks.com/need-catch-amp-debate/
    https://daringfireball.net/linked/2017/01/17/schreiber-amp
    [-]
    - xpaulbettsx 2386 days ago
      Fix your browser /shrug
      [-]
      - JoshMnem 2386 days ago
        It isn't just my browser. AMP performs very badly in some non-mobile browsers (no extensions).
      - amigoingtodie 2386 days ago
        Fix your website
  - Sir_Cmpwn 2387 days ago
    Would you please remove amp entirely?
- TeMPOraL 2386 days ago
  Same here. The page just stays blank for few seconds, and then pops into existence.
  (I do use uMatrix to block 3rd party JS.)

pmoriarty 2386 days ago

Why not just always run git under memory limits?

For example:

  %  ulimit -a
  -t: cpu time (seconds)              unlimited
  -f: file size (blocks)              unlimited
  -d: data seg size (kbytes)          unlimited
  -s: stack size (kbytes)             8192
  -c: core file size (blocks)         0
  -m: resident set size (kbytes)      unlimited
  -u: processes                       30127
  -n: file descriptors                1024
  -l: locked-in-memory size (kbytes)  unlimited
  -v: address space (kbytes)          unlimited
  -x: file locks                      unlimited
  -i: pending signals                 30127
  -q: bytes in POSIX msg queues       819200
  -e: max nice                        30
  -r: max rt priority                 99
  -N 15:                              unlimited
  %  ulimit -d $((100 * 1024)) # 100 MB
  %  ulimit -m $((100 * 1024)) # 100 MB
  %  ulimit -l $((100 * 1024)) # 100 MB
  %  ulimit -v $((100 * 1024)) # 100 MB
  %  git clone https://github.com/Katee/git-bomb.git
  Cloning into 'git-bomb'...
  remote: Counting objects: 18, done.
  remote: Compressing objects: 100% (6/6), done.
  remote: Total 18 (delta 2), reused 0 (delta 0), pack-reused 12
  Unpacking objects: 100% (18/18), done.
  fatal: Out of memory, malloc failed (tried to allocate 118 bytes)
  warning: Clone succeeded, but checkout failed.
  You can inspect what was checked out with 'git status'
  and retry the checkout with 'git checkout -f HEAD'

ericfrederich 2387 days ago
Run this to create a 40K file which expands to 1GiB
```
  yes | head -n536870912 | bzip2 -c > /tmp/foo.bz2
```
I would imagine you could do something really creative with ImageMagick to create a giant PNG file as well that'll make browsers, viewers, editors crash as well.
[-]
- tedunangst 2387 days ago
  PNG has dimensions in the header so the decoder should know when it's decompressed enough.
- Sean1708 2386 days ago
  You can take it a step further using Zip Bombs[0].
  [0]: https://en.wikipedia.org/wiki/Zip_bomb
- Hupriene 2387 days ago
  You can also make archives that contain themselves:
  https://research.swtch.com/zip
warent 2387 days ago
Odd. It's surprising to me that this example runs out of memory. What would be a possible solution?
Admittedly I don't know that much about the inner-workings of git, but off the top of my head, perhaps something with traversing the tree depth-first and releasing resources as you hit the bottom?
[-]
- ericfrederich 2387 days ago
  You need a problem to have a solution to it. What do you consider to be the problem here?
  This is essentially something that can be expressed in relatively few bytes that expands to something much larger.
  Imagine I had a compressed file format for blank files "0x00" the whole way. It is implemented by writing in ascii the size of the uncompressed file.
  So the contents of a file called terrabyte.blank is just ascii "1000000000000" ... or the contents of a file called petabyte.blank is "10000000000000"
  I cannot decompress these files... what is the solution?
  [-]
  - geezerjay 2386 days ago
    >You need a problem to have a solution to it. What do you consider to be the problem here? > >This is essentially something that can be expressed in relatively few bytes that expands to something much larger.
    That seems to be the problem. I mean, if an object expands to something much larger to the point that it crashes services just by the sheer volume of the resources it takes... That is pretty much the definition of an attack vector of a denial-of-service attack.
    [-]
    - TeMPOraL 2386 days ago
      There is a problem here, but it's not with data. It's with the service.
      Being able to express trees efficiently in a data format is an useful feature, but it requires the code processing it not to be lazy and assume people will never create pathological tree structures.
  - warent 2387 days ago
    I'm not following; why can't you decompress it? Of course you cant decompress it into memory, but if it's trying to that then there's a problem in the code (problem identified).
    Naive solution, just write to the end of the file and make sure you have enough disk. More sophisticated solution, shard the file across multiple disks.
    [-]
    - Piskvorrr 2385 days ago
      That's not a solution, that's sweeping the problem under the rug: "just have the OS provide storage, therefore it's not my problem any more, solved. (Never mind that with a few more layers, the tree would decompress into a structure larger than all the storage ever available to mankind)"
- peff 2387 days ago
  Git assumes it can keep a small struct in memory for each file in the repository (not the file contents, but a fixed per-file size). This repository just has a very large number of files.
  [-]
  - glandium 2386 days ago
    Large as in 10 billions. Even if git only needed 1 byte in memory per file, it would need 10GB.
- koenigdavidmj 2386 days ago
  One option is to modify each of the utilities so that it doesn't have a full representation of the whole tree in memory. I doubt this is feasible in all cases, though for something like 'git status' it should be doable.
  If the tree object format was required to store its own path, then you wouldn't be able to repeat the tree a bunch of times. The in-memory representation would be the same size, but you would now need that same number of objects in the repository. No more exponential fanout.
  But that would kind of defeat the purpose of Git for real use cases (renaming a directory shouldn't make the size of your repo blow up).
- TeMPOraL 2386 days ago
  Have git (the client) monitor its own memory usage and abort if it gets above a set limit (say, default, 1GB), with a message that tells you how to change or disable the limit.
gwerbin 2387 days ago
Would this be possible with a patch-based version control system like Darcs or Pijul? Does patch-based version control have other analogous security risks, or is it "better" in this case?
[-]
- fanf2 2386 days ago
  If the patch language includes a recursive copy than it's possible to reproduce this problem in that setting.
  [-]
  - geezerjay 2386 days ago
    If I understood correctly, this problem isn't caused by recursive copies but simply by expanding references. The example shows that the reference expansion leads to an exponential increase in resources required by the service.
    [-]
    - TeMPOraL 2386 days ago
      This means the same in this context; if it was just expanding references one by one while walking through the tree this would not happen - the bomb requires copies of expanded references to be stored in memory.

emeraldd 2386 days ago

Bare for the win.

    git clone https://github.com/Katee/git-bomb.git --bare

TeMPOraL 2386 days ago
Going to second level on Github breaks commit name for me - it gets stuck with "Fetching latest commit..." message. Curiously, go one level deeper and the commit message is again correct.
https://github.com/Katee/git-bomb/tree/master/d0/d0
(INB4 The article suggests Github is aware of this repo, so I have no qualms posting this link here.)
infinity0 2386 days ago
Directory hard links would "fix" this issue since `git checkout` could just create a directory hard link for each duplicated tree. I wonder why traditional UNIX does not support this for any filesystem.
(Yes you would need to add a loop detector for paths and resolve ".." differently but it's not like doing this is conceptually hard.)
breakingcups 2386 days ago
Has anyone tried to see how well BitBucket and Gitlab handle this?
Retr0spectrum 2387 days ago
What happens if you try to make a recursive tree?
[-]
- katee 2387 days ago
  You can't make a valid recursive tree without a pre-image attack against SHA1. However `git` doesn't actually verify the SHA1s when it does most commands. If you make a recursive tree and try `git status` it will segfault because the directory walking gets stuck in infinite recursion.
- ethomson 2387 days ago
  As in a tree that points to itself? You cannot, since a tree would have to point to its own SHA1. So this would require you to know your own tree's SHA and embed it in the tree.
  [-]
  - mv4 2387 days ago
    Reminded me of the GIF that displays its own MD5 hash:
    https://twitter.com/bascule/status/838927719534477312
    [-]
    - jwandborg 2386 days ago
      So it's possible, but impractical?
      [-]
      - mv4 2386 days ago
        I think it's possible.
porfirium 2387 days ago
If we all click "Download ZIP" on this repo we can crash GitHub together!
Just click here: https://codeload.github.com/Katee/git-bomb/zip/master
[-]
- AceJohnny2 2387 days ago
  I hope and expect that GitHub has the basic infrastructure to monitor excessive processes and kill them.
- exikyut 2386 days ago
  Scratches head
  ...I clicked Download a few seconds ago.
  GitHub is still thinking. :/
  Edit: After about a minute I got a pink unicorn.
- abritinthebay 2386 days ago
  Wouldn't that just do a `git fetch` and therefore not have the issue?
  [-]
  - minitech 2386 days ago
    "Download ZIP" downloads the repository’s files as a zip. No Git involved for the downloader.
    [-]
    - chii 2386 days ago
      i expect the download zip to be implemented as running 'git archive --format zip | write-http-response-stream'
      [-]
      - mschuster91 2386 days ago
        Hmm I'd hope they do a caching step in between ;)
kowdermeister 2387 days ago
I thought it would self destruct after cloning of forking before clicking :)