Containers, Security, and Echo Chambers

(blog.jessfraz.com)

49 points | by ipm42 2139 days ago

3 comments

  • erulabs 2138 days ago
    Dropping privileges from docker containers, and container isolation are both very interesting and important topics - ones I hear discussed constantly. This author would serve himself much better by dropping the self congratulations - no reason at all for comments like “tech bros crying” and “container isolation is a hard problem - unless you’re me!”. Sorry, you’re not the only nerd smart enough to populate the capdrop table. Also, the default app armor profile for Docker leaves -a lot- to be desired. I hear the term “Echo Chamber” most often from people who deem themselves smarter than the rest, and always in an accusational way... fairly ironic if you ask me.
  • benmmurphy 2138 days ago
    linux namespaces can be aggressively locked down. for example sandstorm by kenton varda has not been vulnerable to any of the recent linux kernel vulnerabilities except for badiret and a TLB bug (https://github.com/sandstorm-io/sandstorm/blob/master/docs/u...). it certainly wasn't vulnerable to dirty cow which is quite difficult to protect against.

    however, i think docker may have been vulnerable against dirty cow. [https://github.com/gebl/dirtycow-docker-vdso] i'm not sure if this was before or after jessie's work on securing docker.

    also, i don't think gvisor would have been vulnerable to dirty cow. it looks like it 'gates' all the mmap/munmap/madvise syscalls through a sentry process which does some kind of emulation of virtual memory through some magic. [https://github.com/google/gvisor/blob/797cda301677abc8523d5a...]. like ultimately, i think mmap() system calls need to be executed in the monitored process but i think they are only done by the sentry using ptrace and if the sentry dies then i assume it is the root of the pid namespace so the process it is monitoring is killed as well.

    • kentonv 2138 days ago
      It's been a while, but IIRC Docker wasn't affected by Dirty COW because they mount /proc read-only. (Sandstorm was unaffected because it doesn't mount /proc at all.)

      FWIW I haven't kept the security non-events up to date over the past year or two. There was at least one Linux kernel bug I can remember last year (CVE-2017-5123) that allowed a breakout from all container engines, because waitid() is too important a syscall to block. However, the vulnerability was newly-introduced and hadn't made its way into too many distros before it was fixed.

    • amscanne 2138 days ago
      DirtyCOW requires either ptrace or /proc/PID/mem, neither of which are accessible natively via gVisor. Even if they were, the VDSO is isolated, which was a good vector for privilege escalation.

      I don’t think “locked down namespaces” means anything. Sandstorm was not vulnerable because it doesn’t mount /proc and doesn’t allow ptrace.

  • mindhash 2138 days ago
    After reading this article I am more convinced about gvisor or something similar.

    The security of your systems is best left to experts.

    • lrvick 2138 days ago
      Speaking as someone who has been called a security "expert" by many for years: I actually have very little idea what I am doing and neither do most of the people I know that discover vulnerabilities. I find issues by just reading code that clearly got minimal if any review or catching common design flaws that would of never happened if someone took the time to think about their threat profile and attack surface before implementation.

      The attitude of most engineers I encounter of "security is someone else's problem" is the problem.

      If you are writing systems other people rely on to be secure, then security is your problem. You will do a much better job avoiding creating new security holes if you take the time to learn some basics yourself instead of expecting "experts" to do it for you.

      Namespacing features and system call filtering tools like gvisor, seccomp, selinux, apparmor etc should be your -last- line of defense and they are only going to be useful tools for you if you invest the time to understand them and tune them to your specific needs.

      • ithkuil 2138 days ago
        there is an interesting thin line dividing a "hardening feature" (e.g. a system call filtering system) and something perceived as a genuine execution environment category (e.g. a virtual machine).

        Most people like to be able to reason about general security implications in broad strokes and concepts as "virtual machine" have the ability to convey a given notion of isolation guarantees that make them stand out as a primitive you can build upon, rather than an "additional layer".

        In order to achieve this standing, the "virtual machine" concept is rooted on the general idea that most of the traditional OS abstraction is moved inside the sandbox, leaving only a very small, easy to understand (and hence easy to secure) channel to the underlying shared resource. This is traditionally achieved by running a fully kernel in the sandbox (guest OS) and having it interact with the host through a hypervisor. The optional assistance provided by hardware virtualisation features is often necessary to achieve good performance, mostly because of the necessity to move a traditional OS in the guest, which was designed to work as a primary OS in the first place.

        The niche gVisor is trying to fill is the ability to approximate that abstraction without requiring the traditional hypervisor mode, which has some practical drawbacks that make it hard to deploy in some scenario (think of hardware support for nested virtualization which would be required to run your own virtualization solutions inside of public cloud compute instances).

        gVisor achieves this by fundamentally implementing a user-space kernel, leveraging some of the aforementioned system call filtering tools as one of the possible ways of implementing the sandbox mechanism and the guest/host communication channel. So, while using the very same features that are commonly thought of providing additional hardening features, gVisor fills a different niche, more akin to what people usually call virtualization.

        It can be argued that the amount of host features exercised by gVisor is too high to be able to call it a virtualization feature proper (especially in its seccomp mode), but it shares with classic virtualization one very powerful property: when end-user workloads require a given OS feature (e.g. some new cgroups feature), only the guest "OS" needs to implement it.

        On the other hand, traditional seccomp/selinux/apparmor style hardening requires the host OS to implement all the features needed by the guest workload. Furthermore, it also often requires that the rules (e.g. syscall filters) to be updated to let the sandboxed workload use said features, and the amended rule can often be applied incorrectly. Moreover, the filtering rules need to be expressive enough in order to implement some scenarios in the first place (e.g. seccomp-bpf cannot currently follow pointers).