4 comments

  • zenlibs 9 days ago

    Fuchsia has an interesting take on filesystems [1]. One can write it completely in the user-space, avoiding expensive kernel<-->user-space switching. Additional benefit of storage sand-boxing comes for free, as each app can implement it's own fs, with the rest of the system unawares of it's existence.

    I wish such a fully-user-space option existed for Linux. This work is philosophically in the opposite direction, moving more functionality into kernel space for perf benefits.

    [1]: https://fuchsia.dev/fuchsia-src/the-book/filesystems.md

    • comex 9 days ago

      That document describes filesystems which are accessed over IPC, not filesystem-as-library like you seem to be describing. In fact, it's the same basic idea as FUSE. One user process (accessing the filesystem) makes an IPC call to another user process (server that implements the filesystem), which necessarily passes through the kernel and performs a context switch in each direction. On the other hand, it's quite possible that Fuschia's IPC is better optimized than FUSE, so it might have better performance in practice.

      • zenlibs 9 days ago

        libfs [1] is a userspace library offered by fuchsia abstracting the traditional vfs (virtual filesystem interface), allowing the fs to exist wholly in userspace, without a kernel component.

        Quoting: > Unlike more common monolithic kernels, Fuchsia’s filesystems live entirely within userspace. They are not linked nor loaded with the kernel; they are simply userspace processes which implement servers that can appear as filesystems

        [1]: https://fuchsia.googlesource.com/fuchsia/+/master/zircon/sys...

        • comex 9 days ago

          > which implement servers

          A "server" is an IPC mechanism; this is describing a way for one userspace process to serve filesystems to other userspace processes.

          It sounds like the kernel has no built-in notion of a "filesystem", and filesystems just take advantage of the kernel's generic IPC mechanism, which is also used by a lot of other things. That's great – but it's still true that IPC must go through the kernel, and switching from one user process (the client) to another (the server) is a context switch.

          It may be that the code also supports locating the client and server within the same process – I have not looked at it. But that's not what the documentation describes, so it's at least not the main intended operating mode.

          • zenlibs 9 days ago

            A userspace program can completely avoid kernel IPC if it has no intention to expose the fs to other processes. Client and server code can exist within same "app", without IPC, in the same process

            • geofft 9 days ago

              There are plenty of existing libraries that do exactly that. This isn't novel to Fuchsia. A good example is GNOME's GVfs https://en.wikipedia.org/wiki/GVfs , which is basically a plugin architecture to the standard GLib I/O routines. (Although as it happens, it still places the mounts in separate daemon processes.)

              Other things that come to mind are SQLite's VFS layer https://www.sqlite.org/vfs.html , Apache Commons VFS for Java https://commons.apache.org/proper/commons-vfs/ , glibc's fopencookie(3) which lets you provide a custom, in-process implementation of a FILE * http://man7.org/linux/man-pages/man3/fopencookie.3.html , libnfs which even comes with an LD_PRELOAD https://github.com/sahlberg/libnfs , etc.

              (And as others have pointed out, while client and server code can exist without IPC, as the names "client" and "server" would imply, that isn't the primary intention. The docs you link say, "To open a file, Fuchsia programs (clients) send RPC requests to filesystem servers ...." And even the terminology of a file system as a "server" isn't novel to Fuchsia; that's the approach the HURD and Plan 9 both take for filesystems, for instance.)

            • monocasa 9 days ago

              If you have the capabilities to the whole device.

          • wahern 9 days ago

            You said, "avoiding expensive kernel<-->user-space switching", which is wrong. Filesystems are implemented entirely in user space, just not in the same user space processes. Consumers exist in separate processes from the producers--plural, because the underlying block device storage may be managed by processes separate from the processes managing VFS state. Context switches are a necessary part of having separate user space processes, and context switching through kernel space (or at least some protected, privileged context) is necessary in order to authenticate messaging capabilities.

            Note that there are ways to minimize the amount of time spent in privileged contexts. Shared memory can be used to pass data directly, but unless you want all your CPUs pegged at 100% utilization the kernel must be involved somehow to optimize IPC polling. In any event, the same strategies can be used for in-kernel VFS services, so it's not a useful distinction.

            • zenlibs 9 days ago

              The only reason to use an IPC or go through the kernel is to expose the fs to the rest of the OS. If an app doesn't intend to expose the fs, the entirety of the fs can exist within the app process.

              Quoting: > "Unlike many other operating systems, the notion of “mounted filesystems” does not live in a globally accessible table. Instead, the question “what mountpoints exist?” can only be answered on a filesystem-specific basis -- an arbitrary filesystem may not have access to the information about what mountpoints exist elsewhere."

              • wahern 9 days ago

                libfs is an abstraction layer around some of the VFS bits. An analogous Unix approach be would be shifting the burden of compact file descriptor allocation (where Unix open(2) must return the lowest numbered free descriptor) to the process rather than the kernel. (IIRC this is also actually done in Fuschia as part of its POSIX personality library.)

                Notice in the above that it's implied that that actual filesystem server (e.g. that manages ext4 state on a block device) is in another process altogether. And so for every meaningful open, read, write, and close there's some sort of IPC involved.

                A process accessing a block device directly without any IPC is something that can already be done in Unix. For example, you can open a block device in read-write mode directly and mmap it into your VM space. Also, see BSD funopen(3) and GNU fopencookie(3)[1], which is a realization of similar consumer-side state management, except for ISO C stdio FILE handles; it's simpler because ISO C doesn't provide an interface for hierarchical FS management.

                There's no denying that Fuschia's approach is more flexible and the C interface architecture more easily adaptable to monolithic, single-process solutions. But it stems from it's microkernel approach which has the side effect of forcing VFS implementation code to be implemented as a reusable library. There's no reason a Unix process couldn't reuse the Linux kernel's VFS and ext4 code directly except that it was written for a monolithic code base that assumes a privileged context. Contrast NetBSD's rump kernel architecture where you can more easily repurpose kernel code for user space solutions; in terms of traditional C library composition, NetBSD's subsystem implementations have always fallen somewhere between Linux and traditional microkernels and so were naturally more adaptable to the rump kernel architecture.

                [1] See also the more limited fmemopen and open_memstream interfaces adopted by POSIX.

                • lukeh 9 days ago

                  The catch is, if you want to securely mediate access to a shared resource, you need to have something outside your protection boundary do it, be it a kernel or user-space server.

            • microcolonel 9 days ago

              Cool, but isn't that just an embedded database that looks a bit like a filesystem? It's nothing all that new to use an mmapped embedded database in an application, and every major operating system with a GUI includes (or tends to be distributed with) a copy of sqlite.

              Also, if the point is that you want to expose the filesystem to other applications, but have it local to your own, then why not expose it through FUSE, even if it's in your own process?

          • tathougies 9 days ago

            Fuchsia doesn't have an interesting take. It's a take copied from microkernels, which I contend are the most written kinds of kernel, not because they're always technically superior, but because OS authors like writing them.

            • ori_b 9 days ago

              > avoiding expensive kernel<-->user-space switching

              And replacing it with expensive userspace<--->userspace switching.

              But because the kernel is still doing the context switch between the processes, now you're doing userspace<-->kernel<--->userspace switching.

              • nybble41 8 days ago

                In the ideal case the IPC mechanism is just a block of memory shared between processes running simultaneously on separate cores, so there is no need to context-switch. You still want some sort of kernel-based blocking semaphore to avoid busy-waiting when the IPC queue is empty, but since it's only used when the queue is empty it isn't part of the performance-critical path.

              • monocasa 9 days ago

                Except you ultimately want your buffer cache, virtual memory, and vfs tightly coupled because they're three sides of the same coin.

                IMO, the ideal combo looks something like this FUSE/BPF work combined with XOK's capability based buffer cache rather than trying to split everything out into user mode.

                • wahern 9 days ago

                  This all gets back to the microkernel debates. As you say, it's far easier to implement such an architecture by stuffing a lot of the most important bits into a monolithic kernel. But various microkernel projects have shown this isn't necessary, and doing things this way has proven brittle and insecure.

                  So-called safe languages don't help, either, because the whole purpose of doing this in a monolithic, shared memory context is precisely because it's easier to move fast and break things in terms of unsafe optimizations (e.g. circular, direct pointer references) unburdened by careful, formal constraints. If written in Rust every other line of Linux code would be wrapped in unsafe{}. Most of the parts that needn't be would be better moved into user space, anyhow.

                  • monocasa 9 days ago

                    There is no uKernel out there that has anything on Linux wrt to FS perf. The uKernels have not shown that they've solved the FS problem as well as monolithic kernels.

                    • nwmcsween 9 days ago

                      This isn't due to Linux being faster, a capability based system that exposed an area of storage with an fs library would basically be app -> hardware vs app -> expensive cxt switch -> vfs -> fs -> hardware.

                      • geofft 9 days ago

                        Right, the claim is that no one has actually demonstrated that, though. It sounds like it should be possible in theory but no microkernel has actually done so.

                        One big reason, I suspect, is that very little of the work in the VFS layer touches hardware. Reads of directory structure, metadata, and data are all handled from cache whenever possible, and writes are buffered. When reading, blocks are prefetched so nearby data is in memory, and when flushing writes, the kernel optimizes and orders them to be most efficient. A library that always handles reads and writes from hardware will be slower because it actually goes to hardware, no matter how many context switches it saves. And if you can't share a read cache and write buffers across processes, you're effectively going to hardware for all initial reads and when the program exits.

                        (Also, context switches aren't that expensive. They're not free, sure, but it's easily possible for software costs to outweigh it.)

                        • monocasa 9 days ago

                          The one system I've seen do that in a way that still allows you to multiplex the device and the buffer cache was XOK. But the case of sharing both securely required a very custom filesystem, and there was still a lot of work to totally validate the concept.

                  • blattimwind 9 days ago

                    FS hooking (ala usvfs) is very similar to what you want.

                    • Palomides 9 days ago

                      on linux, something like intel's SPDK will let you do everything, very quickly, in userspace

                      • monocasa 9 days ago

                        Which doesn't quite have the same semantics, as you can't multiplex the disk easily. Your one app totally owns the device.

                      • nwmcsween 9 days ago

                        welcome to exokernels circa 1994

                      • cyphar 9 days ago

                        I saw the authors' talk at Linux Conf last year. It seems like an awesome improvement but I'm actually far more interested in the "future work" which can be done. Namely, this system could be expanded to data "caching" whereby the kernel could route read(2) and write (2) to an underlying "struct file" in-kernel. This would allow for effectively zero-overhead FUSE-based overlay filesystems (which would be super useful for container runtimes -- especially once OCIv2 is usable).

                        • ashishbijlani 9 days ago

                          Author here. I’ve implemented this already and the kernel changes are available on GitHub. Please read section 5.2 in the Usenix paper for details.

                          • cyphar 9 days ago

                            Awesome! I hadn't worked through the entire paper before commenting. I will definitely make use of this for container runtimes (I think we talked about this after your talk). I believe you said you were working on your PhD at the time, I hope it's going well for you. :D

                        • Scaevolus 9 days ago

                          Here's a slide deck from the same authors: https://events.linuxfoundation.org/wp-content/uploads/2017/1...

                          tl;dr: perform metadata caching in the kernel using eBPF, avoiding context switches for common operations like listdir() and getattr(), and reduce FUSE overhead from ~18% to ~6%.

                          • jakegold 7 days ago

                            FUSE is particularly useful for writing virtual file systems. Unlike traditional file systems that essentially work with data on mass storage, virtual filesystems don't actually store data themselves. They act as a view or translation of an existing file system or storage device.

                            In principle, any resource available to a FUSE implementation can be exported as a file system.