Dropping cache didn’t drop cache

(blog.twitter.com)

175 points | by r4um 1081 days ago

6 comments

  • aetherspawn 1080 days ago
    I know that Linus hates unit tests, but these kinds of scenarios are perfect for regression tests. The investigation of the issue took so long, that you'd really want to spend those extra few hours writing the test to save yourself investigation in the future. The patch doesn't include any comment about a race condition in the actual code, so let's assume that all the knowledge is more-or-less lost and forgotten in 12 months or so.
    • Denvercoder9 1080 days ago
      It's no longer true that the kernel doesn't have unit/regression tests. The kernel ships with a thousand or so selftests nowadays, Linus actually requires selftests for some patches, and there's also a unit testing framework (kunit).

      Regardless, from the commit message:

      > I didn't manage to come up with a reproducer in test environment, and the problem can't be reproduced after rebooting.

    • biryani_chicken 1080 days ago
      If Linus doesn't like unit tests, can't a separate entity keep their own independent unit test repository, test the kernel against it and report regressions? I think it would be a worthy project.
      • mh- 1080 days ago
      • nijave 1079 days ago
        Depending on the stability of the interfaces, you introduce a game of cat-and-mouse of sorts where people make test breaking changes but some other entity is constantly trying to keep the tests update. Maybe, less of an issue if you're not making changes that require test refactoring (which probably also comes down to the quality of the test)
      • nitrogen 1079 days ago
        I haven't read it in years, but last time I did Phoronix was doing this implicitly by running every new kernel against the Phoronix Test Suite.
    • wolfi1 1080 days ago
      I don't see how a classical unit test would have catched this, as all other components are mocked away. Regression tests on the other hand can only catch errors that already occurred. I#m not sure that was the case.
      • Izkata 1080 days ago
        The idea is to write a regression test now, to catch it if it's accidentally reintroduced by a future developer who knows nothing about this bug.
        • pydry 1080 days ago
          Still not something a unit test can really catch. An end to end test maybe (run multiple times coz it's a race condition).

          I've seen unit tests that try to reproduce race conditions. They're naive mirrors of the code that break as soon as you even think about changing the implementation. They're kind of pointless.

          • mnahkies 1080 days ago
            Personally I've found that the value in the tests you are describing is in proving that you've fixed the issue.

            When you have a race condition that is difficult to reason about or reproduce, if you can distil it down to a test case that consistently fails until you fix it then you can be comfortable that your solution holds - yes it may be of little value going forward but it can be a very valuable part of the process.

            It can also serve as documentation going forward, as whilst it may be brittle, when does fail hopefully the person will check the description/blame and instantly find the context as to why it exists, before using their judgement as to whether it is still valid or not. (Rather than changing something that looks odd and not realising that they've re-introduced the edge case)

            • pydry 1079 days ago
              I don't find that this type of test provides any value. It's a hollow reflection of the code. At best it provides some sort of confidence that the code you intended to write was indeed the code you wrote.

              End to end tests do provide a lot of value, though, both in validating that your fix was correct and preventing regressions. They require a lot of up front investment in tooling, however.

              • mnahkies 1079 days ago
                I prefer to lean heavily on e2e/integration tests as much as possible, because I've found most bugs arise from the boundaries between discreet systems, particularly when they are developed by different teams.

                However the point I was trying to make was more around super specific scenarios where a somewhat artificial test can highlight the possibility and prove that it's correctly implemented. As well as showing future readers that the code wasn't making an arbitrary decision but it was desired behaviour.

                For example it's difficult to simulate rare/transitive failure states in e2e testing, but unit tests you can easily force these, and verify that they are handled correctly.

                One thing that is tangential but springs to mind is when there are log statements that a monitoring system is observing - in these cases I'd add a unit test for that log line being emitted, not because I'm worried that the code doesn't work, but because I'm worried someone will remove it in future not realising that it was the backbone of some monitor

                • nitrogen 1079 days ago
                  I'm worried someone will remove it in future not realising that it was the backbone of some monitor

                  It's for this reason that defining a separate method for monitors might be a good idea. Then it's clear this isn't just someone's overeager noise, and the method could be changed to send data directly to a monitoring system in the future if the logging and monitoring stack changes.

            • aetherspawn 1080 days ago
              Yes, I think the most value comes from reminding someone that the problem exists in say 2 years time.

              At work the other day I refactored a major system and discovered several completely left of field legacy behaviours because of regression tests that didn’t seem to make sense (“Why would a sane person want THIS to happen? Eh? Ooooh”)

      • londons_explore 1080 days ago
        The ideal tests are not strictly unit or integration tests, but have a suitable mixture of both.

        They also run with sanitisers which can detect races and undefined behaviour.

        And really good tooling also does formal verification to try and find any codepaths or inputs that could lead to an assertion failure.

        I think any of the above 3 things would have had a reasonable chance of catching this issue.

    • enw 1080 days ago
      This patch got me curious how the kernel is even tested. I have yet to see a single kernel commit that included any sort of tests.
      • YZF 1080 days ago
        The answer to how is the kernel tested is (historically) mostly manually by a lot of people: https://stackoverflow.com/questions/3177338/how-is-the-linux...

        Basically what used to be known as alpha/beta testing. Seems like there's more automated testing these days.

        I work on a smaller piece of software with many automated tests that is significantly less reliable. I wonder if there's some lessons there. Another interesting tidbit is that our tests break a lot as well.

        In "The Olde Days" many a pieces of complex/reliable software was written without a single unit test or any automated test for that matter. Sure, some of it had bugs, maybe even nasty bugs, but so does most software today. Have we made progress? maybe.

        • josephg 1080 days ago
          Tests let you modify code that’s unfamiliar to you without fear you’ll accidentally break something. If you live in a codebase and work with it every day, automated testing is arguably less important because you build an intuition around what kind of changes might break things. I wouldn’t be surprised if the lack of testing has actively fostered the ownership model Linux has adopted. You would need something like that to keep the bugs at bay.

          But there’s lots of code out there which doesn’t (or can’t) have a dedicated owner in the same kind of way. I had an issue opened on one of my GitHub projects the other day. I’m the sole maintainer of this project, the code is fiendishly complex and it’s working without any changes for about 18 months. Somebody found a bug! But after 18 months I’ve forgotten the mental context I had while writing that code - so I was terrified I’d break something else while fixing that problem. But I had an extensive test suite to fall back on. So when I fixed the bug and the existing tests all passed, I had the confidence to publish a new version. (And of course, I added the bug’s repro to the test suite for next time).

          Lots of projects need guard rails like this. Maybe they’re solo projects. Or maybe there’s a lot of churn in the team. Sometimes there’s just more code than there are engineers to keep track of everything, so people are bouncing around a lot and don’t build much expertise in a single area. For the long tail of software, I’m pretty confident in saying unit tests have dramatically improved reliability beyond the standard we had a couple of decades ago. Have they made us lazier too? Maybe! But in my book they’re definitely a win on the whole.

      • caskstrength 1080 days ago
        You probably couldn't find any because tests are usually submitted as standalone commits. You can find tests for several kernel subsystems here: https://github.com/torvalds/linux/tree/master/tools/testing
    • arve0 1080 days ago
      > The patch doesn't include any comment about a race condition in the actual code

      …but it’s in the commit message, which is basically the same as a comment.

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

      • aetherspawn 1080 days ago
        Yes, as long as you know which line to run git blame
        • TeMPOraL 1080 days ago
          It'll work until the next big rewrite of that component :).

          I love it when people put this level of detail into their commit messages, but this technique has one weak spot I don't know what to do about: if someone makes a change that confuses git's semantic heuristics - rename a function, split a file into two, etc. (or maybe all of them in a single commit), it creates a boundary for blame/log-trace that's hard to bridge.

          But then, maybe I'm just not clever enough with git log -L ...

          • nijave 1079 days ago
            Hopefully whoever is git blaming understands enough about git to hop past any intermediate changes back to the source

            I've never done any kernel development, but with commercial code it's pretty common to have to dig backwards including going back to the issue tracker or even searching through company chat (Slack, email, etc) if the shop doesn't do [a good job with] documentation

          • nitrogen 1079 days ago
            The --follow, -M, and -C options have usually helped me out of most blame dead ends, but sometimes I do still have to look up each blamed commit manually to keep following history.
    • quickthrower2 1080 days ago
      This seems so strange to me in 2021 that perhaps the most widely used OS kernel in the world wouldn’t be aiming to have test coverage, given the billions in value obtained from it. I guess this is for legacy reasons, and I know as a dev it’s very hard to add tests to an untested system both practically and psychologically and that’s in a team environment not open source
      • slver 1080 days ago
        I'm fine with that premise: it's strange.

        But if you think about the implications of this strangeness, it can be equally interpreted as "Linus is playing a dangerous game" or "maybe we've kind of exaggerated the importance of unit tests, seeing Linux is doing fine".

        Linux doesn't seem to stand out as exceptionally buggy. It has bugs, but then so does every project that's unit tested.

        • Retric 1079 days ago
          Alternatively, unit tests are simply a poor fit when context is so critical for Linux kernel bugs. In much the way unit tests are really hard to write for concurrency issues.

          In theory they should still be useful, but in practice perhaps not.

          • slver 1079 days ago
            Unit tests are honestly a poor fit for many problems. Unfortunately it takes Linus levels of determination to resist and carve out an exception status for your project.
          • sangnoir 1079 days ago
            Unit tests also have to be maintained - which compounds the cost
      • londons_explore 1080 days ago
        The Linux kernel is pretty well laid out. There are lots of unit and integration tests that could be usefully run against it with no code changes.
    • iamgopal 1080 days ago
      Any rationale why he hates unit test ?
      • encryptluks2 1080 days ago
        He doesn't, he just wants the billion dollar enterprises to be testing it not the kernel team which are more of an academic community. Makes sense to me.
        • bitcharmer 1080 days ago
          It doesn't make much sense to me.

          There is nothing positive in forfeiting principles of good software engineering in the name of falsely perceived imbalance of stakes in the project. Instead of falling back on the biggest beneficiaries of the project for testing, we should have a participation model that welcomes contributions and improvements at all stages of development.

          I've contributed to the kernel multiple times over the last few years and it was always a pain to some degree. Sometimes due to outdated contribution model, sometimes because of maintainer's personal preferences and allegiances.

          Sadly current kernel dev culture still promotes exclusion and building walls instead of reducing tension and removing barriers for entry.

          • unglaublich 1080 days ago
            They do welcome contributions at all stages. Just because it doesn't happen at your terms doesn't make it exclusive or about building walls. Do you think there is a contribution model that will make everyone happy AND the kernel good?
            • bitcharmer 1080 days ago
              > Just because it doesn't happen at your terms

              That's a good example of the toxic attitude I'm referring to. I never said anything about my terms. Just a healthy OSS ecosystem with more balanced control over what and how gets done.

              > Do you think there is a contribution model that will make everyone happy

              I don't know, do you? It'd be safe to assume however that the Linux project could benefit from some improvements. Or are you trying to suggest it's perfect in its current form?

              • encryptluks2 1080 days ago
                How many projects have you worked on that claimed they were coming into restructure things to make them better and they just made things a lot worse? Everyone has opinions on what could be improved, but rarely do they actually get things accomplished like the Linux kernel team has done. They already have contribution documents as well:

                https://www.kernel.org/doc/html/latest/process/1.Intro.html

                > Sadly current kernel dev culture still promotes exclusion and building walls instead of reducing tension and removing barriers for entry

                I suggest if you are being excluded then you say what that exclusion is. I keep seeing people say this, but then asked about it they say things like... well I wanted to develop a driver in JavaScript and they said no. If you have legitimate issues then bring them forward to the Code of Conduct committee:

                https://www.kernel.org/code-of-conduct.html

                • bitcharmer 1080 days ago
                  Ok, I'll bite, here's a good example...

                  Working for a large institution that heavily employs the latest hardware we stumbled upon an issue with tsc calibration on skylake platforms with overclocked BCLK. The issue stemmed from the fact that cpuid for our particular SKU returned nominal instead of actual tsc frequency. We submitted a simple patch that just introduced a kcmdline parameter to disable cpuid based calibration and fall back to msr-based method that works fine. The maintainer shot it down only because _in theory_ the cpuid method should work even for over clocked cases. In practice an Intel person admitted that the firmware for our particular SKU should be updated to fix this. Over two years later Intel still didn't release fw update and our simple patch hasn't been accepted because the maintainer sees the world as a perfect place. You know who helped us? RedHat. They saw the value of adding this workaround to allow their users use Linux even if it was the cpu vendor that messed up.

                  Now tell me, how is this welcoming attitude on Linux kernel part? How can you blame me for my opinion when a partial fix to a real problem gets rejected for completely academic reasons?

                  • encryptluks2 1079 days ago
                    Where is the mailing list thread discussing the fix? I'd need more context. It does sound like Intel admitted that they need to update the firmware to resolve this. Are you loading the Intel microcode?

                    I found a related patch here:

                    https://lkml.org/lkml/2008/9/29/178

                    • bitcharmer 1079 days ago
                      Skylake was launched in 2015. Why someone would think we submitted a patch 7 years before that date is beyond me. Maybe you haven't read my comment after all, I get that impression on LKML sometimes...

                      Anyway, here it is: https://lkml.org/lkml/2018/7/13/631

                      • encryptluks2 1079 days ago
                        I didn't think that was you, I just thought it was related and an example of how a similar patch could get approved.

                        I looked at your example, and I'm sorry that you thought I was doubting you but it looks like that everything was resolved satisfactorily based on what I read for you. I didn't get the impression they were saying no, just that there was other ways to do it.

                        • bitcharmer 1079 days ago
                          > everything was resolved satisfactorily

                          What part of "our fix was not accepted" do you deem as "resolved satisfactory". I provided you with a broad context and a reference to the original discussion that clearly shows there was opposition against a practical solution and a lack of promised fix on the hw-side, and you still have the audacity to say it was "resolved". The patch was not accepted into the mainline, so please explain what you mean when you say the problem was resolved? Resolved how?

                          This is exactly what drives practitioners like myself away from contributing to the kernel.

                          • janto 1079 days ago
                            It doesn't look like you're that good at "reducing tension", as you describe it in your initial post. Maybe you should not be contributing to the kernel then?
              • janto 1080 days ago
                It's safer to assume that what you perceive as toxic is actually working much better than what you have in mind, given the success of the Linux ecosystem.
                • bitcharmer 1080 days ago
                  The level of toxicity and success are orthogonal. The warehouse part of Amazon's business is even more successful and more toxic (as a workplace) at the same time. So what is the point you are trying to make here?
                  • TeMPOraL 1080 days ago
                    Depends on the metric of success you pick. Comparing Linux Kernel to Amazon warehouse business is apples to elephants.

                    Amazon's warehousing is successful financially, perhaps less so reputationally, but the workers in general aren't very happy about the conditions - and they're not exactly volunteers[0]. A proprietary business is also by definition extremely exclusive - it's not like Amazon will give you their blueprints and operations book for free, to help you start your semi-automated warehouse, and it won't let you contribute to their business either, without a great degree of vetting.

                    Meanwhile, Linux Kernel is built by true volunteers, and is successful in terms of spread and reliability as a software engineering artifact. Its process is very much inclusive - they'll accept just about anything that contributes positively to the final artifact, according to goals the "core team" defines[1], without much regard for who you are or where are you from. The people contributing regularly don't seem to complain about conditions either.

                    There's no actual toxicity in there. Unfortunately, in recent years it's become popular for people to call out anything they don't like as "toxic". It often reads like a 5 year old complaining that their grandma makes them wear a hat outside in winter[2].

                    I've never heard that Linus "just wants the billion dollar enterprises to be testing it not the kernel team which are more of an academic community", but I can see this as a valid point of view. Neither him nor the other early Linux Kernel contributors wanted the whole planetary economy to become critically dependent on their little project. They don't automatically inherit responsibility for everyone else - especially that the whole project is open and inclusive, so anyone who wants to help can help - for instance by creating their own test suites and using it to create and submit bugfixes.

                    --

                    [0] - Not in the ways that matter. Sure, they've approached Amazon on their own, but they did that only because they don't have any better way of housing and feeding themselves and their kin. The market may not be putting a literal gun to your head, but that doesn't make economic desperation less of a threat to life.

                    [1] - Without that part, you'd have a ball of mud, not a software project.

                    [2] - Though every now and then, the "toxicity" line is more like playing with a WMD. Like a 5 year old who thinks threatening to call CPS on their parents for not giving them candy is being clever.

                    • bitcharmer 1080 days ago
                      I'm not comparing Linux project to Amazon. I'm saying you can be highly successful and toxic at the same time and provide Amazon as an example.

                      > they'll accept just about anything that contributes positively to the final artifact

                      That is simply not true. See my example a few comments up.

                  • janto 1079 days ago
                    Labelling something as "toxic" likely implies "bad for the success of the ecosystem". So no, they are definitely not orthogonal. They are (presumably negatively) correlated.

                    The point made was: seeing that the Linux project is pretty successful already, it's not "safe to assume" that these changes you envision would lead to either a good kernel or a healthy ecosystem.

                    You seem to want to optimize for something else...?

              • unglaublich 1079 days ago
                > That's a good example of the toxic attitude I'm referring to. [...]

                Calling a counter argument toxic is very... ah well, you know.

                > [...] Just a healthy OSS ecosystem with more balanced control over what and how gets done. [...]

                Those are _your_ terms I was referring to.

                > I don't know, do you? [...]

                No, but I'm also not saying: "current kernel dev culture still promotes exclusion and building walls". If you come up with such bold allegations you'll have to back them up.

                > [...] Or are you trying to suggest it's perfect in its current form?

                No, because "perfect" is an individual opinion.

          • the_duke 1080 days ago
            I used to have the same opinion, but I think for Linux it actually makes sense.

            A large, automated test suite places a lot of extra burden on contributors and maintainers.

            You will be expected to run tests, fix old tests, investigate flaky tests, and write new ones that cover your bug fix or new feature.

            The kernel instead pushes the responsibility for writing test suites and validating new releases onto first Line consumers like Red Hat, IBM, Google etc all.

            This is probably the only viable way to make it work. Maintainer resources are stretched thin as it is.

        • dominotw 1079 days ago
          unit testing != QA testing
    • encryptluks2 1080 days ago
      Its not that he hates unit tests, it is that the billion dollar enterprises and countless other people using Linux are the ones that should be testing it.
      • alpaca128 1080 days ago
        Doesn't make sense here, doesn't make sense with Windows 10 being mainly tested by actual users, doesn't make sense with AAA games being alpha/beta versions at release.

        The result is never as good as it could be.

    • TeeMassive 1080 days ago
      I used to be a huge fan boy of the Linux kernel development. But when I got to actually try to see how it's done I've just given up. I mean, setting up your email so it's like the 80s? Coding standards based on old tty terminals? No unit testing? Toxic leaders? No thank you.

      While the GNU/Linux was 20 years ahead, it's now more than 10 years late.

      Also relevant: https://www.usenix.org/system/files/1311_05-08_mickens.pdf

      > This is not the world of the systems hacker. When you debug a distributed system or an OS kernel, you do it Texas-style. You gather some mean, stoic people, people who have seen things die, and you get some primitive tools, like a compass and a rucksack and a stick that’s pointed on one end, and you walk into the wilderness and you look for trouble, possibly while using chewing tobacco. As a systems hacker, you must be pre-pared to do savage things, unspeakable things, to kill runaway threads with your bare hands, to write directly to network ports using telnet and an old copy of an RFC that you found in the Vatican. When you debug systems code, there are no high-level debates about font choices and the best kind of turquoise, because this is the Old Testament, an angry and monochro-matic world, and it doesn’t matter whether your Arial is Bold or Condensed when people are covered in boils and pestilence and Egyptian pharaoh oppression. HCI people discover bugs by receiving a concerned email from their therapist. Systems people discover bugs by waking up and discovering that their first-born children are missing and “ETIMEDOUT ” has been written in blood on the wall.

      • encryptluks2 1080 days ago
        > setting up your email so it's like the 80s

        Using email for development isn't the 80s. It is very streamlined and built into the Git code itself and mail clients that are actively maintained like Mutt. Just because you can't imagine what life would be like without Gmail doesn't mean that these maintainers are less efficient and antiquated than you are.

        > Coding standards based on old tty terminals

        Don't even know what this means. What coding standard is based on old TTY terminals? TTY is part of Linux.

        > No unit testing

        Linux is the most widely used OS in enterprise. The internet runs on Linux. If you want unit tests for everyone then start an open source project to do this, or complain about the enterprises that aren't testing.

        > Toxic leaders

        Who is toxic? I love Linus's approach. It has allowed him to get things done without toxic people in the community trying to social engineer their way into destroying something good.

        • loxias 1080 days ago
          > Don't even know what this means. What coding standard is based on old TTY terminals? TTY is part of Linux.

          I'm guessing they're trying to poke fun at people who care about your code being too wide, and hard to read?

          (But, while I no longer stick to a strict 80 column limit, like I used to, wide lines in code sucks. A proud mutt using systems engineer.)

          • d110af5ccf 1080 days ago
            If you strictly adhere to a ~100 column width then you can easily fit 3 files side by side on a mainstream monitor. People complain about useful practices and characterize them as antiquated rather than taking the time to appreciate that not everyone uses the same workflow.
            • encryptluks2 1080 days ago
              Setting a fixed width is not one that I agree with. You can enable line wrapping in pretty much every text editor.
              • cerved 1079 days ago
                Remember that long lines aren't just a problem for display.

                Longer lines might indicate that the code is either nested too deep, likely too complicated. Or poorly named withUnreadableRidicouslyLongWeirdWhatsGoingOnHereName.

                It's also problematic if the line ends up doing too much, ie. lots of chained method, since VCS track changes per line.

                While editors do a good job at word wrapping, they're not always excellent. Humans may do a better job.

                Note that the kennels style guide isn't fixed at 80 characters but strongly preferred.

                "Coding style is all about readability and maintainability using commonly available tools.

                The limit on the length of lines is 80 columns and this is a strongly preferred limit.

                Statements longer than 80 columns will be broken into sensible chunks, unless exceeding 80 columns significantly increases readability and does not hide information. Descendants are always substantially shorter than the parent and are placed substantially to the right. The same applies to function headers with a long argument list. However, never break user-visible strings such as printk messages, because that breaks the ability to grep for them."

              • d110af5ccf 1079 days ago
                I'm late to respond, but automatic line wrapping completely butchers code formatting and creates a difficult to skim mess.

                I also don't understand why fixed width would cause any issues in practice. The only time I find myself coming anywhere near 80 characters is for API calls with way too many arguments (I'm looking at you, OpenGL). Even then, the fault arguably lies with me for my tendency to use overly descriptive variable names. And anyway, it's trivial to split such calls across multiple lines such that it actually improves the overall readability.

        • TeeMassive 1080 days ago
          > . It is very streamlined and built into the Git code itself and mail clients that are actively maintained like Mutt. Just because you can't imagine what life would be like without Gmail doesn't mean that these maintainers are less efficient and antiquated than you are.

          You assume that there are not other tools to submit code and participate in discussion. That's obviously not true. Also the argument that "it is supported with git" is just a non-argument.

          > Don't even know what this means. What coding standard is based on old TTY terminals? TTY is part of Linux.

          You obviously don't know indeed. 80 lines comes from old teletype writers: http://www.navy-radio.com/tty.htm

          And even Linus disagreed recently (http://lkml.iu.edu/hypermail/linux/kernel/2005.3/08168.html) but the limit was only increased to 100 columns. So yeah.

          > Linux is the most widely used OS in enterprise. The internet runs on Linux.

          And?

          > If you want unit tests for everyone then start an open source project to do this, or complain about the enterprises that aren't testing.

          That's just arguing from a position of power. There is an open source project by the way, named kunit, but it's obviously left as an afterthought. Testing is left to "community testing" which is just a way of saying "compile the whole thing, test the whole thing, let's wish somebody caught most bugs", which is just the naive way of doing things before automated testing became popular in the early 2000s.

          > Who is toxic? I love Linus's approach. It has allowed him to get things done without toxic people in the community trying to social engineer their way into destroying something good.

          There are tons of other successful open source projects that do have to rely on base insults and abuse. The fact that the way the Linux kernel hasn't changed a bit for 30 years just shows how much its community is intolerant to ideas.

          • eitland 1080 days ago
            Let's see here:

            There's a whole community of otherwise often cutthroat rivals cooperating somewhat peacefully on a truly giant software infrastructure project for 25+ years. A place where no one even asks about your skin colour, gender, religion or other abilities as long as you can submit correct code in a way that everyone is comfortable with. I'm fairly certain even Donald Trump, bin Laden, a seven year old school girl or a gay albino penguin could all get a patch accepted if the patch was OK. Edit: I'm fairly certain even Iranian and Israelian coders could get patches merged in the same release and nobody would even notice. Maybe it has even happened a number of times already for what I know. Same goes for hardcore Atheists, Buddhists, Catholics, Hindus, and what have you. We can't even know because the kernel people do not keep notes about that.

            Then there are people coming from the outside trying to insert themselves as authorities not only on what tooling should be used but also as moral authorities down to the level of deciding on what language should be used.

            Who were toxic did you say?

            • TeeMassive 1079 days ago
              > There's a whole community of otherwise often cutthroat rivals cooperating somewhat peacefully on a truly giant software infrastructure project for 25+ years.

              That just appealing to tradition. There are other open source projects with "cutthroat rivals cooperating" who do not insist on keeping archaic systems. It's not like the viral close of the GPL will cease to exist and enforce public contributions of useful projects just because patches are not sent by stripped down emails anymore.

              > A place where no one even asks about your skin colour, gender, religion or other abilities as long as you can submit correct code in a way that everyone is comfortable with. I'm fairly certain even Donald Trump, bin Laden, a seven year old school girl or a gay albino penguin could all get a patch accepted if the patch was OK. Edit: I'm fairly certain even Iranian and Israelian coders could get patches merged in the same release and nobody would even notice. Maybe it has even happened a number of times already for what I know. Same goes for hardcore Atheists, Buddhists, Catholics, Hindus, and what have you. We can't even know because the kernel people do not keep notes about that.

              I never made the argument that they reject people based on their ethnic or religious background, why are you even bringing this up? Toxicity is not limited to being xenophobic...

              > Then there are people coming from the outside trying to insert themselves as authorities not only on what tooling should be used but also as moral authorities down to the level of deciding on what language should be used.

              That's just appealing to tradition and if we weren't talking about software I would be under the impression I was debating a hardcore conservative renting against "outsiders" but I digress.

              It's not like critique from the "outside" can't be insightful. And frankly when people immediately become defensive and feel that someone threaten them with their "moral authority" for a simple critique, it only shows that they do not talk from a position of confidence.

          • tankenmate 1080 days ago
            > You assume that there are not other tools to submit code and participate in discussion. That's obviously not true. Also the argument that "it is supported with git" is just a non-argument.

            git was developed as a tool to support linux kernel development, why would you not support the right tool for the job? git was (and is) better than all the alternatives and is why kernel developers have stuck with it.

            I guess if git offends you could just use patch files.

            Or write something that is obviously so much better that people naturally switch to it.

            • TeeMassive 1079 days ago
              I'm not even arguing against git, what are you talking about?

              I'm only pointing out how the Linux kernel development is the only project (except maybe some GNU projects) that still insists on using archaic stuff, like sending email with patches.

              And if fact I remember reading about a maintainer of a GNU project which I don't remember the name renting about how he's pretty much the only maintaining the project. There's also a project to get people up and running to contribute the kernel code base, most people stops at the email stuff. It's not like those things do not have tangible consequences.

              And so far I've never seen good counterarguments except snark and logical fallacies in the form of "it always been that way" and "just start your own" which only reinforces my point about toxic leaders.

  • ketzo 1080 days ago
    The saying I heard in my embedded programming class was:

    “There are two hard things in programming: naming things, cache invalidation, and off-by-one-errors.”

  • aeyes 1080 days ago
    I am amazed that Twitter employs kernel developers.
    • oblio 1080 days ago
      FAANG & co. discovered that not being afraid of software, but instead harnessing its power, lets them leave old school enterprises in the dust.

      And once you're no longer afraid of software, but instead an active creator of software, it's much cheaper and safer to instead employ some open source developers to take care of maintenance for your stuff. The old school enterprise way would have been to pay IBM, Red Hat, Oracle, whoever, millions and tens of millions of dollars in support for Cover-Your-A** type policies set by middle managers.

      FAANG & co. instead hire developers for this maintenance, and guess what, once you hire these people, they can do other things for you, because they're generally smart people (and also motivated, can work without supervision, etc.).

    • saagarjha 1080 days ago
      Most large companies employ at least a handful of kernel developers to keep their servers running right. Some (Google, Facebook, etc.) have large teams that contribute significant amounts of code upstream, often based on what they're running internally.
    • EnderWT 1080 days ago
      From this job posting[1]:

      "The Kernel & Operating System team at Twitter is responsible for shipping the Linux kernel and OS versions on which all of Twitter’s services run. Our job is to ensure the safe, reliable, fast, and efficient operation of the lowest levels of the Twitter software stack! We actively contribute many of our fixes upstream and are active members of the Linux OSS community."

      [1]https://careers.twitter.com/en/work-for-twitter/202101/005c4...

    • otterley 1079 days ago
      You shouldn’t be. At the level of scale at which Twitter runs, a custom kernel modification that improves performance by even a small percent can yield millions of dollars per year in hardware and energy savings. Investing in a kernel developer makes financial sense when viewed in that lens.
    • bartvk 1080 days ago
      The writer recently joined, if their Twitter bio is correct (August 2020). They have a handle with a bunch of numbers, and zero tweets: https://twitter.com/YangShi05755293

      I couldn't really find there are more in the team.

      • saagarjha 1080 days ago
        A quick search shows that the team includes at least several more engineers, which is not very surprising.
  • flakiness 1080 days ago
    Great read, although I don't think I have enough knowledge to fully appreciate the details of the article.

    I had similar problem before, and I didn't even notice that the cache was not cleared and had worked on pointless hypothesis until a coworker pointed out that there was a case where the kernel didn't evict the page cache. It's very hard to even detect that problem.

    Twitter's Engineering blog has several interesting posts recently btw. Kudos to them.

    • bartvk 1080 days ago
      What I'm amazed, is that the patch is so small:

      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

      • lilyball 1080 days ago
        It’s so small because it’s deleting a bad optimization.
        • vanderZwan 1080 days ago
          Ah yes, the old

          "Doctor, when I do [x] it hurts"

          "Then stop doing that"

          approach

    • justaj 1079 days ago
      > Great read, although I don't think I have enough knowledge to fully appreciate the details of the article.

      Same here, but luckily I started reading the article expecting that. Primarily because:

      > The article requires some prerequisite knowledge about Linux kernel memory management subsystem.

      Oh how I wish all articles had this sort of disclaimer.

  • bboreham 1080 days ago
    I tried to debug a very similar-sounding issue a couple of years back. Many GB used in dentrys, not shrinking when asked, no obvious cause.

    Sadly I have no kernel hacking skills, don’t even know what a dentry is. Kudos to the author.

    • a1369209993 1079 days ago
      FWIW, I'd assume that's "directory entry", the (file name,inode number) pair that's used to associate names of files in a directory with the data structures (inodes) describing those files. Not sure how you'd end with many gigabytes of them though, they're usually on the order of length-of-name + 8-24 bytes IIRC, so that would be around tens of millions of files?
      • bboreham 1077 days ago
        From TFA: “it seems that dentry caches consume around 2GB memory under one memory cgroup”.

        My issue happened after stress-testing Kubernetes by repeatedly starting and stopping many containers. So over time there could be tens of millions of files bind-mounted.

        • bboreham 1076 days ago
          Here is the output from 'slabtop' that I captured at the time, right after a commanded cache drop - looks like dentry wasn't the biggest component, but there were 32 million of them:

                 OBJS    ACTIVE  USE OBJ SIZE   SLABS OBJ/SLAB CACHE SIZE NAME                   
            534740452 533096794  99%    0.03K 4312423      124  17249692K kmalloc-32
             32734926  23946797  73%    0.19K 1558806       21   6235224K dentry
                86560     86558  99%   64.00K   86560        1   5539840K kmalloc-65536
              2124646   1994340  93%    2.00K 1062323        2   4249292K kmalloc-2048
              1013664   1013606  99%    4.00K 1013664        1   4054656K kmalloc-4096
             14592544  11935392  81%    0.12K  456017       32   1824068K kmalloc-128
              1300528   1158942  89%    1.00K  325132        4   1300528K kmalloc-1024
  • say_it_as_it_is 1080 days ago
    How does a Linux kernel team prioritize work when it's not investigating issues?
    • reasonabl_human 1079 days ago
      Using a JIRA board, I suppose. Investigating issues would only be one facet of the job, other examples could be building optimizations for the low-level stack, evaluating upcoming releases, etc.