Node.js setInterval callback function unexpected halt

(github.com)

65 points | by luu 129 days ago

18 comments

  • pkulak 128 days ago

    Ouch, not a good look when someone keeps mentioning that you should never expect a month of uptime.

    • ddtaylor 128 days ago

      I am impressed the reporter of the bug kept his cool and did all the debugging work for them using libfaketime while they blobiated about code not supposed to be running for 25 days and being unable to reproduce anything.

      • jessaustin 128 days ago

        Eh, I've seen worse. Overall it seems everyone was pretty cooperative. We're just lucky this wasn't an npm bug...

      • gnulinux 128 days ago

        2^31 seconds is a little more than 68 years. But the thread discusses 2^31 milliseconds, which is a little less than a month.

        • manigandham 128 days ago

          It's milliseconds, the title is incorrect as stated in the comments here and the linked issue thread.

          • gnulinux 128 days ago

            Yes I fixed my comment.

      • chatmasta 128 days ago

        It’s amusing to see the bottom of issue threads like this grow with messages “issue was referenced from another project.” It would be cool to generate a graph of which projects have issues linking to each other. I suppose it would be sort of like a dependency graph, but one that only includes the most fragile dependencies. :)

        • thecatspaw 128 days ago

          > Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.

          I guess thats why he mentioned gameservers specifically, as those often depend on being a single node

          • int0x80 128 days ago

            Even if not a gameserver, virtually anything should not be buggy after one month of uptime. That is just ridiculous. Redundancy is completely orthogonal to this issue.

            • baybal2 128 days ago

              I hate and like people behind JS development as a language at the same time.

              Node.js devs more than anybody else realize what's wrong with the general direction of mainstream webdev languages, while at the same time cultivating that culture around themselves like that.

              A good decision they made was not using high level IO abstractions that both eats performance, and obscures synchronicity implications.

              And the bad one is their general chabuduo attitude. While nobody in a sane mind will use node.js for an airplane autopilot,or a space mission software, there is no reason for them not to strive for improving node.js reliability to the extend demanded in the "serious software" industry

          • sp332 128 days ago

            And if you boot all your redundant servers at the same time, they're all going to go down pretty close to the same time as well.

            • pvorb 128 days ago

              And you also need to know about this bug in order to make your server go down after 2^31 ms. As stated in the Github issue, the service remained otherwise functional.

              • NullPrefix 128 days ago

                So the lesson learned: Dont' check if the service is up, just restart it anyways.

              • int0x80 128 days ago

                Exactly, that is one of the problems.

            • kryptiskt 128 days ago

              Here is the commit that fixes it: https://github.com/nodejs/node/pull/22214

              • samspenc 128 days ago

                TL;DR from the GitHub discussion: had to do with an integer overflow related to oxffffff hex string lengths and unsigned ints, introduced in one of the Node 10.x versions. This is the pull request that fixes it: https://github.com/nodejs/node/pull/22214

                • trias 128 days ago

                  doesn't this fix mean it crashes after 2^32 milliseconds now?

                  • moefh 128 days ago

                    No, it only uses an int if the value is less than 0xffffffff, otherwise it uses a double.

                    That means it's ok up to 2^53 milliseconds, at which point it starts losing precision (you can't represent every integer above 2^53 as a double, it gets rounded).

                    It doesn't really matter because no computer that exists today will still be running in 2^53 milliseconds.

                  • foreigner 128 days ago

                    AFAICT hasn't the fix just kicked the can down the road to 2^64 milliseconds? Future NodeJS coders 585 million years from now are going to be cursing at us...

                    • 128 days ago
                      [deleted]
                      • pvtmert 128 days ago

                        exactly same thing i thought when i see some timer bug (not recurring events)

                        i actually came to this bug couple times (nw.js user here) i re-initialize timers once in a week. :/

                        • dmitrygr 128 days ago
                          • _pmf_ 128 days ago

                            Seems similar to this beauty: https://github.com/nodejs/node/issues/12115

                            • canhascodez 128 days ago

                              You seem to be making some sort of negative judgment about that particular issue. Perhaps you'd care to give us the benefit of your reasoning.

                              • _pmf_ 127 days ago

                                > You seem to be making some sort of negative judgment about that particular issue. Perhaps you'd care to give us the benefit of your reasoning.

                                A freshman learns not to store data in floating point data types if the significant digits of the data's domain exceeds the floating points guaranteed 1:1 domain.

                                Some maintainers of node.js's file system module did not (or did not care about it)[].

                                Given that JS developers are hailed as the pinnacle of software engineering, this strikes me as odd.

                                [] "But JS does not have a proper 64 bit datatype / BigInts" ... that's not an argument

                            • 128 days ago
                              [deleted]
                              • congresstart 128 days ago

                                Incorrect title, should be 2^31 milliseconds

                                • femto113 128 days ago

                                  Which is about 25 days, or roughly a month, either of which would make a better title. I read the article hoping to discover how they ran a test that was 60 years long...

                                  • rtkwe 128 days ago

                                    I don't think so, seeing 2^31 (units) immediately tells me what's probably going wrong in a way that 25 days or ~1 month doesn't unless I go and convert that into milliseconds.

                                    • stevenwoo 128 days ago

                                      Just spitballing, could use a tool like the debugger to inject a value into a system library function to set it to 59.99 years or whatever value necessary to do tests.

                                      • Arnavion 128 days ago

                                        Well it would've been the same as what they did for the 25 days case I imagine - using libfaketime.

                                        • johannes1234321 128 days ago

                                          > I read the article hoping to discover how they ran a test that was 60 years long...

                                          If you read the link you could figure that out: ;)

                                          "I used libfaketime to speed things up by a factor of 1000 on a linux server "

                                        • mijoharas 128 days ago

                                          Agreed, I just read the title and thought: Who cares? there is no way this could be a problem yet.

                                          • peterkelly 128 days ago

                                            Even Windows 95 lasted longer than that: https://www.cnet.com/news/windows-may-crash-after-49-7-days/

                                          • snacktaster 128 days ago

                                            nodejs ecosystem is a dumpster full of just bad code

                                            edit: lots and lots of garbage in there

                                          • truth_seeker 128 days ago

                                            Its a feature rather than a bug ;)

                                            Can't think why someone would do that. Such a large number of milliseconds just not practical.

                                            • Vendan 128 days ago

                                              stuff like crons, logging stats, all kinds of things. Note that this apparently even affected some setTimeout calls after that long, so you couldn't schedule something to happen 500 ms from now, for example...

                                              • olliej 128 days ago

                                                Long runtime apps, calendar-esque scheduling, etc al@ seem like things that could reasonably* hit that

                                                * specifically they could reasonably end up with time steps >= 2^31ms. Even if the design isn’t ideal it should still work

                                                • avip 128 days ago

                                                  That does not sound like a sound implementation.

                                                  • zamalek 128 days ago

                                                    The bug occurs if your process reaches ~25 days, period. Regardless of the interval that you set - it could be 1ms and still fail. It has been 'fixed' and is now ~49 days, but setTimeout does not throw and so you have to rely on your process crashing due to some other condition.

                                                    If you have setTimeout anywhere in your code (including packages), you will need to force it to crash/exit once a month.

                                                    The implementation of setTimeout in Node is not sound.

                                                    • olliej 128 days ago

                                                      It’s perfectly reasonable:

                                                      Run task every 4 weeks:

                                                      Onesecond = 1000;

                                                      Oneminute = 60 * oneSecond;

                                                      oneHour = 60 * oneMinute;

                                                      oneDay = 24 * oneHour;

                                                      oneWeek = 7 * oneDay;

                                                      Then from that reasonable set of definitions you make a even fire in 4 * oneWeek.

                                                      Maybe you aren’t even using setTimeout, but you’re using a timer library that internally is just using aetTineout (what does node have besides setTimeout by the way?).

                                                      You end up with an accumulation of individually reasonable questions that don’t work, for no obvious reason.

                                                      That said someone else is saying this actually is a “timers stop working in general after X days”, rather than just long lifetime timers, which makes this even more obtuse.

                                                    • truth_seeker 128 days ago

                                                      IMHO Scheduling the task beyond 12-24 hours, especially when it requires the Runtime timers overhead won't be practical.

                                                    • truth_seeker 128 days ago

                                                      I thank everyone from bottom of my heart who downvoted the comment but know this you will NOT get a gift from Santa this christmas. ;)

                                                      • tedunangst 128 days ago

                                                        Maybe next time read the bug report. 500 is not an impractically large number of milliseconds.

                                                      • int0x80 128 days ago

                                                        Either it works or it does not. If it doesn't, you document the fact.

                                                        • toast0 128 days ago

                                                          If it doesn't work, the attempt to set a timer that won't ever fire should throw as well.

                                                          • truth_seeker 128 days ago

                                                            I was pointing such large number of milliseconds passed as argument.

                                                            • Vendan 128 days ago

                                                              The bug was discovered on a call with 500 passed as the number of milliseconds.

                                                        • justicezyx 128 days ago

                                                          So this is like a reminder to anyone who always reminder others that you do not need to solve problems like Google has.

                                                          • baybal2 128 days ago

                                                            Well, so does the settimeout. A lot of underlying code is limited to signed 32 bit integers as per prehistoric convention.

                                                            JS committee should consider turning timer values into js native signed double

                                                            • tomxor 128 days ago

                                                              > cumulative running time is not needed to calculate when next to call the interval

                                                              I don't see how that solves anything, you are just defining a new limit to be broken, which would occur when the timer resolution is next increased if not in practice.

                                                              It makes sense that either the actual interval or timeout periods are limited by the underlying types used to store them... but I see no intrinsic reason why cumulative time that setInterval runs need be limited by any underlying type (cumulative running time is not needed to calculate when next to call the interval).

                                                              • baybal2 128 days ago

                                                                >I don't see how that solves anything, you are just defining a new limit to be broken, which would occur when the timer resolution is next increased if not in practice.

                                                                I think that having some uniformity in a zoo of rolloverable values will make life easier to people who don't read the language spec meticulously to the letter.

                                                                And yes, by any extend of human imagination, no programming language can justify implemention of a plain periodic timer through (oldTime - nowTime) % interval.

                                                                A timer should be a timer, and modern opetating sysyems provide developers with a whole arsenal of high level timer abstractions just for those people to not to write their own hacks.

                                                            • jschulenklopper 128 days ago

                                                              Would someone seriously encounter a defect like that? Who would set a function to run after ~25 days (2^31 milliseconds) and expect that that process still runs after all those days? You're not safeguarding yourself against the very likely situation that processes could stop over such a long period.

                                                              Also, considering the length of that period, it's most likely that the delay for the function-to-be-called is from a business (and not some technical) requirement. So, a more fitting design would be to set up a delayed job for asynchronously executing long-running or later tasks in the background by a scheduler.

                                                              • tobr 128 days ago

                                                                Conceptually they are not scheduling a function to be run after a month, but scheduling it to run twice per second as long as the process is running. When the process is still running a month later, it is very unexpected that the scheduling would suddenly stop.

                                                                • jschulenklopper 128 days ago

                                                                  My bad, perhaps because of the original (or confusing) post title earlier. A scheduling process that stops after 25 days is indeed a more realistic bug than a function planned for execution in 25 days not being called.

                                                                • moefh 128 days ago

                                                                  The bug happened for setInterval() with any timeout (the bug report is for one with 500 milliseconds).

                                                                  The problem is that the function that returned the current time was overflowing a 32-bit integer, so interval calculations were messed up and setInteval() wouldn't fire.

                                                                  The fix[1] was simply to change an int to an unsigned int.

                                                                  [1] https://github.com/nodejs/node/pull/22214/commits/3bb7fdfbbe...

                                                                  • jschulenklopper 128 days ago

                                                                    So, kudos to the maintainer to solve the issue, as it is a valid bug to be fixed... but it's not likely (or a case of suboptimal design) that such a defect would ever be encountered for real.

                                                                    • Vendan 128 days ago

                                                                      I quite often write stuff with a function that gets called every few minutes or seconds... which is what the bug was actually discovered on. That's not suboptimal design.

                                                                      • jschulenklopper 128 days ago

                                                                        Agreed. A scheduling process that stops (does not call functions anymore) after 25 days is a strange phenomenon. I misread the post title.

                                                                    • tedunangst 128 days ago

                                                                      What? The scheduled delay isn't 25 days.

                                                                      • jschulenklopper 128 days ago

                                                                        No, I realized later. But also that case wouldn't have worked :-)