I am impressed the reporter of the bug kept his cool and did all the debugging work for them using libfaketime while they blobiated about code not supposed to be running for 25 days and being unable to reproduce anything.
It’s amusing to see the bottom of issue threads like this grow with messages “issue was referenced from another project.” It would be cool to generate a graph of which projects have issues linking to each other. I suppose it would be sort of like a dependency graph, but one that only includes the most fragile dependencies. :)
> Note that what I meant by that isn't that your services aren't expected to be up for a month. It's that redundancy is expected (2+ instances behind a load balancer) so servers are resilient to failure.
I guess thats why he mentioned gameservers specifically, as those often depend on being a single node
I hate and like people behind JS development as a language at the same time.
Node.js devs more than anybody else realize what's wrong with the general direction of mainstream webdev languages, while at the same time cultivating that culture around themselves like that.
A good decision they made was not using high level IO abstractions that both eats performance, and obscures synchronicity implications.
And the bad one is their general chabuduo attitude. While nobody in a sane mind will use node.js for an airplane autopilot,or a space mission software, there is no reason for them not to strive for improving node.js reliability to the extend demanded in the "serious software" industry
TL;DR from the GitHub discussion: had to do with an integer overflow related to oxffffff hex string lengths and unsigned ints, introduced in one of the Node 10.x versions. This is the pull request that fixes it: https://github.com/nodejs/node/pull/22214
stuff like crons, logging stats, all kinds of things. Note that this apparently even affected some setTimeout calls after that long, so you couldn't schedule something to happen 500 ms from now, for example...
The bug occurs if your process reaches ~25 days, period. Regardless of the interval that you set - it could be 1ms and still fail. It has been 'fixed' and is now ~49 days, but setTimeout does not throw and so you have to rely on your process crashing due to some other condition.
If you have setTimeout anywhere in your code (including packages), you will need to force it to crash/exit once a month.
The implementation of setTimeout in Node is not sound.
> cumulative running time is not needed to calculate when next to call the interval
I don't see how that solves anything, you are just defining a new limit to be broken, which would occur when the timer resolution is next increased if not in practice.
It makes sense that either the actual interval or timeout periods are limited by the underlying types used to store them... but I see no intrinsic reason why cumulative time that setInterval runs need be limited by any underlying type (cumulative running time is not needed to calculate when next to call the interval).
Would someone seriously encounter a defect like that? Who would set a function to run after ~25 days (2^31 milliseconds) and expect that that process still runs after all those days? You're not safeguarding yourself against the very likely situation that processes could stop over such a long period.
Also, considering the length of that period, it's most likely that the delay for the function-to-be-called is from a business (and not some technical) requirement. So, a more fitting design would be to set up a delayed job for asynchronously executing long-running or later tasks in the background by a scheduler.
Conceptually they are not scheduling a function to be run after a month, but scheduling it to run twice per second as long as the process is running. When the process is still running a month later, it is very unexpected that the scheduling would suddenly stop.
My bad, perhaps because of the original (or confusing) post title earlier. A scheduling process that stops after 25 days is indeed a more realistic bug than a function planned for execution in 25 days not being called.