A common mistake when NumPy’s RNG with PyTorch

(tanelp.github.io)

261 points | by sunils34 1110 days ago

22 comments

  • _coveredInBees 1110 days ago
    Yeah, I'd run into this 2 years ago and ended up also reporting an issue on the Centernet repo [1]

    The solution I have in that issue adapts from the very helpful discussions in the original Pytorch issue [2]

    `worker_init_fn=lambda id: np.random.seed(torch.initial_seed() // 2*32 + id)`

    I will admit that this is *very* easy to mess up as evidenced by the fact that examples in the official tutorials for Pytorch and other well known code-bases suffer from it. In the Pytorch training framework I've helped develop at work, we've implemented a custom `worker_init_fn` as outlined in [1] that is the default for all "trainer" instances who are responsible for instantiating DataLoaders in 99% of our training runs.

    Also, as an aside, Holy Clickbaity title Batman! Maybe I should have blogged about this 2 years ago. Heck, every 6 months or so, I think that, and then I realize that I'd rather spend time with my kids and on my hobbies when I'm not working on interesting ML stuff and/or coding. An added side benefit is not having to worry about making idiotic clickbaity titles like this to farm karma, or provide high-quality unpaid labor for Medium in order for my efforts to be actually seen by people. But it could also just be that I'm lazy :-)

    [1] https://github.com/xingyizhou/CenterNet/issues/233

    [2] https://github.com/pytorch/pytorch/issues/5059

    • nurpax 1110 days ago
      This field contains the worker-specific seed:

          torch.utils.data.get_worker_info().seed
      
      So I guess something like the below (untested!) could work too:

         worker_init_fn=lambda id: np.random.seed(torch.utils.data.get_worker_info().seed)
    • _coveredInBees 1110 days ago
      Should be 2**32 above. HN formatting swallowed an asterisk.
  • shoyer 1110 days ago
    This post is yet another example of why you should never use APIs for random number generation that rely upon and mutate hidden global state, like the functions in numpy.random. Instead, use APIs that explicitly deal with RNG state, e.g., by calling methods on an explicitly created numpy.random.Generator object. JAX takes this one step further: there are no mutable RNG objects at all, and the users has to explicitly manipulate RNG state with pure functions.

    It’s a little annoying to have to set and pass RNG state explicitly, but on the plus side you never hit these sorts of issues. Your code will also be completely reproducible, without any chance of spooky “action at a distance.” Once you’ve been burned by this a few times, you’ll never go back.

    You might think that explicitly seeding the global RNG would solve reproducibility issues, but it really doesn’t. If you call into any code you didn’t write, it might also be using the same global RNG.

    • warsheep 1110 days ago
      The solution you suggest is irrelevant to the issue mentioned in the article. Even if you use np.random.RandomState, or any other "explicit RNG state", that state will still be copied in the fork() call.

      The post just stresses that one should be careful when using random states and multiprocessing, so you should either reseed after forking or using multiprocess/multithread-aware RNG API.

      • nemetroid 1110 days ago
        I believe the point is that the error will be more obvious if the state is passed around explicitly.
        • acdha 1110 days ago
          Possibly but this is the kind of boilerplate which people tend to ignore, especially when a program is non-trivial. It’s really easy to notice if you’re doing something like `seed_rng(); fork();` but once there’s distance and more than one thing being passed around I’d be surprised if you didn’t find the same pattern, perhaps a bit less common.

          Fundamentally, there two problems: fork() is a performance trick to try to do setup only once and seeding an RNG is a type of setup which isn’t intuitively obvious can’t be optimized that way; and if most people learn from a tutorial or quick start this is exactly the kind of important but non core issue people omit or ignore in that context.

          • OskarS 1110 days ago
            Additionally, I think people make a hidden assumption that they don't even realize they're making: that when you ask for random numbers from numpy, they're more or less "true" random numbers, not seeded ones. Like, I think the intention of the programmers is just "give me a bunch of random numbers, I don't really care how as long as they're random", and assumes that that is what that numpy function does. But it doesn't: it provides you a pseudo-random sequence – not true randomness – so of course the sequence is identical after the fork.

            Like, they think they're reading from /dev/random, but they're not: they're just running rand() (metaphorically speaking).

            • acdha 1110 days ago
              Definitely - back when I supported a computational neuroscience group that came up multiple times (not numpy but similar contexts), along with the various quirks around floating point math. Even experienced people do things like that because they’re focused on the actual problem and this is a leaky implementation detail.
  • unityByFreedom 1110 days ago
    Great catch!

    > I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem. It’s inside PyTorch’s official tutorial, OpenAI’s code, NVIDIA’s projects, etc. [1]

    [1] https://github.com/pytorch/pytorch/issues/5059

  • timzaman 1110 days ago
    Iirc the bug Karpathy mentioned in his tweet was actually due to the seed being the same across multigpu data parallel workers! You need to account for this too. So the author hasnt solved it.

    I know this bc I fixed the bug. And probably caused it. Hehe.

    Also you dont just want to set ur numpy seed but also the native python one and the torch one.

    • Salgat 1110 days ago
      Yeah this isn't really a bug or even an issue with the library. If you instantiate an RNG with a seed and then fork your process, well duh of course the RNG will be repeated across the forks.
  • jeeeb 1110 days ago
    I always randomly log a sample of my inputs to TendorBoard to manually review what my training data actually looks like and (hopefully) pick up on bugs like these. Similarly I find logging high loss inputs very informative.

    Coincidentally I find this article timely as I was recently reviewing PyTorch DataLoader docs regarding random number generator seeding. It’s the kind of thing unit test don’t pick up since it only occurs when you use separate worker processes.

  • Too 1110 days ago
    .NET has a similar pitfall, but not due to forking but rather that the Random() default seed is based on the system clock. So starting several threads constructing new Random objects with the hope that they are unique might in fact give you same RNG sequences.
  • jandrese 1110 days ago
    Forgetting to seed your RNG is a really classic bug. IMHO RNGs should auto seed unless explicitly set not to, but since the opposite behaviour was baked into C so many years ago it's kind of the default. The worst part is how easy a bug this is to miss unless you're explicitly printing out the first set of random numbers for some strange reason.
    • _delirium 1110 days ago
      NumPy does auto-seed the RNG if you don't pass a seed yourself, using platform-specific code to pull some entropy from the OS. So that common case is handled reasonably well, unlike with C. In fact if you want exactly reproducible results (e.g. in testcases), you have to seed with a known seed, to avoid that default behavior.

      The issue here is a little more subtle: if you fork 10 copies of your Python process, all 10 inherit the current RNG state, and will thereafter produce identical random number sequences. If you were manually forking, you might guess that was a potential problem, and re-seed the RNGs after forking. But PyTorch's data loaders fork a bunch of processes to do things in parallel, so users might not realize that they're using duplicate copies of their RNG state.

      • jeeeb 1110 days ago
        It’s even slightly more subtle than that.

        Python multiprocessing doesn’t use fork on Windows. It starts a new process and so shouldn’t be affected by this.

        So to trigger this you need to have num_processes != 0 on your DataLoader and be running on a non-Windows platform.

        • shmel 1110 days ago
          I get the desire to be pedantic, but does anyone at all train DL models on Windows? (barring toy projects for fun and perhaps debugging) The same can be said about num_workers > 0. You _have to_ fork worker threads unless you train something super tiny like MNIST and you load the whole dataset on GPU.
          • hermitdev 1110 days ago
            > does anyone at all train DL models on Windows?

            Yes. My last job was at a financial shop that was all Windows. They were doing ML with Python on Windows. Azure has boxes available for this.

        • ynik 1110 days ago
          Starting with Python 3.8, multiprocessing will also use new processes by default on MacOS (due to some system libraries not being fork-safe).

          IMHO cross-platform Python projects should call `multiprocessing.set_start_method('spawn')` to get the same behavior everywhere.

    • phonebucket 1110 days ago
      I’m of the opposite opinion and would get away from all auto RNG seeding:

      1) this will help reproducibility a great deal, which is a pain so often.

      2) forcing users to actually understand the seeding of RNGs from the point that they are novice programmers could help allay bugs of the sort seen in this post, which I believe stems from having too much faith that RNGs will simply work out of the box as substitutions for ‘real’ random variables.

      • rurban 1109 days ago
        I have again a different opinion. Allow both: srand() - explicit seed initialization, as well as autoseed.

        But you really need to change selected known bad seeds, which destroy the PRNG statistical properties. Most PRNG's have a couple of known bad seeds, but nobody does anything against it. Same for hash functions.

      • 0-_-0 1110 days ago
        Indeed. Almost every time (like now) when you think you need a random number, you actually need a low discrepancy sequence.
  • jamesfisher 1110 days ago
    Note official TensorFlow tutorials make the exact same mistake. I've reported it but it hasn't been fixed. [1]

    [1]: https://github.com/tensorflow/tensorflow/issues/47755

  • qd6pwu4 1110 days ago
    I notice that the web page of this article is beautifully justified to two sides instead of left alignment, and there is hyphen in breaking lines. Does anyone know how to achieve this in web page? text-align: justify seems to produce inferior results than this page, e.g. rivers in text.
    • rockostrich 1110 days ago
      `hypens: auto` is probably it. It allows words to be split with a hypen at the browser's discretion.
      • qd6pwu4 1110 days ago
        thanks, that seems to be the answer
  • tsimionescu 1110 days ago
    This seems like another reason to never use fork() without exec(). Fork is really a mine field when used this way (and a pretty big maintenance burden on the kernel, by my understanding, to provide the illusion of sharing read-only state with the parent process).
  • andrew_v4 1110 days ago
    Is there something specific about numpy here, or would it be any RNG?

    I'm looking at some code that uses random.random() to randomly apply augmentations, I suspect that will have the same issue right?

    • rurban 1109 days ago
      It is a well-known trap that forked or threaded RNG's keep using the same seed. It can be called a feature, eg when you need to process different ranges, using the same sequence of random numbers for each range, but mostly it's a bug. You need to init your rng seed for each thread or fork to get different random sequences.

      For splitting up sequential ranges, a good rng typically has an advance function to advance the seed for each range. So you can get reproducibility.

  • formerly_proven 1110 days ago
    Python has os.register_at_fork nowadays, so why‘d still have this kind of behavior? Not reseeding after fork has been a footgun for almost as long as fork exists.
  • canjobear 1110 days ago
    I usually write my own data handling functions rather than trying to play PyTorch’s game here. I find their abstractions here confusing or not useful.
  • etiam 1110 days ago
    Would normally refrain from upvoting this on account of the title, but the actual topic was important enough that I think it can be worth an exception.
  • selimthegrim 1110 days ago
    A less clickbaity title might have been: Bugs are easy to make. Here’s how to make fewer bugs when modifying existing PyTorch and NumPy code.
  • Nimitz14 1110 days ago
    Oh wow, I've definitely done this mistake without realizing it...
  • ivoras 1110 days ago
    A lot of comments are criticising the frameworks or the developers, but suprisingly almost no one is criticising Python, which remains a language of the early 90ies as far as parallelism is concerned.

    A bit like Stockholm syndrome - "Python doesn't do threading" is so ingrained in its users (and I'm a user) minds that it's not even questioned as a potential source of problems.

    (Noone said it's easy to do. That's why language developers and implementers are a special breed even today.)

  • anon_tor_12345 1110 days ago
    This is probably because I never read these kinds of blogposts but this is one of the most flagrantly clickbait titles I've ever seen. Like the article doesn't even suggest ditching numpy in favor of jax or some kind of other hot take (which would at least warrant such a bombastic title) it literally just presents one instance in which you might be making a mistake when using numpy's rng (not even something more unique to numpy). And the PyTorch team is aware of this and hence exposes `worker_init_fn`. So the title should actually be "Using fork without understanding fork? You might be making a mistake."
    • wokwokwok 1110 days ago
      I suppose...

      1) This is an issue from 2018 (https://github.com/pytorch/pytorch/issues/5059), which links to the closed numpy issue (https://github.com/numpy/numpy/issues/9248) which just says: seed your random numbers folk.

      2) The documentation in pytorch covers this (https://pytorch.org/docs/stable/data.html#randomness-in-mult...), but it's not really highlighted specifically in, eg. tutorials. (but it is in the FAQ https://pytorch.org/docs/stable/notes/faq.html#dataloader-wo...)

      3) It doesn't affect windows, which uses spawn instead of fork.

      4) To quote the author:

      > I downloaded and analysed over a hundred thousand repositories from GitHub that import PyTorch. I kept projects that use NumPy’s random number generator with multi-process data loading. Out of these, over 95% of the repositories are plagued by this problem.

      ^ No actual stats, just some vague hand waving; this just seems like nonsense.

      So, I suppose... there's some truth to it being a documentation issue, but I guess the title + (1-3) kind of say to me: OP thought they discovered something significant... turns out, they didn't.

      Oh well, spin it into some page views.

      • anon_tor_12345 1110 days ago
        >No actual stats, just some vague hand waving; this just seems like nonsense.

        i had exactly the same thought - if they'd actually crawled github they'd have some nice plots to back up the claim.

      • unityByFreedom 1110 days ago
        Better title? Over 95% of GitHub repos using NumPy and PyTorch aren't getting the random numbers they think they are.
        • sdfhbdf 1110 days ago
          Probably over the HN Title character limit.
          • unityByFreedom 1110 days ago
            95% of GitHub repos using NumPy/PyTorch don't get the randomness they intended.
    • gleenn 1110 days ago
      OP said they scanned and found this problem in thousands of projects including some ones which are probably copied heavily as examples like from Nvidia. While the post might be a little strong, at least they back up their statement that many others are actually suffering this problem
    • coolreader18 1110 days ago
      Maybe the intent is for it to be read as "If you're using pytorch and numpy, it's _very_ likely you're making this mistake", but the effect is still that the headline is clickbait
      • nerdponx 1110 days ago
        It's so obviously clickbait that I wonder if it's meant to be tongue-in-cheek.
    • TruthWillHurt 1110 days ago
      "I downloaded over a hundred thousand repositories from GitHub that import PyTorch... Out of these, over 95% of the repositories are plagued by this problem."

      Title seems pretty accurate to me!

  • ummonk 1110 days ago
    "You're making a mistake" sounds like one shouldn't use PyTorch and NumPy together, when the actual message is "there might be a mistake in your code".
  • nxpnsv 1110 days ago
    So click baity. A proper title would be, be careful when using random numbers and multi processing...
    • rlanday 1110 days ago
      My initial reaction was, “it can’t possibly be as big as the mistake you’re making if you’re using TensorFlow!”
      • nxpnsv 1110 days ago
        Certainly if you use it without understanding how to use a random number generator... but that is not really a fault in numpy OR TensorFlow...
  • king_magic 1110 days ago
    Aside from the infuriating clickbait title (which I shall not dignify with an upvote), this is part of why I preprocess augmented images. I don't like too much magic in my custom derived (PyTorch) Dataset objects.
  • BlueTemplar 1110 days ago
    Taking the title at face value : yeah, you are making a mistake, using GAFAM tools like PyTorch, Github or OpenAI's GPTs.