A library to assist writing memory-unsafe code in "pure" Python

(github.com)

53 points | by albertzeyer 12 days ago

3 comments

Galanwe 11 days ago
There are some real world use cases for these.
I do use some similar helpers to e.g. create pickable weakrefs for passing large objects to forking multiprocessing pools as arguments, or expose large objects in shared memories. It's not portable against non-CPython implementations, but as long as it's well documented and known, it's fine.
[-]
- dabacaba 11 days ago
  Sharing Python objects between processes is not safe. Python assumes that access to objects is protected by GIL, but this is not the case when using multiple processes (even if the processes only read from the objects, they still modify the refcounts). Sharing memory buffers can be done safely using mmap.
  [-]
  - Galanwe 11 days ago
    Sharing objects is completely fine upon a forking multiprocessing pool. Each worker process ends up with a (lazy) copy of the parent address space, including the whole python interpreter state.
    When you think about it, you already rely on that to access imported modules, global variables, etc in your worker processes. The GIL and reference counting has nothing to care about it either, as they are both copied from the parent process as well, thus you can freely continue to read global state (and even modify, though it will be local to the worker process).
    That is, if you have large objects to share with your worker processes, and you don't need to get them back, you can freely just assign them as global variables before starting your multiprocessing pool, and have your workers read them, at zero transfer cost, and safely. This _trick_ has been used since decades in Python.
    Now the annoying thing is, that global variable mumbo jumbo is kind of dirty and ad-hoc to maintain. Using custom pickable weakrefs (using something similar to the trick with id() in this repo), you can create proxy weak objects to transfer to your workers, thus achieving the same effect, while keeping a "function argument" interface, instead of proxy8ng through global variables.
    [-]
    - albertzeyer 11 days ago
      The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount. The problem is also described here: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multip...
      But also, you say you would prefer such a unbound memory access hack instead of using a global variable?
      But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.
      [-]
      - Galanwe 11 days ago
        > The problem is, once you access such shared objects in Python, it is never readonly access but actually read-write, because it modifies the refcount.
        That is right, but is a mere drop in the sea. First, because reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked), meaning you will mainly copy on write these external small structures anyway. Second, what I'm describing here is for when pickling objects across workers is prohibitively slow and memory consuming, typically that means sharing pandas dataframes of dozens or hundreds of gigabytes. Some copied refcount pages here and these is really not going to be a culprit.
        > But also, why does it need to be a global variable? When you fork(), afterwards all the local variables are available to the child process. No need for global variables.
        Right, but you need some way to access these variables, and once you're in a worker process you simply are in a difference scope.
        def workerfunc(x): # I'm a poor worker in an empty scope def parent(): juicy_variable = ... with Pool(42) as pool: result = pool.map(workerfunc, [1, 2, 3])
        [-]
        albertzeyer 11 days ago
        > reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)
        That's wrong. That was never the case.
        Recent CPython: https://github.com/python/cpython/blob/6d419db10c84cacbb3862...
        CPython 2.0: https://github.com/python/cpython/blob/2a9b0a93091b9ef7350a9...
        CPython 0.9.8: https://github.com/python/cpython/blob/dd104400dc551dd4098f3...
        Regarding multiprocessing.Pool, that would not work as I said. I was thinking more about a plain fork, like this:
        def parent(): juicy_variable = ... def workerfunc(x): # I can access juicy_variable ... childs = [] for i in [1, 2, 3]: child = fork() if child == 0: workerfunc(i) sys.exit() childs.append(child) # Wait for and cleanup childs. # Communicate somehow with childs to get back results. ...
        [-]
        Galanwe 11 days ago
        >> reference counting is not intrusive in CPython (meaning the reference counting structures are outside the PyObject, last I checked)
        > That's wrong. That was never the case.
        You are right, I was mistaken.
        The point stands though, PyObjects are not really an issue for use cases where these tricks are needed.
        > I was thinking more about a plain fork, like this
        Right well you can recreate a multiprocessing pool of your own with different pros and cons, sure, that's an other approach I guess.
    - ptx 11 days ago
      Do you have any publicly available code demonstrating this pattern?
      [-]
      - Galanwe 11 days ago
        I don't actually, but it can be explained in a few lines of code. Consider the following two simple functions:
        def ref(obj): return id(obj) def deref(addr): import ctypes return ctypes.cast(addr, ctypes.py_object).value
        Basically, this relies on an implementation detail of `id()` in CPython: the unique id of an object is its memory address. `ref()` returns a reference to an object (think `&` in C), and `deref()` dereferences it back (think `*` in C). This is close to the standard `weakref` module in essence, but weakref is a black box.
        Now even though the callstack is cleared upon fork of the worker processes, you still have the parent objects available, and properly tracked and refcounted, as you can check from `gc.get_objects()`. This is in fact a feature of `gc` as explained in the doc (https://docs.python.org/3/library/gc.html):
        > If a process will fork() without exec(), avoiding unnecessary copy-on-write in child processes will maximize memory sharing and reduce overall memory usage. This requires both avoiding creation of freed “holes” in memory pages in the parent process and ensuring that GC collections in child processes won’t touch the gc_refs counter of long-lived objects originating in the parent process. To accomplish both, call gc.disable() early in the parent process, gc.freeze() right before fork(), and gc.enable() early in child processes.
        Now whenever you want to send large objects to a `multiprocessing.Pool` or `concurrent.futures.ProcessPoolExecutor`, you can avoid expensive pickling by just sending these references.
        class BigObject: pass def child(rbo): bo = deref(rbo) return bo.compute_something() def parent(): bo1 = BigObject() bo2 = BigObject() with Pool(2) as pool: result = pool.map(child, [ref(bo1), ref(bo2)])
        In a real codebase though, there are some caveats around this. You cannot take the reference of just anything, there are temporaries, cached small integers, etc. You will need some form of higher level wrapper around `ref()` to properly choose when and what to reference or to copy.
        Also it may be inconvenient to have your child functions explicitely dereference their parameters, it will force you to write _dereference wrappers_ around your original functions. A good strategy I've used is to create a proxy class that stores a reference and override `__getstate__`/`__setstate__` for pickling itself as reference and unpickling itself as a proxy. That way, you can transparently pass these proxies to your original functions without any modification.
        [-]
        ptx 11 days ago
        Oh, I see. You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.
        You could also do it without pointers and ctypes by using e.g. an array index as the ID:
        inherited_objects = [] def ref(obj): object_id = len(inherited_objects) inherited_objects.append(obj) return object_id def deref(object_id): return inherited_objects[object_id]
        Although this part needs a small change as well, so that the object ID is assigned before forking:
        def parent(): bo1 = BigObject() bo2 = BigObject() refs = list(map(ref, [bo1, bo2])) with mp.Pool(2) as pool: result = pool.map(child, refs)
        [-]
        Galanwe 11 days ago
        > You want to avoid serializing the objects since they will be copied anyway with fork(), but the parent needs a way to refer to a particular object when talking to the child, so it needs to pass some kind of ID.
        Yes, that is exactly and succintely the crux of the idea :-)
        As you found out, you can rely on indices or keys in a global object to achieve the same result. The annoying part though is that you need to pre-provision these objets before the pool, and clean them after to avoid keeping references to them. That means some explicit boilerplate every time you use a pool.
        The nice thing with the id() trick is that it's very unintrusive for the caller, as the reference count stays the same in the parent process, it is only increased in the child, unbeknownst to the parent.
  - hot_gril 11 days ago
    What do you mean by sharing objects between processes in Python? By default it sends a pickled copy to each subprocess, but there are special things like Manager I haven't used.
    [-]
    - Galanwe 11 days ago
      The `multiprocessing` and `concurrent.futures` pools support a fork strategy on Linux (which is the default).
      What that means is that your worker processes run in an interpreter whose state is the same as the parent. Effectively you will then inherit imported modules, global variables, etc.
      The mechanisms to send objects across threads do rely on pickling objects in a queue, yes. But if you know you will use a forking pool, you can just put your variables as globals before the fork and access them in your workers.
      Essentially, you can change this:
      def worker(obj): print(obj) def parent(): objs = [obj1, obj2] with Pool(2) as pool: pool.map(worker, objs) # each obj is pickled and sent to the worker through a queue
      To this:
      OBJS = [] def worker(idx): print(OBJS[idx]) def parent(): OBJS.append(obj1) OBJS.append(obj2) with Pool(2) as pool: pool.map(worker, range(len(OBJS))) del OBJS[0] del OBJS[1]
      > there are special things like Manager I haven't used.
      Managers are an abstraction over a shared memory, and objects that can be safely stored in it. From my experience it's cumbersome to use and very limited in functionality.
      [-]
      - hot_gril 11 days ago
        I tried this on a Linux machine in Py3 earlier, which I think is equivalent to your example:
        from multiprocessing import Pool shared = {} def f(x): global shared # idk if this matters but tried without it too shared[x] = x print(shared) if __name__ == '__main__': with Pool(5) as p: p.map(f, [1, 2, 3]) print(shared)
        It printed {1: 1}, {2: 2}, {3: 3} then {} at the end. So the parent and other children didn't see the change made by each child. Doesn't seem like they're RW sharing.
        [-]
        Galanwe 11 days ago
        Yes, as mentioned, the workers get a copy of the interpreter states. Every modification made in a worker to these objects stay local to the worker. These tricks are for sending data to a worker, not receiving back data.
owlstuffing 11 days ago
Sergeant Frank Tree: You shouldn't touch the ordnance at all. But more specifically, you should never pull this hand-operating lever to the rear.
Ward Douglas: Never.
Olesya000 11 days ago
[dead]