WIP: Handle dangling objects and rv_policy changes robustly #889

oremanj · 2025-01-28T17:56:54Z

This is a potential answer to #888, also inspired by the discussion in #589. Seeking feedback on the approach before I work on docs and tests.

When returning unique or shared ownership to Python when some Python instances already hold a non-owning reference to the same C++ object, we have a few choices:

Create a new owning instance in addition to the existing non-owning one. (This PR's approach.)
Upgrade the existing non-owning instance to owning. (More appealing in some ways, but hard to implement with shared ownership since nanobind doesn't have holders.)
Just return the non-owning instance and forget about the ownership. (nanobind's current approach, which sometimes causes problems.)

This PR also handles the fact that a non-owning instance can dangle. That is, the referenced C++ object could be destroyed while the Python instance is still alive -- especially on PyPy, where it's hard to control when a Python instance dies. Dangling instances are always non-owning, so they are mostly handled by the same logic that handles rv_policy changes. The remaining piece of support for dangling instances is to acknowledge that, if the C++ referent is freed, a new instance could be allocated with its same address, even one with internal storage. (I have observed this in production.) So, there is some new logic in inst_new_int to remove the previous must-be-dangling instances from inst_c2p, rather than crashing because they exist. This also implies a new state that a nanobind instance can be in: if inst->offset == 0, the instance refers to no C++ object and is not stored in the inst_c2p map.

This is a potential answer to wjakob#888, also inspired by the discussion in wjakob#589. Seeking feedback on the approach before I work on docs and tests. When returning unique or shared ownership to Python when some Python instances already hold a non-owning reference to the same C++ object, we have a few choices: * Create a new owning instance in addition to the existing non-owning one. (This PR's approach.) * Upgrade the existing non-owning instance to owning. (More appealing in some ways, but hard to implement with shared ownership since nanobind doesn't have holders.) * Just return the non-owning instance and forget about the ownership. (nanobind's current approach, which sometimes causes problems.) This PR also handles the fact that a non-owning instance can dangle. That is, the referenced C++ object could be destroyed while the Python instance is still alive -- especially on PyPy, where it's hard to control when a Python instance dies. Dangling instances are always non-owning, so they are mostly handled by the same logic that handles rv_policy changes. The remaining piece of support for dangling instances is to acknowledge that, if the C++ referent is freed, a new instance could be allocated with its same address, even one with internal storage. (I have observed this in production.) So, there is some new logic in `inst_new_int` to remove the previous must-be-dangling instances from `inst_c2p`, rather than crashing because they exist. This also implies a new state that a nanobind instance can be in: if inst->offset == 0, the instance refers to no C++ object and is not stored in the inst_c2p map.

wjakob · 2025-01-29T03:31:17Z

Hi @oremanj,

after thinking a bit about this, I think that my position on this is as follows: if you create a Python object with a rv_policy::reference RVP, you must guarantee that the C++ pointer cannot die before the Python instance. Normally that means that the object in question is completely static, or that there is some other mechanism (e.g. nb::keep_alive) that ensures the liveness of the memory region holding that pointer. If the C++ object dies prematurely, is replaced by another object at the same address, and that object is subsequently cast to Python, then you are using nanobind outside of its specification. Possibly, the specification should be more clear.

The PR you created is.. complicated. As far as I can tell, the main point is to enhance the capabilities of non-owning Python objects to do things that go beyond the spec, and to handle nontrivial consequences that this has on shared ownership. I am less clear about those parts, so it is possible that I misunderstood. There is one part of this PR that seems quite harmless, which is to append objects to a linked list in a different order. I had the impression that that might considerably reduce the fallout from "out of spec" usage (but not all of it?). In any case, I am much more open to a change of that magnitude.

That all said, in #589, you mentioned the following:

I just realized that there's a 100% valid way to produce this error: if you have a Python object created from a unique_ptr<T> returned by C++, and you pass its ownership back to C++ in order to initialize a new unique_ptr<T> argument. The C++ object address is still registered in inst_c2p, but C++ can validly destroy and deallocate it, which creates the same opportunity for a spurious collision assertion in inst_new_int later on.

I agree that this is a problem, and that should be fixed. I think it will be necessary to redesign the "relinquish" mechanism so that this failure mode does not exist. But this is more specific to unique pointers and not use of non-owning references in general.

Altogether, I had the impression that the contents of the PR aren't particularly related to #888, which was about confusion between return value policy and whether or not to search the instance map. (But please let me know if I missed something)

oremanj · 2025-01-29T06:24:14Z

Thanks for your feedback. I'm going to try to make the case for why I think this is a problem that is worth spending some complexity budget on. Of course, as always the final outcome is up to you.

I think that my position on this is as follows: if you create a Python object with a rv_policy::reference RVP, you must guarantee that the C++ pointer cannot die before the Python instance.

That is certainly a position that would simplify the constraints on nanobind. It doesn't seem very realistic to me; there are many C++ objects whose lifetimes cannot be tied to the lifetime of a Python object, which it is still useful to be able to inspect from Python. Setting rules for use of those objects in terms of visible operations in the Python program is feasible, though of course making it impossible to make lifetime errors is better where possible. Setting rules about when the Python object is destroyed -- something that Python goes out of its way to hide from the user, and that can be impacted in ways that many experts would have a hard time noticing, by commonplace actions like throwing an exception, attaching a debugger, or using PyPy -- seems like a standard that's almost impossible to actually meet. It's particularly pernicious because the existence of a dangling object creates no problems most of the time, but will sporadically fail in unclear ways when an address happens to get reused. I tend to think it's worth a fair bit of implementation complexity to avoid that sort of heisenbug.

To demonstrate the difficulty of adhering to the position you've laid out: Any C++ pointer passed to a Python callable will use rv_policy::reference by default, due to the use of rv_policy::automatic_reference in such contexts. If the callee throws an exception, then the Python instance will be kept alive at least as long as that exception object is, via exc.__traceback__.tb_frame.f_locals. It is quite likely that that exception could unwind past the scope in which the passed C++ pointer is valid. Rigorously adhering to the rule you've stated would thus require some serious hoop-jumping from anyone passing a mutable C++ object to a callback: they would need to make a Python-owned copy of it first, and then copy the changes from the Python-owned copy back to the C++ original if the callback completes successfully. Even that level of care wouldn't suffice on PyPy, since it can keep unreferenced instances alive for a pretty much arbitrary length of time (unless you call gc.collect(), but that's too slow to do often).

The PR you created is.. complicated.

I don't dispute this. I think it reflects the fact that the underlying problem is complicated. Lifetime issues, and specifically the mismatch in lifetime idioms between the different languages involved, are probably the hardest part of writing cross-language bindings. If the binding library handles reasonably predictable ways for lifetime issues to create surprises, then that has a force multiplier robustness impact on all authors of extension modules using the library, and all people who write Python scripts using those extension modules. On the other hand, if the binding library chooses to not acknowledge some of that underlying complexity, then the complexity will tend to show up in surprising and occasionally brittle ways for users who are much less experienced in understanding the nuances of these issues than we are.

I'm sure there are likely ways to achieve the same goals as this PR (or at least many of them) with less implementation complexity. I am extremely open to making changes along those lines; I made it as simple as I could without leaving any cases unhandled (as far as I could tell), but I don't know which cases might deserve to remain unhandled. I am more skeptical of the idea that the problems this PR solves categorically don't matter, because I have personally spent quite a number of hours debugging crashes that this PR would've prevented.

As far as I can tell, the main point is to enhance the capabilities of non-owning Python objects to do things that go beyond the spec, and to handle nontrivial consequences that this has on shared ownership.

I would say that the main point is to fix the fact that exposing an address to Python using the reference RVP creates surprising results when the same address is later exposed in a way that carries more ownership. Some examples that don't involve any dangling:

If you expose a reference to T followed by a shared_ptr<T> with the same referent, the shared ownership is lost.
If you expose the same object as reference_internal with two separate parents (which is plausibly correct if both parents keep it alive in C++ via shared ownership), the second parent's lifetime linkage is lost; only the first has an effect.

If we don't admit the existence of dangling objects, then a decent fraction of the value of this PR could be obtained more simply by refusing to reuse an object that carries no ownership when returning an object that does carry ownership. (Or if you do, attach the ownership to the existing object somehow.) "Ownership" here would include things like reference_internal and shared_ptr, so it's not state that is currently tracked explicitly, but doing so would be feasible.

Altogether, I had the impression that the contents of the PR aren't particularly related to #888, which was about confusion between return value policy and whether or not to search the instance map.

The relevance is mostly the example from #888 that says "For example, suppose that code returns an object twice -- once using rv_policy::reference, and later using rv_policy::take_ownership. The second ownership transfer will never occur." This PR fixes that. Another less-slam-dunk link: #888 says "It can also lead to weird/unexpected behavior. For example, [snippet] will not actually move the field when somebody else has previously created a reference to s.field." This PR arguably fixes that too, in that rv_policy::move will now reuse an owning object but not a non-owning one.

There is one part of this PR that seems quite harmless, which is to append objects to a linked list in a different order. I had the impression that that might considerably reduce the fallout from "out of spec" usage (but not all of it?). In any case, I am much more open to a change of that magnitude.

That helps ensure that when we have created multiple Python instances for the same C++ type and address, we return the one that carries more ownership. Unfortunately it doesn't help at all under the nanobind status quo where we never create multiple Python instances for the same C++ type and address.

I agree that this [reuse of relinquished unique_ptr addresses] is a problem, and that should be fixed. I think it will be necessary to redesign the "relinquish" mechanism so that this failure mode does not exist. But this is more specific to unique pointers and not use of non-owning references in general.

I mean, fundamentally if you pass a unique_ptr<T> from Python to C++ and then you later get a unique_ptr<T> passed from C++ to Python with the same address, there is no way to tell whether it's truly the "same object" or not. If you want to be able to revive the existing relinquished instance in order to better support cases where it is the same object, then you have to keep the instance pointer registered somewhere, which creates the opportunity for address collisions if the object is instead destroyed on the C++ side and never passed back to Python. So I think there is actually a fairly close relationship between supporting unique_ptr ownership transfer and allowing dangling instances to exist.

You could say that relinquished instances lose their C++ address identity and can never be revived, such that later passing the same C++ address back to Python always creates a new instance. I think that would be an unfortunate regression from the current status quo, but it is at least consistent and basically understandable.

But if you want to say that relinquished objects maintain their C++ address identity, then I think you have to admit that this means they can dangle, and thus nanobind ought to handle dangling objects in some way other than considering their existence to be evidence of a bug. I guess you could alternatively implement de-conflicting logic along the lines of this PR while refusing to do it for anything but a relinquished object... but that seems a little silly to me :-)

wjakob · 2025-02-02T06:46:07Z

Hi @oremanj,

thank you for your response.

It will take me a bit longer to properly review this PR and its implications. Do you know what the impact of its changes will be on issue #879?

oremanj · 2025-02-03T18:30:18Z

Sure, no rush! #879 is sort of the opposite problem: an instance returned using reference will reuse one that carries (relinquished) ownership. I didn't foresee that case, but I think it would only be a few lines more in nb_type_put_can_reuse to fix it.

wjakob mentioned this pull request Jan 29, 2025

Return value policy gray zone #888

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Handle dangling objects and rv_policy changes robustly #889

WIP: Handle dangling objects and rv_policy changes robustly #889

oremanj commented Jan 28, 2025

wjakob commented Jan 29, 2025

oremanj commented Jan 29, 2025 •

edited

Loading

wjakob commented Feb 2, 2025 •

edited

Loading

oremanj commented Feb 3, 2025

WIP: Handle dangling objects and rv_policy changes robustly #889

Are you sure you want to change the base?

WIP: Handle dangling objects and rv_policy changes robustly #889

Conversation

oremanj commented Jan 28, 2025

wjakob commented Jan 29, 2025

oremanj commented Jan 29, 2025 • edited Loading

wjakob commented Feb 2, 2025 • edited Loading

oremanj commented Feb 3, 2025

oremanj commented Jan 29, 2025 •

edited

Loading

wjakob commented Feb 2, 2025 •

edited

Loading