Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return value policy gray zone #888

Open
wjakob opened this issue Jan 28, 2025 · 6 comments
Open

Return value policy gray zone #888

wjakob opened this issue Jan 28, 2025 · 6 comments

Comments

@wjakob
Copy link
Owner

wjakob commented Jan 28, 2025

cc @hawkinsp @vfdev-5 @oremanj

Nanobind exposes various return value policies that generally do an OK job. But there is a gray zone where their current behavior is IMO confusing. In particular, rv_policy::move, rv_policy::take_ownership, and rv_policy::reference_internal all check if an existing Python object is already associated with the pointer/reference. In that case, they directly return that and the return value policy is ignored.

This can lead to leaks. For example, suppose that code returns an object twice -- once using rv_policy::reference, and later using rv_policy::take_ownership. The second ownership transfer will never occur.

It can also lead to weird/unexpected behavior. For example,

.def("move_field", [](Struct &s) -> Field& { return s.field; }, rv_policy::move)

will not actually move the field when somebody else has previously created a reference to s.field.

Confusion aside, these lookups to check for the existence of an instance also have a non-negligible cost (hash table traversal) that would be nice to avoid.

But it is not so simple to change this behavior.

For example, one low hanging fruit that I was looking at just now was to disable the search for existing instances when the user passes rv_policy::move (which is used for pass-by value, so this would be great optimization that hits many usecases).

However, this breaks overloaded assignment operators (specifically nb::self() += nb::self() used in test14_operators). This is related to a PR by @oremanj (#803). The operator is overloaded with rv_policy::move and enforcing that now breaks preservation of the same object for an in-place update.

Taking a step back, the ideal behavior for in-place operators is to return the same Python object if the operator returns *this, and copy or move otherwise (perhaps move in the case of pass-by value, and copy in remaining cases?) We always move for pass-by value, so it seems that we need an return_existing_or_copy return value policy...

Next, I looked at disabling the instance search for rv_policy::take_ownership. This causes many unique pointer-related tests to fail in test_holders.py, and the test suite eventually segfaults in tests/test_thread.py.

The segfault is instructive: it happens in a function binding that essentially does [](Value *value) -> Value * { return value; } (pointer pass-through). The default value policy automatic turns into take_ownership, and that's the wrong one to use here. So making any change here will probably break quite a lot of user code.

I thought that it would be useful to have a discussion about whether others think this is a problem, and how to improve it.

One potential solution that I was thinking about is to separate the return value policy into two independent aspects:

  1. what happens when an existing instance is found, and
  2. what to do if not?

So we could, e.g., have

  • always_copy
  • always_move
  • always_take_ownership
  • return_existing_or_copy
  • return_existing_or_move
  • return_existing_or_take_ownership
  • ...

(with the current policy names mapping to the return_existing_* variants).

By numbering them suitably, the use of these (many) policies could be dealt with using bit arithmetic in the implementation. To take the negative position, this change makes something that users already find confusing even more overwhelming.

Your feedback would be greatly appreaciated!

oremanj added a commit to oremanj/nanobind that referenced this issue Jan 28, 2025
This is a potential answer to wjakob#888, also inspired by the discussion in wjakob#589. Seeking feedback on the approach before I work on docs and tests.

When returning unique or shared ownership to Python when some Python instances already hold a non-owning reference to the same C++ object, we have a few choices:
* Create a new owning instance in addition to the existing non-owning one. (This PR's approach.)
* Upgrade the existing non-owning instance to owning. (More appealing in some ways, but hard to implement with shared ownership since nanobind doesn't have holders.)
* Just return the non-owning instance and forget about the ownership. (nanobind's current approach, which sometimes causes problems.)

This PR also handles the fact that a non-owning instance can dangle. That is, the referenced C++ object could be destroyed while the Python instance is still alive -- especially on PyPy, where it's hard to control when a Python instance dies. Dangling instances are always non-owning, so they are mostly handled by the same logic that handles rv_policy changes. The remaining piece of support for dangling instances is to acknowledge that, if the C++ referent is freed, a new instance could be allocated with its same address, even one with internal storage. (I have observed this in production.) So, there is some new logic in `inst_new_int` to remove the previous must-be-dangling instances from `inst_c2p`, rather than crashing because they exist. This also implies a new state that a nanobind instance can be in: if inst->offset == 0, the instance refers to no C++ object and is not stored in the inst_c2p map.
oremanj added a commit to oremanj/nanobind that referenced this issue Jan 28, 2025
This is a potential answer to wjakob#888, also inspired by the discussion in wjakob#589. Seeking feedback on the approach before I work on docs and tests.

When returning unique or shared ownership to Python when some Python instances already hold a non-owning reference to the same C++ object, we have a few choices:
* Create a new owning instance in addition to the existing non-owning one. (This PR's approach.)
* Upgrade the existing non-owning instance to owning. (More appealing in some ways, but hard to implement with shared ownership since nanobind doesn't have holders.)
* Just return the non-owning instance and forget about the ownership. (nanobind's current approach, which sometimes causes problems.)

This PR also handles the fact that a non-owning instance can dangle. That is, the referenced C++ object could be destroyed while the Python instance is still alive -- especially on PyPy, where it's hard to control when a Python instance dies. Dangling instances are always non-owning, so they are mostly handled by the same logic that handles rv_policy changes. The remaining piece of support for dangling instances is to acknowledge that, if the C++ referent is freed, a new instance could be allocated with its same address, even one with internal storage. (I have observed this in production.) So, there is some new logic in `inst_new_int` to remove the previous must-be-dangling instances from `inst_c2p`, rather than crashing because they exist. This also implies a new state that a nanobind instance can be in: if inst->offset == 0, the instance refers to no C++ object and is not stored in the inst_c2p map.
@oremanj
Copy link
Contributor

oremanj commented Jan 28, 2025

As it happens, I attempted to solve this problem some months ago, following the discussion in #589. I gave up on it because I was running into a few too many fiddly corner cases, but with fresh eyes all of them were resolved pretty easily and I've uploaded the result as #889. That PR is focused on the correctness issues here, not the performance ones.

I don't think straight-up doubling the number of rv_policies is a very user-friendly solution (correctly choosing between always_take_ownership and return_existing_or_take_ownership is equivalent to choosing whether the already-exposed instances are non-owning or owning, which seems very difficult for the binding author to determine), but if we're willing to break the 3-bit barrier, I could at least see a use case for adding new copy/move policies:

  • return_existing_or_copy (possible shorter names copy_or_none, copy_if_new) would be the appropriate policy for augmented assignment operators, freeing up move to not look for an existing object
  • return_existing_or_move (move_or_none, move_if_new) might be the appropriate policy for functions that return an rvalue reference, while unconditional move is appropriate for functions that return a value

@wjakob
Copy link
Owner Author

wjakob commented Jan 29, 2025

Thank you @oremanj. I think you meant #889, I posted feedback there.

@oremanj
Copy link
Contributor

oremanj commented Jan 29, 2025

Sorry yes, #888 for #889 was a typo. I'll respond there.

@hpkfft
Copy link
Contributor

hpkfft commented Jan 29, 2025

A quick thought (that might appear simpler to developers) would be rvp::lookup | rvp::move instead of return_existing_or_move.

@hpkfft
Copy link
Contributor

hpkfft commented Jan 30, 2025

Is it useful to have both copy and move? I haven't had the occasion to use this aspect of nanobind much, so I may be missing something obvious. However, I'm wondering if the existing policies are unnecessarily mixing purely C++ policy with the relationship between Python objects and C++ objects.
C++ has copy elision, and if that is not used, selects move constructors or copy constructors as appropriate for the return statement.
Can nanobind have something like rvp::new_object to replace both copy and move?

rvp description
take_ownership Python will own and will eventually delete/free the C++ resources
new_object Python will own a new object that is distinct from the C++ one
lookup Python already has a corresponding object, which is looked up and returned
reference Python refers to a C++ object and C++ has lifetime responsibilities

And, reference_internal as before

@oremanj
Copy link
Contributor

oremanj commented Jan 30, 2025

Is it useful to have both copy and move?

They're at least needed internally in order to communicate the value category to nb_type_put, since a function's return value has been type-erased to void* by the time nanobind actually copies or moves it. We need to be able to tell that, by default, we should move out of a returned rvalue (it's guaranteed safe since we're the only ones with access to it) but not move out of a returned const T&.

It would be theoretically possible to distinguish between "user-facing" RVPs (which get mentioned in .def() or nb::cast() arguments, and ultimately passed to type_caster::from_cpp() which knows the static type of the thing that's being converted) versus "internal" RVPs (which get passed to nb_type_put and need to provide enough detail to operate on a void*). If we did that, then automatic/automatic_reference/new_object would be external-only, copy/move would be internal-only, and reference/lookup/take_ownership would be both. I'm not sure it's worth the trouble and mental overhead though.

For reference, the converter between "external" and "internal" RVPs in current nanobind is infer_policy. It's hard to tell because both use the same enum type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants