Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate what copy=True means for dask #866

Open
crusaderky opened this issue Dec 5, 2024 · 6 comments
Open

Disambiguate what copy=True means for dask #866

crusaderky opened this issue Dec 5, 2024 · 6 comments
Milestone

Comments

@crusaderky
Copy link

The current documentation for the copy parameter of asarray states:
https://data-apis.org/array-api/latest/API_specification/generated/array_api.asarray.html#asarray

copy (Optional[bool]) – boolean indicating whether or not to copy the input. If True, the function must always copy. If False, the function must never copy for input which supports the buffer protocol and must raise a ValueError in case a copy would be necessary. If None, the function must reuse existing memory buffer if possible and copy otherwise. Default: None.

The meaning of copy=True is ambiguous for dask.
There are two possible interpretations:

  1. updating the output array will never alter the input array; OR
  2. the output array will never share memory with the input array; in other words dereferencing the input array can release its memory even if you still hold a reference to the output array.

In dask, updating a collection actually creates brand new graph nodes under the hood and repoints the collection to those nodes, so the original is never modified. However, the original chunks, generated e.g. by from_array, are still held inside the graph.

I strongly prefer the first definition, as IMHO decisions around memory management should be considered low level and delegated to the individual libraries.

@rgommers
Copy link
Member

rgommers commented Dec 5, 2024

I strongly prefer the first definition, as IMHO decisions around memory management should be considered low level and delegated to the individual libraries.

Yes agreed. In all such cases we've cared about the semantics, not about physical memory layout. A library should always be able to reuse memory or other such optimizations if it guarantees that that does not affect the semantics.

I'm pretty that using definition (1) is not controversial, and I remember clarifying this in some other place in the docs already - just can't remember where.

@ev-br
Copy link

ev-br commented Dec 5, 2024

If anything, it'd be quite helpful to clarify what copy=None means for dask, data-apis/array-api-compat#209

@crusaderky
Copy link
Author

crusaderky commented Dec 5, 2024

If anything, it'd be quite helpful to clarify what copy=None means for dask, data-apis/array-api-compat#209

We could change "if possible" to "if possible and reasonable".

@asmeurer
Copy link
Member

asmeurer commented Dec 6, 2024

I'm pretty that using definition (1) is not controversial, and I remember clarifying this in some other place in the docs already - just can't remember where.

#788 (comment)

@kgryte kgryte added this to the v2024 milestone Dec 6, 2024
@rgommers
Copy link
Member

rgommers commented Dec 6, 2024

Thanks for the link @asmeurer!

It seems like this issue overlaps a lot with that one. The "logical semantics" point stands I think - the difference is that for JAX copies sometimes do matter, while for Dask they never matter according to @crusaderky. So I think for Dask it's fine to not copy memory - would you agree?

The "never" here is of course the point of interest - it is probably possible to write some code where that never isn't true, especially if one starts mixing Dask and NumPy, which is fairly common.

@crusaderky
Copy link
Author

xref #867

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants