Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorstore cannot open vlen UTF8 string written with Zarr-Python #103

Open
shoyer opened this issue Jun 5, 2023 · 3 comments
Open

tensorstore cannot open vlen UTF8 string written with Zarr-Python #103

shoyer opened this issue Jun 5, 2023 · 3 comments

Comments

@shoyer
Copy link
Member

shoyer commented Jun 5, 2023

To reproduce:

import zarr

zarr.save_array(
    '/tmp/string.zarr',
    np.array(['foo', 'bar'], dtype=object),
    dtype=str,
)
tensorstore.open({'driver': 'zarr', 'kvstore': 'file:///tmp/string.zarr'}).result()

Raises:

ValueError: FAILED_PRECONDITION: Error opening "zarr" driver: Error reading local file "/tmp/string.zarr/.zarray": Error parsing object member "dtype": Unsupported zarr dtype: "|O" [tensorstore_spec='{\"context\":{\"cache_pool\":{},\"data_copy_concurrency\":{},\"file_io_concurrency\":{}},\"driver\":\"zarr\",\"kvstore\":{\"driver\":\"file\",\"path\":\"/tmp/string.zarr/\"}}']

Zarr-Python writes these arrays (see documentation) with an "object" dtype and vlen-utf8 filter:

$ cat /tmp/string.zarr/.zarray
{
    "chunks": [
        2
    ],
    "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
    },
    "dtype": "|O",
    "fill_value": 0,
    "filters": [
        {
            "id": "vlen-utf8"
        }
    ],
    "order": "C",
    "shape": [
        2
    ],
    "zarr_format": 2
}
@jbms
Copy link
Collaborator

jbms commented Jun 16, 2023

Currently vlen-utf8 isn't supported but it would not be too difficult. I may be able to add support soon.

@Will-Tyler
Copy link

Hi, I would be interested in working on this issue if you support outside contributions and think this is a reasonable first issue. I used TensorStore for some small projects, so I have some familiarity with the C++ API. Let me know, and I will look into it.

@laramiel
Copy link
Collaborator

laramiel commented Sep 25, 2024

We accept contributions, however I'm not sure that this is a good first issue. We currently don't support variable length datatypes, and I think that supporting variable length datatypes will be quite involved. We also don't support filters. Perhaps adding filter support would be a better starting point than adding vlen utf8.

If you still want to look at this, start by poking around at it in tensorstore/driver/zarr. The dtype_test, for example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants