Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store optional bounding box information in the column metadata #8

Closed
jorisvandenbossche opened this issue Feb 22, 2022 · 7 comments · Fixed by #21
Closed

Store optional bounding box information in the column metadata #8

jorisvandenbossche opened this issue Feb 22, 2022 · 7 comments · Fixed by #21
Assignees
Milestone

Comments

@jorisvandenbossche
Copy link
Collaborator

jorisvandenbossche commented Feb 22, 2022

There was a bit of discussion around this in #4.

The proposal is to add an optional column metadata field (alongside the currently required "crs" and "encoding" fields) that describes the bounding box of the full file (so the overall bounding box or envelope of all geometries in the file).

In the geo-arrow-spec version of this metadata specification, we are already using it (https://github.com/geopandas/geo-arrow-spec/blob/main/metadata.md#bounding-boxes), and there it takes the form of a a list that specifies the minimum and maximum values of each dimension. So for 2D data it would look like "bbox" : [<xmin>, <ymin>, <xmax>, <ymax>].

This formatting aligns with for example the GeoJSON spec (https://datatracker.ietf.org/doc/html/rfc7946#section-5).


This optional information can be useful when processing this data. For example, in dask-geopandas we already make use of this feature to filter partitions (sub-datasets) of a dataset. When using Parquet, people often make use of "partitioned datasets", where the dataset consists of (potentially nested directories of) many smaller Parquet files. In such a situation, you could spatially sort the data when dividing into partitions and each individual file could contain the data of a certain region. If each individual Parquet file would then store information about the bounding box of their geometries, this allows to only read those files needed when doing a spatial query while reading the dataset (a kind of "predicate pushdown", as can be done for Parquet based on column statistics).

@cholmes
Copy link
Member

cholmes commented Feb 22, 2022

+1

I'm curious about the argument for making it optional and not required? Or at least recommended? Enabling spatial sorting of partitioned dataset seems like a pretty big win. I suppose we could have an 'extension' for the partitioned data use case where the bounds is required.

@jorisvandenbossche
Copy link
Collaborator Author

I suppose the main reason to have it optional is that it might require an additional computation to obtain those bbox values when writing the data (similarly in Parquet, column statistics (min/max) are optional). But I don't feel strongly about having it optional. And having it "recommended" is certainly good.

@cholmes
Copy link
Member

cholmes commented Feb 22, 2022

similarly in Parquet, column statistics (min/max) are optional

Cool, that seems like a good precedent to follow. Let's go with 'recommended' then, and explain why it's good to have.

@alasarr
Copy link
Collaborator

alasarr commented Feb 28, 2022

+1

@paleolimbot
Copy link
Collaborator

+1 for "optional" (as is the case for most other spatial formats, whose writing can be done faster without computing anything).

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" once one gets close to the north pole, the south pole, or the international date line. S2's latlngrect and PROJ's "area" both can return a rectangle with something like "left_lon" and "right_lon" (rather than min/max) to address that. For geodedic coords, an S2 "covering" is a better choice anyway.

If a proper spatial index is an option (#13), that might be a better choice.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed (in the R bindings this is currently something that happens with a blind call to read_parquet()).

@TomAugspurger TomAugspurger added this to the 0.2 milestone Mar 1, 2022
@cholmes cholmes modified the milestones: 0.2, 0.1 Mar 1, 2022
@cholmes
Copy link
Member

cholmes commented Mar 1, 2022

Will be in the metadata, so will be JSON. Just an array of 4 numbers.

@jorisvandenbossche
Copy link
Collaborator Author

I opened an initial PR for this at #21

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" ..

Yes, that is a good question, and something I am not fully sure about what to do with this. I also noted that on the PR (#21 (comment)). The GeoJSON spec mentions that the edges are basically planar straight lines.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Unfortunately, in the Arrow implementation of Parquet, we currently don't have access to the rowgroup's column chunk metadata (see somewhat related issue about this at https://issues.apache.org/jira/browse/ARROW-15548)

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed

Yes, that's indeed a responsibility for a reader (although as long as you only takes subsets, the bbox will not be really "invalid", but just larger than strictly necessary)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants