Store optional bounding box information in the column metadata #8

jorisvandenbossche · 2022-02-22T15:52:34Z

There was a bit of discussion around this in #4.

The proposal is to add an optional column metadata field (alongside the currently required "crs" and "encoding" fields) that describes the bounding box of the full file (so the overall bounding box or envelope of all geometries in the file).

In the geo-arrow-spec version of this metadata specification, we are already using it (https://github.com/geopandas/geo-arrow-spec/blob/main/metadata.md#bounding-boxes), and there it takes the form of a a list that specifies the minimum and maximum values of each dimension. So for 2D data it would look like "bbox" : [<xmin>, <ymin>, <xmax>, <ymax>].

This formatting aligns with for example the GeoJSON spec (https://datatracker.ietf.org/doc/html/rfc7946#section-5).

This optional information can be useful when processing this data. For example, in dask-geopandas we already make use of this feature to filter partitions (sub-datasets) of a dataset. When using Parquet, people often make use of "partitioned datasets", where the dataset consists of (potentially nested directories of) many smaller Parquet files. In such a situation, you could spatially sort the data when dividing into partitions and each individual file could contain the data of a certain region. If each individual Parquet file would then store information about the bounding box of their geometries, this allows to only read those files needed when doing a spatial query while reading the dataset (a kind of "predicate pushdown", as can be done for Parquet based on column statistics).

The text was updated successfully, but these errors were encountered:

cholmes · 2022-02-22T16:21:22Z

+1

I'm curious about the argument for making it optional and not required? Or at least recommended? Enabling spatial sorting of partitioned dataset seems like a pretty big win. I suppose we could have an 'extension' for the partitioned data use case where the bounds is required.

jorisvandenbossche · 2022-02-22T17:08:24Z

I suppose the main reason to have it optional is that it might require an additional computation to obtain those bbox values when writing the data (similarly in Parquet, column statistics (min/max) are optional). But I don't feel strongly about having it optional. And having it "recommended" is certainly good.

cholmes · 2022-02-22T17:13:28Z

similarly in Parquet, column statistics (min/max) are optional

Cool, that seems like a good precedent to follow. Let's go with 'recommended' then, and explain why it's good to have.

alasarr · 2022-02-28T17:59:13Z

+1

paleolimbot · 2022-03-01T15:20:54Z

+1 for "optional" (as is the case for most other spatial formats, whose writing can be done faster without computing anything).

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" once one gets close to the north pole, the south pole, or the international date line. S2's latlngrect and PROJ's "area" both can return a rectangle with something like "left_lon" and "right_lon" (rather than min/max) to address that. For geodedic coords, an S2 "covering" is a better choice anyway.

If a proper spatial index is an option (#13), that might be a better choice.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed (in the R bindings this is currently something that happens with a blind call to read_parquet()).

cholmes · 2022-03-01T17:15:00Z

Will be in the metadata, so will be JSON. Just an array of 4 numbers.

jorisvandenbossche · 2022-03-02T10:18:03Z

I opened an initial PR for this at #21

It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" ..

Yes, that is a good question, and something I am not fully sure about what to do with this. I also noted that on the PR (#21 (comment)). The GeoJSON spec mentions that the edges are basically planar straight lines.

From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file.

Unfortunately, in the Arrow implementation of Parquet, we currently don't have access to the rowgroup's column chunk metadata (see somewhat related issue about this at https://issues.apache.org/jira/browse/ARROW-15548)

Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed

Yes, that's indeed a responsibility for a reader (although as long as you only takes subsets, the bbox will not be really "invalid", but just larger than strictly necessary)

TomAugspurger added this to the 0.2 milestone Mar 1, 2022

cholmes assigned jorisvandenbossche Mar 1, 2022

cholmes modified the milestones: 0.2, 0.1 Mar 1, 2022

jorisvandenbossche mentioned this issue Mar 2, 2022

Optional column metadata field to store bounding box information #21

Merged

jorisvandenbossche closed this as completed in #21 Mar 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store optional bounding box information in the column metadata #8

Store optional bounding box information in the column metadata #8

jorisvandenbossche commented Feb 22, 2022 •

edited

Loading

cholmes commented Feb 22, 2022

jorisvandenbossche commented Feb 22, 2022

cholmes commented Feb 22, 2022

alasarr commented Feb 28, 2022

paleolimbot commented Mar 1, 2022

cholmes commented Mar 1, 2022

jorisvandenbossche commented Mar 2, 2022

Store optional bounding box information in the column metadata #8

Store optional bounding box information in the column metadata #8

Comments

jorisvandenbossche commented Feb 22, 2022 • edited Loading

cholmes commented Feb 22, 2022

jorisvandenbossche commented Feb 22, 2022

cholmes commented Feb 22, 2022

alasarr commented Feb 28, 2022

paleolimbot commented Mar 1, 2022

cholmes commented Mar 1, 2022

jorisvandenbossche commented Mar 2, 2022

jorisvandenbossche commented Feb 22, 2022 •

edited

Loading