-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store optional bounding box information in the column metadata #8
Comments
+1 I'm curious about the argument for making it optional and not required? Or at least recommended? Enabling spatial sorting of partitioned dataset seems like a pretty big win. I suppose we could have an 'extension' for the partitioned data use case where the bounds is required. |
I suppose the main reason to have it optional is that it might require an additional computation to obtain those bbox values when writing the data (similarly in Parquet, column statistics (min/max) are optional). But I don't feel strongly about having it optional. And having it "recommended" is certainly good. |
Cool, that seems like a good precedent to follow. Let's go with 'recommended' then, and explain why it's good to have. |
+1 |
+1 for "optional" (as is the case for most other spatial formats, whose writing can be done faster without computing anything). It's worth considering the case of lon/lat here, where a rectangular bounding box is at worst "invalid" and at best "odd" once one gets close to the north pole, the south pole, or the international date line. S2's latlngrect and PROJ's "area" both can return a rectangle with something like "left_lon" and "right_lon" (rather than min/max) to address that. For geodedic coords, an S2 "covering" is a better choice anyway. If a proper spatial index is an option (#13), that might be a better choice. From a read perspective, if each rowgroup could get its own "bounding box" that would be even better than per-file. Another thing to consider is that readers have to be very careful to invalidate the bounding box once a subset is computed (in the R bindings this is currently something that happens with a blind call to |
Will be in the metadata, so will be JSON. Just an array of 4 numbers. |
I opened an initial PR for this at #21
Yes, that is a good question, and something I am not fully sure about what to do with this. I also noted that on the PR (#21 (comment)). The GeoJSON spec mentions that the edges are basically planar straight lines.
Unfortunately, in the Arrow implementation of Parquet, we currently don't have access to the rowgroup's column chunk metadata (see somewhat related issue about this at https://issues.apache.org/jira/browse/ARROW-15548)
Yes, that's indeed a responsibility for a reader (although as long as you only takes subsets, the bbox will not be really "invalid", but just larger than strictly necessary) |
There was a bit of discussion around this in #4.
The proposal is to add an optional column metadata field (alongside the currently required "crs" and "encoding" fields) that describes the bounding box of the full file (so the overall bounding box or envelope of all geometries in the file).
In the geo-arrow-spec version of this metadata specification, we are already using it (https://github.com/geopandas/geo-arrow-spec/blob/main/metadata.md#bounding-boxes), and there it takes the form of a a list that specifies the minimum and maximum values of each dimension. So for 2D data it would look like
"bbox" : [<xmin>, <ymin>, <xmax>, <ymax>]
.This formatting aligns with for example the GeoJSON spec (https://datatracker.ietf.org/doc/html/rfc7946#section-5).
This optional information can be useful when processing this data. For example, in dask-geopandas we already make use of this feature to filter partitions (sub-datasets) of a dataset. When using Parquet, people often make use of "partitioned datasets", where the dataset consists of (potentially nested directories of) many smaller Parquet files. In such a situation, you could spatially sort the data when dividing into partitions and each individual file could contain the data of a certain region. If each individual Parquet file would then store information about the bounding box of their geometries, this allows to only read those files needed when doing a spatial query while reading the dataset (a kind of "predicate pushdown", as can be done for Parquet based on column statistics).
The text was updated successfully, but these errors were encountered: