PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types #240

wgtmac · 2024-05-10T14:56:04Z

This PR adds the geometry and geography logical types to the Parquet spec.

It is a joint work with Apache Iceberg and GeoParquet to add native geospatial support:

Design doc: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI.
Iceberg PR: Spec: Support geo type iceberg#10981

Closes #429

jiayuasu · 2024-05-10T20:13:12Z

@wgtmac Thanks for the work. On the other hand, I'd like to highlight that GeoParquet (https://github.com/opengeospatial/geoparquet/tree/main) has been there for a while and many geospatial software has started to support reading and writing it.

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

jiayuasu · 2024-05-10T20:15:13Z

Geo Iceberg does not need to conform to GeoParquet because people should not directly use a parquet reader to read iceberg parquet files anyways. So that's a separate story.

wgtmac · 2024-05-11T01:23:58Z

Is the ultimate goal of this PR to merge GeoParquet spec into Parquet completely, or it might end up creating a new spec that is not compatible with GeoParquet?

@jiayuasu That's why I've asked the possibility of direct compliance to the GeoParquet spec in the Iceberg design doc. I don't intend to create a new spec. Instead, it would be good if the proposal here can meet the requirement of both Iceberg and GeoParquet, or share the common stuff to make the conversion between Iceberg Parquet and GeoParquet lightweight. We do need advice from the GeoParquet community to make it possible.

szehon-ho

From Iceberg side, I am excited about this, I think it will make Geospatial inter-op easier in the long run to define the type formally in parquet-format, and also unlock row group filtering. For example, Iceberg's add_file for parquet file. Perhaps there can be conversion utils for GeoParquet if we go ahead with this, and definitely like to see what they think.

Im new in parquet side, so had some questions

src/main/thrift/parquet.thrift

pitrou · 2024-05-15T08:24:29Z

@paleolimbot is quite knowledgeable on the topic and could probably be give useful feedback.

pitrou · 2024-05-15T08:36:13Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

paleolimbot

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

In reading this I do wonder if there should just be an extension mechanism here instead of attempting to enumerate all possible encodings in this repo. The people that are engaged and working on implementations are the right people to engage here, which is why GeoParquet and GeoArrow have been successful (we've engaged the people who care about this, and they are generally not paying attention to apache/parquet-format nor apache/arrow).

There are a few things that this PR solves in a way that might not be possible using EXTENSION, which is that of column statistics. It would be nice to have some geo-specific things there (although maybe that can also be part of the extension mechanism). Another thing that comes up frequently is where to put a spatial index (rtree)...I don't think there's any good place for that at the moment.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata...this metadata is typically propagated through projections and the things we do in the GeoParquet standard (store bounding boxes, refer to columns by name) become stale with the ways that schema metadata are typically propagated through projections and concatenations.

src/main/thrift/parquet.thrift

wgtmac · 2024-05-17T15:46:24Z

I wonder if purely informative metadata really needs to be represented as Thrift types. When we define canonical extension types in Arrow, metadata is generally serialized as a standalone JSON string.

Doing so in Parquet as well would lighten the maintenance workload on the serialization format, and would also allow easier evolution of geometry metadata to support additional information.

Edit: this seems to be the approach adopted by GeoParquet as well.

@pitrou Yes, that might be an option. Then we can perhaps use the same json string defined in the iceberg doc. @jiayuasu @szehon-ho WDYT?

EDIT: I think we can remove those informative attributes like subtype, orientation, edges. Perhaps encoding can be removed as well if we only support WKB. dimension is something that we must be aware of because we need to build bbox which depends on whether the coordinate is represented as xy, xyz, xym and xyzm.

wgtmac · 2024-05-17T15:54:38Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

It would be nice to allow the format to be extended in a way that does not depend on schema-level metadata.

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge? @paleolimbot @jiayuasu

paleolimbot · 2024-05-17T19:48:56Z

If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

The main reasons that the schema level metadata had to exist is because there was no way to put anything custom at the column level to give geometry-aware readers extra metadata about the column (CRS being the main one) and global column statistics (bbox). Bounding boxes at the feature level (worked around as a separate column) is the second somewhat ugly thing, which gives reasonable row group statistics for many things people might want to store. It seems like this PR would solve most of that.

I am not sure that a new logical type will catch on to the extent that GeoParquet will, although I'm new to this community and I may be very wrong. The GeoParquet working group is enthusiastic and encodings/strategies for storing/querying geospatial datasets in a data lake context are evolving rapidly. Even though it is a tiny bit of a hack, using extra columns and schema-level metadata to encode these things is very flexible and lets implementations be built on top of a number of underlying readers/underlying versions of the Parquet format.

wgtmac · 2024-05-18T02:46:21Z

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial. For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Kontinuation · 2024-05-18T06:15:01Z

Another thing that comes up frequently is where to put a spatial index (rtree)

I thought this can be something similar to the page index or bloom filter in parquet, which are stored somewhere between row groups or before the footer. It can be row group level or file level as well.

The bounding-box based sort order defined for geometry logical type is already good enough for performing row-level and page-level data skipping. Spatial index such as R-tree may not be suitable for Parquet. I am aware that flatgeobuf has optional static packed Hilbert R-tree index, but for the index to be effective, flatgeobuf supports random access of records and does not support compression. The minimal granularity of reading data in Parquet files is data pages, and the pages are usually compressed so it is impossible to access records within pages randomly.

paleolimbot · 2024-05-20T02:43:39Z

I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet.

I agree! I think first-class geometry support is great and I'm happy to help wherever I can. I see GeoParquet as a way for existing spatial libraries to leverage Parquet and is not well-suited to Parquet-native things like Iceberg (although others working on GeoParquet may have a different view).

Extension mechanisms are nice because they allow an external community to hash out the discipline-specific details where these evolve at an orthogonal rate to that of the format (e.g., GeoParquet), which generally results in buy-in. I'm not familiar with the speed at which the changes proposed here can evolve (or how long it generally takes readers to implement them), but if @pitrou's suggestion of encoding the type information or statistics in serialized form makes it easier for this to evolve it could provide some of that benefit.

Spatial index such as R-tree may not be suitable for Parquet

I also agree here (but it did come up a lot of times in the discussions around GeoParquet). I think developers of Parquet-native workflows are well aware that there are better formats for random access.

paleolimbot · 2024-05-21T13:32:08Z

I think we really need your advise here. If you rethink the design of GeoParquet, how can it do better if parquet format has some geospatial knowledge?

I opened up opengeospatial/geoparquet#222 to collect some thoughts on this...we discussed it at our community call and I think we mostly just never considered that the Parquet standard would be interested in supporting a first-class data type. I've put my thoughts there but I'll let others add their own opinions.

src/main/thrift/parquet.thrift

jorisvandenbossche · 2024-05-21T15:20:13Z

Just to ensure my understanding is correct:

This is proposing to add a new logical type annotating the BYTE_ARRAY physical type. For readers that expect just such a BYTE_ARRAY column (e.g. existing GeoParquet implementations), that is compatible if the column would start having a logical type as well? (although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type).
For such "legacy" readers (just reading the WKB values from a binary column), the only thing that actually changes (apart from the logical type annotation) are the values of the statistics? Now, I assume that right now no GeoParquet reader is using the statistics of the binary column, because the physical statistics for BYTE_ARRAY ("unsigned byte-wise comparison") are essentially useless in the case those binary blobs represent WKB geometries. So again that should probably not give any compatibility issues?

jorisvandenbossche · 2024-05-21T16:03:09Z

although I assume this might depend on how the specific parquet reader implementation deals with an unknown logical type, i.e. error about that or automatically fall back to the physical type

To answer this part myself, at least for the Parquet C++ implementation, it seems an error is raised for unknown logical types, and it doesn't fall back to the physical type. So that does complicate the compatibility story ..

wgtmac · 2024-05-21T16:09:38Z

@jorisvandenbossche I think your concern makes sense. It should be a bug if parquet-cpp fails due to unknown logical type and we need to fix that. I also have concern about a new ColumnOrder and need to do some testing. Adding a new logical type should not break anything from legacy readers.

jornfranke · 2024-05-21T19:55:14Z

Apache Iceberg is adding geospatial support: https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI. It would be good if Apache Parquet can support geometry type natively.

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

szehon-ho · 2024-05-21T21:14:39Z

On the geo integration into Iceberg no one has really worked since some time: apache/iceberg#2586

Yes there is now a concrete proposal apache/iceberg#10260 , and the plan currently is to bring it up in next community sync

cholmes · 2024-05-23T20:55:53Z

Thanks for doing this @wgtmac - it's awesome to see this proposal! I helped initiate GeoParquet, and hope we can fully support your effort.

@paleolimbot I'm happy to see the fast evolution of GeoParquet specs. I don't think the addition of geometry type aims to replace or deprecate something from GeoParquet. Instead, GeoParquet can simply ignore the new type as of now, or leverage the built-in bbox if beneficial.

That makes sense, but I think we're also happy to have GeoParquet replaced! As long as it can 'scale up' to meet all the crazy things that hard core geospatial people need, while also being accessible to everyone else. If Parquet had geospatial types from the start we wouldn't have started GeoParquet. We spent a lot of time and effort trying to get the right balance between making it easy to implement for those who don't care about the complexity of geospatial (edges, coordinate reference systems, epochs, winding), while also having the right options to handle it for those who do. My hope has been that the decisions we made there will make it easier to add geospatial support to any new format - like that a 'geo-ORC' could use the same fields and options that we added.

For additional (informative) attributes of the geometry type, if some of them are stable and make sense to store them natively into parquet column metadata, then perhaps we can work together to make it happen? I think the main goal of this addition is to enhance interoperability of geospatial data across systems and at the same time it takes little effort to convert to GeoParquet.

Sounds great! Happy to have GeoParquet be a place to 'try out' things. But I think ideally the surface area of 'GeoParquet' would be very minimal or not even exist, and that Parquet would just be the ideal format to store geospatial data in. And I think if we can align well between this proposal and GeoParquet that should be possible.

rdblue · 2025-02-03T18:35:21Z

src/main/thrift/parquet.thrift

@@ -386,6 +409,61 @@ struct BsonType {
 struct VariantType {
 }

+/** Coordinate reference system (CRS) encoding for Geometry and Geography logical types */
+enum CRSEncoding {


Nit: looks like this is not used.

rdblue · 2025-02-03T18:39:51Z

Geospatial.md

+latitude based on the WGS84 datum.
+
+Custom CRS can be specified by a string value. It is recommended to use the
+identifier of the CRS like [Spatial reference identifier][srid] and [PROJJSON][projjson].


Is PROJJSON considered an identifier? I think it may be more clear if the reference to PROJJSON here were moved and clarified elsewhere as a convention for how you might pass a CRS definition as PROJJSON in a table or file property.

I have checked that PROJJSON has identifiers: https://proj.org/en/stable/specifications/projjson.html#identifiers. However I'm not an expert so perhaps @jiayuasu @paleolimbot could help answer it?

Yes, PROJJSON optionally embeds an identifier in its JSON structure if the CRS has one (however, some of the data we are trying to convince large organizations/governments to distribute in Parquet don't have an authority/code and some require more than one authority/code to specify the CRS for the x-y separately from the z).

Because we've gone in quite a few circles on this one, my preference is just a string representation of the CRS with no further specification (i.e., writer/reader is responsible for serializing and deserializing the CRS, respectively).

If that isn't acceptable, I would add "writers should write the most compact form of CRS that fully describes the CRS. Identifiers should be used where possible and written in the form authority:code (e.g., OGC:CRS84 to specify longitude/latitude on the WGS84 ellipsoid)." That definition would result in 99.9% of geometry columns having a compact (but self-contained) CRS definition (authority:code), while also allowing producers to write whatever an upstream library provided them.

Barring either of those being acceptable, I would just make the projjson:some_schema_metadata_field language explicit.

Because we've gone in quite a few circles on this one, my preference is just a string representation of the CRS with no further specification (i.e., writer/reader is responsible for serializing and deserializing the CRS, respectively).

I think this is exactly the current goal with the recommendation (not enforcement) of an ID-based CRS.

rdblue · 2025-02-03T18:43:50Z

Thanks for all the work here, @wgtmac and everyone that has helped review and refine this! It looks ready to me and I'm glad to see that it is simpler. I agree with @paleolimbot that what this has now reduced down to looks great.

szehon-ho · 2025-02-03T18:48:48Z

Geospatial.md

+
+## Bounding Box
+
+A geometry has at least two coordinate dimensions: X and Y for 2D coordinates


Nit: 'geometry' may be misleading as its both geometry and geography. how about 'A bounding box value has at least...'?

A related question: should we rename the GeometryStatistics to GeospatialStatistics to avoid confusion?

Geospatial.md

src/main/thrift/parquet.thrift

szehon-ho · 2025-02-03T18:55:08Z

Left a few nits, mostly looks good to me as well! Thanks @wgtmac , @rdblue and everyone!

jorisvandenbossche · 2025-02-04T22:58:52Z

Geospatial.md

+The default CRS `OGC:CRS84` means that the objects must be stored in longitude,
+latitude based on the WGS84 datum.


It is explicitly mentioned here that this case means lon/lat. But does that imply that for other CRS values the stored coordinates follow the axis order defined by the CRS? And not always as x/y or lon/lat as GeoParquet specifies?

(in either case, I think it would be good to be more specific about this)

wgtmac · 2025-02-05T08:34:29Z

Updated the PR to address various comments:

Use geospatial features and geospatial instances to avoid confusion for both geometry and geography types.
Rename GeometryStatistics to GeospatialStatistics.
Clarify that axis order is defined by the CRS so we're not enforcing lon/lat.
Add a section for CRS customization recommendation.

Let me know what you think. @rdblue @paleolimbot @jorisvandenbossche @szehon-ho @jiayuasu

jorisvandenbossche · 2025-02-05T08:44:42Z

Clarify that axis order is defined by the CRS so we're not enforcing lon/lat.

Has this been discussed the last months to change this? This is a huge break in compatibility with GeoParquet, and I am not sure that is going to be practical, both for a transitional phase (for example it makes it more difficult to write parquet files that both use this new geometry type and still are valid GeoParquet files as well) as for readers/writers long term (AFAIK most libraries/engines that would consume such data would have to swap the coordinates then, and additionally also always have to inspect the details of the CRS while consuming the WKB).

wgtmac · 2025-02-05T09:16:19Z

Has this been discussed the last months to change this? This is a huge break in compatibility with GeoParquet, and I am not sure that is going to be practical, both for a transitional phase (for example it makes it more difficult to write parquet files that both use this new geometry type and still are valid GeoParquet files as well) as for readers/writers long term (AFAIK most libraries/engines that would consume such data would have to swap the coordinates then, and additionally also always have to inspect the details of the CRS while consuming the WKB).

There was a discussion with @rdblue @jiayuasu @szehon-ho to not want people assuming that X is always longitude. For the default CRS OGC:WGS84, we have explicitly specified the axis order is lon/lat. Is it possible to use projjson as the CRS for GeoParquet to do the transition when the axis order must be overriden?

paleolimbot · 2025-02-05T15:44:33Z

There was a discussion ... to not want people assuming that X is always longitude.

I believe the outcome of that discussion was that Iceberg wanted to leave the interpretation of the CRS completely to the engine/reader/writer, and that being explicit about axis order was not consistent with that. Joris is 100% correct that this language forces the Parquet implementation to be CRS aware (when I believe the intent was the opposite!).

The very short backstory is that if you ask any PostGIS/GeoArrow/GeoPackage/GeoParquet/Sedona/GeoPandas/R-sf/DuckDB/Almost anything else to reproject to "EPSG:4326" (or any representation of it), you will get longitude, latitude, even though the CRS definition says otherwise. Feel free to send a note to any one of us to get a long backstory 🙂

I really think you want Parquet to stay out of this one (or adopt GeoParquet/GeoArrow/GeoPackage's language).

Is it possible to use projjson as the CRS for GeoParquet to do the transition when the axis order must be overriden?

I think it is more likely that producers will just write GeoParquet + binary and ignore the native geometry type.

jiayuasu · 2025-02-05T16:09:10Z

@jorisvandenbossche

There was a discussion on the Iceberg PR, which @rdblue explicitly asks to remove the lon/lat enforcement because it overrides what the CRS says, which might lead to inconsistent behavior.

I don't have a strong opinion either way. But I do want to point out that maintaining the compatibility between GeoParquet and Parquet Geo is not a goal anyways. Otherwise we will not be able to make any progress. When Parquet Geo is in, an explicit transformation step on the old GeoParquet files is needed.

rdblue · 2025-02-05T19:00:11Z

Sorry to introduce this issue! I didn't realize that my rationale conflicted with what GeoParquet was already doing.

My initial concern was this language:

X must be longitude and Y must be latitude. This explicitly overrides the axis order defined in CRS

This specifically states that the order of dimensions in bounding box metadata must differ from the CRS in some cases. To me, that seems like a big implementation risk if people don't know to swap them. In addition, the names that we use for the bounding box values (xmin, ymin, xmax, ymax) are misleading when the WKB values use x=latitude, y=longitude but x and y in metadata must be x=longitude, y=latitude.

Also, please correct me if I'm wrong here. My current understanding is that the WKB data will correspond to the CRS even if the bounding box dimensions override it.

I do prefer a spec in which this ambiguity doesn't exist. I also pointed out that it is strange that we allow xmin > xmax and ymin > ymax, depending on whether the x or y dimension is longitude. That is what led to the latest change that points out that y may be longitude. And I'm glad we added the clarification so that we caught this problem!

Would it work to change this to use longitude and latitude specifically? We could have longitude, latitude, z, and m dimensions, which would probably be clear. The downside is that it may still be difficult for people to produce this if they need to understand the CRS in order to correctly map Y to longitude in some cases. I think we want to avoid needing everything to understand the CRS, but this may require a specific flag to capture whether X and Y in the data values need to be reversed in metadata.

I'm undecided, but I'm actually leaning toward the current language where X and Y are consistent with the data values and never flipped (assuming that my understanding is correct). It seems to me that this makes the choice opaque and has much less implementation risk.

paleolimbot · 2025-02-05T20:04:57Z

I think we are all on the same page here! The high level intent for everybody is that the Parquet and Iceberg types are able to interoperate with the rest of the ecosystem (to maximize adoption for spatial and non-spatial libraries alike) with minimal ambiguity, and that Parquet and Iceberg should take on a minimum of spatial understanding.

It is often confusing, but I cannot stress enough that the language we have in GeoParquet is the industry standard. Requiring something else is a sufficient barrier to interoperability that there is a risk the "official" type will not be supported (This is not to say that I will not try to help...I will! But I can only do so much in the face of such a significant departure from the norm.)

This specifically states that the order of dimensions in bounding box metadata must differ from the CRS in some cases

I think it was always the (perhaps unclear) intent that the axes of the bounds/statistics were identical to the WKB (on purpose, so that implementations do not need to parse the CRS to iterate over the WKB and calculate the bounds). Perhaps there is a way to make this more clear?

I'm actually leaning toward the current language where X and Y are consistent with the data values and never flipped

I think this either requires that GeoArrow/GeoPandas/PostGIS/Every other library I'm aware of has to either (1) rewrite their WKB before writing to Parquet (slow) or (2) permute the axes of the CRS (which invalidates the identifier and requires some logic that isn't baked into most libraries today).

It seems to me that this makes the choice opaque and has much less implementation risk.

Some options that I think would have less implementation risk:

Include an optional permutation alongside the CRS (e.g., [0, 1] to indicate authority compliance), but assume GeoParquet/GeoArrow/GeoPackage/Industry standard otherwise (credit to Martin who pointed out this option in a thread on this PR).
Use the GeoParquet/GeoArrow/GeoPackage/Industry standard language and see if there are any issues (I'm not aware of any in several years of experience with GeoArrow/GeoParquet)
Make no assertions about axis order (i.e., CRS interpretation is purely up to the reader/writer). Because the industry standard is ubiquitous, I think this will cause fewer problems than being explicit about the opposite.

jorisvandenbossche · 2025-02-05T20:56:20Z

I follow what Dewey has already answered, but just trying to additionally clarify a few points from Ryan's post:

Also, please correct me if I'm wrong here. My current understanding is that the WKB data will correspond to the CRS even if the bounding box dimensions override it.

@rdblue if I understand you correctly, then yes I think that is not correct. WKB data is defined to be x/y, and almost any producer of WKB values or file format using WKB under the hood (including GeoParquet) will use the mapping of x=lon / y=lat.
So for example when using EPSG:4326 (defined with an axis order of lat/lon), the WKB will not correspond to the CRS.

This specifically states that the order of dimensions in bounding box metadata must differ from the CRS in some cases. To me, that seems like a big implementation risk if people don't know to swap them. In addition, the names that we use for the bounding box values (xmin, ymin, xmax, ymax) are misleading when the WKB values use x=latitude, y=longitude but x and y in metadata must be x=longitude, y=latitude.

So with my above answer, your last sentence is here is also not correct (I am considering GeoParquet here for a moment). We define both the bbox as the WKB values to use the convention of x=lon / y=lat, so that the bbox and the WKB data are always consistent with each other.
This actually ensures that you can read and filter data based on the bbox statistics without having to inspect the CRS of the column. You mention "I think we want to avoid needing everything to understand the CRS", but so that is exactly what GeoParquet tries to achieve by saying that x=lon and y=lat. Because if you are not sure if the bbox and WKB data is lon/lat or lat/lon, then you always have to first inspect the CRS before you know how to specify the bbox filter and how to parse the WKB values.

This is clearly all confusing and easy to misunderstand / misinterpret each other, which is IMO a good reason to make this more explicit in the spec. So I am personally not a fan of Dewey's last suggestion of leaving this vague and then letting implementations choose how to handle this (which will then in practice be how GeoParquet does it, I would guess, but which is the opposite of what you could read in the current version of the spec)

rdblue · 2025-02-05T21:09:20Z

Thanks for the additional background, @paleolimbot and @jorisvandenbossche!

if I understand you correctly, then yes I think that is not correct. WKB data is defined to be x/y, and almost any producer of WKB values or file format using WKB under the hood (including GeoParquet) will use the mapping of x=lon / y=lat.

That's great! I think that the metadata and data should match. If it is already the convention of libraries to override the CRS definition and write both WKB and bboxes with x=longitude and y=latitude, when we should just make that clear.

What about saying that the x dimension always represents longitude and y always represents latitude, for both WKB values and for the bbox structs in metadata?

Use the GeoParquet/GeoArrow/GeoPackage/Industry standard language and see if there are any issues (I'm not aware of any in several years of experience with GeoArrow/GeoParquet)

What is this language? Does it state basically what I said above? From what @jorisvandenbossche says, it sounds like that is the case.

paleolimbot · 2025-02-05T21:50:30Z

What is this language?

The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always (x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS.

References: geoparquet, geopackage, geoarrow (I can dig up more if there is interest!)

rdblue · 2025-02-05T22:01:12Z

Sounds good to me! We should make sure that the statement clearly applies to both data and WKB values in Parquet.

wgtmac · 2025-02-06T02:34:41Z

I have added a section to explicitly define the axis order used in WKB and bbox. Let me know what you think. @paleolimbot @jorisvandenbossche

paleolimbot

Thank you for bearing with us on all of this!

wgtmac · 2025-02-10T02:23:56Z

The vote has passed: https://lists.apache.org/thread/tgkomrqsynzd5tm3385wm4tfk933lx6w. Let me merge this PR. Thanks everyone!

wgtmac force-pushed the geo branch from 4d36df9 to ad29afd Compare May 10, 2024 15:01

szehon-ho reviewed May 11, 2024

View reviewed changes

wgtmac marked this pull request as ready for review May 11, 2024 16:13

wgtmac changed the title ~~WIP: Add geometry logical type~~ PARQUET-2471: Add geometry logical type May 11, 2024

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield reviewed May 14, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot reviewed May 15, 2024

View reviewed changes

paleolimbot mentioned this pull request May 21, 2024

Thoughts about a first-class GEOMETRY data type in Parquet? opengeospatial/geoparquet#222

Open

jorisvandenbossche reviewed May 21, 2024

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

paleolimbot mentioned this pull request May 21, 2024

[Parquet][C++] Behaviour of unknown logical type when encountered in Parquet reader apache/arrow#41764

Open

wgtmac force-pushed the geo branch from 745ecb1 to f71c010 Compare May 25, 2024 15:22

use single string field for crs

082e31d

rdblue reviewed Feb 3, 2025

View reviewed changes

rdblue approved these changes Feb 3, 2025

View reviewed changes

szehon-ho reviewed Feb 3, 2025

View reviewed changes

remove unused CRSEncoding

df4f18e

jorisvandenbossche reviewed Feb 4, 2025

View reviewed changes

use geosptial and add CRS Customization

9c9c916

define axis order

3d91380

jiayuasu approved these changes Feb 6, 2025

View reviewed changes

paleolimbot approved these changes Feb 6, 2025

View reviewed changes

rdblue approved these changes Feb 6, 2025

View reviewed changes

wgtmac merged commit 94b9d63 into apache:master Feb 10, 2025
4 checks passed

This was referenced Feb 13, 2025

[Parquet][C++] Implement Geography and Geometry types in the C++ Parquet implementation apache/arrow#45522

Open

GH-45522: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations apache/arrow#45459

Open


		## Bounding Box

		A geometry has at least two coordinate dimensions: X and Y for 2D coordinates

		The default CRS `OGC:CRS84` means that the objects must be stored in longitude,
		latitude based on the WGS84 datum.

PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types #240

PARQUET-2471: Add GEOMETRY and GEOGRAPHY logical types #240

Conversation

wgtmac commented May 10, 2024 • edited Loading

jiayuasu commented May 10, 2024

jiayuasu commented May 10, 2024

wgtmac commented May 11, 2024

szehon-ho left a comment • edited Loading

Choose a reason for hiding this comment

pitrou commented May 15, 2024 • edited Loading

pitrou commented May 15, 2024 • edited Loading

paleolimbot left a comment

Choose a reason for hiding this comment

wgtmac commented May 17, 2024 • edited Loading

wgtmac commented May 17, 2024

paleolimbot commented May 17, 2024

wgtmac commented May 18, 2024

Kontinuation commented May 18, 2024

paleolimbot commented May 20, 2024

paleolimbot commented May 21, 2024

jorisvandenbossche commented May 21, 2024

jorisvandenbossche commented May 21, 2024

wgtmac commented May 21, 2024

jornfranke commented May 21, 2024 • edited Loading

szehon-ho commented May 21, 2024

cholmes commented May 23, 2024

rdblue Feb 3, 2025

Choose a reason for hiding this comment

rdblue Feb 3, 2025

Choose a reason for hiding this comment

wgtmac Feb 4, 2025

Choose a reason for hiding this comment

paleolimbot Feb 4, 2025

Choose a reason for hiding this comment

wgtmac Feb 5, 2025

Choose a reason for hiding this comment

rdblue commented Feb 3, 2025

szehon-ho Feb 3, 2025

Choose a reason for hiding this comment

wgtmac Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

szehon-ho commented Feb 3, 2025 • edited Loading

jorisvandenbossche Feb 4, 2025

Choose a reason for hiding this comment

wgtmac commented Feb 5, 2025

jorisvandenbossche commented Feb 5, 2025

wgtmac commented Feb 5, 2025

paleolimbot commented Feb 5, 2025

jiayuasu commented Feb 5, 2025 • edited Loading

rdblue commented Feb 5, 2025 • edited Loading

paleolimbot commented Feb 5, 2025

jorisvandenbossche commented Feb 5, 2025

rdblue commented Feb 5, 2025

paleolimbot commented Feb 5, 2025

rdblue commented Feb 5, 2025

wgtmac commented Feb 6, 2025

paleolimbot left a comment

Choose a reason for hiding this comment

wgtmac commented Feb 10, 2025 • edited Loading

wgtmac commented May 10, 2024 •

edited

Loading

szehon-ho left a comment •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

pitrou commented May 15, 2024 •

edited

Loading

wgtmac commented May 17, 2024 •

edited

Loading

jornfranke commented May 21, 2024 •

edited

Loading

wgtmac Feb 4, 2025 •

edited

Loading

szehon-ho commented Feb 3, 2025 •

edited

Loading

jiayuasu commented Feb 5, 2025 •

edited

Loading

rdblue commented Feb 5, 2025 •

edited

Loading

wgtmac commented Feb 10, 2025 •

edited

Loading