Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45522: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations #45459

Open
wants to merge 107 commits into
base: main
Choose a base branch
from

Conversation

paleolimbot
Copy link
Member

@paleolimbot paleolimbot commented Feb 7, 2025

Rationale for this change

The GEOMETRY and GEOGRAPHY logical types are being proposed as an addition to the Parquet format.

What changes are included in this PR?

This is a continuation of @Kontinuation 's initial PR (#43977) implementing apache/parquet-format#240 , which included:

  • Added geometry logical types (printing, serialization, deserialization)
  • Added geometry column statistics (serialization, deserialization, writing)
  • Support reading/writing parquet files containing geometry columns

Changes after this were:

  • Rebasing on the latest apache/arrow
  • Split geography/geometry types
  • Synchronize the final parameter names (e.g., no more "encoding", "edges" -> "algorithm")
  • Simplify geometry_util_internal.h and use Status instead of exceptions according to suggestions from the previous PR

In order to write test files, I also:

  • Implemented conversion to/from the GeoArrow extension type
  • Wired the requisite options to pyarrow so that the files could be written from Python

Those last two are probably a bit much for this particular PR, and I'm happy to move them.

Some things that aren't in this PR (but should be in this one or a future PR):

  • Update the bounding box logic to implement the "wraparound" bounding boxes where max > min (and generally make sure the stats for geography are written for trivial cases)
  • Test more invalid WKB cases

Are these changes tested?

Yes!

Are there any user-facing changes?

Yes!

Example from the included Python bindings:

import pyarrow as pa
from pyarrow import parquet
import geoarrow.pyarrow as ga  # For registering the extension type
import geopandas

path = "/Users/dewey/gh/parquet-testing/data/geospatial/example-crs_vermont-4326.parquet"
file = parquet.ParquetFile(path, arrow_extensions_enabled=True)
file.schema
#> <pyarrow._parquet.ParquetSchema object at 0x1136ee600>
#> required group field_id=-1 schema {
#>   optional binary field_id=-1 geometry (Geometry(crs=));
#> }
file.metadata.metadata
#> (eventually should contain any CRSes that were dumped there)
geometry_index = len(file.schema.names) - 1
file.metadata.row_group(0).column(geometry_index).geospatial_statistics
#> <pyarrow._parquet.GeospatialStatistics object at 0x117b07f40>
#>   geospatial_types: [3]
#>   xmin: -73.4296726142165
#>   xmax: -71.50351111518535
#>   ymin: 42.72708222103286
#>   ymax: 45.00831248634144
#>   zmin: None
#>   zmax: None
#>   mmin: None
#>   mmax: None

# Type and CRS should propagate through
file.schema_arrow.field("geometry").type
#> WkbType(geoarrow.wkb <OGC:CRS84>)

# GeoPandas should be able to take the result of this and ensure
# the CRS is not lost (and that the geometry column is picked up)
table = file.read()
df = geopandas.GeoDataFrame.from_arrow(table)
df.geometry.crs.name
#> 'WGS 84 (CRS84)'
df.geometry.head(5)
#> 0    POLYGON ((-72.45707 42.72708, -73.28203 42.743...
#> Name: geometry, dtype: geometry
parquet.write_table(table, "foofy.parquet", write_geospatial_logical_types=True)
parquet.read_table("foofy.parquet", arrow_extensions_enabled=True).schema
#> geometry: extension<geoarrow.wkb<WkbType>>

@paleolimbot paleolimbot changed the title (Updated) Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations [GH-45522]: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations Feb 13, 2025
@paleolimbot paleolimbot marked this pull request as ready for review February 13, 2025 05:37
@paleolimbot paleolimbot requested a review from wgtmac as a code owner February 13, 2025 05:37
@paleolimbot
Copy link
Member Author

@wgtmac This is ready for a first look! I've noted a few things about scope that could be dropped from this PR to the Description...I'm happy to do this in any order you'd like. Let me know!

@paleolimbot paleolimbot changed the title [GH-45522]: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations GH-45522: [Parquet][C++] Proof-of-concept Parquet GEOMETRY and GEOGRAPHY logical type implementations Feb 13, 2025
Copy link

⚠️ GitHub issue #45522 has been automatically assigned in GitHub to PR creator.

}
};

class WKBBuffer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to document classes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point! I'll circle back to this one this evening.

return *data_++;
}

::arrow::Result<uint32_t> ReadUInt32(bool swap) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason all class implementations are in the header? (holdover from templating)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to pull out the implementations into a .cc file although I wonder if this is slightly easier to drop in to the 3 or 4 other C++ Parquet implementations if kept together. I would also wonder if the compiler benefits from seeing the implementations (but I'm no expert here!).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Feb 13, 2025
}

uint32_t value = ::arrow::util::SafeLoadAs<uint32_t>(data_);
data_ += sizeof(uint32_t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the data_ and size_updates seem to be sprinkled around in a lot of different places I wonder it it would pay to make a generic method like `template T UnsafeConsume() {
T t = SafeLoadAs(data_, sizeof(T))
data_ += sizeof(T);
size_ -= sizeof(T);
}

template Result Consume() {
if (sizeof(T) > size_) {
... return error
}
return UnsafeConsume();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added versions of these! (I went with ReadXXX but I'm not particularly attached 🙂 )

};

static ::arrow::Result<geometry_type> FromWKB(uint32_t wkb_geometry_type) {
switch (wkb_geometry_type % 1000) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can 1000 be made a nemonic constant? (is there a pointer to the spec on why 1000?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because ISO WKB defined geometry types such that / 1000 and % 1000 can be used to separate the geometry type and dimensions component. I moved the / 1000 and % 1000 next to eachother and added a comment because I wasn't sure what exactly to name the constant but I'm open to suggestions!

}
};

struct GeometryType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what is standard in Geo Naming, but could this be called Geometry and the nested enum by called type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe not nest this in a struct and just have the static methods here as top level functions? then GeometryType could be the enum?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was designed to mimic how enums are defined in types.h (e.g., TimeUnit::unit), but I agree that a normal enum is way better. I removed the functions that weren't essential and moved FromWKB into the WKB bounder where it's more clear what it's doing!

}
}

template <typename Coord, typename Func>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: please document non trivial functions. A better name for Func might be Consume or CoordConsumer consumer

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw Visit in an Arrow header so I changed it to that (but happy to use something else if it's more clear!)

I will circle back to documentation this evening (it's a great point that there isn't any 😬 )


void UpdateXYZ(std::array<double, 3> coord) { UpdateInternal(coord); }

void UpdateXYM(std::array<double, 3> coord) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be worth passing array > 3 byte reference (or more generally most of them by reference). I guess without a benchmark it might be hard to tell.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them all to be by reference here (I would be surprised if a compiler didn't inline these calls either way but I'm also not an expert!)


::arrow::Status ReadSequence(WKBBuffer* src, Dimensions::dimensions dimensions,
uint32_t n_coords, bool swap) {
using XY = std::array<double, 2>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defining these within a class or struct and commenting them, then using them in other UpdateXYZ methods might make some of the code more readable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved these into BoundingBox::XY[Z[M]]!

WKBGeometryBounder() = default;
WKBGeometryBounder(const WKBGeometryBounder&) = default;

::arrow::Status ReadGeometry(WKBBuffer* src, bool record_wkb_type = true) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from an API perspective is it intended to let callers change record_wkb_type? If not consider make ReadGeometry without this parameter then move this implementation to a private helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this to be internal!

@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants