Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated package verision does not handle units datatype anymore rapi_prepare: Unknown column type for prepare: INVALID #590

Open
wlangera opened this issue Nov 13, 2024 · 8 comments
Labels
help wanted ❤️ we'd love your help!

Comments

@wlangera
Copy link

Hi

I recently updated R and all my packages.
Now I have some code that does not work anymore.
It looks like duckdb cannot the handle units datatype anymore?
I do not know what my previous version of this package was. Now I have v1.1.2.
There are some related issues (solved and unsolved).

Reproducible example:

# Load required libraries
library(arrow)
library(duckdb)
library(sf)
library(dplyr)

# Define a CRS (e.g., EPSG:4326 for WGS84)
crs_epsg <- 4326

# Create a sample spatial dataset `data_sf` with points and set CRS
data_sf <- st_as_sf(data.frame(
    id = 1:5,
    value = runif(5, 10, 20),
    geometry = st_sfc(
      st_point(c(1, 1)),
      st_point(c(2, 2)),
      st_point(c(3, 3)),
      st_point(c(4, 4)),
      st_point(c(5, 5))
    )
  )) %>%
  st_set_crs(crs_epsg)  # Set CRS for `data_sf`

# Create a sample `cell_areas` dataset with polygons and set CRS
cell_areas <- st_as_sf(data.frame(
    id = 1:5,
    geometry = st_sfc(
      st_polygon(list(rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0)))),
      st_polygon(list(rbind(c(1, 1), c(2, 1), c(2, 2), c(1, 2), c(1, 1)))),
      st_polygon(list(rbind(c(2, 2), c(3, 2), c(3, 3), c(2, 3), c(2, 2)))),
      st_polygon(list(rbind(c(3, 3), c(4, 3), c(4, 4), c(3, 4), c(3, 3)))),
      st_polygon(list(rbind(c(4, 4), c(5, 4), c(5, 5), c(4, 5), c(4, 4))))
    )
  )) %>%
  st_set_crs(crs_epsg) %>%                # Set CRS for `cell_areas`
  mutate(area = st_area(geometry)) %>%    # Calculate area for each polygon
  st_drop_geometry()

# Define the temporary Parquet file
temparrow <- tempfile(fileext = ".parquet")

# Process and write data to Parquet
joined_data_sf <- data_sf %>%
  st_drop_geometry() %>%                  # Drop geometry from `data_sf`
  inner_join(cell_areas, by = "id") %>%   # Join with `cell_areas` by `id`
  arrow::write_dataset(path = temparrow)  # Write to Parquet

# Read the Parquet dataset and convert to DuckDB format
arrow_dataset <- arrow::open_dataset(temparrow)

arrow::to_duckdb(arrow_dataset)
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM arrow_001) q01 WHERE (0 = 1)
#> Caused by error in `dbSendQuery()`:
#> ! rapi_prepare: Unknown column type for prepare: INVALID

Created on 2024-11-13 with reprex v2.1.1

If I save the area as a numeric variable. The code works:

# Load required libraries
library(arrow)
library(duckdb)
library(sf)
library(dplyr)

# Define a CRS (e.g., EPSG:4326 for WGS84)
crs_epsg <- 4326

# Create a sample spatial dataset `data_sf` with points and set CRS
data_sf <- st_as_sf(data.frame(
    id = 1:5,
    value = runif(5, 10, 20),
    geometry = st_sfc(
      st_point(c(1, 1)),
      st_point(c(2, 2)),
      st_point(c(3, 3)),
      st_point(c(4, 4)),
      st_point(c(5, 5))
    )
  )) %>%
  st_set_crs(crs_epsg)  # Set CRS for `data_sf`

# Create a sample `cell_areas` dataset with polygons and set CRS
cell_areas <- st_as_sf(data.frame(
    id = 1:5,
    geometry = st_sfc(
      st_polygon(list(rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0)))),
      st_polygon(list(rbind(c(1, 1), c(2, 1), c(2, 2), c(1, 2), c(1, 1)))),
      st_polygon(list(rbind(c(2, 2), c(3, 2), c(3, 3), c(2, 3), c(2, 2)))),
      st_polygon(list(rbind(c(3, 3), c(4, 3), c(4, 4), c(3, 4), c(3, 3)))),
      st_polygon(list(rbind(c(4, 4), c(5, 4), c(5, 5), c(4, 5), c(4, 4))))
    )
  )) %>%
  st_set_crs(crs_epsg) %>%                            # Set CRS
  mutate(area = as.numeric(st_area(geometry))) %>%    # Calculate area
  st_drop_geometry()

# Define the temporary Parquet file
temparrow <- tempfile(fileext = ".parquet")

# Process and write data to Parquet
joined_data_sf <- data_sf %>%
  st_drop_geometry() %>%                  # Drop geometry from `data_sf`
  inner_join(cell_areas, by = "id") %>%   # Join with `cell_areas` by `id`
  arrow::write_dataset(path = temparrow)  # Write to Parquet

# Read the Parquet dataset and convert to DuckDB format
arrow_dataset <- arrow::open_dataset(temparrow)

arrow::to_duckdb(arrow_dataset)
#> # Source:   table<arrow_001> [5 x 3]
#> # Database: DuckDB v1.1.2 [ward_langeraert@Windows 10 x64:R 4.4.2/:memory:]
#>      id value         area
#>   <int> <dbl>        <dbl>
#> 1     1  10.8 12364036567.
#> 2     2  17.7 12360269788.
#> 3     3  17.8 12352737380.
#> 4     4  19.4 12341441640.
#> 5     5  14.1 12326386013.

Created on 2024-11-13 with reprex v2.1.1

@krlmlr krlmlr added this to the 1.1.3-1 milestone Dec 4, 2024
@krlmlr
Copy link
Collaborator

krlmlr commented Dec 4, 2024

Thanks. Do you still have your old installation of R around? It would help to understand which version of duckdb and the other packages was installed at that time.

We see a different error (but still an error) with duckdb 1.1.0.

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 4, 2024

It works with duckdb 1.0.0-2 or earlier, can you confirm?

It starts to break in 0b8e2b6 (upstream: duckdb/duckdb#13446, introduction of Arrow extension types), and the behavior changes again in 3c99573.

To me, it looks like, in the past, it worked "by accident". We would need to think through how to support these extension types properly.

@wlangera
Copy link
Author

wlangera commented Dec 4, 2024

I just got a new laptop so I cannot exactly say what version was the last exactly.
Based on the renv.lock file history. I can only say it probably worked in version 0.7.1-1 with R v4.3.2.

But I do not have these versions anymore.

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 4, 2024

Thanks. By now, we know that 1.0.0-2 worked. Can you confirm by installing that version on your new laptop and checking again?

@krlmlr krlmlr removed this from the 1.1.3-1 milestone Dec 4, 2024
@wlangera
Copy link
Author

wlangera commented Dec 5, 2024

Yes, it works with v1.0.0-2.

packageurl <- "http://cran.r-project.org/src/contrib/Archive/duckdb/duckdb_1.0.0-2.tar.gz"
install.packages(packageurl, repos=NULL, type="source")
# Load required libraries
library(arrow)
library(duckdb)
library(sf)
library(dplyr)

# Define a CRS (e.g., EPSG:4326 for WGS84)
crs_epsg <- 4326

# Create a sample spatial dataset `data_sf` with points and set CRS
data_sf <- st_as_sf(data.frame(
    id = 1:5,
    value = runif(5, 10, 20),
    geometry = st_sfc(
      st_point(c(1, 1)),
      st_point(c(2, 2)),
      st_point(c(3, 3)),
      st_point(c(4, 4)),
      st_point(c(5, 5))
    )
  )) %>%
  st_set_crs(crs_epsg)  # Set CRS for `data_sf`

# Create a sample `cell_areas` dataset with polygons and set CRS
cell_areas <- st_as_sf(data.frame(
    id = 1:5,
    geometry = st_sfc(
      st_polygon(list(rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0)))),
      st_polygon(list(rbind(c(1, 1), c(2, 1), c(2, 2), c(1, 2), c(1, 1)))),
      st_polygon(list(rbind(c(2, 2), c(3, 2), c(3, 3), c(2, 3), c(2, 2)))),
      st_polygon(list(rbind(c(3, 3), c(4, 3), c(4, 4), c(3, 4), c(3, 3)))),
      st_polygon(list(rbind(c(4, 4), c(5, 4), c(5, 5), c(4, 5), c(4, 4))))
    )
  )) %>%
  st_set_crs(crs_epsg) %>%                # Set CRS for `cell_areas`
  mutate(area = st_area(geometry)) %>%    # Calculate area for each polygon
  st_drop_geometry()

# Define the temporary Parquet file
temparrow <- tempfile(fileext = ".parquet")

# Process and write data to Parquet
joined_data_sf <- data_sf %>%
  st_drop_geometry() %>%                  # Drop geometry from `data_sf`
  inner_join(cell_areas, by = "id") %>%   # Join with `cell_areas` by `id`
  arrow::write_dataset(path = temparrow)  # Write to Parquet

# Read the Parquet dataset and convert to DuckDB format
arrow_dataset <- arrow::open_dataset(temparrow)

arrow::to_duckdb(arrow_dataset)
#> # Source:   table<arrow_001> [5 x 3]
#> # Database: DuckDB v1.0.0 [ward_langeraert@Windows 10 x64:R 4.4.2/:memory:]
#>      id value         area
#>   <int> <dbl>        <dbl>
#> 1     1  11.2 12364036567.
#> 2     2  16.0 12360269788.
#> 3     3  14.7 12352737380.
#> 4     4  10.2 12341441640.
#> 5     5  19.1 12326386013.

Created on 2024-12-05 with reprex v2.1.1

@krlmlr
Copy link
Collaborator

krlmlr commented Dec 5, 2024

Thanks for confirming. Please stay on that version for now, not sure when and how we can make that part work again.

@krlmlr krlmlr added the help wanted ❤️ we'd love your help! label Dec 5, 2024
@toppyy
Copy link
Contributor

toppyy commented Jan 6, 2025

I started looking into this. The error I'm currently seeing with example above is a bit different as krlml noted:

# Load required libraries
library(arrow)
#> 
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#> 
#>     timestamp
library(duckdb)
#> Loading required package: DBI
library(sf)
#> Linking to GEOS 3.10.2, GDAL 3.4.1, PROJ 8.2.1; sf_use_s2() is TRUE
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# Define a CRS (e.g., EPSG:4326 for WGS84)
crs_epsg <- 4326

# Create a sample spatial dataset `data_sf` with points and set CRS
data_sf <- st_as_sf(data.frame(
    id = 1:5,
    value = runif(5, 10, 20),
    geometry = st_sfc(
      st_point(c(1, 1)),
      st_point(c(2, 2)),
      st_point(c(3, 3)),
      st_point(c(4, 4)),
      st_point(c(5, 5))
    )
  )) %>%
  st_set_crs(crs_epsg)  # Set CRS for `data_sf`

# Create a sample `cell_areas` dataset with polygons and set CRS
cell_areas <- st_as_sf(data.frame(
    id = 1:5,
    geometry = st_sfc(
      st_polygon(list(rbind(c(0, 0), c(1, 0), c(1, 1), c(0, 1), c(0, 0)))),
      st_polygon(list(rbind(c(1, 1), c(2, 1), c(2, 2), c(1, 2), c(1, 1)))),
      st_polygon(list(rbind(c(2, 2), c(3, 2), c(3, 3), c(2, 3), c(2, 2)))),
      st_polygon(list(rbind(c(3, 3), c(4, 3), c(4, 4), c(3, 4), c(3, 3)))),
      st_polygon(list(rbind(c(4, 4), c(5, 4), c(5, 5), c(4, 5), c(4, 4))))
    )
  )) %>%
  st_set_crs(crs_epsg) %>%                # Set CRS for `cell_areas`
  mutate(area = st_area(geometry)) %>%    # Calculate area for each polygon
  st_drop_geometry()

# Define the temporary Parquet file
temparrow <- tempfile(fileext = ".parquet")

# Process and write data to Parquet
joined_data_sf <- data_sf %>%
  st_drop_geometry() %>%                  # Drop geometry from `data_sf`
  inner_join(cell_areas, by = "id") %>%   # Join with `cell_areas` by `id`
  arrow::write_dataset(path = temparrow)  # Write to Parquet

# Read the Parquet dataset and convert to DuckDB format
arrow_dataset <- arrow::open_dataset(temparrow)

arrow::to_duckdb(arrow_dataset)
#> Error in `db_query_fields.DBIConnection()`:
#> ! Can't query fields.
#> ℹ Using SQL: SELECT * FROM (FROM arrow_001) q01 WHERE (0 = 1)
#> Caused by error in `dbSendQuery()`:
#> ! rapi_prepare: Failed to prepare query SELECT *
#> FROM (FROM arrow_001) q01
#> WHERE (0 = 1)
#> Error: Serialization Error: Failed to parse JSON string: A
#> 3
#> 263169
#> 197888
#> 5
#> UTF-8
#> 782
#> 0
#> 1026
#> 1
#> 262153
#> 5
#> units
#> 787
#> 2
#> 16
#> 2
#> 262153
#> 1
#> m
#> 262153
#> 1
#> m
#> 16
#> 0
#> 1026
#> 1
#> 262153
#> 5
#> names
#> 16
#> 2
#> 262153
#> 9
#> numerator
#> 262153
#> 11
#> denominator
#> 1026
#> 1
#> 262153
#> 5
#> class
#> 16
#> 1
#> 262153
#> 14
#> symbolic_units
#> 254
#> 1026
#> 1023
#> 16
#> 1
#> 262153
#> 5
#> units
#> 254

Created on 2025-01-06 with reprex v2.1.1

It seems that the error occurs in duckdb when the constructor of ArrowSchemaMetadata tries to parse the metadata associated with the column area as JSON with StringUtil::ParseJSONMap().

The data logged is the metadata of area which apparently has the type info for arrow to interpret it as units.

arrow_dataset$schema$field(2)$type$extension_metadata_utf8()
#> A\n3\n263169\n197888\n5\nUTF-8\n782\n0\n1026\n1\n262153\n5\nunits\n787\n2\n16\n2\n262153\n1\nm\n262153\n1\nm\n16\n0\n1026\n1\n262153\n5\nnames\n16\n2\n262153\n9\nnumerator\n262153\n11\ndenominator\n1026\n1\n262153\n5\nclass\n16\n1\n262153\n14\nsymbolic_units\n254\n1026\n1023\n16\n1\n262153\n5\nunits\n254\n

Are we aiming to maintain the type info? Or would it suffice that area is returned as a double like when using as.numeric() in the example?

@krlmlr
Copy link
Collaborator

krlmlr commented Jan 6, 2025

@hannes: Who can help shed lights on Arrow extension types with DuckDB?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted ❤️ we'd love your help!
Projects
None yet
Development

No branches or pull requests

3 participants