Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New C callables to support tibble #1679

Open
10 tasks
krlmlr opened this issue Sep 24, 2022 · 5 comments
Open
10 tasks

New C callables to support tibble #1679

krlmlr opened this issue Sep 24, 2022 · 5 comments

Comments

@krlmlr
Copy link
Member

krlmlr commented Sep 24, 2022

Subsetting and subassignment is up to 5x slower than the corresponding data frame operations. I don't see potential to optimize at the R level without compromising readability. To implement efficiently in C, we'd need the following C callables:

  • vec_slice
  • vec_as_names
  • vec_set_names
  • vec_as_location
  • vec_as_location2
  • vec_as_subscript2
  • vec_is (*)
  • vec_recycle (*)
  • num_as_location (*)
  • vec_as_location2 (*)

Ideally, these C callables would return an error object instead of throwing an error. Do we have a good pattern for this?

@krlmlr
Copy link
Member Author

krlmlr commented Sep 25, 2022

Alternatively, vctrs could provide a trimmed-down version of tibble subsetting, in C, without tibble's compatibility code, based on https://github.com/DavisVaughan/standardize. We could evaluate beforehand what happens when we remove the compatibility code from tibble -- replacing all deprecate_warn() calls with deprecate_stop() .

@mgirlich
Copy link
Contributor

To support this case here is a benchmarks from my original issue in tibble:

f <- function(x, n = 10e3) {
  for (i in seq(n)) {
    x[["x"]] <- 1L
  }
}

t <- tibble::tibble(x = 1L)
df <- data.frame(x = 1L)
l <- list(x = 1L)

bench::mark(
  tibble = f(t),
  dataframe = f(df),
  list = f(l)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 3 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tibble     523.22ms 523.22ms      1.91   211.5KB     15.3
#> 2 dataframe   78.31ms  80.04ms     12.0     80.9KB     18.0
#> 3 list         1.35ms   1.52ms    560.          0B     31.9

Created on 2022-08-25 with reprex v2.0.2

The difference between tibble and a bare list are so big that for me it made more sense to use the following workflow:

  • convert tibble to a list
  • update the column
  • convert list back to a tibble

See tidyverse/tidyr#1386 (comment) for a real life example where the performance loss was so drastic that I had to use this approach.

@lionel-
Copy link
Member

lionel- commented Sep 26, 2022

Ideally, these C callables would return an error object instead of throwing an error. Do we have a good pattern for this?

withCallingHandlers() is fast, so I think you could just wrap your .Call()? We're presumably not looking for maximal efficiency here.

@krlmlr
Copy link
Member Author

krlmlr commented Sep 27, 2022

Subsetting and subset assignment are basic operations that are also used in loops or repeatedly. The styler package got 30 % faster when replacing tibble subsetting with data.frame subsetting: r-lib/styler#1007.

I'd like to make tibble subsetting faster than data.frame subsetting. I wouldn't even mind to move the entire subsetting to vctrs to achieve this.

@krlmlr
Copy link
Member Author

krlmlr commented Oct 31, 2022

@lionel-: Are you suggesting to use withCallingHandlers() once, as a catch-all for all errors? I'll need to play with that to see if the behavior changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants