Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve create_age_groups #135

Open
Nic-Chr opened this issue Sep 19, 2024 · 1 comment
Open

Improve create_age_groups #135

Nic-Chr opened this issue Sep 19, 2024 · 1 comment

Comments

@Nic-Chr
Copy link
Contributor

Nic-Chr commented Sep 19, 2024

Hi, I think we can improve the speed of create_age_groups quite a bit and also remove the dependency on 'utils' package.

If we avoid cut() which is inefficient in creating factors as it goes through unnecessary unique() + match() steps internally.
We already have our cleaned age breaks which are unique and sorted, meaning we can avoid using cut() and directly use .bincode().
.bincode() is basically a low-level factor constructor and also what cut() uses as well.
To get a character vector, all that's needed is to subset our age breaks onto our bin codes.

On the topic of cut() inefficiency, there is a stack thread I opened a while ago: https://stackoverflow.com/questions/76867914/can-cut-be-improved

Proposed function and benchmark:

create_age_groups <- function(x, from = 0, to = 90, by = 5, as_factor = FALSE){
  
  if (!is.numeric(x)) {
    cli::cli_abort("{.arg x} must be a {.cls numeric} vector, not a {.cls {class(x)}} vector.")
  }
  
  breaks <- seq(from, to, by)
  breaks <- sort(unique(breaks))
  n_breaks <- length(breaks)
  n <- max(n_breaks - 1L, 0L)
  
  bands <- paste0(
    breaks[seq_len(n)], "-", 
    breaks[seq.int(to = n_breaks, length.out = n)] - 1L
  )
  
  rightmost_band <- paste0(breaks[n_breaks], "+")
  bands[n_breaks] <- rightmost_band
  
  codes <- .bincode(x, breaks = c(breaks, Inf), right = FALSE)
  
  if (as_factor) {
    out <- codes
    levels(out) <- bands
    class(out) <- c("ordered", "factor")
  }
  else {
    out <- bands[codes]
  }
  out
}
library(bench)

x <- 20
create_age_groups(x)
#> [1] "20-24"
phsmethods::create_age_groups(x)
#> [1] "20-24"

mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)               98.6µs   103µs     9202.    2.41KB    10.4 
#> 2 phsmethods::create_age_groups(x)  155.5µs   168µs     5697.    3.84KB     8.30
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 101µs  106µs     8998.    2.71KB    12.7 
#> 2 phsmethods::create_age_groups(x, as… 153µs  162µs     5868.    3.84KB     8.30

x <- sample(0:100, 10^7, T)
mark(create_age_groups(x), 
     phsmethods::create_age_groups(x))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                            min  median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                       <bch:tm> <bch:t>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x)                414ms   416ms      2.40     191MB     2.40
#> 2 phsmethods::create_age_groups(x)    701ms   701ms      1.43     420MB     4.28
mark(create_age_groups(x, as_factor = TRUE), 
     phsmethods::create_age_groups(x, as_factor = TRUE))
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                           <bch> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 create_age_groups(x, as_factor = TR… 361ms  361ms      2.77     153MB     2.77
#> 2 phsmethods::create_age_groups(x, as… 654ms  654ms      1.53     343MB     1.53

Created on 2024-09-19 with reprex v2.0.2

This obviously relates to issues #93 and #54, which I think are also worthwhile but as subsequent step.

@Moohan
Copy link
Member

Moohan commented Sep 19, 2024

I'm always keen on speed/memory improvements (especially if they work on larger datasets). My usual approach when I've made changes like this in the past to other functions is to start by expanding the tests for the function(s) so I'm 100% sure there's no unintended regressions or behaviour changes.

I've also been interested in https://lorenzwalthert.github.io/touchstone/ for a while which is meant for exactly these types of improvements - it involves a bit of setup but then you get a benchmark comment added to PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants