-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tweak archive construction and compactification tolerance interfaces #601
base: dev
Are you sure you want to change the base?
Conversation
- Make `new_epi_archive()` perform only basic checks and construction, and output an "unvalidated" `epi_archive`. - Make `validate_epi_archive()` operate on (unvalidated or already-validated) `epi_archive`s (rather than `as_epi_archive` arguments) and perform the more expensive checks omitted by `new_epi_archive`. (But not necessarily all required checks; it may be assumed that basic checks have already passed.) - Move compactification tolerance option to `as_epi_archive()`.
- Standardize to names & defaults of `compactify = TRUE, compactify_abs_tol = 0` in most places; `abs_tol` for compactify-specific functions. Add missing tolerance setting in `epix_merge()`. Keep `should_compactify` as-is in `revision_summary()` for now. - Change `compactify = NULL` possibility to `compactify = "message"`, and message instead of warn. - Make `compactify_abs_tol = 0` still compactify when exactly equal by using `<= compactify_abs_tol` rather than `<` via `dplyr::near()`. - Update examples and vignettes to not have unnecessary `compactify = TRUE`.
- Use compactification-with-tolerance also on "bare"/"unclassed" integer columns. - Don't use locf-with-tolerance on key columns, to avoid dropping epikeytimes entirely in some situations.
6b24dc5
to
889dbee
Compare
The extra validation
|
)) # nolint: object_usage_linter | ||
removed_by_compactify <- function(updates_df, ukey_names, abs_tol) { | ||
if (!is.data.table(updates_df) || !identical(key(updates_df), ukey_names)) { | ||
updates_df <- updates_df %>% arrange(!!!ukey_names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: recommend using the same approach here and line 424 (updates_df <- updates_df %>% arrange(pick(all_of(ukey_names)))
).
# Messages about redundant rows if the number of rows decreased, and we didn't | ||
# explicitly say to compactify | ||
if (identical(compactify, "message") && nrow(compactified) < nrow_before_compactify) { | ||
elim <- removed_by_compactify(data_table, key_vars, compactify_abs_tol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue: It seems expensive to rerun compactification just for this message. What if we always return both "keep" and "remove" results in apply_compactification
? Or possibly "keep" results and "remove" indices. Not sure how this works out for memory use vs runtime. Which we prefer depends on how often the "message" option will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Always return" might be a performance issue for the compactify = TRUE
case, now the default. Keep/remove flags might be the answer to make this possible still semi-neatly. I'm not sure how much "message"
is going to be used though... I was considering just dropping that option altogether. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you were already considering dropping it before my comment, could be more reason to do so.
We could keep the "message" option, but not have it print out removed rows? Or just leave this as-is and improve later if we run into issues (it may be that compactifying is so fast that 2x the work is still not a problem).
@@ -98,8 +98,8 @@ revision_summary <- function(epi_arch, | |||
few_revisions = 3, | |||
abs_spread_threshold = NULL, | |||
rel_spread_threshold = 0.1, | |||
compactify_tol = .Machine$double.eps^0.5, | |||
should_compactify = TRUE) { | |||
should_compactify = TRUE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: change this arg to be called compactify
to match naming in archive-creation fns.
Checklist
Please:
PR).
brookslogan, nmdefries.
DESCRIPTION
. Always incrementthe patch version number (the third number), unless you are making a
release PR from dev to main, in which case increment the minor version
number (the second number).
(backwards-incompatible changes to the documented interface) are noted.
Collect the changes under the next release number (e.g. if you are on
1.7.2, then write your changes under the 1.8 heading).
process.
Change explanations for reviewer
new
only doing very basic checks,validate
taking its output and doing heavier checks, and other functions handling other heavy operations.compactify_abs_tol = 0
works as the default. It's also been made available inepix_merge()
.This serves to help set up for and maintain compatibility with some more performant archive construction & sliding utilities in the works.
Magic GitHub syntax to mark associated Issue(s) as resolved when this is merged into the default branch
compactify_tol
interface & usage #570