-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) Blocking api redesign #2136
Conversation
3250b1a
to
eeb2fe1
Compare
a59041e
to
dc0bf04
Compare
dc0bf04
to
2e39e84
Compare
Todo/issues:
Separately, probably want to do a PR that makes all input data into splink a SplinkDataFrame because it's causing problems here Fundamentally the issue is Unless we want to re-write how blocking works so that it doesn't need a linker, we have to use a linker The currently implementation that uses:
|
…king_with_no_linker (WIP) Blocking with no linker
Closing in favour of #2180 2180 |
Summary
This PR redesigns the user-facing API for the analysis of blocking rules.
Existing functions
At the moment we have the following functions for analysing blocking rules:
linker.count_num_comparisons_from_blocking_rule
linker._count_num_comparisons_from_blocking_rule_pre_filter_conditions
linker.cumulative_comparisons_from_blocking_rules_records
linker.cumulative_num_comparisons_from_blocking_rules_chart
linker.count_num_comparisons_from_blocking_rules_for_prediction
1 and 2 perform have the same objective, but they they differ in the definition and speed of the computation and the result, see here.
Functions 3,4 and 5 all perform the same underlying analysis, they only vary in the presentation.
Proposal
linker
and make them available fromsplink.blocking_analysis
count_comparisons_from_blocking_rule
cumulative_comparisons_from_blocking_rules_chart
cumulative_comparisons_from_blocking_rules_records(include_total:bool)
Discussion
There are two areas of complexity with analysis of blocking rules:
For more details on the 'fast/estimate' technique, see here.
One reason the 'fast/estimate' is important is that a key objective of of blocking rule analysis is to reject blocking rules that generate too many comparisons to be computationally feasible (e.g.
block_on("first_name")
could generate 1trn comparisons). But if you use the 'slow/precise' methodology to attempt to detect this, the computation will never end. So when using the slow/precise method, it's generally desirable to 'test' everything using the 'fast/estimate' method to ensure the user isn't asking for a computation that will never complete.Initial work
I've started an initial go that looks like this, but concluded that, prior to getting this going, we need to allow creating
__splink__df_concat
without the need for a linkerComplexity
In order to compute the cumulative/marginal number of comparisons created by a set of blocking rules, we need to use a linker.
This is because of the complexity of the logic, particularly around exploding blocking rules.
Each blocking rule needs to be 'aware' of the comparisons generated by the previous rule, and in the case of exploding blocking rules, this involves creating a lookup table of ID-ID pairs that have been created.
As a result, we end up in a less-than-ideal situation in which when writing
count_comparisons_to_be_scored_from_blocking_rules
function that exists independently of the main linker, we end up needing to create a linker under the hood so the SQL is correctly generatedTODO/Check
[ ] - Tidy up logic for making BlockingRule/BlockingRuleCreator uniform. Do type hinting
[ ] - Move internal logic to splink.internals
[ ] - Check both pure function, and use within 'estimate_probability_two_random_records_match' are consistent and correct in terms of the count of marginal rules, by comparing against Splink 3
[ ] - Check it works with spark and exploding blocking rules
[ ] - Check with works with duckdb irrespective of the type of input data (pandas df, duckdb tablename, etc)
block_using_rules_sqls
correctly and check arguments make sensedeterministic_link
works correctly with two dataset link onlypredict
works correctly with two dataset link onlyestimate_probability_two_random_records_match
works