Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

RobinL · 2023-11-02T11:39:17Z

Original PR is here. Various additional comments there

RobinL · 2023-11-03T15:02:41Z

@nerskin @aymonwuolanne
Feel like I'm making good progress on this now but would appreciate your thoughts.

This is the plan:

The problem i created

Prior to your PR, there was a gotya in block_using_rules_sql In a nutshell, that function is supposed to return a sql string only with no side effects. But there were side effects. The code worked, but was hard to understand debug. Entirely my fault - (it was my code).

I've corrected it in this PR.

Building on this 'precedent' (you had no way of knowing it was a bad precedent!), the additional code you added had further side effects (the materialisation of the exploded_id_tables). So i figured this I didn't want to layer additional issues on the original error.

Further problems I created

In the BlockingRule class there were a lot of poor/ambiguous naming conventions and lack of type hints. Meaning code that used the BlockingRule class was hard to understand

Solution (WIP)

Here's what I'm planning:

Name things in the BlockingRule class better. [I might possibly do this as a separate PR for clarity]
Promote as much code into the BlockingRule class as possible. For example the BlockingRule should have a function called something like marginal_exploded_id_pairs_table_sql, rather than this code being inlined in other functions. The idea is to substantially reduce the length and complexity of the block_using_rules_sql function.
Similarly the sql code that joins the input tables to the ids_to_compare table should live within the BlockingRule
Eliminate the side effects by moving the materialisation of exploded_id_tables to a separate function that runs prior to the block_using_rules_sql

I might also see whether there's any milage in having SaltedBlockingRule and ExplodingBlockingRule classes, possibly as mixins, or possibly inheriting from BlockingRule. But that might come in a further PR

You can see some of this work already in the current PR.

Does this all sound sensible at this stage? In particular is there anything you think in the plan that would be a showstopper

aymonwuolanne · 2023-11-06T06:23:30Z

It hadn't clicked with me that there were side effects mixed in with a function that's meant to just return some SQL, that's a great thing to avoid if possible.

I think these steps look really good and they'll make it easier to follow.

Personally, I'd avoid SaltedBlockingRule and ExplodedBlockingRule as classes because then we might need a third SaltedExplodedBlockingRule and I think it'd be hard to avoid duplicating some code to account for the salting inside of the exploded_id_tables. I am happy to defer to others on that though.

One small suggestion: with the try except statement in materialise_exploded_id_tables, isn't the table cache already checked when you run linker._initialise_df_concat_with_tf()? I don't think the error handling approach is necessary there.

aymonwuolanne · 2023-12-04T04:54:20Z

Thanks for the extra work on this Robin. Does this mean the other Blocking Rule refactoring is mostly done now? Let me know if I can help out with this PR at all.

RobinL · 2023-12-04T09:54:48Z

@aymonwuolanne Yes - all the groundwork unrelated to exploding blocking rules is now in master, hopefully setting the stage to add back the exploding blocking rules (in this PR) in a way which is more self-contained. e.g. see here for the SaltedBlockingRule

Thanks for the offer. I'll do a little more on this and give you a heads up once I'm reasonably happy. Then would be good to get a review from you guys to make sure you're happy/suggest further improvements.

Building on all this, there's actually a bunch more improvements we have planned to blocking rules that will go into the work we're doing on Splink 4 fairly soon. If you're interested in talking more about Splink 4 feel free to give me a shout via email

RobinL · 2023-12-11T16:07:44Z

@nerskin @aymonwuolanne Sorry it's taken so long, but I'm fairly happy with this now.

@nerskin would you mind having a look and maybe trying it out to verify it does what you expect (I've included the tests you wrote, plus one to cover the unique id/source dataset issue I previously mentioned, so it should be ok). It won't let me assign you as a reviewer formally, but would be good to get the OK from you before merging

The root cause of the challenges was that the materialisation of the id pairs table is a special case (unlike the other blocking code), so trying to make it fit into the overall codebase without things getting too complex was difficult.

A few notes:

I explicitly disallowed salted and exploding blocking rules. I know your version worked, and I even had a working version with the new Salted and Exploding blocking rules classes, but in the end I dont think the additional complexity is worth it given likely usage
patterns
Another reason this was tricky is that blocking rules are used quite widely in the codebase. For example, in functions like find_matches_to_new_records and cumulative_num_comparisons_from_blocking_rules_chart.
The materialisation of the id pairs results in three separate actions being needed any time blocking is conducted with exploding rules.
- materialise_exploded_id_tables (a sql statement that must be executed and materialised prior to doing any blocking)
- block_using_rules_sqls (part of blocking)
- drop_materialised_id_pairs_dataframe (only after the final blocked table has been executed/created can we clean up the id pairs tables that are no longe rneeded)
- But this means the code ends up being quite brittle because it's possible to forget to do all three
- I have been unable to find a strategy that I'm happy with that 'fits in' to Splink, and handles doing all three 'seemlessly'
Since tracking down all these uses is hard, I've instead locked down support to only predict() and deterministic_link(). The primary mechanism I've used is to raise an error we attempt to use an exploded blocking rule. Specifically, if no materialised id pairs table is found, and error is raised

RobinL · 2024-01-05T11:19:00Z

@nerskin @aymonwuolanne. Just pinging you for the new year in case you want to have a look at this before we do an internal review and get it merged. Happy New Year!

aymonwuolanne · 2024-01-08T21:48:09Z

Thanks Robin, happy New Year to you too! I've read through the changes, and I think they look great, happy for your team to do a final review. There were a few parts where I thought it could be simplified but in retrospect the way it was done was necessary, so I'm happy with it as it is.

ADBond

This looks great! All makes sense, just one teensy comment but happy for you to merge 👍

ADBond · 2024-01-17T09:18:52Z

splink/blocking.py

+        input_dataframe = linker._initialise_df_concat_with_tf()
+
+        input_colnames = {col.name for col in input_dataframe.columns}


i think these can come outside the loop?

nerskin and others added 15 commits October 23, 2023 21:58

add tests for array-based blocking

ed6e47b

Add logic for blocking on array intersections by unnesting tables

f3024ef

update hardcoded hash in test_correctness_of_convergence.py

bfea53c

Merge branch 'moj-analytical-services:master' into master

4399142

linting/formatting

d3fe4dd

update table names for consistency with splink conventions

a48b0ec

Update tests

3b8222a

ensure that tables names are unique

1f1ea76

lint

eacdc04

wip

a9076ea

wip

8ed833f

move materialisation logic to separate function

217f633

rename for clarity

43298b1

improve clairty of names

90fadb5

pushing logic into blockingrule class

efb72c3

RobinL added 8 commits November 6, 2023 06:57

better names

4caa847

remove materialised tables after use

4c40ffd

is salted

836ea36

merge in master

46d834b

fix merge

fa096cd

exploding blocking rule class

8621d2d

all logic now pushed into blocking rules classes

4d40bd9

better names

1117737

This was referenced Nov 6, 2023

(WIP) Refactor blocking rule for clarity #1699

Closed

BlockingRule: Clarify name of sql property #1700

Merged

BlockingRule: Refactor to enable better iteration #1701

Merged

RobinL marked this pull request as draft November 7, 2023 10:01

change all files to current master

7c71849

RobinL added 7 commits December 11, 2023 10:50

fix spark

24cd2e7

Merge branch 'master' into refactor_ids_to_compare_creation

0b2f338

format better

1d63218

fix tests

12e0c90

make work with deterministic link

c2aea3c

put check supported logic in place most likely to be caught

a58deca

gives correct error messages

4021d3f

RobinL added 6 commits December 11, 2023 16:10

fix tests

0cede3d

add tests back in

26bf082

fix link type issue and unique ids

c7307eb

lint

8aecb27

fix line length

6798cb1

rename method for clarity

9ef61f1

RobinL changed the title ~~(WIP) refactoring exploded blocking rules~~ Refactor of 1664: add ability to do efficient blocking based on list/array intersections Dec 12, 2023

RobinL marked this pull request as ready for review December 12, 2023 13:19

ADBond self-requested a review January 16, 2024 09:10

ADBond approved these changes Jan 17, 2024

View reviewed changes

RobinL added 2 commits January 17, 2024 12:04

Merge branch 'master' into refactor_ids_to_compare_creation

1ece5d7

Move out of loop

ca0e202

RobinL merged commit 2661b08 into master Jan 17, 2024
10 checks passed

RobinL deleted the refactor_ids_to_compare_creation branch January 17, 2024 12:16

RobinL mentioned this pull request Jan 18, 2024

Add support for SaltedBlockingRule for EM training (again) #1853

Merged

ADBond mentioned this pull request Jan 23, 2024

Database API - updates and test conformance #1875

Merged

zmbc mentioned this pull request Mar 1, 2024

[FEAT] Add ability to do efficient blocking based on list/array intersections #1448

Closed

ADBond mentioned this pull request Jan 16, 2025

Should BlockingRules store identifier column information? #2588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

RobinL commented Nov 2, 2023 •

edited

Loading

RobinL commented Nov 3, 2023 •

edited

Loading

aymonwuolanne commented Nov 6, 2023

aymonwuolanne commented Dec 4, 2023

RobinL commented Dec 4, 2023 •

edited

Loading

RobinL commented Dec 11, 2023 •

edited

Loading

RobinL commented Jan 5, 2024

aymonwuolanne commented Jan 8, 2024

ADBond left a comment

ADBond Jan 17, 2024

		input_dataframe = linker._initialise_df_concat_with_tf()

		input_colnames = {col.name for col in input_dataframe.columns}

Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

Conversation

RobinL commented Nov 2, 2023 • edited Loading

RobinL commented Nov 3, 2023 • edited Loading

The problem i created

Further problems I created

Solution (WIP)

aymonwuolanne commented Nov 6, 2023

aymonwuolanne commented Dec 4, 2023

RobinL commented Dec 4, 2023 • edited Loading

RobinL commented Dec 11, 2023 • edited Loading

RobinL commented Jan 5, 2024

aymonwuolanne commented Jan 8, 2024

ADBond left a comment

Choose a reason for hiding this comment

ADBond Jan 17, 2024

Choose a reason for hiding this comment

RobinL commented Nov 2, 2023 •

edited

Loading

RobinL commented Nov 3, 2023 •

edited

Loading

RobinL commented Dec 4, 2023 •

edited

Loading

RobinL commented Dec 11, 2023 •

edited

Loading