Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692

Merged
merged 43 commits into from
Jan 17, 2024

Conversation

RobinL
Copy link
Member

@RobinL RobinL commented Nov 2, 2023

Original PR is here. Various additional comments there

@RobinL
Copy link
Member Author

RobinL commented Nov 3, 2023

@nerskin @aymonwuolanne
Feel like I'm making good progress on this now but would appreciate your thoughts.

This is the plan:

The problem i created

Prior to your PR, there was a gotya in block_using_rules_sql In a nutshell, that function is supposed to return a sql string only with no side effects. But there were side effects. The code worked, but was hard to understand debug. Entirely my fault - (it was my code).

I've corrected it in this PR.

Building on this 'precedent' (you had no way of knowing it was a bad precedent!), the additional code you added had further side effects (the materialisation of the exploded_id_tables). So i figured this I didn't want to layer additional issues on the original error.

Further problems I created

In the BlockingRule class there were a lot of poor/ambiguous naming conventions and lack of type hints. Meaning code that used the BlockingRule class was hard to understand

Solution (WIP)

Here's what I'm planning:

  • Name things in the BlockingRule class better. [I might possibly do this as a separate PR for clarity]
  • Promote as much code into the BlockingRule class as possible. For example the BlockingRule should have a function called something like marginal_exploded_id_pairs_table_sql, rather than this code being inlined in other functions. The idea is to substantially reduce the length and complexity of the block_using_rules_sql function.
  • Similarly the sql code that joins the input tables to the ids_to_compare table should live within the BlockingRule
  • Eliminate the side effects by moving the materialisation of exploded_id_tables to a separate function that runs prior to the block_using_rules_sql

I might also see whether there's any milage in having SaltedBlockingRule and ExplodingBlockingRule classes, possibly as mixins, or possibly inheriting from BlockingRule. But that might come in a further PR

You can see some of this work already in the current PR.

Does this all sound sensible at this stage? In particular is there anything you think in the plan that would be a showstopper

@aymonwuolanne
Copy link
Contributor

It hadn't clicked with me that there were side effects mixed in with a function that's meant to just return some SQL, that's a great thing to avoid if possible.

I think these steps look really good and they'll make it easier to follow.

Personally, I'd avoid SaltedBlockingRule and ExplodedBlockingRule as classes because then we might need a third SaltedExplodedBlockingRule and I think it'd be hard to avoid duplicating some code to account for the salting inside of the exploded_id_tables. I am happy to defer to others on that though.

One small suggestion: with the try except statement in materialise_exploded_id_tables, isn't the table cache already checked when you run linker._initialise_df_concat_with_tf()? I don't think the error handling approach is necessary there.

@aymonwuolanne
Copy link
Contributor

Thanks for the extra work on this Robin. Does this mean the other Blocking Rule refactoring is mostly done now? Let me know if I can help out with this PR at all.

@RobinL
Copy link
Member Author

RobinL commented Dec 4, 2023

@aymonwuolanne Yes - all the groundwork unrelated to exploding blocking rules is now in master, hopefully setting the stage to add back the exploding blocking rules (in this PR) in a way which is more self-contained. e.g. see here for the SaltedBlockingRule

Thanks for the offer. I'll do a little more on this and give you a heads up once I'm reasonably happy. Then would be good to get a review from you guys to make sure you're happy/suggest further improvements.

Building on all this, there's actually a bunch more improvements we have planned to blocking rules that will go into the work we're doing on Splink 4 fairly soon. If you're interested in talking more about Splink 4 feel free to give me a shout via email

@RobinL
Copy link
Member Author

RobinL commented Dec 11, 2023

@nerskin @aymonwuolanne Sorry it's taken so long, but I'm fairly happy with this now.

@nerskin would you mind having a look and maybe trying it out to verify it does what you expect (I've included the tests you wrote, plus one to cover the unique id/source dataset issue I previously mentioned, so it should be ok). It won't let me assign you as a reviewer formally, but would be good to get the OK from you before merging

The root cause of the challenges was that the materialisation of the id pairs table is a special case (unlike the other blocking code), so trying to make it fit into the overall codebase without things getting too complex was difficult.

A few notes:

  • I explicitly disallowed salted and exploding blocking rules. I know your version worked, and I even had a working version with the new Salted and Exploding blocking rules classes, but in the end I dont think the additional complexity is worth it given likely usage
    patterns

  • Another reason this was tricky is that blocking rules are used quite widely in the codebase. For example, in functions like find_matches_to_new_records and cumulative_num_comparisons_from_blocking_rules_chart.

  • The materialisation of the id pairs results in three separate actions being needed any time blocking is conducted with exploding rules.

    • materialise_exploded_id_tables (a sql statement that must be executed and materialised prior to doing any blocking)
    • block_using_rules_sqls (part of blocking)
    • drop_materialised_id_pairs_dataframe (only after the final blocked table has been executed/created can we clean up the id pairs tables that are no longe rneeded)
    • But this means the code ends up being quite brittle because it's possible to forget to do all three
    • I have been unable to find a strategy that I'm happy with that 'fits in' to Splink, and handles doing all three 'seemlessly'
  • Since tracking down all these uses is hard, I've instead locked down support to only predict() and deterministic_link(). The primary mechanism I've used is to raise an error we attempt to use an exploded blocking rule. Specifically, if no materialised id pairs table is found, and error is raised

@RobinL RobinL changed the title (WIP) refactoring exploded blocking rules Refactor of 1664: add ability to do efficient blocking based on list/array intersections Dec 12, 2023
@RobinL RobinL marked this pull request as ready for review December 12, 2023 13:19
@RobinL
Copy link
Member Author

RobinL commented Jan 5, 2024

@nerskin @aymonwuolanne. Just pinging you for the new year in case you want to have a look at this before we do an internal review and get it merged. Happy New Year!

@aymonwuolanne
Copy link
Contributor

Thanks Robin, happy New Year to you too! I've read through the changes, and I think they look great, happy for your team to do a final review. There were a few parts where I thought it could be simplified but in retrospect the way it was done was necessary, so I'm happy with it as it is.

@ADBond ADBond self-requested a review January 16, 2024 09:10
Copy link
Contributor

@ADBond ADBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! All makes sense, just one teensy comment but happy for you to merge 👍

Comment on lines 393 to 395
input_dataframe = linker._initialise_df_concat_with_tf()

input_colnames = {col.name for col in input_dataframe.columns}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think these can come outside the loop?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants