-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of 1664: add ability to do efficient blocking based on list/array intersections #1692
Conversation
@nerskin @aymonwuolanne This is the plan: The problem i createdPrior to your PR, there was a gotya in I've corrected it in this PR. Building on this 'precedent' (you had no way of knowing it was a bad precedent!), the additional code you added had further side effects (the materialisation of the exploded_id_tables). So i figured this I didn't want to layer additional issues on the original error. Further problems I createdIn the BlockingRule class there were a lot of poor/ambiguous naming conventions and lack of type hints. Meaning code that used the BlockingRule class was hard to understand Solution (WIP)Here's what I'm planning:
I might also see whether there's any milage in having You can see some of this work already in the current PR. Does this all sound sensible at this stage? In particular is there anything you think in the plan that would be a showstopper |
It hadn't clicked with me that there were side effects mixed in with a function that's meant to just return some SQL, that's a great thing to avoid if possible. I think these steps look really good and they'll make it easier to follow. Personally, I'd avoid One small suggestion: with the try except statement in |
Thanks for the extra work on this Robin. Does this mean the other Blocking Rule refactoring is mostly done now? Let me know if I can help out with this PR at all. |
@aymonwuolanne Yes - all the groundwork unrelated to exploding blocking rules is now in Thanks for the offer. I'll do a little more on this and give you a heads up once I'm reasonably happy. Then would be good to get a review from you guys to make sure you're happy/suggest further improvements. Building on all this, there's actually a bunch more improvements we have planned to blocking rules that will go into the work we're doing on Splink 4 fairly soon. If you're interested in talking more about Splink 4 feel free to give me a shout via email |
@nerskin @aymonwuolanne Sorry it's taken so long, but I'm fairly happy with this now. @nerskin would you mind having a look and maybe trying it out to verify it does what you expect (I've included the tests you wrote, plus one to cover the unique id/source dataset issue I previously mentioned, so it should be ok). It won't let me assign you as a reviewer formally, but would be good to get the OK from you before merging The root cause of the challenges was that the materialisation of the id pairs table is a special case (unlike the other blocking code), so trying to make it fit into the overall codebase without things getting too complex was difficult. A few notes:
|
@nerskin @aymonwuolanne. Just pinging you for the new year in case you want to have a look at this before we do an internal review and get it merged. Happy New Year! |
Thanks Robin, happy New Year to you too! I've read through the changes, and I think they look great, happy for your team to do a final review. There were a few parts where I thought it could be simplified but in retrospect the way it was done was necessary, so I'm happy with it as it is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! All makes sense, just one teensy comment but happy for you to merge 👍
splink/blocking.py
Outdated
input_dataframe = linker._initialise_df_concat_with_tf() | ||
|
||
input_colnames = {col.name for col in input_dataframe.columns} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think these can come outside the loop?
Original PR is here. Various additional comments there