Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise string distance comparison #2517

Merged

Conversation

zmbc
Copy link
Contributor

@zmbc zmbc commented Nov 20, 2024

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

This is a follow up to #2195, addressing the PR comments there. Closes #1994.

Give a brief description for the solution you have provided

As discussed in the prior PR, this mostly models PairwiseStringDistanceFunctionAtThresholds and PairwiseStringDistanceFunctionLevel off of DistanceFunctionAtThresholds and DistanceFunctionLevel
respectively.
The main difference is that it is pairwise on an array column (duh) and that it only accepts a small list
of string distance functions and transpiles them, instead of the user passing an arbitrary SQL function.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter
  • Run the spellchecker (if appropriate)

Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank - I think this looks good. Minor comment below about the default argument and suggested refactor of test to align to the newer format - but other than that i think this is good to merge

splink/internals/comparison_library.py Outdated Show resolved Hide resolved
tests/test_comparison_lib.py Show resolved Hide resolved
splink/internals/comparison_library.py Outdated Show resolved Hide resolved
@zmbc
Copy link
Contributor Author

zmbc commented Dec 2, 2024

@RobinL I believe I've addressed your comments. I don't understand why a test is failing -- it does not seem related in any way to these changes.

@ADBond
Copy link
Contributor

ADBond commented Dec 3, 2024

@RobinL I believe I've addressed your comments. I don't understand why a test is failing -- it does not seem related in any way to these changes.

@zmbc you are correct - apologies this is an unrelated issue #2515 (which will be fixed shortly, so should not be an issue going forward). Have re-run it to get it to pass, for clarity.

@RobinL
Copy link
Member

RobinL commented Dec 3, 2024

Brilliant, thanks @zmbc and @JonnyShiUW this is great

@RobinL RobinL merged commit 02b702b into moj-analytical-services:master Dec 3, 2024
25 checks passed
@RobinL RobinL mentioned this pull request Dec 5, 2024
11 tasks
@zmbc zmbc deleted the pairwise_string_distance_comparison branch December 7, 2024 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEAT] Allow fuzzy matches on array-valued columns
3 participants