Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pairwise string comparison #51

Merged
merged 9 commits into from
Dec 16, 2024
Merged

Conversation

ADBond
Copy link
Owner

@ADBond ADBond commented Dec 12, 2024

Following Splink#2517 and Splink#2546, add Clickhouse-specific versions of PairwiseStringDistanceFunctionLevel and PairwiseStringDistanceFunctionAtThresholds.

These could not work using just the dialect markers introduced in the latter PR, as the SQL in Splink works on the assumption that the lambda is passed as the second argument to functions. However, in Clickhouse it must be passed as the first argument, and the parser will fail if it is passed elsewhere. This rules out options such as defining a udf with the arguments switched (as the parser will still fail). Rather than using some more involved string manipulation, we simply re-implement the SQL in a way that is appropriate for Clickhouse dialect.

This actually turned out to be doubly useful as it is also not directly possible to unnest lists by a single level in Clickhouse, requiring a workaround (although tbf this could have still been achieved with a udf).

Mainly to facilitate testing this also allows ClickhouseAPI to correctly register pandas columns that are arrays of strings. For now any other array columns will be coërced to arrays of strings (I think, though haven't tested).
For chdb this is currently less straightforward as we rely on its native SELECT * FROM Python(input) rather than doing any manual processing, and this does not presently recognise array types.

@ADBond ADBond changed the title Pairwse string comparison Pairwise string comparison Dec 12, 2024
@ADBond ADBond added enhancement New feature or request comparisons labels Dec 12, 2024
@ADBond ADBond force-pushed the feature/pariwise-string-comparison branch from 94a513e to 06e53fe Compare December 16, 2024 09:44
@ADBond ADBond merged commit e8de8cd into main Dec 16, 2024
15 checks passed
@ADBond ADBond deleted the feature/pariwise-string-comparison branch December 16, 2024 09:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comparisons enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant