Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Following Splink#2517 and Splink#2546, add Clickhouse-specific versions of
PairwiseStringDistanceFunctionLevel
andPairwiseStringDistanceFunctionAtThresholds
.These could not work using just the dialect markers introduced in the latter PR, as the SQL in Splink works on the assumption that the lambda is passed as the second argument to functions. However, in Clickhouse it must be passed as the first argument, and the parser will fail if it is passed elsewhere. This rules out options such as defining a udf with the arguments switched (as the parser will still fail). Rather than using some more involved string manipulation, we simply re-implement the SQL in a way that is appropriate for Clickhouse dialect.
This actually turned out to be doubly useful as it is also not directly possible to unnest lists by a single level in Clickhouse, requiring a workaround (although tbf this could have still been achieved with a udf).
Mainly to facilitate testing this also allows
ClickhouseAPI
to correctly register pandas columns that are arrays of strings. For now any other array columns will be coërced to arrays of strings (I think, though haven't tested).For
chdb
this is currently less straightforward as we rely on its nativeSELECT * FROM Python(input)
rather than doing any manual processing, and this does not presently recognise array types.