Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comparison level validation check #1926

Conversation

ThomasHepworth
Copy link
Contributor

Type of PR

  • BUG
  • FEAT
  • MAINT
  • DOC

Is your Pull Request linked to an existing Issue or Pull Request?

Tags onto #1921 and addresses a "bug" that Chloe was getting in one of our pipeline runs.


Give a brief description for the solution you have provided

Previously, if a user attempted to use a comparison as a comparison level, they would be greeted with a "not JSON serializable" error:
Screenshot 2024-02-05 at 11 20 39
due to Comparison not being "dumpable" - https://github.com/moj-analytical-services/splink/blob/master/splink/validate_jsonschema.py#L65.

This is hard to interpret and debug for anyone unfamiliar with Splink.

This PR primarily aims to address this by adding an explicit validation check for invalid comparison levels (i.e. a level that is neither a dict or a ComparisonLevel). This check is performed within the check_comparison_level_types function, which simply loops through a Comparison and checks the data types within.

Any errors identified will be added to the ErrorLogger and printed to console once all settings have been checked.

Example of new breakdown with broken settings
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_level_library as cll
import splink.duckdb.comparison_template_library as ctl
from splink.duckdb.blocking_rule_library import block_on
from splink.datasets import splink_datasets

df = splink_datasets.fake_1000


def compare_postcodes(col_name: str):
    comparison_dict = {
        "output_column_name": col_name,
        "comparison_description": col_name + " distance",
        "comparison_levels": [
            cll.null_level(col_name),
            cl.exact_match(col_name, term_frequency_adjustments=False),
            "abcde",
            cll.else_level(),
        ],
    }

    return comparison_dict


settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on("first_name"),
        block_on("surname"),
    ],
    "comparisons": [
        {"comparison_levels": []},
        compare_postcodes("test"),
        cll.exact_match_level("email"),
        cll.exact_match_level("apple"),
    ],
}

linker = DuckDBLinker(df, settings)

Other Changes

  • The ErrorLogger class. This now has more descriptive method names, easier to interpret code and logs the errors in red text (more on this below).
  • All logging messages have been migrated to settings_validation_log_strings.py.
  • Removed sorted() method from evaluate_comparison_dialects and switched off SplinkDeprecation warnings unless requested by the user.

PR Checklist

  • Added documentation for changes
  • Added feature to example notebooks or tutorial (if appropriate)
  • Added tests (if appropriate)
  • Updated CHANGELOG.md (if appropriate)
  • Made changes based off the latest version of Splink
  • Run the linter

@ThomasHepworth ThomasHepworth requested a review from RobinL February 5, 2024 11:35
Copy link
Member

@RobinL RobinL left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brilliant - great work!

@ThomasHepworth ThomasHepworth merged commit 670389e into update_load_settings_and_make_it_the_defacto_load_logic Feb 6, 2024
2 checks passed
@ThomasHepworth ThomasHepworth deleted the add_comparison_level_validation_check branch February 6, 2024 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants