Skip to content

Commit

Permalink
Merge branch 'master' into refactor_ids_to_compare_creation
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinL authored Dec 11, 2023
2 parents 24cd2e7 + 471525b commit 0b2f338
Show file tree
Hide file tree
Showing 38 changed files with 1,772 additions and 1,427 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/pytest_postgres.yml
Original file line number Diff line number Diff line change
Expand Up @@ -89,5 +89,5 @@ jobs:
- name: Run only postgres-marked tests
run: |
source .venv/bin/activate
pytest -v -m postgres_only tests/
pytest -v --durations=0 -m postgres_only tests/
2 changes: 1 addition & 1 deletion .github/workflows/pytest_run_tests_with_cache.yml
Original file line number Diff line number Diff line change
Expand Up @@ -72,5 +72,5 @@ jobs:
- name: Run tests
run: |
source .venv/bin/activate
pytest tests/
pytest --durations=0 tests/
14 changes: 13 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

## [3.9.10] - 2023-12-07

### Changed

- Remove unused code from Athena linker ([#1775](https://github.com/moj-analytical-services/splink/pull/1775))
- Add argument for `register_udfs_automatically` ([#1774](https://github.com/moj-analytical-services/splink/pull/1774))

### Fixed

- Fixed issue with `_source_dataset_col` and `_source_dataset_input_column` ([#1731](https://github.com/moj-analytical-services/splink/pull/1731))
- Delete cached tables before resetting the cache ([#1752](https://github.com/moj-analytical-services/splink/pull/1752)

## [3.9.9] - 2023-11-14

Expand Down Expand Up @@ -46,6 +57,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Corrected path for Spark `.jar` file containing UDFs to work correctly for Spark < 3.0 ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))
- Spark UDF `damerau_levensthein` is now only registered for Spark >= 3.0, as it is not compatible with earlier versions ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))

[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.9...HEAD
[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.10...HEAD
[3.9.10]: https://github.com/moj-analytical-services/splink/compare/v3.9.9...3.9.10
[3.9.9]: https://github.com/moj-analytical-services/splink/compare/v3.9.8...3.9.9
[3.9.8]: https://github.com/moj-analytical-services/splink/compare/v3.9.7...v3.9.8
7 changes: 5 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -166,13 +166,16 @@ To find the best place to ask a question, report a bug or get general advice, pl

## Awards

🥇 Analysis in Government Awards 2020: Innovative Methods: [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)

🥇 MoJ DASD Awards 2020: Innovation and Impact - Winner

🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https://analysisfunction.civilservice.gov.uk/news/announcing-the-winner-of-the-first-analysis-in-government-peoples-choice-award/)

🥈 Analysis in Government Awards 2022: Innovative Methods [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)

🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)


## Citation

Expand Down
2 changes: 1 addition & 1 deletion docs/blog/posts/2023-07-27-feature_update.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
date: 2022-07-27
date: 2023-07-27
authors:
- ross-k
- robin-l
Expand Down
119 changes: 119 additions & 0 deletions docs/blog/posts/2023-12-06-feature_update.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
date: 2023-12-06
authors:
- ross-k
categories:
- Feature Updates
---

# Splink Updates - December 2023

Welcome to the second installment of the Splink Blog!

Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!

<!-- more -->

Latest Splink version: [v3.9.10](https://github.com/moj-analytical-services/splink/releases/tag/v3.9.10)

## :bar_chart: Charts Gallery

The Splink docs site now has a [Charts Gallery](../../charts/index.md) to show off all of the charts that come out-of-the-box with Splink to make linking easier.

[![](../posts/img/charts_gallery.png){ width="400" }](../../charts/index.md)

Each chart now has an explanation of:

1. What the chart shows
2. How to interpret it
3. Actions to take as a result

This is the first step on a longer term journey to provide more guidance on how to evaluate Splink models and linkages, so watch this space for more in the coming months!

## :chart_with_upwards_trend: New Charts

We are always adding more charts to Splink - to understand how these charts are built see our new [Charts Developer Guide](../../dev_guides/charts/understanding_and_editing_charts.md).

Two of our latest additions are:

### :material-matrix: Confusion Matrix

When evaluating any classification model, a confusion matrix is a useful tool for summarizing performance by representing counts of true positive, true negative, false positive, and false negative predictions.

Splink now has its own [confusion matrix chart](../../charts/confusion_matrix_from_labels_table.ipynb) to show how model performance changes with a given match weight threshold.

[![](./img/confusion_matrix.png){ width="400" }](../../charts/confusion_matrix_from_labels_table.ipynb)

Note, labelled data is required to generate this chart.

### :material-table: Completeness Chart

When linking multiple datasets together, one of the most important factors for a successful linkage is the number of common fields across the datasets.

Splink now has the [completeness chart](../../charts/completeness_chart.ipynb) which gives a simple view of how well populated fields are across datasets.

[![](./img/completeness_chart.png)](../../charts/completeness_chart.ipynb)


## :clipboard: Settings Validation

The [Settings dictionary](../../settings_dict_guide.md) is central to everything in Splink. It defines everything from the sql dialect of your backend to how features are compared in Splink model.

A common sticking point with users is how easy it is to make small errors when defining the Settings dictionary, resulting in unhelpful error messages.

To address this issue, the [Settings Validator](../../dev_guides/settings_validation/settings_validation_overview.md) provides clear, user-friendly feedback on what the issue is and how to fix it.


## :simple-adblock: Blocking Rule Library (Improved)

In our [previous blog](../posts/2023-12-06-feature_update.md#no_entry_sign-drop-support-for-python-37) we introduced the Blocking Rule Library (BRL) built upon the `exact_match_rule` function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the `block_on` function to supercede `exact_match_rule`.

For example, a block on `first_name` and `surname` now looks like:

```py
from splink.duckdb.blocking_rule_library import block_on
block_on(["first_name", "surname"])
```

as opposed to

```py
import splink.duckdb.blocking_rule_library as brl
brl.and_(
brl.exact_match_rule("first_name"),
brl.exact_match_rule("surname")
)
```

All of the [tutorials](../../demos/tutorials/03_Blocking.ipynb), [example notebooks](../../demos/examples/examples_index.md) and [API docs](../../blocking_rule_library.md) have been updated to use `block_on`.

## :electric_plug: Backend Specific Installs

Some users have had difficulties downloading Splink due to additional dependencies, some of which may not be relevant for the backend they are using. To solve this, you can now install a minimal version of Splink for your given SQL engine.

For example, to install Splink purely for Spark use the command:

```bsh
pip install 'splink[spark]'
```

See the [Getting Started page](../../getting_started.md#backend-specific-installs) for further guidance.

## :no_entry_sign: Drop support for python 3.7

From Splink 3.9.7, support has been dropped for python 3.7. This decision has been made to manage dependency clashes in the back end of Splink.

If you are working with python 3.7, please revert to Splink 3.9.6.

```bsh
pip install splink==3.9.6
```

## :soon: What's in the pipeline?

* :four: Work on **Splink 4** is currently underway
* :material-thumbs-up-down: More guidance on how to evaluate Splink models and linkages




Binary file added docs/blog/posts/img/charts_gallery.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/posts/img/completeness_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/blog/posts/img/confusion_matrix.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 24 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[tool.poetry]
name = "splink"
version = "3.9.9"
version = "3.9.10"
description = "Fast probabilistic data linkage at scale"
authors = ["Robin Linacre <[email protected]>", "Sam Lindsay", "Theodore Manassis", "Tom Hepworth", "Andy Bond", "Ross Kennedy"]
license = "MIT"
Expand Down Expand Up @@ -59,6 +59,11 @@ optional = true
pytest-benchmark = "^4"
lzstring = "1.0.4"

[tool.poetry.group.typechecking]
optional = true
[tool.poetry.group.typechecking.dependencies]
mypy = "1.7.0"

[tool.poetry.group.demos]
[tool.poetry.group.demos.dependencies]
importlib-resources = "5.4.0"
Expand Down Expand Up @@ -118,3 +123,21 @@ markers = [
"sqlite",
"sqlite_only",
]

[tool.mypy]
packages = "splink"
# temporary exclusions
exclude = [
# modules getting substantial rewrites:
'.*comparison_imports\.py$',
'.*comparison.*library\.py',
'comparison_level_composition',
# modules with large number of errors
'.*linker\.py',
]
# for now at least allow implicit optionals
# to cut down on noise. Easy to fix.
implicit_optional = true
# for now, ignore missing imports
# can remove later and install stubs, where existent
ignore_missing_imports = true
2 changes: 1 addition & 1 deletion splink/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "3.9.9"
__version__ = "3.9.10"
5 changes: 4 additions & 1 deletion splink/athena/athena_helpers/athena_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import awswrangler as wr

from splink.exceptions import InvalidAWSBucketOrDatabase
from splink.misc import ensure_is_list
from splink.splink_dataframe import SplinkDataFrame

Expand Down Expand Up @@ -30,7 +31,9 @@ def _verify_athena_inputs(database, bucket, boto3_session):
if errors:
database_bucket_txt = " and ".join(errors)
do_does_grammar = ["does", "it"] if len(errors) == 1 else ["do", "them"]
raise Exception(athena_warning_text(database_bucket_txt, do_does_grammar))
raise InvalidAWSBucketOrDatabase(
athena_warning_text(database_bucket_txt, do_does_grammar)
)


def _garbage_collection(
Expand Down
85 changes: 56 additions & 29 deletions splink/athena/linker.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import logging
import os
from typing import Union

import awswrangler as wr
import boto3
Expand Down Expand Up @@ -41,14 +42,29 @@ def columns(self):
def validate(self):
pass

def _drop_table_from_database(self, force_non_splink_table=False):
def _drop_table_from_database(
self, force_non_splink_table=False, delete_s3_data=True
):
# Check folder and table set for deletion
self._check_drop_folder_created_by_splink(force_non_splink_table)
self._check_drop_table_created_by_splink(force_non_splink_table)

# Delete the table from s3 and your database
self.linker._drop_table_from_database_if_exists(self.physical_name)
self.linker._delete_table_from_s3(self.physical_name)
table_deleted = self.linker._drop_table_from_database_if_exists(
self.physical_name
)
if delete_s3_data and table_deleted:
self.linker._delete_table_from_s3(self.physical_name)

def drop_table_from_database_and_remove_from_cache(
self,
force_non_splink_table=False,
delete_s3_data=True,
):
self._drop_table_from_database(
force_non_splink_table=force_non_splink_table, delete_s3_data=delete_s3_data
)
self.linker._remove_splinkdataframe_from_cache(self)

def _check_drop_folder_created_by_splink(self, force_non_splink_table=False):
filepath = self.linker.s3_output
Expand Down Expand Up @@ -443,7 +459,7 @@ def _extract_ctas_metadata(self, ctas_metadata):
def drop_all_tables_created_by_splink(
self,
delete_s3_folders=True,
tables_to_exclude=[],
tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
):
"""Run a cleanup process for the tables created by splink and
currently contained in your output database.
Expand All @@ -455,10 +471,15 @@ def drop_all_tables_created_by_splink(
backing data contained on s3. If False, the tables created
by splink will be removed from your database, but the parquet
outputs will remain on s3. Defaults to True.
tables_to_exclude (list, optional): A list of input tables you wish to
add to an ignore list. These will not be removed during garbage
collection.
tables_to_exclude (list[SplinkDataFrame | str], optional): A list
of input tables you wish to add to an ignore list. These
will not be removed during garbage collection.
"""
# Run cleanup on the cache before checking the db
self.drop_tables_in_current_splink_run(
delete_s3_folders,
tables_to_exclude,
)
_garbage_collection(
self.output_schema,
self.boto3_session,
Expand All @@ -470,7 +491,7 @@ def drop_splink_tables_from_database(
self,
database_name: str,
delete_s3_folders: bool = True,
tables_to_exclude: list = [],
tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
):
"""Run a cleanup process for the tables created by splink
in a specified database.
Expand All @@ -483,9 +504,9 @@ def drop_splink_tables_from_database(
backing data contained on s3. If False, the tables created
by splink will be removed from your database, but the parquet
outputs will remain on s3. Defaults to True.
tables_to_exclude (list, optional): A list of input tables you wish to
add to an ignore list. These will not be removed during garbage
collection.
tables_to_exclude (list[SplinkDataFrame | str], optional): A list
of input tables you wish to add to an ignore list. These
will not be removed during garbage collection.
"""
_garbage_collection(
database_name,
Expand All @@ -497,7 +518,7 @@ def drop_splink_tables_from_database(
def drop_tables_in_current_splink_run(
self,
delete_s3_folders: bool = True,
tables_to_exclude: list = [],
tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
):
"""Run a cleanup process for the tables created
by the current splink linker.
Expand All @@ -510,25 +531,31 @@ def drop_tables_in_current_splink_run(
backing data contained on s3. If False, the tables created
by splink will be removed from your database, but the parquet
outputs will remain on s3. Defaults to True.
tables_to_exclude (list, optional): A list of input tables you wish to
add to an ignore list. These will not be removed during garbage
collection.
tables_to_exclude (list[SplinkDataFrame | str], optional): A list
of input tables you wish to add to an ignore list. These
will not be removed during garbage collection.
"""

tables_to_exclude = ensure_is_list(tables_to_exclude)
tables_to_exclude = [
tables_to_exclude = {
df.physical_name if isinstance(df, SplinkDataFrame) else df
for df in tables_to_exclude
]
}

# Exclude tables that the user doesn't want to delete
tables = self._names_of_tables_created_by_splink.copy()
tables = [t for t in tables if t not in tables_to_exclude]

for table in tables:
_garbage_collection(
self.output_schema,
self.boto3_session,
delete_s3_folders,
name_prefix=table,
)
# pop from our tables created by splink list
self._names_of_tables_created_by_splink.remove(table)
cached_tables = self._intermediate_table_cache

# Loop through our cached tables and delete all those not in our exclusion
# list.
for splink_df in list(cached_tables.values()):
if (splink_df.physical_name not in tables_to_exclude) and (
splink_df.templated_name not in tables_to_exclude
):
splink_df.drop_table_from_database_and_remove_from_cache(
force_non_splink_table=False, delete_s3_data=delete_s3_folders
)
# As our cache contains duplicate term frequency tables and AWSwrangler
# run deletions asynchronously, add any previously seen tables to the
# list of tables to exclude from deletion.
# This prevents attempts to delete a table that has already been purged.
tables_to_exclude.add(splink_df.physical_name)
Loading

0 comments on commit 0b2f338

Please sign in to comment.