Merge branch 'master' into refactor_ids_to_compare_creation

moj-analytical-services · Dec 11, 2023 · 0b2f338 · 0b2f338
2 parents 24cd2e7 + 471525b
commit 0b2f338
Show file tree

Hide file tree

Showing 38 changed files with 1,772 additions and 1,427 deletions.
diff --git a/.github/workflows/pytest_postgres.yml b/.github/workflows/pytest_postgres.yml
@@ -89,5 +89,5 @@ jobs:
       - name: Run only postgres-marked tests
         run: |
           source .venv/bin/activate
-          pytest -v -m postgres_only tests/
+          pytest -v --durations=0 -m postgres_only tests/
 
diff --git a/.github/workflows/pytest_run_tests_with_cache.yml b/.github/workflows/pytest_run_tests_with_cache.yml
@@ -72,5 +72,5 @@ jobs:
       - name: Run tests
         run: |
           source .venv/bin/activate
-          pytest tests/
+          pytest --durations=0 tests/
 
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,6 +11,17 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Fixed
 
+## [3.9.10] - 2023-12-07
+
+### Changed
+
+- Remove unused code from Athena linker ([#1775](https://github.com/moj-analytical-services/splink/pull/1775))
+- Add argument for `register_udfs_automatically` ([#1774](https://github.com/moj-analytical-services/splink/pull/1774))
+
+### Fixed
+
+- Fixed issue with  `_source_dataset_col` and `_source_dataset_input_column` ([#1731](https://github.com/moj-analytical-services/splink/pull/1731))
+- Delete cached tables before resetting the cache ([#1752](https://github.com/moj-analytical-services/splink/pull/1752)
 
 ## [3.9.9] - 2023-11-14
 
@@ -46,6 +57,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Corrected path for Spark `.jar` file containing UDFs to work correctly for Spark < 3.0 ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))
 - Spark UDF `damerau_levensthein` is now only registered for Spark >= 3.0, as it is not compatible with earlier versions ([#1622](https://github.com/moj-analytical-services/splink/pull/1622))
 
-[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.9...HEAD
+[unreleased]: https://github.com/moj-analytical-services/splink/compare/3.9.10...HEAD
+[3.9.10]: https://github.com/moj-analytical-services/splink/compare/v3.9.9...3.9.10
 [3.9.9]: https://github.com/moj-analytical-services/splink/compare/v3.9.8...3.9.9
 [3.9.8]: https://github.com/moj-analytical-services/splink/compare/v3.9.7...v3.9.8
diff --git a/README.md b/README.md
@@ -166,13 +166,16 @@ To find the best place to ask a question, report a bug or get general advice, pl
 
 ## Awards
 
-🥇 Analysis in Government Awards 2020: Innovative Methods: [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
+🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https://www.gov.uk/government/news/launch-of-the-analysis-in-government-awards)
 
 🥇 MoJ DASD Awards 2020: Innovation and Impact - Winner
 
 🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https://analysisfunction.civilservice.gov.uk/news/announcing-the-winner-of-the-first-analysis-in-government-peoples-choice-award/)
 
-🥈 Analysis in Government Awards 2022: Innovative Methods [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
+🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https://twitter.com/gov_analysis/status/1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)
+
+🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https://www.civilserviceawards.com/best-use-of-data-science-and-technology-award-2/)
+
 
 ## Citation
 

diff --git a/docs/blog/posts/2023-07-27-feature_update.md b/docs/blog/posts/2023-07-27-feature_update.md
@@ -1,5 +1,5 @@
 ---
-date: 2022-07-27
+date: 2023-07-27
 authors:
   - ross-k
   - robin-l

diff --git a/docs/blog/posts/2023-12-06-feature_update.md b/docs/blog/posts/2023-12-06-feature_update.md
@@ -0,0 +1,119 @@
+---
+date: 2023-12-06
+authors:
+  - ross-k
+categories:
+  - Feature Updates
+---
+
+# Splink Updates - December 2023
+
+Welcome to the second installment of the Splink Blog! 
+
+Here are some of the highlights from the second half of 2023, and a taste of what is in store for 2024!
+
+<!-- more -->
+
+Latest Splink version: [v3.9.10](https://github.com/moj-analytical-services/splink/releases/tag/v3.9.10)
+
+## :bar_chart: Charts Gallery
+
+The Splink docs site now has a [Charts Gallery](../../charts/index.md) to show off all of the charts that come out-of-the-box with Splink to make linking easier. 
+
+[![](../posts/img/charts_gallery.png){ width="400" }](../../charts/index.md)
+
+Each chart now has an explanation of:
+
+1. What the chart shows
+2. How to interpret it
+3. Actions to take as a result
+
+This is the first step on a longer term journey to provide more guidance on how to evaluate Splink models and linkages, so watch this space for more in the coming months!
+
+## :chart_with_upwards_trend: New Charts
+
+We are always adding more charts to Splink - to understand how these charts are built see our new [Charts Developer Guide](../../dev_guides/charts/understanding_and_editing_charts.md).
+
+Two of our latest additions are:
+
+### :material-matrix: Confusion Matrix
+
+When evaluating any classification model, a confusion matrix is a useful tool for summarizing performance by representing counts of true positive, true negative, false positive, and false negative predictions.
+
+Splink now has its own [confusion matrix chart](../../charts/confusion_matrix_from_labels_table.ipynb) to show how model performance changes with a given match weight threshold. 
+
+[![](./img/confusion_matrix.png){ width="400" }](../../charts/confusion_matrix_from_labels_table.ipynb)
+
+Note, labelled data is required to generate this chart.
+
+### :material-table: Completeness Chart
+
+When linking multiple datasets together, one of the most important factors for a successful linkage is the number of common fields across the datasets. 
+
+Splink now has the [completeness chart](../../charts/completeness_chart.ipynb) which gives a simple view of how well populated fields are across datasets.
+
+[![](./img/completeness_chart.png)](../../charts/completeness_chart.ipynb)
+
+
+## :clipboard: Settings Validation
+
+The [Settings dictionary](../../settings_dict_guide.md) is central to everything in Splink. It defines everything from the sql dialect of your backend to how features are compared in Splink model. 
+
+A common sticking point with users is how easy it is to make small errors when defining the Settings dictionary, resulting in unhelpful error messages.
+
+To address this issue, the [Settings Validator](../../dev_guides/settings_validation/settings_validation_overview.md) provides clear, user-friendly feedback on what the issue is and how to fix it.
+
+
+## :simple-adblock: Blocking Rule Library (Improved)
+
+In our [previous blog](../posts/2023-12-06-feature_update.md#no_entry_sign-drop-support-for-python-37) we introduced the Blocking Rule Library (BRL) built upon the `exact_match_rule` function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the `block_on` function to supercede `exact_match_rule`.
+
+For example, a block on `first_name` and `surname` now looks like:
+
+```py
+from splink.duckdb.blocking_rule_library import block_on
+block_on(["first_name", "surname"])  
+```
+
+as opposed to
+
+```py
+import splink.duckdb.blocking_rule_library as brl
+brl.and_(
+  brl.exact_match_rule("first_name"),
+  brl.exact_match_rule("surname")
+)
+```
+
+All of the [tutorials](../../demos/tutorials/03_Blocking.ipynb), [example notebooks](../../demos/examples/examples_index.md) and [API docs](../../blocking_rule_library.md) have been updated to use `block_on`.
+
+## :electric_plug: Backend Specific Installs
+
+Some users have had difficulties downloading Splink due to additional dependencies, some of which may not be relevant for the backend they are using. To solve this, you can now install a minimal version of Splink for your given SQL engine.
+
+For example, to install Splink purely for Spark use the command:
+
+```bsh
+pip install 'splink[spark]'
+```
+
+See the [Getting Started page](../../getting_started.md#backend-specific-installs) for further guidance.
+
+## :no_entry_sign: Drop support for python 3.7
+
+From Splink 3.9.7, support has been dropped for python 3.7. This decision has been made to manage dependency clashes in the back end of Splink.
+
+If you are working with python 3.7, please revert to Splink 3.9.6.
+
+```bsh
+pip install splink==3.9.6
+```
+
+## :soon: What's in the pipeline?
+
+* :four:   Work on **Splink 4** is currently underway
+* :material-thumbs-up-down:   More guidance on how to evaluate Splink models and linkages
+
+
+
+
diff --git a/docs/blog/posts/img/charts_gallery.png b/docs/blog/posts/img/charts_gallery.png
diff --git a/docs/blog/posts/img/completeness_chart.png b/docs/blog/posts/img/completeness_chart.png
diff --git a/docs/blog/posts/img/confusion_matrix.png b/docs/blog/posts/img/confusion_matrix.png
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "splink"
-version = "3.9.9"
+version = "3.9.10"
 description = "Fast probabilistic data linkage at scale"
 authors = ["Robin Linacre <[email protected]>", "Sam Lindsay", "Theodore Manassis", "Tom Hepworth", "Andy Bond", "Ross Kennedy"]
 license = "MIT"
@@ -59,6 +59,11 @@ optional = true
 pytest-benchmark = "^4"
 lzstring = "1.0.4"
 
+[tool.poetry.group.typechecking]
+optional = true
+[tool.poetry.group.typechecking.dependencies]
+mypy = "1.7.0"
+
 [tool.poetry.group.demos]
 [tool.poetry.group.demos.dependencies]
 importlib-resources = "5.4.0"
@@ -118,3 +123,21 @@ markers = [
     "sqlite",
     "sqlite_only",
 ]
+
+[tool.mypy]
+packages = "splink"
+# temporary exclusions
+exclude = [
+    # modules getting substantial rewrites:
+    '.*comparison_imports\.py$',
+    '.*comparison.*library\.py',
+    'comparison_level_composition',
+    # modules with large number of errors
+    '.*linker\.py',
+]
+# for now at least allow implicit optionals
+# to cut down on noise. Easy to fix.
+implicit_optional = true
+# for now, ignore missing imports
+# can remove later and install stubs, where existent
+ignore_missing_imports = true
diff --git a/splink/__init__.py b/splink/__init__.py
@@ -1 +1 @@
-__version__ = "3.9.9"
+__version__ = "3.9.10"
diff --git a/splink/athena/athena_helpers/athena_utils.py b/splink/athena/athena_helpers/athena_utils.py
@@ -1,5 +1,6 @@
 import awswrangler as wr
 
+from splink.exceptions import InvalidAWSBucketOrDatabase
 from splink.misc import ensure_is_list
 from splink.splink_dataframe import SplinkDataFrame
 
@@ -30,7 +31,9 @@ def _verify_athena_inputs(database, bucket, boto3_session):
     if errors:
         database_bucket_txt = " and ".join(errors)
         do_does_grammar = ["does", "it"] if len(errors) == 1 else ["do", "them"]
-        raise Exception(athena_warning_text(database_bucket_txt, do_does_grammar))
+        raise InvalidAWSBucketOrDatabase(
+            athena_warning_text(database_bucket_txt, do_does_grammar)
+        )
 
 
 def _garbage_collection(

diff --git a/splink/athena/linker.py b/splink/athena/linker.py
@@ -2,6 +2,7 @@
 
 import logging
 import os
+from typing import Union
 
 import awswrangler as wr
 import boto3
@@ -41,14 +42,29 @@ def columns(self):
     def validate(self):
         pass
 
-    def _drop_table_from_database(self, force_non_splink_table=False):
+    def _drop_table_from_database(
+        self, force_non_splink_table=False, delete_s3_data=True
+    ):
         # Check folder and table set for deletion
         self._check_drop_folder_created_by_splink(force_non_splink_table)
         self._check_drop_table_created_by_splink(force_non_splink_table)
 
         # Delete the table from s3 and your database
-        self.linker._drop_table_from_database_if_exists(self.physical_name)
-        self.linker._delete_table_from_s3(self.physical_name)
+        table_deleted = self.linker._drop_table_from_database_if_exists(
+            self.physical_name
+        )
+        if delete_s3_data and table_deleted:
+            self.linker._delete_table_from_s3(self.physical_name)
+
+    def drop_table_from_database_and_remove_from_cache(
+        self,
+        force_non_splink_table=False,
+        delete_s3_data=True,
+    ):
+        self._drop_table_from_database(
+            force_non_splink_table=force_non_splink_table, delete_s3_data=delete_s3_data
+        )
+        self.linker._remove_splinkdataframe_from_cache(self)
 
     def _check_drop_folder_created_by_splink(self, force_non_splink_table=False):
         filepath = self.linker.s3_output
@@ -443,7 +459,7 @@ def _extract_ctas_metadata(self, ctas_metadata):
     def drop_all_tables_created_by_splink(
         self,
         delete_s3_folders=True,
-        tables_to_exclude=[],
+        tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
     ):
         """Run a cleanup process for the tables created by splink and
         currently contained in your output database.
@@ -455,10 +471,15 @@ def drop_all_tables_created_by_splink(
                 backing data contained on s3. If False, the tables created
                 by splink will be removed from your database, but the parquet
                 outputs will remain on s3. Defaults to True.
-            tables_to_exclude (list, optional): A list of input tables you wish to
-                add to an ignore list. These will not be removed during garbage
-                collection.
+            tables_to_exclude (list[SplinkDataFrame | str], optional): A list
+                of input tables you wish to add to an ignore list. These
+                will not be removed during garbage collection.
         """
+        # Run cleanup on the cache before checking the db
+        self.drop_tables_in_current_splink_run(
+            delete_s3_folders,
+            tables_to_exclude,
+        )
         _garbage_collection(
             self.output_schema,
             self.boto3_session,
@@ -470,7 +491,7 @@ def drop_splink_tables_from_database(
         self,
         database_name: str,
         delete_s3_folders: bool = True,
-        tables_to_exclude: list = [],
+        tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
     ):
         """Run a cleanup process for the tables created by splink
         in a specified database.
@@ -483,9 +504,9 @@ def drop_splink_tables_from_database(
                 backing data contained on s3. If False, the tables created
                 by splink will be removed from your database, but the parquet
                 outputs will remain on s3. Defaults to True.
-            tables_to_exclude (list, optional): A list of input tables you wish to
-                add to an ignore list. These will not be removed during garbage
-                collection.
+            tables_to_exclude (list[SplinkDataFrame | str], optional): A list
+                of input tables you wish to add to an ignore list. These
+                will not be removed during garbage collection.
         """
         _garbage_collection(
             database_name,
@@ -497,7 +518,7 @@ def drop_splink_tables_from_database(
     def drop_tables_in_current_splink_run(
         self,
         delete_s3_folders: bool = True,
-        tables_to_exclude: list = [],
+        tables_to_exclude: list[Union[SplinkDataFrame, str]] = [],
     ):
         """Run a cleanup process for the tables created
         by the current splink linker.
@@ -510,25 +531,31 @@ def drop_tables_in_current_splink_run(
                 backing data contained on s3. If False, the tables created
                 by splink will be removed from your database, but the parquet
                 outputs will remain on s3. Defaults to True.
-            tables_to_exclude (list, optional): A list of input tables you wish to
-                add to an ignore list. These will not be removed during garbage
-                collection.
+            tables_to_exclude (list[SplinkDataFrame | str], optional): A list
+                of input tables you wish to add to an ignore list. These
+                will not be removed during garbage collection.
         """
+
         tables_to_exclude = ensure_is_list(tables_to_exclude)
-        tables_to_exclude = [
+        tables_to_exclude = {
             df.physical_name if isinstance(df, SplinkDataFrame) else df
             for df in tables_to_exclude
-        ]
+        }
+
         # Exclude tables that the user doesn't want to delete
-        tables = self._names_of_tables_created_by_splink.copy()
-        tables = [t for t in tables if t not in tables_to_exclude]
-
-        for table in tables:
-            _garbage_collection(
-                self.output_schema,
-                self.boto3_session,
-                delete_s3_folders,
-                name_prefix=table,
-            )
-            # pop from our tables created by splink list
-            self._names_of_tables_created_by_splink.remove(table)
+        cached_tables = self._intermediate_table_cache
+
+        # Loop through our cached tables and delete all those not in our exclusion
+        # list.
+        for splink_df in list(cached_tables.values()):
+            if (splink_df.physical_name not in tables_to_exclude) and (
+                splink_df.templated_name not in tables_to_exclude
+            ):
+                splink_df.drop_table_from_database_and_remove_from_cache(
+                    force_non_splink_table=False, delete_s3_data=delete_s3_folders
+                )
+            # As our cache contains duplicate term frequency tables and AWSwrangler
+            # run deletions asynchronously, add any previously seen tables to the
+            # list of tables to exclude from deletion.
+            # This prevents attempts to delete a table that has already been purged.
+            tables_to_exclude.add(splink_df.physical_name)