Skip to content

Commit

Permalink
Merge branch 'master' into refactor_ids_to_compare_creation
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinL committed Jan 17, 2024
2 parents 9ef61f1 + f128871 commit 1ece5d7
Show file tree
Hide file tree
Showing 45 changed files with 663 additions and 150 deletions.
1 change: 1 addition & 0 deletions .github/workflows/auto_update_script_contents.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
9 changes: 7 additions & 2 deletions .github/workflows/autoblack.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
name: autoblack
on: [pull_request]

env:
PYTHON_VERSION: "3.12.1"

jobs:
build:
runs-on: ubuntu-latest
Expand All @@ -9,10 +13,10 @@ jobs:
with:
ref: ${{ github.event.pull_request.head.ref }}
repository: ${{ github.event.pull_request.head.repo.full_name }}
- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
python-version: ${{ env.PYTHON_VERSION }}

- name: Load cached Poetry installation
uses: actions/cache@v2
Expand All @@ -22,6 +26,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
12 changes: 6 additions & 6 deletions .github/workflows/lint.yml
Original file line number Diff line number Diff line change
@@ -1,20 +1,19 @@
name: Lint
on: [pull_request]

env:
PYTHON_VERSION: "3.12.1"

jobs:
build:
runs-on: ubuntu-latest
strategy:
max-parallel: 4
matrix:
python-version: [3.8]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
- name: Set up Python ${{ env.PYTHON_VERSION }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
python-version: ${{ env.PYTHON_VERSION }}

- name: Load cached Poetry installation
uses: actions/cache@v2
Expand All @@ -24,6 +23,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 0 additions & 1 deletion .github/workflows/poetry_pypi_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ jobs:
- uses: actions/checkout@v3
with:
ref: master
token: ${{ secrets.SPLINK_TOKEN }}
- name: Install poetry
run: pipx install poetry
- uses: actions/setup-python@v4
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pytest_benchmark_comment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/pytest_benchmark_commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/run_demos_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/run_demos_tutorials.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ jobs:
- name: Install Poetry
uses: snok/install-poetry@v1
with:
version: '1.7.0'
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true
Expand Down
6 changes: 5 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- Splink now fully parallelises data linkage when using DuckDB ([#1796](https://github.com/moj-analytical-services/splink/pull/1796))

### Fixed

- Allow salting in EM training ([#1832](https://github.com/moj-analytical-services/splink/pull/1832))

## [3.9.10] - 2023-12-07

### Changed
Expand All @@ -20,7 +24,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Fixed

- Fixed issue with `_source_dataset_col` and `_source_dataset_input_column` ([#1731](https://github.com/moj-analytical-services/splink/pull/1731))
- Fixed issue with `_source_dataset_col` and `_source_dataset_input_column` ([#1731](https://github.com/moj-analytical-services/splink/pull/1731))
- Delete cached tables before resetting the cache ([#1752](https://github.com/moj-analytical-services/splink/pull/1752)

## [3.9.9] - 2023-11-14
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ Contributions to Splink are not limited to the code. Feedback and input on our d

Behind the scenes, the Splink documentation is split into 2 parts:

- The [Tutorials](./docs/demos/00_Tutorial_Introduction.ipynb) and [Example Notebooks](./docs/examples_index.md) are stored in a separate repo - [splink_demos](https://github.com/moj-analytical-services/splink_demos)
- The [Tutorials](./docs/demos/tutorials/00_Tutorial_Introduction.ipynb) and [Example Notebooks](./docs/demos/examples/examples_index.md) are stored in a separate repo - [splink_demos](https://github.com/moj-analytical-services/splink_demos)
- Everything else is stored in the Splink repo either in:
- the [docs folder](https://github.com/moj-analytical-services/splink/tree/master/docs)
- the Splink code itself. E.g. docstrings from [linker.py](https://github.com/moj-analytical-services/splink/blob/master/splink/linker.py) feed directly into the [Linker API docs](./docs/linker.md).
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ Should you require a more bare-bones version of Splink **without DuckDB**, pleas

The following code demonstrates how to estimate the parameters of a deduplication model, use it to identify duplicate records, and then use clustering to generate an estimated unique person ID.

For more detailed tutorial, please see [here](https://moj-analytical-services.github.io/splink/demos/00_Tutorial_Introduction.html).
For more detailed tutorial, please see [here](https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html).

```py
from splink.duckdb.linker import DuckDBLinker
Expand Down
4 changes: 2 additions & 2 deletions docs/blocking_rule_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ toc_depth: 2
---
# Documentation for `blocking_rules_library`

The `blocking_rules_library` contains a series of pre-made blocking rules available for use in the construction of blocking rule strategies and em training blocks [as described in this topic guide](./topic_guides/drivers_of_performance.html#blocking-rules).
The `blocking_rules_library` contains a series of pre-made blocking rules available for use in the construction of blocking rule strategies and em training blocks [as described in this topic guide](./topic_guides/blocking/blocking_rules.md).

These conform to a more performant standard that is outlined in detail [here](./topic_guides/drivers_of_performance.html#blocking-rules).
These conform to a more performant standard that is outlined in detail [here](./topic_guides/performance/drivers_of_performance.html#blocking-rules).


The detailed API for each of these are outlined below.
Expand Down
17 changes: 9 additions & 8 deletions docs/blog/.authors.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,10 @@
robin-l:
name: Robin Linacre
description: Creator
avatar: https://github.com/robinl.png
authors:
robin-l:
name: Robin Linacre
description: Creator
avatar: https://github.com/robinl.png

ross-k:
name: Ross Kennedy
description: Maintainer
avatar: https://github.com/rossken.png
ross-k:
name: Ross Kennedy
description: Maintainer
avatar: https://github.com/rossken.png
2 changes: 1 addition & 1 deletion docs/charts/profile_columns.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@
"\n",
"To take this skew into account, we can build Splink models with **Term Frequency Adjustments**. These adjustments will increase the amount of evidence for rare matching values and reduce the amount of evidence for common matching values.\n",
"\n",
"To understand how these work in more detail, check out the [Term Frequency Adjustments Topic Guide](../comparisons/term-frequency.md)\n",
"To understand how these work in more detail, check out the [Term Frequency Adjustments Topic Guide](../topic_guides/comparisons/term-frequency.md)\n",
"\n",
"<hr>"
]
Expand Down
2 changes: 1 addition & 1 deletion docs/comparison_helpers.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ tags:
---
# Documentation for `comparison_helpers` functions

The `comparison_helpers` functions are a set of functions to help users create better comparisons by helping them understand [string comparators](./topic_guides/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) (fuzzy matching) and [phonetic matching](./topic_guides/choosing_comparators.ipynb#phonetic-matching).
The `comparison_helpers` functions are a set of functions to help users create better comparisons by helping them understand [string comparators](./topic_guides/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) (fuzzy matching) and [phonetic matching](./topic_guides/comparisons/choosing_comparators.ipynb#phonetic-matching).

The detailed API for each of these are outlined below.

Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_level_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,8 @@ toc_depth: 2
# Documentation for `comparison_level_library`

The `comparison_level_library` contains pre-made comparison levels available for use to
construct custom comparisons [as described in this topic guide](./topic_guides/customising_comparisons.html#method-3-comparisonlevels).
However, not every comparison level is available for every [Splink-compatible SQL backend](./topic_guides/backends.html).
construct custom comparisons [as described in this topic guide](./topic_guides/comparisons/customising_comparisons.html#method-3-comparisonlevels).
However, not every comparison level is available for every [Splink-compatible SQL backend](./topic_guides/splink_fundamentals/backends.html).

The pre-made Splink comparison levels available for each SQL dialect are as given in this table:

Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ toc_depth: 2
---
# Documentation for `comparison_library`

The `comparison_library` contains pre-made comparisons available for use directly [as described in this topic guide](./topic_guides/customising_comparisons.html#method-1-using-the-comparisonlibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/backends.html).
The `comparison_library` contains pre-made comparisons available for use directly [as described in this topic guide](./topic_guides/comparisons/customising_comparisons.html#method-1-using-the-comparisonlibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/splink_fundamentals/backends/backends.html).

The pre-made Splink comparisons available for each SQL dialect are as given in this table:

Expand Down
4 changes: 2 additions & 2 deletions docs/comparison_template_library.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,8 @@ toc_depth: 2

# Documentation for `comparison_template_library`

The `comparison_template_library` contains pre-made comparisons with pre-defined parameters available for use directly [as described in this topic guide](./topic_guides/customising_comparisons.html#method-2-using-the-comparisontemplatelibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/backends.html). More detail on creating comparisons for specific data types is also [included in the topic guide.](./topic_guides/customising_comparisons.html#creating-comparisons-for-specific-data-types)
The `comparison_template_library` contains pre-made comparisons with pre-defined parameters available for use directly [as described in this topic guide](./topic_guides/comparisons/customising_comparisons.html#method-2-using-the-comparisontemplatelibrary).
However, not every comparison is available for every [Splink-compatible SQL backend](./topic_guides/splink_fundamentals/backends/backends.html). More detail on creating comparisons for specific data types is also [included in the topic guide.](./topic_guides/comparisons/customising_comparisons.html#creating-comparisons-for-specific-data-types)

The pre-made Splink comparison templates available for each SQL dialect are as given in this table:

Expand Down
2 changes: 1 addition & 1 deletion docs/demos/examples/examples_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ You can try these demos live in your web browser using the following link:

[Estimating m probabilities from pairwise labels](./duckdb/pairwise_labels.ipynb)

[Deduplicating 50,000 records with Deterministic Rules](./duckdb/examples/duckdb/deterministic_dedupe.ipynb)
[Deduplicating 50,000 records with Deterministic Rules](./duckdb/deterministic_dedupe.ipynb)

[Deduplicating the febrl3 dataset](./duckdb/febrl3.ipynb). Note this dataset comes from [febrl](http://users.cecs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/febrldoc-0.3/manual.html), as referenced in A.2 [here](https://arxiv.org/pdf/2008.04443.pdf) and replicated [here](https://recordlinkage.readthedocs.io/en/latest/ref-datasets.html).

Expand Down
4 changes: 2 additions & 2 deletions docs/demos/tutorials/04_Estimating_model_parameters.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -885,8 +885,8 @@
"\n",
" :simple-readme: For a deeper dive on:\n",
"\n",
" * choosing comparisons, please refer to the [Comparisons Topic Guides](../../topic_guides/customising_comparisons.ipynb)\n",
" * the underlying model theory, please refer to the [Fellegi Sunter Topic Guide](../../topic_guides/fellegi_sunter.md)\n",
" * choosing comparisons, please refer to the [Comparisons Topic Guides](../../topic_guides/comparisons/customising_comparisons.ipynb)\n",
" * the underlying model theory, please refer to the [Fellegi Sunter Topic Guide](../../topic_guides/theory/fellegi_sunter.md)\n",
" * model training, please refer to the Model Training Topic Guides (Coming Soon).\n",
"\n",
" :bar_chart: For more on the charts used in this tutorial, please refer to the [Charts Gallery](../../charts/index.md#model-training)."
Expand Down
4 changes: 2 additions & 2 deletions docs/dev_guides/changing_splink/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,8 +130,8 @@ pytest -W ignore -q -x -m duckdb tests/test_estimate_prob_two_rr_match.py
Splink utilises [github actions](https://docs.github.com/en/actions) to run tests for each pull request. This consists of a few independent checks:

* The full test suite is run separately against several different python versions
* The [example notebooks](./examples_index.html) are checked to ensure they run without error
* The [tutorial notebooks](./demos/00_Tutorial_Introduction.html) are checked to ensure they run without error
* The [example notebooks](../../demos/examples_index.html) are checked to ensure they run without error
* The [tutorial notebooks](../../demos/00_Tutorial_Introduction.html) are checked to ensure they run without error

## Writing tests

Expand Down
4 changes: 2 additions & 2 deletions docs/dev_guides/charts/understanding_and_editing_charts.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Charts in Splink

Interactive charts are a key tool when linking data with Splink. To see all of the charts available, check out the [Splink Charts Gallery](../charts/index.md).
Interactive charts are a key tool when linking data with Splink. To see all of the charts available, check out the [Splink Charts Gallery](../../charts/index.md).


## How do charts work in Splink?
Expand All @@ -23,7 +23,7 @@ For a given chart, there is usually:

If you take any Altair chart in HTML format, you should be able to make changes pretty easily with the Vega-Lite Editor.

For example, consider the [`comparator_score_chart`](../topic_guides/comparisons/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) from the [`comparison_helpers library`](../comparison_helpers.md#splink.comparison_helpers.comparator_score_chart):
For example, consider the [`comparator_score_chart`](../../topic_guides/comparisons/choosing_comparators.ipynb#comparing-string-similarity-and-distance-scores) from the [`comparison_helpers library`](../../comparison_helpers.md#splink.comparison_helpers.comparator_score_chart):

| Before | After |
| ------ | ----- |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tags:
---
# Extending existing comparisons and comparison levels

Creating a linkage (or deduplication) model necessitates making various comparisons between (or within) your data sets. There is some choice available in what kind of comparisons you will wish to do for the linkage problem you are dealing with. Splink comes with several [comparisons ready to use directly](../../topic_guides/customising_comparisons.html#method-1-using-the-comparisonlibrary), as well as several [comparison levels that you can use to construct your own comparison](../../topic_guides/customising_comparisons.html#method-3-comparisonlevels). You may find that within these you find yourself using a specialised version repeatedly, and would like to make a shorthand for this and contribute it to the Splink library for other users to benefit from - this page will aid you in this process.
Creating a linkage (or deduplication) model necessitates making various comparisons between (or within) your data sets. There is some choice available in what kind of comparisons you will wish to do for the linkage problem you are dealing with. Splink comes with several [comparisons ready to use directly](../../topic_guides/comparisons/customising_comparisons.html#method-1-using-the-comparisonlibrary), as well as several [comparison levels that you can use to construct your own comparison](../../topic_guides/comparisons/customising_comparisons.html#method-3-comparisonlevels). You may find that within these you find yourself using a specialised version repeatedly, and would like to make a shorthand for this and contribute it to the Splink library for other users to benefit from - this page will aid you in this process.

This guide supplements [the guide for adding entirely new comparisons and comparison levels](./new_library_comparisons_and_levels.md) to show how things work when you are extending existing entries.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ tags:
# Creating new comparisons and comparison levels for libraries

The Fellegi-Sunter model that Splink implements depends on having several comparisons, which are each composed of two or more comparison levels.
Splink provides several _ready-made_ [comparisons](../../comparison_library.html) and [comparison levels](../../comparison_level_library.html) to use out-of-the-box, but you may find in your particular application that you have to [create your own custom versions](../../topic_guides/customising_comparisons.html#method-4-providing-the-spec-as-a-dictionary) if there is not a suitable comparison/level for the [SQL dialect you are working with](../../topic_guides/backends.html) (or for any available dialects).
Splink provides several _ready-made_ [comparisons](../../comparison_library.html) and [comparison levels](../../comparison_level_library.html) to use out-of-the-box, but you may find in your particular application that you have to [create your own custom versions](../../topic_guides/comparisons/customising_comparisons.html#method-4-providing-the-spec-as-a-dictionary) if there is not a suitable comparison/level for the [SQL dialect you are working with](../../topic_guides/splink_fundamentals/backends/backends.html) (or for any available dialects).

Having created a custom comparison you may decide that your use case is common enough that you want to contribute it to Splink for other users to benefit from. This guide will take you through the process of doing so. Looking at existing examples should also prove to be useful for further guidance, and to perhaps serve as a starting template.

Expand Down
2 changes: 1 addition & 1 deletion docs/dev_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,6 @@ Splink is quite a large, complex codebase. The guides in this section lay out so
* [Transpilation using sqlglot](./transpilation.md) - demonstrates how Splink translates SQL in order to be compatible with multiple SQL engines using the sqlglot package.
* [Performance and caching](./caching.md) - demonstrates how pipelining and caching is used to make Splink run more efficiently.
* [Comparison and Comparison Level Libraries](./comparisons/new_library_comparisons_and_levels.md) - demonstrates how `Comparison` Library and `ComparisonLevel` Library functions are structured within Splink, including how to add new functions and edit existing functions.
* [Charts](./charts.ipynb) - demonstrates how charts are built in Splink, including how to add new charts and edit existing charts.
* [Charts](./charts/understanding_and_editing_charts.md) - demonstrates how charts are built in Splink, including how to add new charts and edit existing charts.
* [User-Defined Functions](./udfs.md) - demonstrates how User Defined Functions (UDFs) are used to provide functionality within Splink that is not native to a given SQL backend.
* [Settings Validation](./settings_validation/settings_validation_overview.md) - summarises how to use and expand the existing settings schema and validation functions.
Loading

0 comments on commit 1ece5d7

Please sign in to comment.