Skip to content

Commit

Permalink
Add documentation for spellchecker and spellcheck docs (#2025)
Browse files Browse the repository at this point in the history
* Documentation for using the spellchecker

* remove comment

* reorder dictionary manually

* add ability to ignore text in 'cards'

* correct spelling in docs

* update with except pattern delimiters

* add words to dictionary

* ignore icon text

* spellcheck and update dictionary

* spellcheck and add words to dictionary

* spellcheck docs and update dictionary

* spellchecked and updated dictionary

* updated name of pyspelling YAML

* update guidance

* update instructions for docs

* spellchecked docs and updated dictionary

* spellchecking

* update bash script to exclude auto-generated files

* ignore angle brackets

* spellcheck and update dictionary

* spellcheck

* add ignore pattern

* spellcheck

* update documentation with changes from master

* old docs guide added to general contribution guidance

* spellcheck and update dictionary

* remove Americanisms

* Capitalise authors' names

* update tables

* move `jaro_winkler_sim` to a code block

* splink -> Splink

* Capitalise names

* postgres -> Postgres

* Add some additional words to the dict

* add descriptive comments to yaml

* mention dictionary is British English

* spellchecking complete!

* alpha-sorted custom dictionary

* correct PyPI

* delete test file

* fixing links

* dummy commit

* update dictionary

* update documentation

* update action to ignore markdown file

* revert markdown

---------

Co-authored-by: Tom Hepworth <[email protected]>
  • Loading branch information
zslade and ThomasHepworth authored Mar 28, 2024
1 parent e53a044 commit d1acc1d
Show file tree
Hide file tree
Showing 42 changed files with 759 additions and 395 deletions.
1 change: 1 addition & 0 deletions .github/workflows/run_demos_examples.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
paths:
- "splink/**"
- "docs/demos/examples/**"
- “!docs/demos/examples/examples_index.md”
- "pyproject.toml"

workflow_dispatch:
Expand Down
10 changes: 5 additions & 5 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,9 +40,9 @@ Thanks for your interest in contributing code to Splink!
There are a number of ways to get involved:

- Start work on an [existing issue](https://github.com/moj-analytical-services/splink/issues), there should be some with a [`good first issue`](https://github.com/moj-analytical-services/splink/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) flag which are a good place to start.
- Tackle a problem you have identified. If you have identified a feature or bug, the first step is to [create a new issue](https://github.com/moj-analytical-services/splink/issues/new/choose) to explain what you have identified and what you plan to implement, then you are free to fork the repo and get coding!
- Tackle a problem you have identified. If you have identified a feature or bug, the first step is to [create a new issue](https://github.com/moj-analytical-services/splink/issues/new/choose) to explain what you have identified and what you plan to implement, then you are free to fork the repository and get coding!

In either case, we ask that you assign yourself to the relevant issue and open up [a draft PR](https://github.blog/2019-02-14-introducing-draft-pull-requests/) while you are working on your feature/bug-fix. This helps the Splink dev team keep track of developments and means we can start supporting you sooner!
In either case, we ask that you assign yourself to the relevant issue and open up [a draft pull request (PR)](https://github.blog/2019-02-14-introducing-draft-pull-requests/) while you are working on your feature/bug-fix. This helps the Splink dev team keep track of developments and means we can start supporting you sooner!

You can always add further PRs to build extra functionality. Starting out with a minimum viable product and iterating makes for better software (in our opinion). It also helps get features out into the wild sooner.

Expand All @@ -57,10 +57,10 @@ When making code changes, we recommend:

### Branching Strategy

Typically, all pull requests (PRs) should be against `master`.
Typically, all pull requests (PRs) should target the `master` branch.
**However, currently the Splink team is working on a major update in Splink v4. For the time being, we are using the `splink4_dev` branch to develop v4.**
As a general rule, substantial new features should be PRed against that branch,
while bug fixes and documentation changes should be PRed against `master`.
As a general rule, substantial new features should be targeted at `splink4_dev`,
while bug fixes and documentation changes should be targeted at `master`.
If you are unsure which category your change falls into, please ask!

We believe that [small Pull Requests](https://essenceofcode.com/2019/10/29/the-art-of-small-pull-requests/) make better code. They:
Expand Down
6 changes: 3 additions & 3 deletions docs/blog/posts/2023-12-06-feature_update.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ Two of our latest additions are:

### :material-matrix: Confusion Matrix

When evaluating any classification model, a confusion matrix is a useful tool for summarizing performance by representing counts of true positive, true negative, false positive, and false negative predictions.
When evaluating any classification model, a confusion matrix is a useful tool for summarising performance by representing counts of true positive, true negative, false positive, and false negative predictions.

Splink now has its own [confusion matrix chart](../../charts/threshold_selection_tool_from_labels_table.ipynb) to show how model performance changes with a given match weight threshold.

Expand All @@ -57,7 +57,7 @@ Splink now has the [completeness chart](../../charts/completeness_chart.ipynb) w

## :clipboard: Settings Validation

The [Settings dictionary](../../settings_dict_guide.md) is central to everything in Splink. It defines everything from the sql dialect of your backend to how features are compared in Splink model.
The [Settings dictionary](../../settings_dict_guide.md) is central to everything in Splink. It defines everything from the SQL dialect of your backend to how features are compared in Splink model.

A common sticking point with users is how easy it is to make small errors when defining the Settings dictionary, resulting in unhelpful error messages.

Expand All @@ -66,7 +66,7 @@ To address this issue, the [Settings Validator](../../dev_guides/settings_valida

## :simple-adblock: Blocking Rule Library (Improved)

In our [previous blog](../posts/2023-12-06-feature_update.md#no_entry_sign-drop-support-for-python-37) we introduced the Blocking Rule Library (BRL) built upon the `exact_match_rule` function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the `block_on` function to supercede `exact_match_rule`.
In our [previous blog](../posts/2023-12-06-feature_update.md#no_entry_sign-drop-support-for-python-37) we introduced the Blocking Rule Library (BRL) built upon the `exact_match_rule` function. When testing this functionality we found it pretty verbose, particularly when blocking on multiple columns, so figured we could do better. From Splink v3.9.6 we introduced the `block_on` function to supersede `exact_match_rule`.

For example, a block on `first_name` and `surname` now looks like:

Expand Down
4 changes: 2 additions & 2 deletions docs/blog/posts/2024-01-25-ethics.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ Furthermore, data linkage is generally used at the start of analytical projects

Data ethics has been a foundational consideration throughout Splink’s development. For example, the decision to make Splink open-source was motivated by an ambition to make our data linking software fully transparent, accessible and auditable to users both inside and outside of government. The fact that this also empowers external users to expand and improve upon Splink’s functionality is another [huge benefit](https://www.robinlinacre.com/open_source_dividend/)!

Another core principle guiding the development of Splink has been explainability. Under the hood we use the [Felligi-Sunter model](../../topic_guides/theory/fellegi_sunter.md) which is an industry-standard, well-researched, explainable methodology. This, in combination with interactive charts such as the [waterfall chart](../../charts/waterfall_chart.ipynb), where model results can be easily broken down and visualised for individual record pairs, make Splink predictions easily interrogatable and explainable. Being able to interrogate predictions is especially valuable when things go wrong - if an incorrect link has been made you can trace it back see exactly why the model made the decision.
Another core principle guiding the development of Splink has been explainability. Under the hood we use the [Fellegi-Sunter model](../../topic_guides/theory/fellegi_sunter.md) which is an industry-standard, well-researched, explainable methodology. This, in combination with interactive charts such as the [waterfall chart](../../charts/waterfall_chart.ipynb), where model results can be easily broken down and visualised for individual record pairs, make Splink predictions easily interrogatable and explainable. Being able to interrogate predictions is especially valuable when things go wrong - if an incorrect link has been made you can trace it back see exactly why the model made the decision.

### What else should we be considering?

Expand Down Expand Up @@ -61,7 +61,7 @@ Sharing both our current knowledge and future discoveries on the ethics of data

As already mentioned, Splink comes with a variety of tools that support explainability. We will be updating the Splink documentation to convey the significance of these resources from a data ethics perspective to help give existing users, potential adopters and their customers greater confidence in building Splink models and model predictions.

Please visit the [Ethics in Data Linking discussion](https://github.com/moj-analytical-services/splink/discussions/1878) on Splink's GitHub repo to get involved in the conversation and share your thoughts - we'd love to hear them!
Please visit the [Ethics in Data Linking discussion](https://github.com/moj-analytical-services/splink/discussions/1878) on Splink's GitHub repository to get involved in the conversation and share your thoughts - we'd love to hear them!

<hr>

Expand Down
12 changes: 6 additions & 6 deletions docs/dev_guides/caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,9 @@ For example, the `predict()` step:
- Inputs `__splink__df_comparison_vectors` and outputs `__splink__df_match_weight_parts`
- Inputs `__splink__df_match_weight_parts` and outputs `__splink__df_predict`

To make this run faster, two key optimisations are implmented:
To make this run faster, two key optimisations are implemented:

- Pipelining - combining multiple `select` statements into a single statemenet using `WITH`([CTE](https://www.postgresql.org/docs/current/queries-with.html)) queries
- Pipelining - combining multiple `select` statements into a single statement using `WITH`([CTE](https://www.postgresql.org/docs/current/queries-with.html)) queries
- Caching: saving the results of calculations so they don't need recalculating. This is especially useful because some intermediate calculations are reused multiple times during a typical Splink session

This article discusses the general implementation of caching and pipelining. The implementation needs some alterations for certain backends like Spark, which lazily evaluate SQL by default.
Expand Down Expand Up @@ -55,22 +55,22 @@ For example, when we run `linker.predict()`, Splink:
- Generates the SQL tasks
- Pipelines them into a single SQL statement
- Hashes the statement to create a physical name for the outputs `__splink__df_predict_cbc9833`
- Checks whether a table with physical name `__splink__df_predict_cbc9833` alredy exists in the database
- Checks whether a table with physical name `__splink__df_predict_cbc9833` already exists in the database
- If not, executes the SQL statement, creating table `__splink__df_predict_cbc9833` in the database.

In terms of implementation, the following happens:

- SQL statements are generated an put in the queue - see [here](https://github.com/moj-analytical-services/splink/blob/6e978a6a61058a73ef6c49039e0d796b12673c1b/splink/linker.py#L982-L983)
- Once all the tasks have been added to the queue, we call `_execute_sql_pipeline()` see [here](https://github.com/moj-analytical-services/splink/blob/6e978a6a61058a73ef6c49039e0d796b12673c1b/splink/linker.py#L994)
- The SQL is combined into a single pipelined statement [here](https://github.com/moj-analytical-services/splink/blob/6e978a6a61058a73ef6c49039e0d796b12673c1b/splink/linker.py#L339)
- We call `_sql_to_splink_dataframe()` which returns the table (from the cache if it already exists, or it executes the sql)
- We call `_sql_to_splink_dataframe()` which returns the table (from the cache if it already exists, or it executes the SQL)
- The table is returned as a `SplinkDataframe`, an abstraction over a table in a database. See [here](https://moj-analytical-services.github.io/splink/SplinkDataFrame.html).

#### Some cached tables do not need a hash

A hash is required to uniquely identify some outputs. For example, blocking is used in several places in Splink, with _different results_. For example, the `__splink__df_blocked` needed to estimate parameters is different to the `__splink__df_blocked` needed in the `predict()` step.

As a result, we cannot materialise a single table called `__splink__df_blocked` in the database and reues it multiple times. This is why we append the hash of the SQL, so that we can uniquely identify the different versions of `__splink__df_blocked` which are needed in different contexts.
As a result, we cannot materialise a single table called `__splink__df_blocked` in the database and reuse it multiple times. This is why we append the hash of the SQL, so that we can uniquely identify the different versions of `__splink__df_blocked` which are needed in different contexts.

There are, however, some tables which are globally unique. They only take a single form, and if they exist in the cache they never need recomputing.

Expand All @@ -90,4 +90,4 @@ However, there are many intermediate outputs which are used by many different Sp

Performance can therefore be improved by computing and saving these intermediate outputs to a cache, to ensure they don't need to be computed repeatedly.

This is achieved by enqueueing SQL to a pipline and strategically calling `execute_sql_pipeline` to materialise results that need to cached.
This is achieved by enqueueing SQL to a pipeline and strategically calling `execute_sql_pipeline` to materialise results that need to cached.
4 changes: 2 additions & 2 deletions docs/dev_guides/changing_splink/blog_posts.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Thanks for considering making a contribution to the [Splink Blog](../../blog/index.md)! We are keen to use this blog as a forum all things data linking and Splink!

This blog, and the docs as a whole, are built using the fantastic [mkdocs-material](https://squidfunk.github.io/mkdocs-material/), to understand more about how the blog works under the hood checkout out the mkdocs-material [blog documentation](https://squidfunk.github.io/mkdocs-material/blog/2022/09/12/blog-support-just-landed/).
This blog, and the docs as a whole, are built using the fantastic [MkDocs material](https://squidfunk.github.io/mkdocs-material/), to understand more about how the blog works under the hood checkout out the MkDocs material [blog documentation](https://squidfunk.github.io/mkdocs-material/blog/2022/09/12/blog-support-just-landed/).

For more general guidance for contributing to Splink, check out our [Contributor Guide](../CONTRIBUTING.md).

Expand All @@ -22,4 +22,4 @@ If you are a new author, you will need to add yourself to the [.authors.yml file

## Testing your changes

Once you have made a first draft, check out how the deployed blog will look by [building the docs locally](./build_docs_locally.md).
Once you have made a first draft, check out how the deployed blog will look by [building the docs locally](./contributing_to_docs.md).
15 changes: 0 additions & 15 deletions docs/dev_guides/changing_splink/build_docs_locally.md

This file was deleted.

75 changes: 75 additions & 0 deletions docs/dev_guides/changing_splink/building_env_locally.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
## Creating a Virtual Environment for Splink

### Managing Dependencies with Poetry

Splink utilises `poetry` for managing its core dependencies, offering a clean and effective solution for tracking and resolving any ensuing package and version conflicts.

You can find a list of Splink's core dependencies within the [pyproject.toml](https://github.com/moj-analytical-services/splink/blob/master/pyproject.toml) file.

#### Fundamental Commands in Poetry

Below are some useful commands to help in the maintenance and upkeep of the [pyproject.toml](https://github.com/moj-analytical-services/splink/blob/master/pyproject.toml) file.

**Adding Packages**
- To incorporate a new package into Splink:
```sh
poetry add <package-name>
```
- To specify a version when adding a new package:
```sh
poetry add <package-name>==<version>
# Add quotes if you want to use other equality calls
poetry add "<package-name> >= <version>"
```

**Modifying Packages**
- To remove a package from the project:
```sh
poetry remove <package-name>
```
- Updating an existing package to a specific version:
```sh
poetry add <package-name>==<version>
poetry add "<package-name> >= <version>"
```
- To update an existing package to the latest version:
```sh
poetry add <package-name>==<version>
poetry update <package-name>
```
Note: Direct updates can also be performed within the pyproject.toml file.

**Locking the Project**
- To update the existing `poetry.lock` file, thereby locking the project to ensure consistent dependency installation across different environments:
```sh
poetry lock
```
Note: This should be used sparingly due to our loose dependency requirements and the resulting time to solve the dependency graph. If you only need to update a single dependency, update it using `poetry add <pkg>==<version>` instead.

**Installing Dependencies**
- To install project dependencies as per the lock file:
```sh
poetry install
```
- For optional dependencies, additional flags are required. For instance, to install dependencies along with Spark support:
```sh
poetry install -E spark
```

A comprehensive list of Poetry commands is available in the [Poetry documentation](https://python-poetry.org/docs/cli/).

### Automating Virtual Environment Creation

To streamline the creation of a virtual environment via `venv`, you may use the [create_venv.sh](https://github.com/moj-analytical-services/splink/blob/master/scripts/create_venv.sh) script.

This script facilitates the automatic setup of a virtual environment, with the default environment name being **venv**.

**Default Environment Creation:**
```sh
source scripts/create_venv.sh
```

**Specifying a Custom Environment Name:**
```sh
source scripts/create_venv.sh <name_of_venv>
```
Loading

0 comments on commit d1acc1d

Please sign in to comment.