From bf09d5a43f2aa7eb0c991279f6a1c84810c76770 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Mon, 8 Jul 2024 20:59:07 +0100 Subject: [PATCH 1/5] splink 4 release blog --- docs/blog/posts/2024-07-10-splink4_release.md | 54 +++++++++++++++++++ 1 file changed, 54 insertions(+) create mode 100644 docs/blog/posts/2024-07-10-splink4_release.md diff --git a/docs/blog/posts/2024-07-10-splink4_release.md b/docs/blog/posts/2024-07-10-splink4_release.md new file mode 100644 index 0000000000..7258a36480 --- /dev/null +++ b/docs/blog/posts/2024-07-10-splink4_release.md @@ -0,0 +1,54 @@ +--- +date: 2024-07-20 +authors: + - robin-l +categories: + - Feature Updates +--- + +# Splink 4.0.0 released + +We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3. + +The improvement we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. + +Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology - Splink 3 and Splink 4 will give the same results, so there's no urgency to upgrade existing pipelines. + +To get started quickly with Splink 4, checkout the [examples](../../demos//examples/examples_index.md). You can see how things have changed by comparing them to the [Splink 3 examples](TODO). + +## Main enhancements + +- **User Experience**: We have revamped all aspects of the user-facing API. Functionality is easier to find, better named and better organised. + +- **Autocomplete everywhere**: All functions, most notably the settings object, have been rewritten to ensure autocomplete (IntelliSense/code completion) works. This means you no longer need to remember the specific name of the wide range of configuration options - a key like `blocking_rules_to_generate_predictions` will autocomplete. Where settings such as `link_type` have a predefined list of valid options, these will also autocomplete. + +- **Faster and more scalable** Our testing suggests that the internal changes have made Splink 4 significantly more scalable. Our testing also suggests Splink 4 is faster than Splink 3 for many workloads. This is in addition to [dramatic speedups](https://github.com/moj-analytical-services/splink/pull/1796) that were integrated into Splink 3 in January, meaning Splink is now 5x faster for a typical workload on a modern laptop than it was in November 2023. We welcome any feedback from users about speed and scalability, as it's hard for us to test the full range of scenarios. + +- **Improved backend code quality** The Splink 4 codebase represents a big refactor to focus on code quality. It should now be easier to contribute, and quicker and easier for the team to fix bugs. + +## Smaller enhancements + +Some highlights of other smaller improvements: + +- **Linker functionality is now organised into namespaces**. In Splink 3, a very large number of functions were available on the `linker` object, making it hard to find and remember what functionality exists. In Splink 4, functions are available in namespaces such as `linker.training` and `linker.inference`. Documentation [here](../../api_docs/api_docs_index.md). + +- **Blocking analysis**. The new blocking functions at `splink.blocking` include routines to ensure users don't accidentally run code that generates so many comparisons it never completes. Blocking analysis is also much faster. See the [blocking tutorial](../../demos/tutorials/03_Blocking.ipynb) for more. + +- **Switch between dialects more easily**. The backend SQL dialect (DuckDB, Spark etc.) is now imported using the relevant database API. This is passed into Splink functions (such as creation of the linker), meaning that switching between dialects is now a case of importing a different database API, no other code needs to change. For example, compare the [DuckDB](../../demos/examples/duckdb/deduplicate_50k_synthetic.ipynb) and [SQLite](../../demos/examples/sqlite/deduplicate_50k_synthetic.ipynb) examples. + +- **Exploratory analysis no longer needs a linker**. Exploratory analysis that is typically conducted before starting data linking can now be done in isolation, without the user needing to configure a linker. Exploratory analysis is now available at `splink.exploratory`. Similarly, blocking can be done without a linker using the functions at `splink.blocking`. + +- **Enhancements to API documentation**. Now that the codebase is better organised, it's made it much easier provide high quality API documentation - the new pages are [here](../../api_docs/api_docs_index.md). + + +## Updating Splink 3 code + +Conceptually, there are no major changes in Splink 4. Splink 4 code follows the same steps as Splink 3. The same core estimation and prediction routines are used. Splink 4 code that uses the same settings will produce the same results (predictions) as Splink 3. + +That said, there have been significant changes to the syntax and a reorganisation of functions. + +For users wishing to familiarise themselves with Splink 4, we recommend the easiest way is to compare and contrast the new [examples](../../demos/examples/examples_index.md) with their [Splink 3 equivalents](TODO). + +You may also find the following screenshot useful, which shows the diff of a fairly standard Splink 3 workflow that has been rewritten in Splink 4. + +[TODO: Screenshot of diff] From edbb3c27eeb313767006fdcef53f1439c7018ff4 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Tue, 16 Jul 2024 09:21:10 +0100 Subject: [PATCH 2/5] add screenshot --- docs/blog/posts/2024-07-10-splink4_release.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/2024-07-10-splink4_release.md b/docs/blog/posts/2024-07-10-splink4_release.md index 7258a36480..fa41161eed 100644 --- a/docs/blog/posts/2024-07-10-splink4_release.md +++ b/docs/blog/posts/2024-07-10-splink4_release.md @@ -10,7 +10,7 @@ categories: We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3. -The improvement we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. +The improvements we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology - Splink 3 and Splink 4 will give the same results, so there's no urgency to upgrade existing pipelines. @@ -51,4 +51,6 @@ For users wishing to familiarise themselves with Splink 4, we recommend the easi You may also find the following screenshot useful, which shows the diff of a fairly standard Splink 3 workflow that has been rewritten in Splink 4. -[TODO: Screenshot of diff] +![image](https://github.com/user-attachments/assets/7fe7c9e7-1a22-4744-a5ad-281d540a8deb) + +You can find the corresponding code [here](https://github.com/RobinL/temp_3_to_4/pull/1/files). From f5bd67c7fad0bad8fd9c35a91076ce9480b2cb5a Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Tue, 16 Jul 2024 09:23:31 +0100 Subject: [PATCH 3/5] add link --- docs/blog/posts/2024-07-10-splink4_release.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/blog/posts/2024-07-10-splink4_release.md b/docs/blog/posts/2024-07-10-splink4_release.md index fa41161eed..f5dfa504a7 100644 --- a/docs/blog/posts/2024-07-10-splink4_release.md +++ b/docs/blog/posts/2024-07-10-splink4_release.md @@ -10,11 +10,11 @@ categories: We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3. -The improvements we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. - Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology - Splink 3 and Splink 4 will give the same results, so there's no urgency to upgrade existing pipelines. -To get started quickly with Splink 4, checkout the [examples](../../demos//examples/examples_index.md). You can see how things have changed by comparing them to the [Splink 3 examples](TODO). +The improvements we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. + +To get started quickly with Splink 4, checkout the [examples](../../demos/examples/examples_index.md). You can see how things have changed by comparing them to the [Splink 3 examples](https://moj-analytical-services.github.io/splink3_legacy_docs/demos/examples/examples_index.html). ## Main enhancements From 8f1cbfba2c4645f8aa09c993d9aac207a8c20231 Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Tue, 16 Jul 2024 09:35:49 +0100 Subject: [PATCH 4/5] add autocompelte example --- docs/blog/posts/2024-07-10-splink4_release.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/2024-07-10-splink4_release.md b/docs/blog/posts/2024-07-10-splink4_release.md index f5dfa504a7..b04dce4f8a 100644 --- a/docs/blog/posts/2024-07-10-splink4_release.md +++ b/docs/blog/posts/2024-07-10-splink4_release.md @@ -20,12 +20,14 @@ To get started quickly with Splink 4, checkout the [examples](../../demos/exampl - **User Experience**: We have revamped all aspects of the user-facing API. Functionality is easier to find, better named and better organised. -- **Autocomplete everywhere**: All functions, most notably the settings object, have been rewritten to ensure autocomplete (IntelliSense/code completion) works. This means you no longer need to remember the specific name of the wide range of configuration options - a key like `blocking_rules_to_generate_predictions` will autocomplete. Where settings such as `link_type` have a predefined list of valid options, these will also autocomplete. - - **Faster and more scalable** Our testing suggests that the internal changes have made Splink 4 significantly more scalable. Our testing also suggests Splink 4 is faster than Splink 3 for many workloads. This is in addition to [dramatic speedups](https://github.com/moj-analytical-services/splink/pull/1796) that were integrated into Splink 3 in January, meaning Splink is now 5x faster for a typical workload on a modern laptop than it was in November 2023. We welcome any feedback from users about speed and scalability, as it's hard for us to test the full range of scenarios. - **Improved backend code quality** The Splink 4 codebase represents a big refactor to focus on code quality. It should now be easier to contribute, and quicker and easier for the team to fix bugs. +- **Autocomplete everywhere**: All functions, most notably the settings object, have been rewritten to ensure autocomplete (IntelliSense/code completion) works. This means you no longer need to remember the specific name of the wide range of configuration options - a key like `blocking_rules_to_generate_predictions` will autocomplete. Where settings such as `link_type` have a predefined list of valid options, these will also autocomplete. + + ![Autocomplete example](https://github.com/user-attachments/assets/305b53ee-d11a-4104-b45f-e5b96db3c973) + ## Smaller enhancements Some highlights of other smaller improvements: From d14c306eac644cc9eb17e3edd2b70cc4ee571c7b Mon Sep 17 00:00:00 2001 From: Robin Linacre Date: Tue, 16 Jul 2024 09:37:54 +0100 Subject: [PATCH 5/5] updates --- docs/blog/posts/2024-07-10-splink4_release.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/blog/posts/2024-07-10-splink4_release.md b/docs/blog/posts/2024-07-10-splink4_release.md index b04dce4f8a..f7fa119cae 100644 --- a/docs/blog/posts/2024-07-10-splink4_release.md +++ b/docs/blog/posts/2024-07-10-splink4_release.md @@ -10,11 +10,11 @@ categories: We're pleased to release Splink 4, which is more scalable and easier to use than Splink 3. -Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology - Splink 3 and Splink 4 will give the same results, so there's no urgency to upgrade existing pipelines. +Version 4 is recommended to all new users. For existing users, there has been no change to the statistical methodology. Version 3 and 4 will give the same results, so there's no urgency to upgrade existing pipelines. The improvements we've made to the user experience mean that Splink 4 syntax is not backwards compatible, so Splink 3 scripts will need to be adjusted to work in Splink 4. However, the model serialisation format is unchanged, so models saved from Splink 3 in `.json` format can be imported into Splink 4. -To get started quickly with Splink 4, checkout the [examples](../../demos/examples/examples_index.md). You can see how things have changed by comparing them to the [Splink 3 examples](https://moj-analytical-services.github.io/splink3_legacy_docs/demos/examples/examples_index.html). +To get started quickly with Splink 4, checkout the [examples](../../demos/examples/examples_index.md). You can see how things have changed by comparing them to the [Splink 3 examples](https://moj-analytical-services.github.io/splink3_legacy_docs/demos/examples/examples_index.html), or see the screenshot at the bottom of this post. ## Main enhancements