diff --git a/docs/datasets.md b/docs/datasets.md index 7a06f40e5b..69280d0b89 100644 --- a/docs/datasets.md +++ b/docs/datasets.md @@ -10,30 +10,30 @@ tags: Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features. To use, simply import `splink_datasets`: ```py -from splink.datasets import splink_datasets +from splink import splink_datasets df = splink_datasets.fake_1000 ``` which you can then use to set up a linker: ```py -from splink.datasets import splink_datasets -from splink.duckdb.linker import DuckDBLinker -import splink.duckdb.comparison_library as cl - +from splink splink_datasets, Linker, DuckDBAPI, SettingsCreator df = splink_datasets.fake_1000 linker = DuckDBLinker( df, - { - "link_type": "dedupe_only", - "comparisons": [cl.exact_match("first_name"), cl.exact_match("surname")], - }, + SettingsCreator( + link_type="dedupe_only", + comparisons=[ + cl.exact_match("first_name"), + cl.exact_match("surname"), + ], + ) ) ``` ??? tip "Troubleshooting" If you get a `SSLCertVerificationError` when trying to use the inbuilt datasets, this can be fixed with the `ssl` package by running: - + `ssl._create_default_https_context = ssl._create_unverified_context`. ## `splink_datasets` @@ -41,7 +41,7 @@ linker = DuckDBLinker( Each attribute of `splink_datasets` is a dataset available for use, which exists as a pandas `DataFrame`. These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested. Once requested they are cached for future use. -The cache can be cleared using [`splink_dataset_utils`](#splink_dataset_utils-object), +The cache can be cleared using [`splink_dataset_utils`](#splink_dataset_utils-object), which also contains information on available datasets, and which have already been cached. ### Available datasets @@ -64,7 +64,7 @@ The datasets available are listed below: ## `splink_dataset_utils` API -In addition to `splink_datasets`, you can also import `splink_dataset_utils`, +In addition to `splink_datasets`, you can also import `splink_dataset_utils`, which has a few functions to help managing `splink_datasets`. This can be useful if you have limited internet connection and want to see what is already cached, or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue). diff --git a/pyproject.toml b/pyproject.toml index 04ce0339d6..173c1c1d66 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -49,7 +49,6 @@ sqlalchemy = ">=1.4.0" # temporarily use binary version, to avoid issues with pg_config path psycopg2-binary = ">=2.8.0" igraph = ">=0.11.2" -ipykernel = "^6.29.4" [tool.poetry.group.linting] [tool.poetry.group.linting.dependencies]