Skip to content

Commit

Permalink
fix toml and typo
Browse files Browse the repository at this point in the history
  • Loading branch information
RobinL committed Jun 10, 2024
1 parent 671a58f commit 88a81c6
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 13 deletions.
24 changes: 12 additions & 12 deletions docs/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,38 +10,38 @@ tags:
Splink has some datasets available for use to help you get up and running, test ideas, or explore Splink features.
To use, simply import `splink_datasets`:
```py
from splink.datasets import splink_datasets
from splink import splink_datasets

df = splink_datasets.fake_1000
```
which you can then use to set up a linker:
```py
from splink.datasets import splink_datasets
from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl

from splink splink_datasets, Linker, DuckDBAPI, SettingsCreator
df = splink_datasets.fake_1000
linker = DuckDBLinker(
df,
{
"link_type": "dedupe_only",
"comparisons": [cl.exact_match("first_name"), cl.exact_match("surname")],
},
SettingsCreator(
link_type="dedupe_only",
comparisons=[
cl.exact_match("first_name"),
cl.exact_match("surname"),
],
)
)
```

??? tip "Troubleshooting"

If you get a `SSLCertVerificationError` when trying to use the inbuilt datasets, this can be fixed with the `ssl` package by running:

`ssl._create_default_https_context = ssl._create_unverified_context`.

## `splink_datasets`

Each attribute of `splink_datasets` is a dataset available for use, which exists as a pandas `DataFrame`.
These datasets are not packaged directly with Splink, but instead are downloaded only when they are requested.
Once requested they are cached for future use.
The cache can be cleared using [`splink_dataset_utils`](#splink_dataset_utils-object),
The cache can be cleared using [`splink_dataset_utils`](#splink_dataset_utils-object),
which also contains information on available datasets, and which have already been cached.

### Available datasets
Expand All @@ -64,7 +64,7 @@ The datasets available are listed below:

## `splink_dataset_utils` API

In addition to `splink_datasets`, you can also import `splink_dataset_utils`,
In addition to `splink_datasets`, you can also import `splink_dataset_utils`,
which has a few functions to help managing `splink_datasets`.
This can be useful if you have limited internet connection and want to see what is already cached,
or if you need to clear cache items (e.g. if datasets were to be updated, or if space is an issue).
Expand Down
1 change: 0 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,6 @@ sqlalchemy = ">=1.4.0"
# temporarily use binary version, to avoid issues with pg_config path
psycopg2-binary = ">=2.8.0"
igraph = ">=0.11.2"
ipykernel = "^6.29.4"

[tool.poetry.group.linting]
[tool.poetry.group.linting.dependencies]
Expand Down

0 comments on commit 88a81c6

Please sign in to comment.