Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters topic guide #1883

Merged
merged 55 commits into from
Apr 4, 2024
Merged
Changes from 1 commit
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a810626
Start of metrics topic guide
zslade Jan 24, 2024
978c3d9
Merge branch 'master' into clusters_topic_guide
zslade Jan 24, 2024
1445571
restructure intro
zslade Jan 30, 2024
3660259
update
zslade Feb 1, 2024
e705a1a
rearrange and fill in gaps
zslade Feb 1, 2024
4557558
updates
zslade Feb 2, 2024
5a18587
merge latest
zslade Feb 5, 2024
932528d
split out sections
zslade Feb 5, 2024
253f26a
fix sections
zslade Feb 5, 2024
232a14a
Update sections
zslade Feb 5, 2024
744226d
update overview/intro
zslade Feb 5, 2024
7bd6878
tweaking intro
zslade Feb 5, 2024
18851d3
tweaks
zslade Feb 6, 2024
20026b2
update density
zslade Feb 6, 2024
4a8bc93
update node degree
zslade Feb 6, 2024
5b70ed7
remove directed etc
zslade Feb 6, 2024
2bca2d7
tweak explanations
zslade Feb 6, 2024
7e4cc91
fleshing out how to guide
zslade Feb 6, 2024
3c46852
update how to and small tweaks
zslade Feb 6, 2024
c707179
Merge branch 'master' into clusters_topic_guide
zslade Feb 6, 2024
d3f9998
reorder
zslade Feb 12, 2024
21e2fd3
cluster centralisation
zslade Feb 12, 2024
c7d460e
small improvements
zslade Feb 12, 2024
bea498a
improvements
zslade Feb 12, 2024
e3ee66c
Merge branch 'master' into clusters_topic_guide
zslade Feb 12, 2024
5fea483
remove average and absolute
zslade Feb 12, 2024
e0e7495
improving centralisation explaination
zslade Feb 12, 2024
7f28ee7
update link
zslade Feb 12, 2024
05ea72d
small tweak
zslade Feb 17, 2024
f8e880c
remove graph definition
zslade Feb 17, 2024
f2d4ccb
Merge branch 'master' into clusters_topic_guide
zslade Feb 29, 2024
336c94a
Merge branch 'master' into clusters_topic_guide
zslade Mar 4, 2024
be5c9a6
Merge branch 'master' into clusters_topic_guide
RossKen Mar 28, 2024
6200739
minor edits
RossKen Mar 28, 2024
8ba2692
Merge branch 'master' into clusters_topic_guide
zslade Mar 28, 2024
016bb00
changes based off comments
zslade Mar 28, 2024
66ef851
Delete docs/comparison_level_library.md
zslade Mar 28, 2024
13e12b9
Delete docs/datasets.md
zslade Mar 28, 2024
f45d9bf
Delete docs/comparison_library.md
zslade Mar 28, 2024
38ab7b6
Delete docs/comparison_template_library.md
zslade Mar 28, 2024
7d694ca
Delete docs/comparison_level_composition.md
zslade Mar 28, 2024
e8cecfa
Merge branch 'master' into clusters_topic_guide
zslade Apr 2, 2024
9bfac5a
tweaks
zslade Apr 2, 2024
2e3f180
tweak
zslade Apr 2, 2024
858a16e
resolving comments and more tweaks
zslade Apr 2, 2024
22bf41d
update to notebook
zslade Apr 2, 2024
01b60b6
update and fix links
zslade Apr 2, 2024
39d0f34
spellcheck
zslade Apr 2, 2024
fd77b6e
add more graphic metric visuals
RossKen Apr 2, 2024
2032c8f
add cluster centralisation caveat
RossKen Apr 3, 2024
e7d08fe
Merge branch 'master' into clusters_topic_guide
RossKen Apr 3, 2024
d898a0a
add back useful density text
zslade Apr 3, 2024
c85ca71
re-add comparison libraries docs
RossKen Apr 4, 2024
100f2d2
add missing md doc
RossKen Apr 4, 2024
63496b3
fix clusters doc link
RossKen Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Start of metrics topic guide
zslade committed Jan 24, 2024
commit a81062672cfbedefc23f34ff05d03d47cb9dcfd7
60 changes: 59 additions & 1 deletion docs/topic_guides/evaluation/clusters.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,61 @@
# Cluster Evaluation

This page is under construction - check back soon!
Graphs provide a natural way to think about linked data. Visualising linked data as a graph and utilising graph metrics are powerful routes to assessing linkage quality, as well as enhancing understanding of datasets and models. Insights gained can be used to refine linking strategies, resulting in more accurate predictions.

Graph metrics (see below) can be particularly useful for obtaining an overall picture of the quality of clusters generated by a Splink model. For example…

At the individual cluster level, Splink’s [Cluster Studio Dashboard]() enables users to visualise clusters and interrogate their members and the links between them. Applying metrics to individual clusters can be useful for analysing graphs with many nodes when it can be impossible to spot spurious links by eye alone.

!!! note
It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. It is often helpful to consider multiple metrics in conjunction with one another to build a comprehensive picture.

It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.


Where do we answer the question of: what does good look like?

## Graph metrics and their application to linked data

A graph is defined as a collection of points (nodes) connected by lines (edges). In data linking, we refer to these collections of nodes as clusters, within which the nodes represent the entity to be linked (e.g. person or journey) and the edges represent a potential match, together with an associated Splink score.

[Include picture]

Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is cluster size which is the number of nodes in a cluster.

For data linking with Splink, it is useful to sort graph metrics into three categories: cluster metrics, node metrics and edge metrics. These are defined below together with their relevance to data linking.

### :fontawesome-solid-circle-nodes: Cluster metrics

Cluster metrics refer to characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.

#### Example: density
[picture]

The density of a cluster is given by the number of edges a cluster contains divided by the maximum possible number of edges. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.

Relevance to data linking: A high density (close to 1) is generally good as it means there are many edges in support of the records in a cluster being linked. A low density score might warrant further investigation.

#### Example: cluster centralisation

TBC

### ⚫️ Node metrics

Node metrics refer to features of the nodes within clusters; for example, node degree which is a count of how many edges (links) are joined to a node.
Example: node degree

### 🔗 Edge metrics

These are a measure of the properties of edges within a cluster. Examples include edge betweeness and bridges*

#### Example: is bridge

*acknowledge the slight difference between our definition and the literature.

## ⚡ How to harness the power of graph metrics with Splink ##

To enable users to calculate a variety of graph metrics for their linked data, Splink provides the `compute_graph_metrics()` method.

Other possible things to include:
Querying with linker.sql_query?
We have also made one of the metrics computed so far available to use for sampling in cluster studio dashboard?