Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters topic guide #1883

Merged
merged 55 commits into from
Apr 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
a810626
Start of metrics topic guide
zslade Jan 24, 2024
978c3d9
Merge branch 'master' into clusters_topic_guide
zslade Jan 24, 2024
1445571
restructure intro
zslade Jan 30, 2024
3660259
update
zslade Feb 1, 2024
e705a1a
rearrange and fill in gaps
zslade Feb 1, 2024
4557558
updates
zslade Feb 2, 2024
5a18587
merge latest
zslade Feb 5, 2024
932528d
split out sections
zslade Feb 5, 2024
253f26a
fix sections
zslade Feb 5, 2024
232a14a
Update sections
zslade Feb 5, 2024
744226d
update overview/intro
zslade Feb 5, 2024
7bd6878
tweaking intro
zslade Feb 5, 2024
18851d3
tweaks
zslade Feb 6, 2024
20026b2
update density
zslade Feb 6, 2024
4a8bc93
update node degree
zslade Feb 6, 2024
5b70ed7
remove directed etc
zslade Feb 6, 2024
2bca2d7
tweak explanations
zslade Feb 6, 2024
7e4cc91
fleshing out how to guide
zslade Feb 6, 2024
3c46852
update how to and small tweaks
zslade Feb 6, 2024
c707179
Merge branch 'master' into clusters_topic_guide
zslade Feb 6, 2024
d3f9998
reorder
zslade Feb 12, 2024
21e2fd3
cluster centralisation
zslade Feb 12, 2024
c7d460e
small improvements
zslade Feb 12, 2024
bea498a
improvements
zslade Feb 12, 2024
e3ee66c
Merge branch 'master' into clusters_topic_guide
zslade Feb 12, 2024
5fea483
remove average and absolute
zslade Feb 12, 2024
e0e7495
improving centralisation explaination
zslade Feb 12, 2024
7f28ee7
update link
zslade Feb 12, 2024
05ea72d
small tweak
zslade Feb 17, 2024
f8e880c
remove graph definition
zslade Feb 17, 2024
f2d4ccb
Merge branch 'master' into clusters_topic_guide
zslade Feb 29, 2024
336c94a
Merge branch 'master' into clusters_topic_guide
zslade Mar 4, 2024
be5c9a6
Merge branch 'master' into clusters_topic_guide
RossKen Mar 28, 2024
6200739
minor edits
RossKen Mar 28, 2024
8ba2692
Merge branch 'master' into clusters_topic_guide
zslade Mar 28, 2024
016bb00
changes based off comments
zslade Mar 28, 2024
66ef851
Delete docs/comparison_level_library.md
zslade Mar 28, 2024
13e12b9
Delete docs/datasets.md
zslade Mar 28, 2024
f45d9bf
Delete docs/comparison_library.md
zslade Mar 28, 2024
38ab7b6
Delete docs/comparison_template_library.md
zslade Mar 28, 2024
7d694ca
Delete docs/comparison_level_composition.md
zslade Mar 28, 2024
e8cecfa
Merge branch 'master' into clusters_topic_guide
zslade Apr 2, 2024
9bfac5a
tweaks
zslade Apr 2, 2024
2e3f180
tweak
zslade Apr 2, 2024
858a16e
resolving comments and more tweaks
zslade Apr 2, 2024
22bf41d
update to notebook
zslade Apr 2, 2024
01b60b6
update and fix links
zslade Apr 2, 2024
39d0f34
spellcheck
zslade Apr 2, 2024
fd77b6e
add more graphic metric visuals
RossKen Apr 2, 2024
2032c8f
add cluster centralisation caveat
RossKen Apr 3, 2024
e7d08fe
Merge branch 'master' into clusters_topic_guide
RossKen Apr 3, 2024
d898a0a
add back useful density text
zslade Apr 3, 2024
c85ca71
re-add comparison libraries docs
RossKen Apr 4, 2024
100f2d2
add missing md doc
RossKen Apr 4, 2024
63496b3
fix clusters doc link
RossKen Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/img/clusters/cluster_density.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/clusters/cluster_size.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/clusters/is_bridge.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 0 additions & 3 deletions docs/topic_guides/evaluation/clusters.md

This file was deleted.

141 changes: 141 additions & 0 deletions docs/topic_guides/evaluation/clusters/graph_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
# Graph metrics

Graph metrics quantify the characteristics of a graph. A simple example of a graph metric is [cluster size](#cluster-size), which is the number of nodes within a cluster.

For data linking with Splink, it is useful to sort graph metrics into three categories:

* [Node metrics](#node-metrics)
* [Edge metrics](#edge-metrics)
* [Cluster metrics](#cluster-metrics)

Each of these are defined below together with examples and explanations of how they can be applied to linked data to evaluate cluster quality. The examples cover all metrics currently available in Splink.

!!! note

It is important to bear in mind that whilst graph metrics can be very useful for assessing linkage quality, they are rarely definitive, especially when taken in isolation. A more comprehensive picture can be built by considering various metrics in conjunction with one another.

It is also important to consider metrics within the context of their distribution and the underlying dataset. For example: a cluster density (see below) of 0.4 might seem low but could actually be above average for the dataset in question; a cluster of size 80 might be suspiciously large for one dataset but not for another.


## :purple_circle: Node metrics

Node metrics quantify the properties of the nodes which live within clusters.

### Node Degree

##### Definition

Node degree is the **number of edges connected to a node**.

##### Example

In the cluster below A has a node degree of 1, whereas D has a node degree of 3.

![Basic Graph - Records](../../../img/clusters/basic_graph_records.drawio.png){:width="80%"}

##### Application in Data Linkage

High node degree is generally considered good as it means there are many edges in support of records in a cluster being linked. Nodes with low node degree could indicate links being missed (false negatives) or be the result of a small number of false links (false positives).

However, erroneous links (false positives) could also be the reason for _high_ node degree, so it can be useful to validate the edges of highly connected nodes.

It is important to consider [cluster size](#cluster-size) when looking at node degree. By definition, larger clusters contain more nodes to form links between, allowing nodes within them to attain higher degrees compared to those in smaller clusters. Consequently, low node degree within larger clusters can carry greater significance.

Bear in mind, that the degree of a single node in a cluster isn't necessarily representative of the overall connectedness of a cluster. This is where [cluster centralisation](#cluster-centralisation) can help.

<hr>

## :link: Edge metrics

Edge metrics quantify the properties of the edges within a cluster.

### 'is bridge'

##### Definition

An edge is classified as a 'bridge' if its **removal splits a cluster into two smaller clusters**.

##### Example

For example, the removal of the link labelled "Bridge" below would break this cluster of 9 nodes into two clusters of 5 and 4 nodes, respectively.

![](../../../img/clusters/is_bridge.drawio.png){:width="70%"}

##### Application in Data Linkage

Bridges can be signalers of false positives in linked data, especially when joining two highly connected sub-clusters. Examining bridges can shed light on issues with the linking process leading to the formation of false positive links.

<hr>

## :fontawesome-solid-circle-nodes: Cluster metrics

Cluster metrics refer to the characteristics of a cluster as a whole, rather than the individual nodes and edges it contains.

### Cluster Size

##### Definition

Cluster size refers to the **number of nodes within a cluster**.

##### Example

The cluster below is of size 5.

![](../../../img/clusters/cluster_size.drawio.png){:width="30%"}

##### Application in Data Linkage

When thinking about cluster size, it is often useful to consider the biggest clusters produced and ask yourself if the sizes seem reasonable for the dataset being linked. For example when linking people, does it make sense that an individual is appearing hundreds of times in the linked data resulting in a cluster of over 100 nodes? If the answer is no, then false positives links are probably being formed.

If you don't have an intuition of what seems reasonable, then it is worth inspecting a sample of the largest clusters in Splink's [Cluster Studio Dashboard](../../../charts/cluster_studio_dashboard.ipynb) to validate (or invalidate) links. From there you can develop an understanding of what maximum cluster size to expect for your linkage. Bear in mind that a large and highly dense cluster is usually less suspicious than a large low-density cluster.

There also might be a lower bound on cluster size. For example, when linking two datasets in which you know people appear at least once in each, the minimum expected size of cluster will be 2. Clusters smaller than the minimum size indicate links have been missed.

### Cluster Density

##### Definition

The density of a cluster is given by the **number of edges it contains divided by the maximum possible number of edges**. Density ranges from 0 to 1. A density of 1 means that all nodes are connected to all other nodes in a cluster.

##### Example

The left cluster below has links between all nodes (giving a density of 1), whereas the right cluster has the minimum number of edges (4) to link 5 nodes together (giving a density of 0.4).

![](../../../img/clusters/cluster_density.drawio.png){:width="80%"}

##### Application in Data Linkage

When evaluating clusters, a high density (closer to 1) is generally considered good as it means there are many edges in support of the records in a cluster being linked.

A low density could indicate links being missed. This could happen, for example, if blocking rules are too tight or the clustering threshold is too high.
zslade marked this conversation as resolved.
Show resolved Hide resolved

A sample of low density clusters can be inspected in Splink's [Cluster Studio Dashboard](../../../charts/cluster_studio_dashboard.ipynb) via the option `sampling_method = "lowest_density_clusters_by_size"`, which performs stratified sampling across different cluster sizes. When inspecting a cluster, ask yourself the question: why aren't more links being formed between record nodes?


### Cluster Centralisation

!!! info "Work in Progress"

We are still working out where Cluster Centralisation can be best used in the context of record linkage. At this stage, we do not have clear recommendations or guidance on the best places to use it - so if you have any expertise in this area we would love to [hear from you](https://github.com/moj-analytical-services/splink/discussions)!

We will update this guidance as and when we have clearer strategies in this space.

##### Definition

[Cluster centralisation](https://en.wikipedia.org/wiki/Centrality#Degree_centrality) is defined as the deviation from maximum [node degree](#node-degree) normalised with respect to the maximum possible value. In other words, cluster centralisation tells us about the concentration of edges in a cluster. Centralisation ranges from 0 to 1.

##### Example

Coming Soon

##### Application in Data Linkage

A high cluster centralisation (closer to 1) indicates that a few nodes are home to significantly more connections compared to the rest of the nodes in a cluster. This can help identify clusters containing nodes with a lower number of connections (low node degree) relative to what is possible for that cluster.

Low centralisation suggests that edges are more evenly distributed amongst nodes in a cluster. This can be good if all nodes within a clusters enjoy many connections. However, low centralisation could also indicate that most nodes are not as highly connected as they could be. To check for this, look at low centralisation in conjunction with low [density](#cluster-density).

<hr>

A guide on [how to compute graph metrics](./how_to_compute_metrics.ipynb) mentioned above with Splink is given in the next chapter.

Please note, this topic guide is a work in progress and we welcome any feedback.
Loading
Loading