This module finds a non-euclidean distance or similarity between two strings.
Jaro and Jaro-Winkler equations provides a score between two short strings where errors are more prone at the end of the string. Jaro's equation measure is the weighted sum of the percentage of matching and transposed characters from each string. Winkler's factor adds weight in Jaro's formula to increase the calculated measure when there is a sequence of characters (a prefix) in both strings.
This version is based on the original C implementation of strcmp95 implementation but does not attempt to normalize characters that are similar to the eyes (e.g.: O
vs 0
).
- Impact of the prefix is limited to 4 characters, as originally defined by Winkler.
- Input strings are not modified beyond whitespace trimming.
- In-word whitespace and characters case will optionally impact score.
- Returns a floating point number rounded to the desired decimals (defaults to
2
) using Python'sround
. - Consider usual floating point arithmetic characteristics when working with this module.
The complexity of this algoritme resides in finding the matching
and transposed
characters. That is because of the interpretation of what are the matching
conditions and the definition of transposed
. Definitions of those two will make the score vary between implementations of this algorithme.
Here is how matching
and transposed
are defined in this module:
- A character of the first string at position
N
ismatching
if found at positionN
or withindistance
on either side in the second string. - The
distance
is calculated using the rounded down length of the longest string divided by two minus one. - Characters in the first string are matched only once against characters of the second string.
- Two characters are
transposed
if they previously matched and aren't at the same position in the matching character subset. - Decimals are rounded according to the scientific method.
TODO: Implementation should be refactored to use Python's Decimal module from the standard library. This module was introduced in Python 3.9.
Calculate the Jaro Winkler similarity (PENNSYLVANIA
and PENNCISYLVNIA
:
P E N N C I S Y L V N I A
┌-─────────────────────────
P │ 1 ╎
E │ 1 ╎
N │ 1 ╎
N │ 1 ╎ Symbols '╎' represent the sliding windows
S │ 1 ╎ boundary in the second string where we look
Y │ ╎ 1 ╎ for the first string's character.
L │ ╎ 1 ╎
V │ ╎ 1 d = 5 in this example.
A │ ╎ 1
N │ ╎ 1
I │ ╎ 1
A │ ╎
Considering the input parameters calculated above:
We found that the
from pyjarowinkler import distance
distance.get_jaro_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.830031080031
distance.get_jaro_winkler_similarity("PENNSYLVANIA", "PENNCISYLVNIA", decimals=12)
# 0.898018648019
distance.get_jaro_distance("hello", "haloa", decimals=4)
# 0.2667
distance.get_jaro_similarity("hello", "haloa", decimals=2)
# 0.73
distance.get_jaro_winkler_distance("hello", "Haloa", scaling=0.1, ignore_case=False)
# 0.4
distance.get_jaro_winkler_distance("hello", "HaLoA", scaling=0.1, ignore_case=True)
# 0.24
distance.get_jaro_winkler_similarity("hello", "haloa", decimals=2)
# 0.76
You need to have installed asdf
on your system. Then, running the commands below will setup your environment with the project's optional (dev) requirements and create the python virtual environment necessary to run test, lint, and build steps.
Typical order of execution is as follow:
$ cd ./jaro-winkler-distance
$ asdf install
$ pip install '.[dev]'
$ hatch python install 3.13 3.12 3.11 3.10 3.9
$ hatch env create
Other helpful commands:
hatch test
hatch fmt
hatch env show
hatch run test:unit
hatch run test:all
hatch run lint:all
$ ./release.sh help
Usage: release.sh [help|major|minor|patch]