A HashMap is a Map implementation. HashMaps are composed of two things: 1) a hash function and 2) a bucket array to store values.
Before going into the implementation details let’s give an overview of how it works. Let’s say we want to keep a tally of things and animals:
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
How are the keys mapped to their values? Using a hash function. Here’s an illustration:
-
We use a hash function to transform the keys (e.g., dog, cat, rat, …) into an array index. This array is called bucket.
-
The bucket holds the values or list of values in case of collisions.
In the illustration, we have a bucket size of 10. In bucket 0, we have a collision. Both cat
and art
keys map to the same bucket even thought their hash codes are different.
In a HashMap, a collision is when different keys lead to the same index. They are nasty for performance since it can reduce the search time from O(1) to O(n).
Having a big bucket size can avoid a collision but also can waste too much memory. We are going to build an optimized HashMap that re-sizes itself when it is getting full. This auto-resizing avoids collisions and don’t need to allocate too much memory upfront. Let’s start with the hash function.
To minimize collisions, we need to create an excellent hash function.
Important
|
A perfect hash function is one that assigns a unique array index for every different key. |
It’s no practical and memory-wise wasteful to have a perfect hash function, so we are going to shoot for a cost-effective hash function instead.
-
A hash function converts keys into array indices.
-
A hash function is composed of two parts:
-
Hash Code: maps any key into an integer (unbonded)
-
Compression function: maps an arbitrary integer to integer in the range of [0… BUCKET_SIZE -1].
-
Before doing a great hash function, let’s see what a lousy hash function looks like. 😉
The goal of a hash code function is to convert any value given into a positive integer — a common way to accomplish with summing each string’s Unicode value.
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]
This function uses codePointAt
to get the Unicode value of a character. E.g., a
has a value of 97, A
is 65, even emojis have codes; “😁” is 128513
.
string.charCodeAt
vs. string.codePointAt
The charCodeAt()
method returns an integer between 0
and 65535
representing the UTF-16 code unit at the given index. However, it doesn’t play nice with Unicode, so it’s better to use codePointAt
instead.
The codePointAt()
method returns a non-negative integer that is the Unicode code point value.
With this function we have the can convert some keys to numbers as follows:
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]
Notice that rat
and art
have the same hash code! These are collisions that we need to solve.
Collisions happened because we are adding the letter’s Unicode and are not taking the order into account nor the type. We can do better by offsetting the character value based on their position in the string. We can also add the object type, so number 10
produce different output than string '10'
.
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]
Since Unicode uses 20 bits, we can offset each character by 20 bits based on the position.
BigInt
BigInt allows operating beyond the maximum safe limit of integers.
Number.MAX_SAFE_INTEGER // => 9,007,199,254,740,991
BigInt has no virtual limits (until you run out of physical memory). It uses the suffix n
.
1n + 3n === 4n
As you can imagine, summing 20bits per letter leads to a humongous number! That’s the case even for three letters words. We are using BigInt
, so it doesn’t overflow.
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]
We don’t have duplicates anymore! If the keys have different content or type, they have a different hash code. However, we need to represent these unbounded integers to finite buckets in an array. We do that using compression function. This function can be as simple as % BUCKET_SIZE
.
However, there’s an issue with the last implementation. It doesn’t matter how enormous (and different) is the hash code number if we at the end use the modulus to get an array index. The part of the hash code that truly matters is the last bits.
10 % 4 //↪️ 2
20 % 4 //↪️ 0
30 % 4 //↪️ 2
40 % 4 //↪️ 0
50 % 4 //↪️ 2
All the hash codes are different and still we get many collisions! 😱
Based on numbers properties, using a prime number as the modulus produce fewer collisions.
10 % 7 //↪️ 3
20 % 7 //↪️ 6
30 % 7 //↪️ 2
40 % 7 //↪️ 4
50 % 7 //↪️ 1
Now it’s more evenly distributed!! 😎👍
-
Bucket size should always be a prime number, so data is distributed more evenly and minimized collisions.
-
Hash code doesn’t have to be too big. At the end what matters is the few last digits.
Let’s design a better HashMap with what we learned.
We are going to use a battle-tested non-cryptographic hash function called FNV Hash.
It is a non-cryptographic hash function designed to be fast while maintaining a low collision rate. The high dispersion of the FNV hashes makes them well suited for hashing nearly identical strings such as URLs, keys, IP addresses, zip codes, and others.
Take a look at the following function:
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
Is somewhat similar to what we did before, in the sense that we use each letter’s Unicode is used to compute the hash. The difference is:
-
We are using the XOR bitwise operation (
^
) to produce an avalanche effect, where a small change in two strings produces completely different hash codes. E.g.
hashCode('cat') //↪️ 4201630708
hashCode('cats') //↪️ 3304940933
A one letter change produce a very different output.
We are using the FVN-1a prime number (16777619
) and the offset (2166136261
) to reduce collisions even further. If you are curious where these numbers come from check out this link.
FVN-1a hash function is a good trade-off between speed and collision prevention.
Now that we have a proper hash function. Let’s move on with the rest of the HashMap implementation.
Let’s start by creating a class and its constructor to initialize the hash map. We are going to have an array called buckets
to hold all the data.
class HashMap {
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
this.buckets = new Array(this.initialCapacity);
this.size = 0;
this.collisions = 0;
}
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
}
Notice that we are also keeping track of collisions (for benchmarking purposes) and a load factor. The load factor measures how full the hash map is. We don’t want to be fuller than 75%. If the HashMap is getting too full, then we are going to fix it doing a rehash (more on that later).
To insert values into a HashMap, we first convert the key into an array index using the hash and compression function. Each bucket of the array will have an object with the shape of {key, value}
.
In code, it looks like this:
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
-
Key doesn’t exist yet, so we create the new key/value pair.
-
Key already exists, then we will replace the value.
-
Key doesn’t exist, but the bucket already has other data, this is a collision! We push the new element to the bucket.
-
To keep insertion order, we keep track of the order of the keys using
keysTrackerArray
andkeysTrackerIndex
.
Notice, that we are using a function called getEntry
to check if the key already exists. It gets the index of the bucket corresponding to the key and then checks if the entry with the given key exists. We are going to implement this function in a bit.
For getting values out of the Map, we do something similar to inserting. We convert the key into an index
using the hash function, then we that index
we look for the value in the bucket.
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
-
Convert key to an array index.
-
If the bucket is empty create a new linked list
-
Use Linked list’s part02-linear-data-structures.asc method to find value on the bucket.
-
Return
bucket
andentry
if found.
With the help of the getEntry
method, we can do the HashMap.get
and HashMap.has
methods:
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
and also,
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
For HashMap.has
we only care if the value exists or not, while that for HashMap.get
we want to return the value or undefined
if it doesn’t exist.
Removing items from a HashMap is not too different from what we did before.
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
If the bucket doesn’t exist or is empty, we don’t have to do anything else. If the value exists, we use the
LinkedList.remove
method.
Rehashing is a technique to minimize collisions when a hash map is getting full. It doubles the size of the map and recomputes all the hash codes and insert data in the new buckets.
When we increase the map size, we try to find the next prime. We explained that keeping the bucket size a prime number is beneficial for minimizing collisions.
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
In the prime.js file you can find the implementation for finding the next prime. Also, you can see the full HashMap implementation on this file: hashmap.js
Hash Map it’s very optimal for searching values by key in constant time O(1). However, searching by value is not any better than an array since we have to visit every value O(n).
Data Structure |
Searching By |
Insert |
Delete |
Space Complexity |
|
Index/Key |
Value |
||||
Hash Map (naïve) |
O(n) |
O(n) |
O(n) |
O(n) |
O(n) |
Hash Map (optimized) |
O(1)* |
O(n) |
O(1)* |
O(1)* |
O(1)* |
* = Amortized run time. E.g. rehashing might affect run time.
As you can notice we have amortized times since, in the unfortunate case of a rehash, it will take O(n) while it resizes. After that, it will be O(1).