Skip to content

Latest commit

 

History

History
310 lines (223 loc) · 13 KB

File metadata and controls

310 lines (223 loc) · 13 KB

HashMap

A HashMap is a Map implementation. HashMaps are composed of two things: 1) a hash function and 2) a bucket array to store values.

Before going into the implementation details let’s give an overview of how it works. Let’s say we want to keep a tally of things and animals:

HashMap example
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

How are the keys mapped to their values? Using a hash function. Here’s an illustration:

image
Figure 1. Internal HashMap representation
This is the main idea:
  1. We use a hash function to transform the keys (e.g., dog, cat, rat, …) into an array index. This array is called bucket.

  2. The bucket holds the values or list of values in case of collisions.

In the illustration, we have a bucket size of 10. In bucket 0, we have a collision. Both cat and art keys map to the same bucket even thought their hash codes are different.

In a HashMap, a collision is when different keys lead to the same index. They are nasty for performance since it can reduce the search time from O(1) to O(n).

Having a big bucket size can avoid a collision but also can waste too much memory. We are going to build an optimized HashMap that re-sizes itself when it is getting full. This auto-resizing avoids collisions and don’t need to allocate too much memory upfront. Let’s start with the hash function.

Designing an optimized hash function

To minimize collisions, we need to create an excellent hash function.

Important
A perfect hash function is one that assigns a unique array index for every different key.

It’s no practical and memory-wise wasteful to have a perfect hash function, so we are going to shoot for a cost-effective hash function instead.

To recap:
  • A hash function converts keys into array indices.

  • A hash function is composed of two parts:

    1. Hash Code: maps any key into an integer (unbonded)

    2. Compression function: maps an arbitrary integer to integer in the range of [0… BUCKET_SIZE -1].

Before doing a great hash function, let’s see what a lousy hash function looks like. 😉

Analysing collisions on bad hash code functions

The goal of a hash code function is to convert any value given into a positive integer — a common way to accomplish with summing each string’s Unicode value.

Naïve hashing function implementation
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]

This function uses codePointAt to get the Unicode value of a character. E.g., a has a value of 97, A is 65, even emojis have codes; “😁” is 128513.

JavaScript built-in string.charCodeAt vs. string.codePointAt

The charCodeAt() method returns an integer between 0 and 65535 representing the UTF-16 code unit at the given index. However, it doesn’t play nice with Unicode, so it’s better to use codePointAt instead.

The codePointAt() method returns a non-negative integer that is the Unicode code point value.

With this function we have the can convert some keys to numbers as follows:

Hashing examples
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]

Notice that rat and art have the same hash code! These are collisions that we need to solve.

Collisions happened because we are adding the letter’s Unicode and are not taking the order into account nor the type. We can do better by offsetting the character value based on their position in the string. We can also add the object type, so number 10 produce different output than string '10'.

Hashing function implementation that offset character value based on the position
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]

Since Unicode uses 20 bits, we can offset each character by 20 bits based on the position.

JavaScript built-in BigInt

BigInt allows operating beyond the maximum safe limit of integers.

Number.MAX_SAFE_INTEGER // => 9,007,199,254,740,991

BigInt has no virtual limits (until you run out of physical memory). It uses the suffix n.

1n + 3n === 4n

As you can imagine, summing 20bits per letter leads to a humongous number! That’s the case even for three letters words. We are using BigInt, so it doesn’t overflow.

Verifying there’s not hashing code duplicates
link:../../../src/data-structures/maps/hash-maps/hashing.js[role=include]

We don’t have duplicates anymore! If the keys have different content or type, they have a different hash code. However, we need to represent these unbounded integers to finite buckets in an array. We do that using compression function. This function can be as simple as % BUCKET_SIZE.

However, there’s an issue with the last implementation. It doesn’t matter how enormous (and different) is the hash code number if we at the end use the modulus to get an array index. The part of the hash code that truly matters is the last bits.

Look at this example with a bucket size of 4.
10 % 4 //↪️ 2
20 % 4 //↪️ 0
30 % 4 //↪️ 2
40 % 4 //↪️ 0
50 % 4 //↪️ 2

All the hash codes are different and still we get many collisions! 😱

Based on numbers properties, using a prime number as the modulus produce fewer collisions.

Let’s see what happens if the bucket size is a prime number:
10 % 7 //↪️ 3
20 % 7 //↪️ 6
30 % 7 //↪️ 2
40 % 7 //↪️ 4
50 % 7 //↪️ 1

Now it’s more evenly distributed!! 😎👍

So, to sum up:
  • Bucket size should always be a prime number, so data is distributed more evenly and minimized collisions.

  • Hash code doesn’t have to be too big. At the end what matters is the few last digits.

Let’s design a better HashMap with what we learned.

Implementing an optimized hash function

We are going to use a battle-tested non-cryptographic hash function called FNV Hash.

FNV (Fowler/Noll/Vo) Hash

It is a non-cryptographic hash function designed to be fast while maintaining a low collision rate. The high dispersion of the FNV hashes makes them well suited for hashing nearly identical strings such as URLs, keys, IP addresses, zip codes, and others.

Take a look at the following function:

Optimal Hash function
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

Is somewhat similar to what we did before, in the sense that we use each letter’s Unicode is used to compute the hash. The difference is:

  1. We are using the XOR bitwise operation (^) to produce an avalanche effect, where a small change in two strings produces completely different hash codes. E.g.

Hash Code example using FVN1a
hashCode('cat') //↪️ 4201630708
hashCode('cats') //↪️ 3304940933

A one letter change produce a very different output.

We are using the FVN-1a prime number (16777619) and the offset (2166136261) to reduce collisions even further. If you are curious where these numbers come from check out this link.

FVN-1a hash function is a good trade-off between speed and collision prevention.

Now that we have a proper hash function. Let’s move on with the rest of the HashMap implementation.

Implementing a HashMap in JavaScript

Let’s start by creating a class and its constructor to initialize the hash map. We are going to have an array called buckets to hold all the data.

HashMap’s constructor
class HashMap {
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
    this.buckets = new Array(this.initialCapacity);
    this.size = 0;
    this.collisions = 0;
  }

link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
}

Notice that we are also keeping track of collisions (for benchmarking purposes) and a load factor. The load factor measures how full the hash map is. We don’t want to be fuller than 75%. If the HashMap is getting too full, then we are going to fix it doing a rehash (more on that later).

Inserting elements in a HashMap

To insert values into a HashMap, we first convert the key into an array index using the hash and compression function. Each bucket of the array will have an object with the shape of {key, value}.

In code, it looks like this:

HashMap’s set method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
  1. Key doesn’t exist yet, so we create the new key/value pair.

  2. Key already exists, then we will replace the value.

  3. Key doesn’t exist, but the bucket already has other data, this is a collision! We push the new element to the bucket.

  4. To keep insertion order, we keep track of the order of the keys using keysTrackerArray and keysTrackerIndex.

Notice, that we are using a function called getEntry to check if the key already exists. It gets the index of the bucket corresponding to the key and then checks if the entry with the given key exists. We are going to implement this function in a bit.

Getting values out of a HashMap

For getting values out of the Map, we do something similar to inserting. We convert the key into an index using the hash function, then we that index we look for the value in the bucket.

HashMap’s getEntry method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]
  1. Convert key to an array index.

  2. If the bucket is empty create a new linked list

  3. Use Linked list’s part02-linear-data-structures.asc method to find value on the bucket.

  4. Return bucket and entry if found.

With the help of the getEntry method, we can do the HashMap.get and HashMap.has methods:

HashMap’s get method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

and also,

HashMap’s has method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

For HashMap.has we only care if the value exists or not, while that for HashMap.get we want to return the value or undefined if it doesn’t exist.

Deleting from a HashMap

Removing items from a HashMap is not too different from what we did before.

HashMap’s delete method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

If the bucket doesn’t exist or is empty, we don’t have to do anything else. If the value exists, we use the LinkedList.remove method.

Rehashing a HashMap

Rehashing is a technique to minimize collisions when a hash map is getting full. It doubles the size of the map and recomputes all the hash codes and insert data in the new buckets.

When we increase the map size, we try to find the next prime. We explained that keeping the bucket size a prime number is beneficial for minimizing collisions.

HashMap’s rehash method
link:../../../src/data-structures/maps/hash-maps/hash-map.js[role=include]

In the prime.js file you can find the implementation for finding the next prime. Also, you can see the full HashMap implementation on this file: hashmap.js

HashMap time complexity

Hash Map it’s very optimal for searching values by key in constant time O(1). However, searching by value is not any better than an array since we have to visit every value O(n).

Table 1. Time complexity for a Hash Map

Data Structure

Searching By

Insert

Delete

Space Complexity

Index/Key

Value

Hash Map (naïve)

O(n)

O(n)

O(n)

O(n)

O(n)

Hash Map (optimized)

O(1)*

O(n)

O(1)*

O(1)*

O(1)*

* = Amortized run time. E.g. rehashing might affect run time.

As you can notice we have amortized times since, in the unfortunate case of a rehash, it will take O(n) while it resizes. After that, it will be O(1).