Advanced Algorithms: Polynomial Rolling Hash

July 17, 2019 • ☕️ 3 min read

Hash Function

Hash functions are used to map large data sets of elements of an arbitrary length (the keys) to smaller data sets of elements of a fixed length (the fingerprints).

The basic application of hashing is efficient testing of equality of keys by comparing their fingerprints.

A collision happens when two different keys have the same fingerprint. The way in which collisions are handled is crucial in most applications of hashing. Hashing is particularly useful in construction of efficient practical algorithms.

Rolling Hash

A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly — the new hash value is rapidly calculated given only the following data:

  • old hash value,
  • the old value removed from the window,
  • and the new value added to the window.

Polynomial String Hashing

An ideal hash function for strings should obviously depend both on the multiset of the symbols present in the key and on the order of the symbols. The most common family of such hash functions treats the symbols of a string as coefficients of a polynomial with an integer variable p and computes its value modulo an integer constant M:

The Rabin–Karp string search algorithm is often explained using a very simple rolling hash function that only uses multiplications and additions - polynomial rolling hash:

H(s0, s1, …, sk) = s0 _ pk-1 + s1 _ pk-2 + … + sk * p0

where p is a constant, and (s1, … , sk) are the input characters.

For example we can convert short strings to key numbers by multiplying digit codes by powers of a constant. The three letter word ace could turn into a number by calculating:

key = 1 _ 262 + 3 _ 261 + 5 * 260

In order to avoid manipulating huge H values, all math is done modulo M.

H(s0, s1, …, sk) = (s0 _ pk-1 + s1 _ pk-2 + … + sk * p0) mod M

A careful choice of the parameters M, p is important to obtain “good” properties of the hash function, i.e., low collision rate.

This approach has the desirable attribute of involving all the characters in the input string. The calculated key value can then be hashed into an array index in the usual way:

function hash(key, arraySize) {
  const base = 13

  let hash = 0
  for (let charIndex = 0; charIndex < key.length; charIndex += 1) {
    const charCode = key.charCodeAt(charIndex)
    hash += charCode * base ** (key.length - charIndex - 1)
  }

  return hash % arraySize
}

The hash() method is not as efficient as it might be. Other than the character conversion, there are two multiplications and an addition inside the loop. We can eliminate one multiplication by using *Horner’s method:

a4 _ x4 + a3 _ x3 + a2 _ x2 + a1 _ x1 + a0 = (((a4 _ x + a3) _ x + a2) _ x + a1) _ x + a0

In other words:

Hi = (P * Hi-1 + Si) mod M

The hash() cannot handle long strings because the hashVal exceeds the size of int. Notice that the key always ends up being less than the array size. In Horner’s method we can apply the modulo (%) operator at each step in the calculation. This gives the same result as applying the modulo operator once at the end, but avoids the overflow.

function hash(key, arraySize) {
  const base = 13

  let hash = 0
  for (let charIndex = 0; charIndex < key.length; charIndex += 1) {
    const charCode = key.charCodeAt(charIndex)
    hash = (hash * base + charCode) % arraySize
  }

  return hash
}

Polynomial hashing has a rolling property: the fingerprints can be updated efficiently when symbols are added or removed at the ends of the string (provided that an array of powers of p modulo M of sufficient length is stored). The popular Rabin–Karp pattern matching algorithm is based on this property

References