Karp-Rabin


2020-08-05 algorithms 💎

The Karp-Rabin algorithm is a string matching algorithm that uses a rolling hash function to determine if one string is a substring of another.

To compare the string s and t we can compare s with each substring in t. This would yield a time complexity of O(len(s) * len(t)).

t: [ | | | | | | | | | | | | | | | | ]
s: [ | | | ]
  s: [ | | | ]
    s: [ | | | ]
      s: [ | | | ]
        s: [ | | | ]
          s: [ | | | ]
            s: [ | | | ]
              s: [ | | | ]
                s: [ | | | ]
                  s: [ | | | ]
                    s: [ | | | ]
                      s: [ | | | ]
                        s: [ | | | ]
                          s: [ | | | ]

To do this a little more efficiently, we can convert the string s to a numeric value using a hash function. Using the same hash function we can append the next character and remove the previous character.

E.g.

To compare bce with abcdabce a brute force version might look like the following:

t: [a|b|c|d|a|b|c|e]
s: [b|c|e]
s:   [b|c|e]
s:     [b|c|e]
s:       [b|c|e]
s:         [b|c|e]
s:           [b|c|e]

Each comparison would require comparing each character in the substring (3 characters) with the corresponding character in the sliding window. This means as the size of the substring grows, so does the number of comparisons.

If we convert the substring to an integer value, then add a new character as we slide left and remove the previous character, then we can compare a single value with another single value. We incur the cost of computing the hash code for each sliding window but by using integers and addition/subtraction we make this a constant time operation.

E.g.

If we use the ascii character set then we can convert each character to the equivalent ascii code.

s:  [b|c|e]
sc: [97, 98, 99].sum == 298

t: [a|b|c|d|a|b|c|e]
s: [b|c|e]
  * (298 == [97, 98, 99])
  * (298 == 294)
  * (298 == 294)
s:   [b|c|e]
  * (298 == (294 - 'a' + 'd')
  * (298 == (294 - 97 + 100)
  * (298 == 297)
s:     [b|c|e]
  * (298 == (297 - 'b' + 'a'))
  * (298 == (297 - 97 + 96))
  * (298 == 296)
s:       [b|c|e]
  * (298 == (296 - 'c' + 'b'))
  * (298 == (296 - 99 + 98))
  * (298 == (295))
s:         [b|c|e]
  * (298 == (295 - 'd' + 'c'))
  * (298 == (295 - 100 + 99))
  * (298 == (294))
s:           [b|c|e]
  * (298 == (294 - 'a' + 'e'))
  * (298 == (294 - 97 + 101))
  * (298 == 298)
#!/usr/bin/env ruby

def verify?(s, t, i)
  s.bytes == t.bytes[i..(i+s.bytes.size-1)]
end

def match?(s, t)
  sc, tc = s.bytes.sum, t.bytes[0...s.bytes.size].sum
  return true if sc == tc && verify?(s, t, 0)

  for i in (1...(t.bytes.size))
    tc -= t.bytes[i-1]
    tc += (t.bytes[i+(s.bytes.size-1)] || 0)

    return true if sc == tc && verify?(s, t, i)
  end

  false
end

[
  ["bce", "abcdabce", true],
  ["bce", "abcdabcd", false],
  ["dba", "ccaccaae", false],
].each do |(s, t, expected)|
  result = match?(s, t)
  puts [s, t, result, expected == result ? "PASS" : "FAIL"].inspect
end