Sunday, November 13, 2011

Rabin-Karp Algorithm

A cool algorithm that can be a very good interview question. This algorithm deals with searching a set of strings in a large text. The typical use case is to detect plagiarism.

Here is the pseudo code from Wikipedia entry.

 1 function RabinKarp(string s[1..n], string sub[1..m])
 2     hsub := hash(sub[1..m]);  hs := hash(s[1..m])
 3     for i from 1 to n-m+1
 4         if hs = hsub
 5             if s[i..i+m-1] = sub
 6                 return i
 7         hs := hash(s[i+1..i+m])
 8     return not found
 
The trick here is not to let the line 7 run in O(m). This can be easily achieved by leveraging the fact that we already have the hash code computed for s[i] to s[i+m-1]. So, if we use a rolling-hash technique, we can get this done in constant time. For example,

s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

The real benefit of this algorithm is seen when we search for multiple patterns within a text. We could use a bloom filter to store the hashes.

For the single pattern matches, a better algorithm is provided by Knuth-Morris-Pratt. The Wikipedia article on the same is really informative.

No comments:

Post a Comment