A cool algorithm that can be a very good interview question. This algorithm deals with searching a set of strings in a large text. The typical use case is to detect plagiarism.
Here is the pseudo code from Wikipedia entry.
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
The real benefit of this algorithm is seen when we search for multiple patterns within a text. We could use a bloom filter to store the hashes.
For the single pattern matches, a better algorithm is provided by Knuth-Morris-Pratt. The Wikipedia article on the same is really informative.
Here is the pseudo code from Wikipedia entry.
1 function RabinKarp(string s[1..n], string sub[1..m]) 2 hsub := hash(sub[1..m]); hs := hash(s[1..m]) 3 for i from 1 to n-m+1 4 if hs = hsub 5 if s[i..i+m-1] = sub 6 return i 7 hs := hash(s[i+1..i+m]) 8 return not found
The trick here is not to let the line 7 run in O(m). This can be easily achieved by leveraging the fact that we already have the hash code computed for s[i] to s[i+m-1]. So, if we use a rolling-hash technique, we can get this done in constant time. For example,
s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]
The real benefit of this algorithm is seen when we search for multiple patterns within a text. We could use a bloom filter to store the hashes.
For the single pattern matches, a better algorithm is provided by Knuth-Morris-Pratt. The Wikipedia article on the same is really informative.
No comments:
Post a Comment