The KMP Algorithm
Not difficult to understand. I learned it from Coursera.
Prefix Function
- A “Border” of a string, is a substring that both acts as a prefix and a suffix of that string
- The “Prefix Function” of a string records the length of the longest border of every prefix of that string
Symbol
s
is a strings[:i]
is a prefix ofs
made up bys[0], s[1], ..., s[i-1]
L[i]
is the length of the longest border ofs[:i]
- In other words,
L
is the “Prefix Function” ofs
- In other words,
Conclusion
L[i+1] <= L[i]+1
, and the equality holds if and only ifs[L] = s[i]
- The longest border of the longest border of a string, is the second longest border of that string.
- if some border of
s[:i]
has a length oflen
ands[len] = s[i]
, then there must be a border ins[:i+1]
with a length oflen+1
Method
L[0] = 0
L[i+1] = len + 1
, wherelen
is the length of the longest border ofs[:i]
that satisfiess[len] = s[i]
- To find
len
, try every border ofs[:i]
from the longest to the shortest - To find the next longest border, see Conclusion 2
- If no border is found, simply compare
s[0]
withs[i]
- If
s[0] = s[i]
, thenL[i+1] = 1
- Else,
L[i+1] = 0
- To find
Implementation
int *pf(string s) {
int len=s.size(),b=0;
int *p=new int[len]; p[0]=0;
for(int i=1;i<len;i++) {
while(b>0 && s[i]!=s[b]) b=p[b-1];
if(s[i]==s[b]) b++;
p[i] = b;
}
return p;
}
KMP
Symbol
p
is the pattern string,t
is the text stringL
is the “Prefix Function” ofp
Conclusion
- If
p
matchest
partially with a longest common prefixp[:i]
, then no matches will be found beforep
is shiftedlen(p) - L[i]
positions rightwards.- There are two “submatches” in this partial match, both of which are the longest border of
p[:i]
- The left one acts as a prefix, while the right one acts as a postfix
p
is shifted rightwards, so doesp[:i]
- No matches are possible before the left submatch arrives at the previous position of the right submatch
- The length of this “vacuum area” is
len(p) - L[i]
- There are two “submatches” in this partial match, both of which are the longest border of
Method
- Make
s = p + '$' + t
, where ‘$’ is absent from bothp
andt
- Calculate the “Prefix Function” of
s
, marked asL
- A match is found at position
i - 2 * len(p)
whenL[i] = len(p)
- Under such circumstances, the longest border of
s[:i]
happens to bep
itself, asp
is a prefix ofs
- This border is also a substring of
s
- Thanks to the ‘$’ sign, we can guarantee that this border never crosses the real “border” between
p
andt
, so it is also a substring oft
- Thus the starting position of the match is
i - (len(p) - 1)
ins
, andi - (len(p) - 1) - (len(p) + 1)
int
- Under such circumstances, the longest border of
Implementation
int main() {
string pattern, text, s;
int *pf;
while(cin >> pattern >> text && pattern != "") {
if(pattern.size() > text.size()) {
cout << "-1" << endl;
continue;
}
s = pattern + "$" + text;
pf = prefix_func(s);
for(int i = pattern.size()+1; i < s.size(); i++) {
if(pf[i] == pattern.size()) {
cout << i - 2 * pattern.size() << " ";
}
}
cout << endl;
delete pf;
}
return 0;
}