Understanding the KMP Algorithm for Efficient String Pattern Matching
Fundamentals of the KMP Algorithm
The Knuth-Morris-Pratt (KMP) algorithm optimizes string matching by avoiding unnecessary re-comparisons. When a mismatch occurs, the algorithm leverages previously matched characters to determine the next starting position, rather than restarting from the beginning.
Prefix Table (Next Array)
The core of KMP lies in constructing a prefix table, also known as the failure function or next array. This table stores the length of the longest proper prefix that is also a suffix for each position in the pattern string.
Definitions:
- Prefix: A substring that starts at the first character but excludes the last character.
- Suffix: A substring that ends at the last character but excludes the first character.
For example, consider the pattern aabaaf:
a: longest equal prefix-suffix length = 0aa: longest equal prefix-suffix length = 1 (prefix 'a', suffix 'a')aab: longest equal prefix-suffix length = 0aaba: longest equal prefix-suffix length = 1aabaa: longest equal prefix-suffix length = 2 (prefix 'aa', suffix 'aa')aabaaf: longest equal prefix-suffix length = 0
The resulting prefix table is: [0, 1, 0, 1, 2, 0]
Constructing the Next Array
The construction uses two pointers: prefixLen tracks the length of the current matching prefix, while i iterates through the string.
void buildNext(int next[], const string& pattern) {
int prefixLen = 0;
next[0] = 0;
for (int i = 1; i < pattern.size(); i++) {
// Backtrack when characters don't match
while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
prefixLen = next[prefixLen - 1];
}
// Extend prefix length when characters match
if (pattern[i] == pattern[prefixLen]) {
prefixLen++;
}
next[i] = prefixLen;
}
}
Problem: Find First Occurrence of Substring
Given a text string haystack and a pattern string needle, return the index of the first occurrence of needle in haystack, or -1 if not found.
Example:
- Input: haystack = "sadbutsad", needle = "sad"
- Output: 0
Solution Using KMP
class Solution {
public:
void buildNext(int next[], const string& pattern) {
int prefixLen = 0;
next[0] = 0;
for (int i = 1; i < pattern.size(); i++) {
while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
prefixLen = next[prefixLen - 1];
}
if (pattern[i] == pattern[prefixLen]) {
prefixLen++;
}
next[i] = prefixLen;
}
}
int strStr(string haystack, string needle) {
if (needle.empty()) return 0;
int next[needle.size()];
buildNext(next, needle);
int patternIdx = 0;
for (int textIdx = 0; textIdx < haystack.size(); textIdx++) {
while (patternIdx > 0 && haystack[textIdx] != needle[patternIdx]) {
patternIdx = next[patternIdx - 1];
}
if (haystack[textIdx] == needle[patternIdx]) {
patternIdx++;
}
if (patternIdx == needle.size()) {
return textIdx - needle.size() + 1;
}
}
return -1;
}
};
Problem: Repeated Substring Pattern
Given a non-empty string, determine if it can be constructed by repeating a substring multiple times.
Approach 1: String Concatenation
If a string s consists of repeated substrings, then concatenating s + s and removing the first and last characters, the original string s should appear within the result.
class Solution {
public:
bool repeatedSubstringPattern(string s) {
string doubled = s + s;
doubled.erase(doubled.begin());
doubled.erase(doubled.end() - 1);
return doubled.find(s) != string::npos;
}
};
Approach 2: KMP-Based Solution
Using the prefix table, if the last value in the next array is non-zero and the string length is divisible by the difference between the string length and this value, the string has a repeating pattern.
class Solution {
public:
void buildNext(int next[], const string& str) {
int prefixLen = 0;
next[0] = 0;
for (int i = 1; i < str.size(); i++) {
while (prefixLen > 0 && str[i] != str[prefixLen]) {
prefixLen = next[prefixLen - 1];
}
if (str[i] == str[prefixLen]) {
prefixLen++;
}
next[i] = prefixLen;
}
}
bool repeatedSubstringPattern(string s) {
if (s.empty()) return false;
int next[s.size()];
buildNext(next, s);
int len = s.size();
int lastPrefixLen = next[len - 1];
if (lastPrefixLen != 0 && len % (len - lastPrefixLen) == 0) {
return true;
}
return false;
}
};
String Operations and Techniques
Strings can be viewed as character arrays. In C, strings terminate with a null character '\0', while C++ provides the string class with a size() method for length determination.
The string class offers more functionality than vector<char>, including operator overloading for concatenation.
Two-Pointer Technique
Two pointers are frequently used in array, string, and linked list problems. When filling arrays, pre-expanding the array and working backwards is often efficient. Note that calling erase() inside a loop results in O(n²) complexity since each erase() operation is O(n).
String Reversal Patterns
When processing strings in fixed-length segments, increment the loop index accordingly. For example, to process every 2k characters, use i += 2*k. Problems like reversing words in a string often require reversing the entire string first, then reversing each word individually.