Home > Tech > Content

Understanding the KMP Algorithm for Efficient String Pattern Matching

Tech May 12 2

Fundamentals of the KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm optimizes string matching by avoiding unnecessary re-comparisons. When a mismatch occurs, the algorithm leverages previously matched characters to determine the next starting position, rather than restarting from the beginning.

Prefix Table (Next Array)

The core of KMP lies in constructing a prefix table, also known as the failure function or next array. This table stores the length of the longest proper prefix that is also a suffix for each position in the pattern string.

Definitions:

Prefix: A substring that starts at the first character but excludes the last character.
Suffix: A substring that ends at the last character but excludes the first character.

For example, consider the pattern aabaaf:

a: longest equal prefix-suffix length = 0
aa: longest equal prefix-suffix length = 1 (prefix 'a', suffix 'a')
aab: longest equal prefix-suffix length = 0
aaba: longest equal prefix-suffix length = 1
aabaa: longest equal prefix-suffix length = 2 (prefix 'aa', suffix 'aa')
aabaaf: longest equal prefix-suffix length = 0

The resulting prefix table is: [0, 1, 0, 1, 2, 0]

Constructing the Next Array

The construction uses two pointers: prefixLen tracks the length of the current matching prefix, while i iterates through the string.

void buildNext(int next[], const string& pattern) {
    int prefixLen = 0;
    next[0] = 0;
    
    for (int i = 1; i < pattern.size(); i++) {
        // Backtrack when characters don't match
        while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
            prefixLen = next[prefixLen - 1];
        }
        // Extend prefix length when characters match
        if (pattern[i] == pattern[prefixLen]) {
            prefixLen++;
        }
        next[i] = prefixLen;
    }
}

Problem: Find First Occurrence of Substring

Given a text string haystack and a pattern string needle, return the index of the first occurrence of needle in haystack, or -1 if not found.

Example:

Input: haystack = "sadbutsad", needle = "sad"
Output: 0

Solution Using KMP

class Solution {
public:
    void buildNext(int next[], const string& pattern) {
        int prefixLen = 0;
        next[0] = 0;
        
        for (int i = 1; i < pattern.size(); i++) {
            while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
                prefixLen = next[prefixLen - 1];
            }
            if (pattern[i] == pattern[prefixLen]) {
                prefixLen++;
            }
            next[i] = prefixLen;
        }
    }
    
    int strStr(string haystack, string needle) {
        if (needle.empty()) return 0;
        
        int next[needle.size()];
        buildNext(next, needle);
        
        int patternIdx = 0;
        for (int textIdx = 0; textIdx < haystack.size(); textIdx++) {
            while (patternIdx > 0 && haystack[textIdx] != needle[patternIdx]) {
                patternIdx = next[patternIdx - 1];
            }
            if (haystack[textIdx] == needle[patternIdx]) {
                patternIdx++;
            }
            if (patternIdx == needle.size()) {
                return textIdx - needle.size() + 1;
            }
        }
        return -1;
    }
};

Problem: Repeated Substring Pattern

Given a non-empty string, determine if it can be constructed by repeating a substring multiple times.

Approach 1: String Concatenation

If a string s consists of repeated substrings, then concatenating s + s and removing the first and last characters, the original string s should appear within the result.

class Solution {
public:
    bool repeatedSubstringPattern(string s) {
        string doubled = s + s;
        doubled.erase(doubled.begin());
        doubled.erase(doubled.end() - 1);
        return doubled.find(s) != string::npos;
    }
};

Approach 2: KMP-Based Solution

Using the prefix table, if the last value in the next array is non-zero and the string length is divisible by the difference between the string length and this value, the string has a repeating pattern.

class Solution {
public:
    void buildNext(int next[], const string& str) {
        int prefixLen = 0;
        next[0] = 0;
        
        for (int i = 1; i < str.size(); i++) {
            while (prefixLen > 0 && str[i] != str[prefixLen]) {
                prefixLen = next[prefixLen - 1];
            }
            if (str[i] == str[prefixLen]) {
                prefixLen++;
            }
            next[i] = prefixLen;
        }
    }
    
    bool repeatedSubstringPattern(string s) {
        if (s.empty()) return false;
        
        int next[s.size()];
        buildNext(next, s);
        
        int len = s.size();
        int lastPrefixLen = next[len - 1];
        
        if (lastPrefixLen != 0 && len % (len - lastPrefixLen) == 0) {
            return true;
        }
        return false;
    }
};

String Operations and Techniques

Strings can be viewed as character arrays. In C, strings terminate with a null character '\0', while C++ provides the string class with a size() method for length determination.

The string class offers more functionality than vector<char>, including operator overloading for concatenation.

Two-Pointer Technique

Two pointers are frequently used in array, string, and linked list problems. When filling arrays, pre-expanding the array and working backwards is often efficient. Note that calling erase() inside a loop results in O(n²) complexity since each erase() operation is O(n).

String Reversal Patterns

When processing strings in fixed-length segments, increment the loop index accordingly. For example, to process every 2k characters, use i += 2*k. Problems like reversing words in a string often require reversing the entire string first, then reversing each word individually.

Tags: KMP Algorithm String Matching Prefix Table

Back to List

Prev: Interacting with MySQL Databases Using Python

Next: Integrating and Wrapping Monaco Editor in Vue 2

Fading Coder

Understanding the KMP Algorithm for Efficient String Pattern Matching

Fundamentals of the KMP Algorithm

Prefix Table (Next Array)

Constructing the Next Array

Problem: Find First Occurrence of Substring

Solution Using KMP

Problem: Repeated Substring Pattern

Approach 1: String Concatenation

Approach 2: KMP-Based Solution

String Operations and Techniques

Two-Pointer Technique

String Reversal Patterns

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Understanding the KMP Algorithm for Efficient String Pattern Matching

Fundamentals of the KMP Algorithm

Prefix Table (Next Array)

Constructing the Next Array

Problem: Find First Occurrence of Substring

Solution Using KMP

Problem: Repeated Substring Pattern

Approach 1: String Concatenation

Approach 2: KMP-Based Solution

String Operations and Techniques

Two-Pointer Technique

String Reversal Patterns

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment