Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Understanding the KMP Algorithm for Efficient String Pattern Matching

Tech May 12 2

Fundamentals of the KMP Algorithm

The Knuth-Morris-Pratt (KMP) algorithm optimizes string matching by avoiding unnecessary re-comparisons. When a mismatch occurs, the algorithm leverages previously matched characters to determine the next starting position, rather than restarting from the beginning.

Prefix Table (Next Array)

The core of KMP lies in constructing a prefix table, also known as the failure function or next array. This table stores the length of the longest proper prefix that is also a suffix for each position in the pattern string.

Definitions:

  • Prefix: A substring that starts at the first character but excludes the last character.
  • Suffix: A substring that ends at the last character but excludes the first character.

For example, consider the pattern aabaaf:

  • a: longest equal prefix-suffix length = 0
  • aa: longest equal prefix-suffix length = 1 (prefix 'a', suffix 'a')
  • aab: longest equal prefix-suffix length = 0
  • aaba: longest equal prefix-suffix length = 1
  • aabaa: longest equal prefix-suffix length = 2 (prefix 'aa', suffix 'aa')
  • aabaaf: longest equal prefix-suffix length = 0

The resulting prefix table is: [0, 1, 0, 1, 2, 0]

Constructing the Next Array

The construction uses two pointers: prefixLen tracks the length of the current matching prefix, while i iterates through the string.

void buildNext(int next[], const string& pattern) {
    int prefixLen = 0;
    next[0] = 0;
    
    for (int i = 1; i < pattern.size(); i++) {
        // Backtrack when characters don't match
        while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
            prefixLen = next[prefixLen - 1];
        }
        // Extend prefix length when characters match
        if (pattern[i] == pattern[prefixLen]) {
            prefixLen++;
        }
        next[i] = prefixLen;
    }
}

Problem: Find First Occurrence of Substring

Given a text string haystack and a pattern string needle, return the index of the first occurrence of needle in haystack, or -1 if not found.

Example:

  • Input: haystack = "sadbutsad", needle = "sad"
  • Output: 0

Solution Using KMP

class Solution {
public:
    void buildNext(int next[], const string& pattern) {
        int prefixLen = 0;
        next[0] = 0;
        
        for (int i = 1; i < pattern.size(); i++) {
            while (prefixLen > 0 && pattern[i] != pattern[prefixLen]) {
                prefixLen = next[prefixLen - 1];
            }
            if (pattern[i] == pattern[prefixLen]) {
                prefixLen++;
            }
            next[i] = prefixLen;
        }
    }
    
    int strStr(string haystack, string needle) {
        if (needle.empty()) return 0;
        
        int next[needle.size()];
        buildNext(next, needle);
        
        int patternIdx = 0;
        for (int textIdx = 0; textIdx < haystack.size(); textIdx++) {
            while (patternIdx > 0 && haystack[textIdx] != needle[patternIdx]) {
                patternIdx = next[patternIdx - 1];
            }
            if (haystack[textIdx] == needle[patternIdx]) {
                patternIdx++;
            }
            if (patternIdx == needle.size()) {
                return textIdx - needle.size() + 1;
            }
        }
        return -1;
    }
};

Problem: Repeated Substring Pattern

Given a non-empty string, determine if it can be constructed by repeating a substring multiple times.

Approach 1: String Concatenation

If a string s consists of repeated substrings, then concatenating s + s and removing the first and last characters, the original string s should appear within the result.

class Solution {
public:
    bool repeatedSubstringPattern(string s) {
        string doubled = s + s;
        doubled.erase(doubled.begin());
        doubled.erase(doubled.end() - 1);
        return doubled.find(s) != string::npos;
    }
};

Approach 2: KMP-Based Solution

Using the prefix table, if the last value in the next array is non-zero and the string length is divisible by the difference between the string length and this value, the string has a repeating pattern.

class Solution {
public:
    void buildNext(int next[], const string& str) {
        int prefixLen = 0;
        next[0] = 0;
        
        for (int i = 1; i < str.size(); i++) {
            while (prefixLen > 0 && str[i] != str[prefixLen]) {
                prefixLen = next[prefixLen - 1];
            }
            if (str[i] == str[prefixLen]) {
                prefixLen++;
            }
            next[i] = prefixLen;
        }
    }
    
    bool repeatedSubstringPattern(string s) {
        if (s.empty()) return false;
        
        int next[s.size()];
        buildNext(next, s);
        
        int len = s.size();
        int lastPrefixLen = next[len - 1];
        
        if (lastPrefixLen != 0 && len % (len - lastPrefixLen) == 0) {
            return true;
        }
        return false;
    }
};

String Operations and Techniques

Strings can be viewed as character arrays. In C, strings terminate with a null character '\0', while C++ provides the string class with a size() method for length determination.

The string class offers more functionality than vector<char>, including operator overloading for concatenation.

Two-Pointer Technique

Two pointers are frequently used in array, string, and linked list problems. When filling arrays, pre-expanding the array and working backwards is often efficient. Note that calling erase() inside a loop results in O(n²) complexity since each erase() operation is O(n).

String Reversal Patterns

When processing strings in fixed-length segments, increment the loop index accordingly. For example, to process every 2k characters, use i += 2*k. Problems like reversing words in a string often require reversing the entire string first, then reversing each word individually.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.