String Algorithm Fundamentals: Border Theory, Z-Function, and Palindrome Automata
Prefix Function (KMP)
The prefix function (also called failure function or pi function) measures the longest proper prefix that is also a suffix for each position. Here's an iterative computation:
int pi[1000005];
void computePrefix(char* s, int n) {
for (int i = 2; i <= n; i++) {
int j = pi[i-1];
while (j > 0 && s[i] != s[j+1])
j = pi[j];
if (s[i] == s[j+1]) j++;
pi[i] = j;
}
}
int main() {
char str[1000005];
scanf("%s", str + 1);
int n = strlen(str + 1);
computePrefix(str, n);
for (int i = 1; i <= n; i++)
printf("%d ", pi[i]);
return 0;
}
The array pi[i] represents the length of the longest border for prefix ending at position i.
Z-Function
The Z-function computes the longest common prefix (LCP) between the entire string and each of its suffixes. For a string S of length n, Z[i] equals the length of the longest subtsring starting at position i that matches a prefix of S.
Algorithm: Maintain a [L, R] interval representing the rightmost matching segment discovered so far. For each position i:
- If
i > R, start fresh with zero matching and extend manually - Otherwise, we can utilize information from the matching segment: initialize
Z[i] = min(Z[i-L], R-i+1), then extend
This guarantees R never decreases, yielding O(n) time complexity.
int Z[1000005];
void zFunction(char* s, int n) {
Z[1] = n;
int left = 0, right = 0;
for (int i = 2; i <= n; i++) {
if (i <= right)
Z[i] = min(Z[i-left+1], right - i + 1);
while (i + Z[i] <= n && s[i + Z[i]] == s[Z[i] + 1])
Z[i]++;
if (i + Z[i] - 1 > right) {
left = i;
right = i + Z[i] - 1;
}
}
}
Extended Z-Function: Given two strings S and T, compute Z[T] and then use it to find the LCP between S and every suffix of S. This is useful for pattern matching scenarios where we need to match a pattern against multiple text positions.
int extZ[1000005];
void extendedZ(char* pattern, int n, char* text, int m) {
zFunction(text, m);
int left = 0, right = 0;
for (int i = 1; i <= n; i++) {
if (i <= right)
extZ[i] = min(Z[i-left+1], right - i + 1);
while (i + extZ[i] <= n && pattern[i + extZ[i]] == text[extZ[i] + 1])
extZ[i]++;
if (i + extZ[i] - 1 > right) {
left = i;
right = i + extZ[i] - 1;
}
}
}
Border Tree Applications
A border is a nonempty string that is both a proper prefix and a suffix. The failure links form a tree structure where each node's ancestors represent all borders of that prefix.
Number of Short Borders: For each position i, the count of borders with length ≤ i/2 equals the number of ancestors in the failure tree whose depth satisfies this constraint. Using DFS with stack tracking achieves O(n) time.
Longest Common Border: For two prefix positions p and q, their longest common border is simply the LCA in the failure tree. Special care must be taken to exclude the trivial case where one is an ancestor of the other.
Covering Problem: Find the shortest border whose occurrences cover the entire string. For each candidate border length i, check whether all occurrence endpoints (sorted) have gaps not exceeding i. Implementation uses bidirectional linked lists for O(n) updates.
Periodicity Theory
Weak Periodicity Lemma: If p and q are periods of a string S with p + q ≤ |S|, then gcd(p, q) is also a period.
Proof: Let d = p - q (assuming p > q). For positions greater than q, we have s[i] = s[i-q] = s[i-q+p] = s[i+d]. For positions less than p, we have s[i] = s[i+p] = s[i+p-q] = s[i+d]. Hence d is a period, and repeated subtraction yields the gcd.
Periodicity Lemma: If p and q are periods and p + q - gcd(p, q) ≤ |S|, then gcd(p, q) is a period.
Divisor Period Theorem: If string T has period a, and substring S = T[:i] has period b where b | a and |S| ≥ a, then b is also a period of T.
Border Progression: All borders with length ≥ |S|/2 form an arithmetic progression. This follows from the periodicity lemmas - the minimal period divides all other periods.
Border Count Bound: The number of distinct border lengths is at most ⌈log₂|S|⌉. This results from the halving argument when considering borders ≥ |S|/2.
Arithmetic Matching: If 2|S| ≥ |T|, then all occurrences of S in T form an arithmetic progression. This is proven by showing the gaps between consecutive matches must be periods of S.
Palindrome Automaton (PAM)
The palindrome automaton is a suffix-tree-like structure for palindromes. Each node represents a unique palindromic substring.
Structure:
- Each node stores: length
len, failure linkfail, and transitionsnext[alphabet] - Two roots: odd root with
len = -1and even root withlen = 0 - Failure link points to the longest palindromic proper suffix
- The chain of failure links from any node gives all palindromic suffixes
Construction: Process the string incrementally. For each new character, traverse failure links to find the longest new palindromic suffix by checking if the character matches to the left of each candidate.
struct PalindromeAutomaton {
static const int ALPHA = 26;
static const int MAXN = 500005;
int next[MAXN][ALPHA];
int fail[MAXN];
int len[MAXN];
int last;
int sz;
char s[MAXN];
void init() {
fail[0] = fail[1] = 1;
len[1] = -1;
sz = 1;
last = 0;
}
int getFail(int v, int pos) {
while (pos - len[v] - 1 < 0 || s[pos - len[v] - 1] != s[pos])
v = fail[v];
return v;
}
void insert(char c, int pos) {
s[pos] = c;
int cur = getFail(last, pos);
int idx = c - 'a';
if (!next[cur][idx]) {
len[++sz] = len[cur] + 2;
int temp = getFail(fail[cur], pos);
fail[sz] = next[temp][idx];
next[cur][idx] = sz;
}
last = next[cur][idx];
}
};
Complexity: Adding one character creates at most one new node (the longest new palindromic suffix). Failure links strictly decrease in length, and each node is traversed at most twice during construction. Total complexity is O(n) time and O(n·|Σ|) space.
Applications:
- Counting distinct palindromic substrings
- Counting palindrmoes ending at each position
- Finding the longest palindromic substring at each position
Higher-Order Palindromes
A palindrome has order k if removing outer characters k-1 times still yields a palindrome. Let f[i][j] represent the order of substring s[i:j].
Transition: If s[i] ≠ s[j] or the inner substring is not a palindrome, then f[i][j] = 0. Otherwise, f[i][j] = f[i + (j-i+1)/2 - 1][j] + 1.
Efficient Computation: Using the PAM, maintain for each node v its "half" - the longest palindromic suffix with length ≤ len[v]/2. When creating a new node via transition p → x → np, we find half(p) and traverse ancestors until we can form another palindrome suffix, yielding half(np). The order is:
f(v) = f(half(v)) + 1 if len(half(v)) = floor(len(v)/2), otherwise 1.
Palindrome Periodicity
If a palindrome S has period p, its cyclic period can be split into two palindromes of lengths p - (|S| mod p) and |S| mod p. This follows from the palindrome property.
Let v be the longest proper palindromic prefix (or suffix) of palindrome u. If 2|v| ≥ |u|, then v appears exactly twice - as prefix and suffix. This constrains where long proper palindromes can occur within a larger palindrome.
Minimum Suffix
Consider suffixes sorted by lexicographic order. The significant suffixes are those prefixes of some minimal suffix when extended by any character. Key properties:
- For any two significant suffixes
UandVwhere|U| < |V|,Umust be a prefix ofV - Doubling Theorem: If
UandVare significant suffixes with|U| < |V|, then2|U| < |V|
This guarantees only O(log n) significant suffixes exist for any string.