Implementing a Full‑Text Search Engine Directly Inside Your Database
Creating a full‑text search system inside a database involves setting up a lightweight inverted index using metadata tables, word lookup tables, and row mapping structures. The core idea is to tokenize text, map tokens to indexed rows, and query the result through prebuilt relationships without relying on external search platforms.
Index Environment Setup
Before any document can be searched, the indexing subsystem needs a dedicated schema and collection of tables. An initialization routine prepares the environment:
CREATE SCHEMA IF NOT EXISTS FT;
CREATE TABLE FT.INDEX_REGISTRY (
REGISTRY_ID INT AUTO_INCREMENT PRIMARY KEY,
SCHEMA_NAME VARCHAR,
TABLE_NAME VARCHAR,
COLUMN_LIST VARCHAR,
UNIQUE(SCHEMA_NAME, TABLE_NAME)
);
CREATE TABLE FT.WORD_DICT (
DICT_ID INT AUTO_INCREMENT PRIMARY KEY,
TERM VARCHAR,
UNIQUE(TERM)
);
CREATE TABLE FT.ROW_ARCHIVE (
ARCHIVE_ID IDENTITY,
HASH_CODE INT,
REGISTRY_ID INT,
LOOKUP_KEY VARCHAR,
UNIQUE(HASH_CODE, REGISTRY_ID, LOOKUP_KEY)
);
CREATE TABLE FT.WORD_ROW_LINK (
ARCHIVE_ID INT,
DICT_ID INT,
PRIMARY KEY(DICT_ID, ARCHIVE_ID)
);
CREATE TABLE FT.STOP_WORDS (
TERM VARCHAR
);
INDEX_REGISTRY tracks which table and columns have been indexed. WORD_DICT stores every unique token across all documents. ROW_ARCHIVE maps a hash, the index registry entry, and the original row’s key. WORD_ROW_LINK connects a dictionary term to every row archive entry that contains it. STOP_WORDS holds terms that should be skipped during tokenization.
Building a Full‑Text Index
A custom function FT_BUILD_INDEX accepts a schema, table, and comma‑separated column list. Internally it:
- Inserts a new entry into
INDEX_REGISTRY. - Scans every row of the target table and applies a tokenizer to each indexed column.
- Populates
WORD_DICT,ROW_ARCHIVE, andWORD_ROW_LINK.
FT_BUILD_INDEX('ONLINE_STORE', 'PRODUCT_CATALOG', 'TITLE, SUMMARY');
After execution, the INDEX_REGISTRY might contain:
| REGISTRY_ID | SCHEMA_NAME | TABLE_NAME | COLUMN_LIST |
|---|---|---|---|
| 1 | ONLINE_STORE | PRODUCT_CATALOG | TITLE, SUMMARY |
Keeping the Index Up‑to‑Date
Triggers on the source table invoke a synchronization procedure whenever rows are inserted, updated, or deleted. The procedure recalculates the hash for the changed row, removes stale entries from WORD_ROW_LINK, and adds fresh mappings. This ensures the index never falls behind.
Querying the Index
A search function FT_QUERY takes the user’s input, a limit, and an offset. The steps performed inside the database engine:
- Normalize and tokenize the input phrase into individual terms.
- Look up each term’s
DICT_IDfromWORD_DICT, ignoring terms found inSTOP_WORDS. - Collect
ARCHIVE_IDvalues fromWORD_ROW_LINKwhere theDICT_IDmatches. Rankings can be determined by term frequency or proximity scoring. - Join the resulting
ARCHIVE_IDset withROW_ARCHIVEandINDEX_REGISTRYto retrieve the originalLOOKUP_KEYvalues (for example, primary key conditions). - Use those conditions to fetch the final rows from the source table.
FT_QUERY('wireless noise cancelling headphones', 15, 0);
The returned result set contains up to 15 matching product rows. Internally the function constructs a query that combines WORD_ROW_LINK and ROW_ARCHIVE, resolving something like:
SELECT pc.*
FROM PRODUCT_CATALOG pc
JOIN ROW_ARCHIVE ra ON pc.PRODUCT_ID = ra.LOOKUP_KEY
JOIN WORD_ROW_LINK wrl ON ra.ARCHIVE_ID = wrl.ARCHIVE_ID
JOIN WORD_DICT wd ON wd.DICT_ID = wrl.DICT_ID
WHERE wd.TERM IN ('wireless', 'noise', 'cancelling', 'headphones')
AND ra.REGISTRY_ID = 1
GROUP BY pc.PRODUCT_ID
ORDER BY COUNT(*) DESC
LIMIT 15 OFFSET 0;
Hadnling Large Volumes of Text
To avoid oversized indexes, the tokenizer can apply additional rules: lowercasing all text, stripping HTML or markup, stemming words, and limiting the length of stored terms. The STOP_WORDS table can be prepopulated with common language-specific noise words. The combination of these techniques keeps the index compact and the search fast.
Concrete Usage Scenario
Consider an e‑commerce catalog with the PRODUCT_CATALOG table. A user searches for "leather organizer". The database splits the phrase, finds the term IDs for "leather" and "organizer", fetches the intersecting ARCHIVE_ID values, retrieves the primary keys, and returns the matching product rows. All heavy lifting—tokenization, mapping, joining—happens inside the database engine through the custom tables and functions.