Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Implementing a Full‑Text Search Engine Directly Inside Your Database

Tech 1

Creating a full‑text search system inside a database involves setting up a lightweight inverted index using metadata tables, word lookup tables, and row mapping structures. The core idea is to tokenize text, map tokens to indexed rows, and query the result through prebuilt relationships without relying on external search platforms.

Index Environment Setup

Before any document can be searched, the indexing subsystem needs a dedicated schema and collection of tables. An initialization routine prepares the environment:

CREATE SCHEMA IF NOT EXISTS FT;

CREATE TABLE FT.INDEX_REGISTRY (
    REGISTRY_ID INT AUTO_INCREMENT PRIMARY KEY,
    SCHEMA_NAME VARCHAR,
    TABLE_NAME VARCHAR,
    COLUMN_LIST VARCHAR,
    UNIQUE(SCHEMA_NAME, TABLE_NAME)
);

CREATE TABLE FT.WORD_DICT (
    DICT_ID INT AUTO_INCREMENT PRIMARY KEY,
    TERM VARCHAR,
    UNIQUE(TERM)
);

CREATE TABLE FT.ROW_ARCHIVE (
    ARCHIVE_ID IDENTITY,
    HASH_CODE INT,
    REGISTRY_ID INT,
    LOOKUP_KEY VARCHAR,
    UNIQUE(HASH_CODE, REGISTRY_ID, LOOKUP_KEY)
);

CREATE TABLE FT.WORD_ROW_LINK (
    ARCHIVE_ID INT,
    DICT_ID INT,
    PRIMARY KEY(DICT_ID, ARCHIVE_ID)
);

CREATE TABLE FT.STOP_WORDS (
    TERM VARCHAR
);

INDEX_REGISTRY tracks which table and columns have been indexed. WORD_DICT stores every unique token across all documents. ROW_ARCHIVE maps a hash, the index registry entry, and the original row’s key. WORD_ROW_LINK connects a dictionary term to every row archive entry that contains it. STOP_WORDS holds terms that should be skipped during tokenization.

Building a Full‑Text Index

A custom function FT_BUILD_INDEX accepts a schema, table, and comma‑separated column list. Internally it:

  1. Inserts a new entry into INDEX_REGISTRY.
  2. Scans every row of the target table and applies a tokenizer to each indexed column.
  3. Populates WORD_DICT, ROW_ARCHIVE, and WORD_ROW_LINK.
FT_BUILD_INDEX('ONLINE_STORE', 'PRODUCT_CATALOG', 'TITLE, SUMMARY');

After execution, the INDEX_REGISTRY might contain:

REGISTRY_ID SCHEMA_NAME TABLE_NAME COLUMN_LIST
1 ONLINE_STORE PRODUCT_CATALOG TITLE, SUMMARY

Keeping the Index Up‑to‑Date

Triggers on the source table invoke a synchronization procedure whenever rows are inserted, updated, or deleted. The procedure recalculates the hash for the changed row, removes stale entries from WORD_ROW_LINK, and adds fresh mappings. This ensures the index never falls behind.

Querying the Index

A search function FT_QUERY takes the user’s input, a limit, and an offset. The steps performed inside the database engine:

  1. Normalize and tokenize the input phrase into individual terms.
  2. Look up each term’s DICT_ID from WORD_DICT, ignoring terms found in STOP_WORDS.
  3. Collect ARCHIVE_ID values from WORD_ROW_LINK where the DICT_ID matches. Rankings can be determined by term frequency or proximity scoring.
  4. Join the resulting ARCHIVE_ID set with ROW_ARCHIVE and INDEX_REGISTRY to retrieve the original LOOKUP_KEY values (for example, primary key conditions).
  5. Use those conditions to fetch the final rows from the source table.
FT_QUERY('wireless noise cancelling headphones', 15, 0);

The returned result set contains up to 15 matching product rows. Internally the function constructs a query that combines WORD_ROW_LINK and ROW_ARCHIVE, resolving something like:

SELECT pc.*
FROM PRODUCT_CATALOG pc
JOIN ROW_ARCHIVE ra ON pc.PRODUCT_ID = ra.LOOKUP_KEY
JOIN WORD_ROW_LINK wrl ON ra.ARCHIVE_ID = wrl.ARCHIVE_ID
JOIN WORD_DICT wd ON wd.DICT_ID = wrl.DICT_ID
WHERE wd.TERM IN ('wireless', 'noise', 'cancelling', 'headphones')
  AND ra.REGISTRY_ID = 1
GROUP BY pc.PRODUCT_ID
ORDER BY COUNT(*) DESC
LIMIT 15 OFFSET 0;

Hadnling Large Volumes of Text

To avoid oversized indexes, the tokenizer can apply additional rules: lowercasing all text, stripping HTML or markup, stemming words, and limiting the length of stored terms. The STOP_WORDS table can be prepopulated with common language-specific noise words. The combination of these techniques keeps the index compact and the search fast.

Concrete Usage Scenario

Consider an e‑commerce catalog with the PRODUCT_CATALOG table. A user searches for "leather organizer". The database splits the phrase, finds the term IDs for "leather" and "organizer", fetches the intersecting ARCHIVE_ID values, retrieves the primary keys, and returns the matching product rows. All heavy lifting—tokenization, mapping, joining—happens inside the database engine through the custom tables and functions.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.