Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Automated Oracle Bone Script Analysis: Noise Reduction, Segmentation, and Character Recognition

Tech 1

Let the original rubbing image be represented as $I_{raw}$. The preprocessing stage applies a transformation $T_{prep}$ to yield a cleaned image $I_{clean} = T_{prep}(I_{raw})$. Subsequently, a descriptor extractor $T_{feat}$ maps the cleaned image to a feature vector $\mathbf{v} = T_{feat}(I_{clean})$. A classification model $C$ then determines the presence of noise: $y = C(\mathbf{v})$, where $y=0$ indicates the presence of interference and $y=1$ signifies a clean region.

Interference in ancient script rubbings primarily manifests as point noise, artificial textures, and inherent surface textures. Point noise is effectively suppressed using adaptive median filtering, which dynamically adjusts the kernel size to preserve edges while eliminating isolated artifacts. For artificial and inherent textures, which predominantly occupy the high-frequency domain, frequency-domain filtering is applied. A Gaussian low-pass filter smooths the image to attenuate high-frequency textural patterns, while wavelet transformation allows for multi-scale texture separation by selecting appropriate basis functions.

Feature extraction encompasses shape, texture, and intensity characteristics. For a given region $R$ with area $A$, the features are defined as:

Shape Feature: $$ \Phi_{shape}(R) = \frac{1}{A} \iint_R (x^2 + y^2) ,dx,dy $$

Texture Feature: $$ \Phi_{tex}(R) = \frac{1}{A} \iint_R G(x,y) \cdot I(x,y) ,dx,dy $$

Intensity Feature: $$ \Phi_{int}(R) = \frac{1}{A} \iint_R I(x,y) ,dx,dy $$

where $G(x,y)$ represents a Gaussian kernel. The feature vector $\mathbf{v} = [\Phi_{shape}, \Phi_{tex}, \Phi_{int}]$ is fed into classifiers such as Support Vector Machines (SVM), Random Forests, or Convolutional Neural Networks (CNN) to categorize the region as script or noise.

import cv2
import numpy as np

def preprocess_rubbing(img_path, out_path):
    src_img = cv2.imread(img_path)
    gray_img = cv2.cvtColor(src_img, cv2.COLOR_BGR2GRAY)

    # Denoising using median filter
    denoised_img = cv2.medianBlur(gray_img, 3)

    # Otsu's thresholding
    _, binary_img = cv2.threshold(denoised_img, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)

    # Morphological operations to clean up
    morph_kernel = np.ones((3, 3), np.uint8)
    cleaned_img = cv2.morphologyEx(binary_img, cv2.MORPH_OPEN, morph_kernel, iterations=2)

    # Contour detection
    contours, _ = cv2.findContours(cleaned_img, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    result_img = cv2.drawContours(src_img.copy(), contours, -1, (0, 255, 0), 2)

    cv2.imwrite(out_path, result_img)
    return result_img

The segmentation of individual characters from rubbings can be modeled as a composite function $S(I_{raw}) = N_{point}(F(I_{raw})) \cup N_{art}(F(I_{raw})) \cup N_{inher}(F(I_{raw}))$, where $F$ extracts the feature space, and $N_{point}$, $N_{art}$, and $N_{inher}$ denote the respective noise removal operators. To achieve robust single-character isolation, a U-Net architecture is employed for pixel-wise segmentation. This encoder-decoder structure captures contextual information while maintaining spatial localization, effectively separating characters from complex backgrounds.

The model is trained using categorical cros-entropy loss and optimized via Adam or SGD. Performance is quantified using $k$-fold cross-validation, ensuring generalization across unseen data. Metrics include precision, recall, and the F1-score.

import torch
import torch.nn as nn

class OracleSegNet(nn.Module):
    def __init__(self, num_classes=4):
        super(OracleSegNet, self).__init__()
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2)
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 16 * 16, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes)
        )

    def forward(self, x):
        x = self.feature_extractor(x)
        return self.classifier(x)

Applying the segmentation model to a batch of test images involves preprocessing, inference, and post-processing. The watershed algorithm is utilized for boundary refinement. For an image $I_t$, the foreground and background are separated using distance transforms, and connected components label isolated regions. Bounding boxes are extracted for components exceeding a minimum area threshold.

import pandas as pd

def extract_characters(src_img):
    gray = cv2.cvtColor(src_img, cv2.COLOR_BGR2GRAY)
    _, thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY_INV + cv2.THRESH_OTSU)
    
    kernel = np.ones((3, 3), np.uint8)
    opened = cv2.morphologyEx(thresh, cv2.MORPH_OPEN, kernel, iterations=2)
    
    background = cv2.dilate(opened, kernel, iterations=3)
    dist_transform = cv2.distanceTransform(opened, cv2.DIST_L2, 5)
    _, foreground = cv2.threshold(dist_transform, 0.5 * dist_transform.max(), 255, 0)
    
    foreground = np.uint8(foreground)
    unknown_region = cv2.subtract(background, foreground)
    
    _, markers = cv2.connectedComponents(foreground)
    markers = markers + 1
    markers[unknown_region == 255] = 0
    
    markers = cv2.watershed(src_img, markers)
    
    bounding_boxes = []
    for label in range(1, markers.max() + 1):
        mask = np.uint8(markers == label)
        x, y, w, h = cv2.boundingRect(mask)
        if w > 15 and h > 15:
            bounding_boxes.append([x, y, x + w, y + h])
            
    return bounding_boxes

def process_test_dataset(test_dir, output_excel):
    results = []
    for idx in range(1, 201):
        img_path = f"{test_dir}/img_{idx}.jpg"
        img = cv2.imread(img_path)
        boxes = extract_characters(img)
        results.append({"image_id": idx, "bounding_boxes": str(boxes)})
        
    df = pd.DataFrame(results)
    df.to_excel(output_excel, index=False)

Character recognition is formulated as a multi-class classification task over $K$ distinct character categories. The output is a $K$-dimensional probability vector $\mathbf{p} = \text{Softmax}(\mathbf{W} \cdot \text{CNN}(I_{char}) + \mathbf{b})$. The pipeline consists of:

  1. Preprocessing: $I_{ref} = \Psi_{prep}(I_{char})$
  2. Feature Extraction: $\mathbf{f} = \Psi_{feat}(I_{ref})$
  3. Model Training: $\Theta = \Psi_{train}(\mathbf{f})$
  4. Data Augmentation: $D_{aug} = \Psi_{aug}(D_{orig})$ usinng elastic deformations, rotations, and scaling to mitigate class imbalance.
  5. Prediction: $\hat{y} = \Psi_{predict}(I_{test}) = \Theta(\Psi_{feat}(\Psi_{prep}(I_{test})))$

To address the high variance in ancient script morphology and the presence of variant characters, Attention Mechanisms are integrated into the CNN backbone. This allows the network to focus on critical stroke regions while ignoring residual background interference. Additionally, transfer learning from a CRNN (Convolutional Recurrent Neural Network) pre-trained on modern or synthetic script datasets accelerates convergence and improves feature representation. The predicted labels are mapped back to their corresponding characters and exported in a structured format.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.