Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Text Similarity Detection System: Development Process and Technical Implementation

Tech May 9 4

CourseSoftware Engineering FundamentalsAssignment LinkCourse Assignment PortalObjectives1. Understand software development lifecycle2. Enhance coding proficiency3. Master code quality and performance analysis techniques4. Implement unit testing for projects5. Understand test coverage metrics

GitHub Repository

View Project on GitHub


  1. Personal Software Process Table

The following table documents time estimates versus actual time spent across all phases of the development lifecycle. This tracking helps identify areas for process improvement in future projects.

PSP2.1 StageDescriptionEstimated Time (minutes)Actual Time (minutes)PlanningProject planning and task estimation3035· EstimateTime estimation for task completion3035DevelopmentCore development work420450· AnalysisRequirements analysis and technology learning6070· Design SpecCreating design documentation6065· Design ReviewDesign review and feedback3025· Coding StandardEstablishing coding guidelines3025· DesignDetailed architectural design6065· CodingImplementation and coding120140· Code ReviewSelf code review3025· TestTesting and code refinement3035ReportingDocumentation and reporting6055· Test ReportFormal test documentation3025· Size MeasurementWorkload quantification105· PostmortemRetrospective and improvement planning2025Total510540


  1. Module Design and Implementation

2.1 Architecture Overview

The solution employs a modular architecture with clear separation of concerns. Each module performs a specific function, making the codebase maintainable and extensible.

Core Modules:

2.2 Function Specifications

Each function is designed with a single responsibility principle, making testing and maintenance straightforward.

def load_document(file_path: str) -> str:
    """
    Load text content from a file with comprehensive error handling.
    
    Args:
        file_path: Path to the text file to be read
        
    Returns:
        str: File contents as a single string
        
    Raises:
        FileNotFoundError: When the file does not exist
        PermissionError: When file access permissions are insufficient
        UnicodeError: When file encoding is incompatible
        IOError: For general input/output failures
    """
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            return file.read()
    except (FileNotFoundError, PermissionError, UnicodeError, IOError) as e:
        logging.error(f"File access error for {file_path}: {str(e)}")
        raise


def tokenize_text(raw_text: str) -> list[str]:
    """
    Clean and tokenize input text using jieba Chinese word segmentation.
    
    Args:
        raw_text: Unprocessed text containing punctuation and whitespace
        
    Returns:
        list[str]: List of tokenized words in precision mode
    """
    jieba.initialize()
    sanitized = re.sub(r'[^\w\s]', '', raw_text)
    tokens = jieba.lcut(sanitized, cut_all=False)
    return tokens


def calculate_cosine_similarity(corpus_a: list[str], corpus_b: list[str]) -> float:
    """
    Compute cosine similarity between two tokenized documents.
    
    Args:
        corpus_a: First document as list of tokens
        corpus_b: Second document as list of tokens
        
    Returns:
        float: Similarity score between 0.0 and 1.0
    """
    vocabulary = set(corpus_a) | set(corpus_b)
    vector_a = [corpus_a.count(term) for term in vocabulary]
    vector_b = [corpus_b.count(term) for term in vocabulary]
    
    dot_product = sum(a * b for a, b in zip(vector_a, vector_b))
    magnitude_a = math.sqrt(sum(x ** 2 for x in vector_a))
    magnitude_b = math.sqrt(sum(y ** 2 for y in vector_b))
    
    if magnitude_a == 0 or magnitude_b == 0:
        return 0.0
        
    return dot_product / (magnitude_a * magnitude_b)

2.3 Module Interaction Flow

The following diagram illustrates the control flow between modules, showing how data flows through the processing pipeline from input to output.

graph LR Entry[Application Start] --> ArgParser[Command Line Parser] ArgParser --> Loader1[File I/O: Source Document] ArgParser --> Loader2[File I/O: Suspect Document] Loader1 --> Preprocessor1[Text Preprocessor] Loader2 --> Preprocessor2[Text Preprocessor] Preprocessor1 --> Comparator[Similarity Engine] Preprocessor2 --> Comparator Comparator --> OutputHandler[Result Writer]

2.4 Algorithm Implementation Details

The cosine similarity algorithm forms the core computation engine. The approach involves constructing term frequency vectors from tokenized documents and computing the dot product relative to vector magnitudes.

Algorithm Steps:

  1. Vocabulary Construction: Create a unified set of all unique terms from both documents
  2. Vector Generation: Count occurrences of each vocabulary term in both documents
  3. Dot Product Calculation: Sum the products of corresponding vector components
  4. Magnitude Computation: Calculate the Euclidean norm for each vector
  5. Similarity Derivation: Divide dot product by the product of magnitudes

Edge Case Handling:

flowchart TD
    A[Start Computation] --> B[Construct TF Vectors]
    B --> C[Calculate Dot Product]
    C --> D[Compute Vector Magnitudes]
    D --> E{Magnitude Zero Check}
    E -->|Yes - Empty Document| F[Return 0.0]
    E -->|No| G[Calculate Cosine Value]
    G --> H[Return Similarity Score]


  1. Performence Analysis and Optimization

3.1 Code Quality Assessment

The codebase was analyzed using PyCharm's built-in static analysis tools, including PEP 8 compliance checks and potential bug detection. Key findings informed several refactoring decisions to improve code clarity and maintainability.

3.2 Performance Profiling Results

Performance analysis was conducted using line_profiler, which provides granular execution time metrics at the function level. This revealed the following time distribution across core functions:

3.3 Optimization Strategies

Optimization 1: Text Preprocessing Enhancement

The original preprocessing function performed separate regex operations for punctuation and whitespace removal. By consolidating these operations into a single regex pattern, the number of text scans was reduced.

Before:

def tokenize_text(text):
    # Separate operations - multiple passes through text
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\s+', ' ', text)     # Normalize whitespace
    words = jieba.lcut(text)
    return words

After:

def tokenize_text(text):
    # Single-pass sanitization
    jieba.initialize()  # Cache dictionary in advance
    sanitized = re.sub(r'[^\w\s]|s+', '', text)
    tokens = jieba.lcut(sanitized, cut_all=False)
    return tokens

Optimization 2: Dictionary Pre-initialization

By calling jieba.initialize() before text processing, the Chinese dictionary is loaded into memory once during module import, eliminating repeated initialization overhead for subsequent function calls.

Measured Improvement:


  1. Unit Testing Implementation

4.1 Test Framework Configuration

The project uses Python's unittest framework, which provides a robust foundation for defining and executing test cases. Mock objects from unittest.mock enable testing of edge cases without filesystem dependencies.

import unittest
from unittest.mock import patch, mock_open
from similarity_checker import (
    load_document,
    tokenize_text,
    calculate_cosine_similarity,
    run_analysis
)


class TestDocumentSimilaritySystem(unittest.TestCase):
    """Comprehensive test suite for the similarity detection system."""

    def test_document_loading_success(self):
        """Verify successful file reading with valid input."""
        test_content = "This is test document content for verification."
        with patch("builtins.open", mock_open(read_data=test_content)):
            result = load_document("dummy_path.txt")
            self.assertEqual(result, test_content)

    def test_document_loading_file_not_found(self):
        """Ensure appropriate exception for missing files."""
        with patch("builtins.open", side_effect=FileNotFoundError):
            with self.assertRaises(FileNotFoundError):
                load_document("/nonexistent/path/document.txt")

    def test_document_loading_permission_denied(self):
        """Verify permission-related exceptions are properly propagated."""
        with patch("builtins.open", side_effect=PermissionError("Access denied")):
            with self.assertRaises(PermissionError):
                load_document("/protected/file.txt")

    def test_document_loading_encoding_failure(self):
        """Test handling of incompatible file encodings."""
        with patch("builtins.open", side_effect=UnicodeDecodeError(
                "utf-8", b"", 0, 1, "invalid start byte")):
            with self.assertRaises(UnicodeDecodeError):
                load_document("corrupted_encoding.txt")

    def test_text_tokenization_accuracy(self):
        """Validate tokenization with various input patterns."""
        input_text = "Testing tokenization!\nWith punctuation, and newlines."
        expected_tokens = ["Testing", "tokenization", "With", 
                          "punctuation", "and", "newlines"]
        result = tokenize_text(input_text)
        self.assertEqual(result, expected_tokens)

    def test_similarity_identical_documents(self):
        """Similarity between identical documents should be 1.0."""
        doc_a = ["natural", "language", "processing"]
        doc_b = ["natural", "language", "processing"]
        score = calculate_cosine_similarity(doc_a, doc_b)
        self.assertAlmostEqual(score, 1.0, places=4)

    def test_similarity_completely_different(self):
        """Documents with no shared terms should score 0.0."""
        doc_a = ["artificial", "intelligence"]
        doc_b = ["quantum", "computing", "physics"]
        score = calculate_cosine_similarity(doc_a, doc_b)
        self.assertEqual(score, 0.0)

    def test_similarity_partial_overlap(self):
        """Verify accurate scoring for documents with partial term overlap."""
        doc_a = ["machine", "learning", "algorithms", "data"]
        doc_b = ["machine", "learning", "neural", "networks"]
        expected_similarity = 0.4082  # sqrt(2/3) * sqrt(2/4) / (|A|*|B| terms)
        result = calculate_cosine_similarity(doc_a, doc_b)
        self.assertAlmostEqual(result, expected_similarity, places=4)

4.2 Test Coverage Analysis


  1. Exception Handling Strategy

Robust error handling ensures the application fails gracefully and provides actionable feedback to users when problems occur.

5.1 Exception Hierarchy

5.2 Test Coverage for Exception Paths

Each exception type is explicitly tested to ensure proper propagation:

def test_io_error_handling(self):
    """Verify IOError exceptions are caught and re-raised appropriately."""
    with patch("builtins.open", side_effect=IOError("Disk read failure")):
        with self.assertRaises(IOError) as context:
            load_document("/dev/sda1")  # Simulated device error
        self.assertIn("Disk read failure", str(context.exception))

def test_unexpected_exception(self):
    """Ensure generic exception handling for unforeseen errors."""
    with patch("builtins.open", side_effect=RuntimeError("Unexpected system error")):
        with self.assertRaises(RuntimeError):
            load_document("any_path.txt")


Summary

This project demonstrates end-to-end implementation of a text similarity detection system using Python. Key achievements include:

  • Modular Architecture: Clean separation of concerns enabling independent testing and maintenance
  • Algorithm Implementation: Accurate cosine similarity computation using term frequency vectors
  • Performance Optimizaton: 19% improvement through regex consolidation and dictionary pre-initialization
  • Comprehensive Testing: 100% unit test coverage on core modules with robust exception handling
  • Quality Assurance: Static analysis integration ensuring code quality standards

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.