Text Similarity Detection System: Development Process and Technical Implementation
CourseSoftware Engineering FundamentalsAssignment LinkCourse Assignment PortalObjectives1. Understand software development lifecycle2. Enhance coding proficiency3. Master code quality and performance analysis techniques4. Implement unit testing for projects5. Understand test coverage metrics
GitHub Repository
- Personal Software Process Table
The following table documents time estimates versus actual time spent across all phases of the development lifecycle. This tracking helps identify areas for process improvement in future projects.
PSP2.1 StageDescriptionEstimated Time (minutes)Actual Time (minutes)PlanningProject planning and task estimation3035· EstimateTime estimation for task completion3035DevelopmentCore development work420450· AnalysisRequirements analysis and technology learning6070· Design SpecCreating design documentation6065· Design ReviewDesign review and feedback3025· Coding StandardEstablishing coding guidelines3025· DesignDetailed architectural design6065· CodingImplementation and coding120140· Code ReviewSelf code review3025· TestTesting and code refinement3035ReportingDocumentation and reporting6055· Test ReportFormal test documentation3025· Size MeasurementWorkload quantification105· PostmortemRetrospective and improvement planning2025Total510540
- Module Design and Implementation
2.1 Architecture Overview
The solution employs a modular architecture with clear separation of concerns. Each module performs a specific function, making the codebase maintainable and extensible.
Core Modules:
2.2 Function Specifications
Each function is designed with a single responsibility principle, making testing and maintenance straightforward.
def load_document(file_path: str) -> str:
"""
Load text content from a file with comprehensive error handling.
Args:
file_path: Path to the text file to be read
Returns:
str: File contents as a single string
Raises:
FileNotFoundError: When the file does not exist
PermissionError: When file access permissions are insufficient
UnicodeError: When file encoding is incompatible
IOError: For general input/output failures
"""
try:
with open(file_path, 'r', encoding='utf-8') as file:
return file.read()
except (FileNotFoundError, PermissionError, UnicodeError, IOError) as e:
logging.error(f"File access error for {file_path}: {str(e)}")
raise
def tokenize_text(raw_text: str) -> list[str]:
"""
Clean and tokenize input text using jieba Chinese word segmentation.
Args:
raw_text: Unprocessed text containing punctuation and whitespace
Returns:
list[str]: List of tokenized words in precision mode
"""
jieba.initialize()
sanitized = re.sub(r'[^\w\s]', '', raw_text)
tokens = jieba.lcut(sanitized, cut_all=False)
return tokens
def calculate_cosine_similarity(corpus_a: list[str], corpus_b: list[str]) -> float:
"""
Compute cosine similarity between two tokenized documents.
Args:
corpus_a: First document as list of tokens
corpus_b: Second document as list of tokens
Returns:
float: Similarity score between 0.0 and 1.0
"""
vocabulary = set(corpus_a) | set(corpus_b)
vector_a = [corpus_a.count(term) for term in vocabulary]
vector_b = [corpus_b.count(term) for term in vocabulary]
dot_product = sum(a * b for a, b in zip(vector_a, vector_b))
magnitude_a = math.sqrt(sum(x ** 2 for x in vector_a))
magnitude_b = math.sqrt(sum(y ** 2 for y in vector_b))
if magnitude_a == 0 or magnitude_b == 0:
return 0.0
return dot_product / (magnitude_a * magnitude_b)
2.3 Module Interaction Flow
The following diagram illustrates the control flow between modules, showing how data flows through the processing pipeline from input to output.
graph LR Entry[Application Start] --> ArgParser[Command Line Parser] ArgParser --> Loader1[File I/O: Source Document] ArgParser --> Loader2[File I/O: Suspect Document] Loader1 --> Preprocessor1[Text Preprocessor] Loader2 --> Preprocessor2[Text Preprocessor] Preprocessor1 --> Comparator[Similarity Engine] Preprocessor2 --> Comparator Comparator --> OutputHandler[Result Writer]
2.4 Algorithm Implementation Details
The cosine similarity algorithm forms the core computation engine. The approach involves constructing term frequency vectors from tokenized documents and computing the dot product relative to vector magnitudes.
Algorithm Steps:
- Vocabulary Construction: Create a unified set of all unique terms from both documents
- Vector Generation: Count occurrences of each vocabulary term in both documents
- Dot Product Calculation: Sum the products of corresponding vector components
- Magnitude Computation: Calculate the Euclidean norm for each vector
- Similarity Derivation: Divide dot product by the product of magnitudes
Edge Case Handling:
flowchart TD
A[Start Computation] --> B[Construct TF Vectors]
B --> C[Calculate Dot Product]
C --> D[Compute Vector Magnitudes]
D --> E{Magnitude Zero Check}
E -->|Yes - Empty Document| F[Return 0.0]
E -->|No| G[Calculate Cosine Value]
G --> H[Return Similarity Score]
- Performence Analysis and Optimization
3.1 Code Quality Assessment
The codebase was analyzed using PyCharm's built-in static analysis tools, including PEP 8 compliance checks and potential bug detection. Key findings informed several refactoring decisions to improve code clarity and maintainability.
3.2 Performance Profiling Results
Performance analysis was conducted using line_profiler, which provides granular execution time metrics at the function level. This revealed the following time distribution across core functions:
3.3 Optimization Strategies
Optimization 1: Text Preprocessing Enhancement
The original preprocessing function performed separate regex operations for punctuation and whitespace removal. By consolidating these operations into a single regex pattern, the number of text scans was reduced.
Before:
def tokenize_text(text):
# Separate operations - multiple passes through text
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
text = re.sub(r'\s+', ' ', text) # Normalize whitespace
words = jieba.lcut(text)
return words
After:
def tokenize_text(text):
# Single-pass sanitization
jieba.initialize() # Cache dictionary in advance
sanitized = re.sub(r'[^\w\s]|s+', '', text)
tokens = jieba.lcut(sanitized, cut_all=False)
return tokens
Optimization 2: Dictionary Pre-initialization
By calling jieba.initialize() before text processing, the Chinese dictionary is loaded into memory once during module import, eliminating repeated initialization overhead for subsequent function calls.
Measured Improvement:
- Unit Testing Implementation
4.1 Test Framework Configuration
The project uses Python's unittest framework, which provides a robust foundation for defining and executing test cases. Mock objects from unittest.mock enable testing of edge cases without filesystem dependencies.
import unittest
from unittest.mock import patch, mock_open
from similarity_checker import (
load_document,
tokenize_text,
calculate_cosine_similarity,
run_analysis
)
class TestDocumentSimilaritySystem(unittest.TestCase):
"""Comprehensive test suite for the similarity detection system."""
def test_document_loading_success(self):
"""Verify successful file reading with valid input."""
test_content = "This is test document content for verification."
with patch("builtins.open", mock_open(read_data=test_content)):
result = load_document("dummy_path.txt")
self.assertEqual(result, test_content)
def test_document_loading_file_not_found(self):
"""Ensure appropriate exception for missing files."""
with patch("builtins.open", side_effect=FileNotFoundError):
with self.assertRaises(FileNotFoundError):
load_document("/nonexistent/path/document.txt")
def test_document_loading_permission_denied(self):
"""Verify permission-related exceptions are properly propagated."""
with patch("builtins.open", side_effect=PermissionError("Access denied")):
with self.assertRaises(PermissionError):
load_document("/protected/file.txt")
def test_document_loading_encoding_failure(self):
"""Test handling of incompatible file encodings."""
with patch("builtins.open", side_effect=UnicodeDecodeError(
"utf-8", b"", 0, 1, "invalid start byte")):
with self.assertRaises(UnicodeDecodeError):
load_document("corrupted_encoding.txt")
def test_text_tokenization_accuracy(self):
"""Validate tokenization with various input patterns."""
input_text = "Testing tokenization!\nWith punctuation, and newlines."
expected_tokens = ["Testing", "tokenization", "With",
"punctuation", "and", "newlines"]
result = tokenize_text(input_text)
self.assertEqual(result, expected_tokens)
def test_similarity_identical_documents(self):
"""Similarity between identical documents should be 1.0."""
doc_a = ["natural", "language", "processing"]
doc_b = ["natural", "language", "processing"]
score = calculate_cosine_similarity(doc_a, doc_b)
self.assertAlmostEqual(score, 1.0, places=4)
def test_similarity_completely_different(self):
"""Documents with no shared terms should score 0.0."""
doc_a = ["artificial", "intelligence"]
doc_b = ["quantum", "computing", "physics"]
score = calculate_cosine_similarity(doc_a, doc_b)
self.assertEqual(score, 0.0)
def test_similarity_partial_overlap(self):
"""Verify accurate scoring for documents with partial term overlap."""
doc_a = ["machine", "learning", "algorithms", "data"]
doc_b = ["machine", "learning", "neural", "networks"]
expected_similarity = 0.4082 # sqrt(2/3) * sqrt(2/4) / (|A|*|B| terms)
result = calculate_cosine_similarity(doc_a, doc_b)
self.assertAlmostEqual(result, expected_similarity, places=4)
4.2 Test Coverage Analysis
- Exception Handling Strategy
Robust error handling ensures the application fails gracefully and provides actionable feedback to users when problems occur.
5.1 Exception Hierarchy
5.2 Test Coverage for Exception Paths
Each exception type is explicitly tested to ensure proper propagation:
def test_io_error_handling(self):
"""Verify IOError exceptions are caught and re-raised appropriately."""
with patch("builtins.open", side_effect=IOError("Disk read failure")):
with self.assertRaises(IOError) as context:
load_document("/dev/sda1") # Simulated device error
self.assertIn("Disk read failure", str(context.exception))
def test_unexpected_exception(self):
"""Ensure generic exception handling for unforeseen errors."""
with patch("builtins.open", side_effect=RuntimeError("Unexpected system error")):
with self.assertRaises(RuntimeError):
load_document("any_path.txt")
Summary
This project demonstrates end-to-end implementation of a text similarity detection system using Python. Key achievements include:
- Modular Architecture: Clean separation of concerns enabling independent testing and maintenance
- Algorithm Implementation: Accurate cosine similarity computation using term frequency vectors
- Performance Optimizaton: 19% improvement through regex consolidation and dictionary pre-initialization
- Comprehensive Testing: 100% unit test coverage on core modules with robust exception handling
- Quality Assurance: Static analysis integration ensuring code quality standards