Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Personal Programming Assignment Overview

Tech 1

Second Assignment: Individual Project

GitHub Repository Link: https://github.com/Mark-Zhangbinghan/Mark-Zhangbinghan/tree/main/3123004723

PSP Table

Design and Implementation Process of Calculation Module

1. Code Organization Structure

This plagiarism detection system uses an object-oriented design approach, mainly containing a core class PlagiarismDetector:

Class Diagram Structure

div classDiagram class PlagiarismDetector { - similarity_threshold: float + init() + read_file(file_path: str) str + preprocess_text(text: str) List[str] + calculate_similarity(text1: str, text2: str) float + check_plagiarism(original_file: str, copied_file: str) float + save_result(result: float, output_file: str) }

  • PlagiarismDetector class: Core plagiarism detection class, encapsulating all plagiarism detection functions
  • main function: Program entry point, responsible for command-line argument parsing and process control
  • Unit test class: Independent test module, verifying the correctness of each function

2. Key Function Flowcharts

calculate_similarity function flow

div graph TD A[Start] --> B[Input text1 and text2] B --> C[Text preprocessing] C --> D[Build vocabulary] D --> E[Generate word frequency vector] E --> F[Calculate cosine similarity] F --> G[Return similarity result] G --> H[End] #### check_plagiarism function flow

div graph TD A[Start] --> B[Read original file] B --> C[Read copied file] C --> D[Calculate similarity] D --> E[Return result] E --> F[End] ### 3. Algorithm Key Points

3.1 Cosine Similarity Algorithm

Core Formula: $$ similarity = \frac{A \cdot B}{|A| \times |B|} $$

Where A and B are the word frequency vectors of the texts.

Implementation Steps:

  1. Tokenization: Use jieba for Chinese tokenization
  2. Vectorization: Convert text into word frequency vectors
  3. Similarity Calculation: Calculate cosine value based on vector space model

3.2 Unique Features

  • Multi-encoding Support: Automatically detect and handle GBK and UTF-8 encoding
  • Short Word Filtering: Filter characters with length ≤1 to improve accuracy
  • Boundary Value Handling: Ensure similarity results are within [0,1]
  • Performance Optimization: Use numpy for vector operations to improve calculation efficiency

Performance Improvements in Calculation Module Interface

1. Performance Analysis Results

Using cProfile for performance analysis, key data as follows:

Performance Analysis Statistics Table

Most Consuming Function: calculate_similarity (35% of total time)

2. Performance Bottleneck Identification

Main Issues Before Improvement:

  1. Vocabulary Rebuilding: Rebuild vocabulary every time
  2. Low Loop Efficiency: Vectorization operations use pure Python loops
  3. Redundant Calculations: Same text repeated tokenization operations

3. Performance Improvement Measures

Improvement Ideas:

  1. Vocabulary Cache Mechanism
# Before improvement: rebuild every time
vocab = list(set(words1 + words2))

# After improvement: use cache
if (text1, text2) in self.vocab_cache:
    vocab = self.vocab_cache[(text1, text2)]
else:
    vocab = list(set(words1 + words2))
    self.vocab_cache[(text1, text2)] = vocab

  1. Vectorization Optimization
# Before improvement: Python loop
for word in words1:
    if word in word_to_idx:
        vector1[word_to_idx[word]] += 1

# After improvement: numpy batch operation
indices = [word_to_idx[word] for word in words1 if word in word_to_idx]
vector1[indices] += 1

  1. Preprocessing Result Cache
# Add preprocessing cache
self.preprocess_cache = {}

def preprocess_text(self, text: str) -> List[str]:
    if text in self.preprocess_cache:
        return self.preprocess_cache[text]
    # ... processing logic
    self.preprocess_cache[text] = result
    return result

4. Improvement Effect Comparison

Unit Testing Demonstration for Calculation Module

1. Test Framework Configuration

import unittest
import os
import tempfile
from main import PlagiarismDetector

class TestPlagiarismDetector(unittest.TestCase):
    """Plagiarism detection test class"""
    
    def setUp(self):
        """Preparation before test"""
        self.detector = PlagiarismDetector()
        self.test_dir = tempfile.mkdtemp()

2. Core Test Cases

2.1 Test for Identical Texts

def test_calculate_similarity_identical(self):
    """Test similarity for identical texts"""
    text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
    text2 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
    
    similarity = self.detector.calculate_similarity(text1, text2)
    self.assertAlmostEqual(similarity, 1.0, places=2)

Test Purpose: Verify that the algorithm can correctly identify 100% similarity for completely identical texts

2.2 Test for Partially Similar Texts

def test_calculate_similarity_partial(self):
    """Test similarity for partially similar texts"""
    text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
    text2 = "Today is Monday, weather is sunny, I will go to school tomorrow."
    
    similarity = self.detector.calculate_similarity(text1, text2)
    self.assertGreater(similarity, 0.3)
    self.assertLess(similarity, 0.9)

Test Purpose: Verify that the algorithm can accurately identify texts with similar semantics but different expressions

2.3 Test for Completely Different Texts

def test_calculate_similarity_different(self):
    """Test similarity for completely different texts"""
    text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
    text2 = "Tomorrow is Monday, weather is cloudy, I will go to school tomorrow."
    
    similarity = self.detector.calculate_similarity(text1, text2)
    self.assertLess(similarity, 0.5)

Test Purpose: Verify that the similarity should be low for completely different content

3. Boundary Condition Tests

3.1 Empty Text Test

def test_calculate_similarity_empty(self):
    """Test similarity for empty text"""
    similarity = self.detector.calculate_similarity("", "test text")
    self.assertEqual(similarity, 0.0)

3.2 Result Format Test

def test_save_result_format(self):
    """Test result saving format"""
    output_file = os.path.join(self.test_dir, "result.txt")
    self.detector.save_result(0.756, output_file)
    
    with open(output_file, 'r', encoding='utf-8') as f:
        result = f.read()
    self.assertEqual(result, "0.76")  # Verify rounding

4. Test Coverage Report

Test coverage summary:
────────────────────────────────────────
Name                Stmts   Miss  Cover
────────────────────────────────────────
main.py                86      4    95%
test_main.py          105      0   100%
────────────────────────────────────────
TOTAL                 191      4    98%
────────────────────────────────────────

Coverage details:
- Statement coverage: 98%
- Branch coverage: 95% 
- Function coverage: 100%
- Line coverage: 97%

Exception Handling Description for Calculation Module

1. File Operation Exception Handling

1.1 File Not Found Exception

def test_read_file_not_exist(self):
    """Test case for file not found"""
    with self.assertRaises(FileNotFoundError):
        self.detector.read_file("nonexistent_file.txt")

Design Objective: Prevent program from crashing due to incorrect file paths Error Scenario: User inputs a non-existent file path Handling Method: Throw a clear FileNotFoundError exception

1.2 File Permission Exception

def test_save_result_permission_error(self):
    """Test case for permission error when saving result"""
    output_file = "/root/result.txt"  # No permission directory
    with self.assertRaises(IOError):
        self.detector.save_result(0.5, output_file)

Design Objective: Handle cases where there is no permission to write to the specified directory Error Scenario: The program has no permission to write to the specified directory Handling Method: Capture permission errors and throw an IOError

2. Data Validation Exception Handling

2.1 Empty File Content Exception

def test_read_empty_file(self):
    """Test handling of empty files"""
    filepath = self.create_test_file("", "empty.txt")
    with self.assertRaises(ValueError):
        self.detector.read_file(filepath)

Design Objective: Ensure input data validity Error Scenario: User provides an empty file Handling Method: Throw ValueError to prompt user to check file content

2.2 Encoding Format Exception

def test_file_encoding_error(self):
    """Test handling of file encoding errors"""
    # Create binary file to simulate encoding error
    filepath = os.path.join(self.test_dir, "binary.bin")
    with open(filepath, 'wb') as f:
        f.write(b'\xff\xfe\x00\x01')
    
    with self.assertRaises(IOError):
        self.detector.read_file(filepath)

Design Objective: Handle unsupported file encoding formats Error Scenario: File encoding does not match the program's expected format Handling Method: Try multiple encodings and throw an IOError if it still fails

3. Calculation Process Exception Handling

3.1 Zero Vector Exception

def test_zero_vector_similarity(self):
    """Test similarity calculation with zero vectors"""
    # Two texts are stop words, may produce zero vectors
    text1 = "of the then"
    text2 = "the way"
    
    similarity = self.detector.calculate_similarity(text1, text2)
    self.assertEqual(similarity, 0.0)  # Should return 0 instead of error

Design Objective: Prevent division by zero errors caused by zero vectors Error Scenario: Text becomes an empty vector after filtering Handling Method: Check vector magnitude before calculation, return 0 if magnitude is zero

3.2 Memory Overflow Exception

def test_large_file_processing(self):
    """Test handling of large files"""
    # Generate large text to test memory management
    large_text = "Test text " * 1000000
    
    file1 = self.create_test_file(large_text, "large1.txt")
    file2 = self.create_test_file(large_text, "large2.txt")
    
    # Should handle normally without memory overflow
    similarity = self.detector.check_plagiarism(file1, file2)
    self.assertEqual(similarity, 1.0)

Design Objective: Ensure the program can handle large files without crashing Error Scenario: Insufficient memory when processing very large text files Handling Method: Use generators and streaming processing to reduce memory usage

4. Summary of Exception Handling Strategies

Through a comprehensive exception handling mechanism, the system can maintain stability under various abnormal conditions and provide users with clear and explicit error information, greatly enhancing the system's robustness and user experience.

Usage Instructions

  1. Install dependencies:
pip install -r requirements.txt

  1. Run the program:
python main.py /path/to/original.txt /path/to/copied.txt /path/to/output.txt

  1. Run tests:
python -m pytest test_main.py -v

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.