Personal Programming Assignment Overview
Second Assignment: Individual Project
GitHub Repository Link: https://github.com/Mark-Zhangbinghan/Mark-Zhangbinghan/tree/main/3123004723
PSP Table
Design and Implementation Process of Calculation Module
1. Code Organization Structure
This plagiarism detection system uses an object-oriented design approach, mainly containing a core class PlagiarismDetector:
Class Diagram Structure
div classDiagram class PlagiarismDetector { - similarity_threshold: float + init() + read_file(file_path: str) str + preprocess_text(text: str) List[str] + calculate_similarity(text1: str, text2: str) float + check_plagiarism(original_file: str, copied_file: str) float + save_result(result: float, output_file: str) }
- PlagiarismDetector class: Core plagiarism detection class, encapsulating all plagiarism detection functions
- main function: Program entry point, responsible for command-line argument parsing and process control
- Unit test class: Independent test module, verifying the correctness of each function
2. Key Function Flowcharts
calculate_similarity function flow
div graph TD A[Start] --> B[Input text1 and text2] B --> C[Text preprocessing] C --> D[Build vocabulary] D --> E[Generate word frequency vector] E --> F[Calculate cosine similarity] F --> G[Return similarity result] G --> H[End] #### check_plagiarism function flow
div graph TD A[Start] --> B[Read original file] B --> C[Read copied file] C --> D[Calculate similarity] D --> E[Return result] E --> F[End] ### 3. Algorithm Key Points
3.1 Cosine Similarity Algorithm
Core Formula: $$ similarity = \frac{A \cdot B}{|A| \times |B|} $$
Where A and B are the word frequency vectors of the texts.
Implementation Steps:
- Tokenization: Use jieba for Chinese tokenization
- Vectorization: Convert text into word frequency vectors
- Similarity Calculation: Calculate cosine value based on vector space model
3.2 Unique Features
- Multi-encoding Support: Automatically detect and handle GBK and UTF-8 encoding
- Short Word Filtering: Filter characters with length ≤1 to improve accuracy
- Boundary Value Handling: Ensure similarity results are within [0,1]
- Performance Optimization: Use numpy for vector operations to improve calculation efficiency
Performance Improvements in Calculation Module Interface
1. Performance Analysis Results
Using cProfile for performance analysis, key data as follows:
Performance Analysis Statistics Table
Most Consuming Function: calculate_similarity (35% of total time)
2. Performance Bottleneck Identification
Main Issues Before Improvement:
- Vocabulary Rebuilding: Rebuild vocabulary every time
- Low Loop Efficiency: Vectorization operations use pure Python loops
- Redundant Calculations: Same text repeated tokenization operations
3. Performance Improvement Measures
Improvement Ideas:
- Vocabulary Cache Mechanism
# Before improvement: rebuild every time
vocab = list(set(words1 + words2))
# After improvement: use cache
if (text1, text2) in self.vocab_cache:
vocab = self.vocab_cache[(text1, text2)]
else:
vocab = list(set(words1 + words2))
self.vocab_cache[(text1, text2)] = vocab
- Vectorization Optimization
# Before improvement: Python loop
for word in words1:
if word in word_to_idx:
vector1[word_to_idx[word]] += 1
# After improvement: numpy batch operation
indices = [word_to_idx[word] for word in words1 if word in word_to_idx]
vector1[indices] += 1
- Preprocessing Result Cache
# Add preprocessing cache
self.preprocess_cache = {}
def preprocess_text(self, text: str) -> List[str]:
if text in self.preprocess_cache:
return self.preprocess_cache[text]
# ... processing logic
self.preprocess_cache[text] = result
return result
4. Improvement Effect Comparison
Unit Testing Demonstration for Calculation Module
1. Test Framework Configuration
import unittest
import os
import tempfile
from main import PlagiarismDetector
class TestPlagiarismDetector(unittest.TestCase):
"""Plagiarism detection test class"""
def setUp(self):
"""Preparation before test"""
self.detector = PlagiarismDetector()
self.test_dir = tempfile.mkdtemp()
2. Core Test Cases
2.1 Test for Identical Texts
def test_calculate_similarity_identical(self):
"""Test similarity for identical texts"""
text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
text2 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
similarity = self.detector.calculate_similarity(text1, text2)
self.assertAlmostEqual(similarity, 1.0, places=2)
Test Purpose: Verify that the algorithm can correctly identify 100% similarity for completely identical texts
2.2 Test for Partially Similar Texts
def test_calculate_similarity_partial(self):
"""Test similarity for partially similar texts"""
text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
text2 = "Today is Monday, weather is sunny, I will go to school tomorrow."
similarity = self.detector.calculate_similarity(text1, text2)
self.assertGreater(similarity, 0.3)
self.assertLess(similarity, 0.9)
Test Purpose: Verify that the algorithm can accurately identify texts with similar semantics but different expressions
2.3 Test for Completely Different Texts
def test_calculate_similarity_different(self):
"""Test similarity for completely different texts"""
text1 = "Today is Sunday, weather is clear, I am going to watch a movie tonight."
text2 = "Tomorrow is Monday, weather is cloudy, I will go to school tomorrow."
similarity = self.detector.calculate_similarity(text1, text2)
self.assertLess(similarity, 0.5)
Test Purpose: Verify that the similarity should be low for completely different content
3. Boundary Condition Tests
3.1 Empty Text Test
def test_calculate_similarity_empty(self):
"""Test similarity for empty text"""
similarity = self.detector.calculate_similarity("", "test text")
self.assertEqual(similarity, 0.0)
3.2 Result Format Test
def test_save_result_format(self):
"""Test result saving format"""
output_file = os.path.join(self.test_dir, "result.txt")
self.detector.save_result(0.756, output_file)
with open(output_file, 'r', encoding='utf-8') as f:
result = f.read()
self.assertEqual(result, "0.76") # Verify rounding
4. Test Coverage Report
Test coverage summary:
────────────────────────────────────────
Name Stmts Miss Cover
────────────────────────────────────────
main.py 86 4 95%
test_main.py 105 0 100%
────────────────────────────────────────
TOTAL 191 4 98%
────────────────────────────────────────
Coverage details:
- Statement coverage: 98%
- Branch coverage: 95%
- Function coverage: 100%
- Line coverage: 97%
Exception Handling Description for Calculation Module
1. File Operation Exception Handling
1.1 File Not Found Exception
def test_read_file_not_exist(self):
"""Test case for file not found"""
with self.assertRaises(FileNotFoundError):
self.detector.read_file("nonexistent_file.txt")
Design Objective: Prevent program from crashing due to incorrect file paths Error Scenario: User inputs a non-existent file path Handling Method: Throw a clear FileNotFoundError exception
1.2 File Permission Exception
def test_save_result_permission_error(self):
"""Test case for permission error when saving result"""
output_file = "/root/result.txt" # No permission directory
with self.assertRaises(IOError):
self.detector.save_result(0.5, output_file)
Design Objective: Handle cases where there is no permission to write to the specified directory Error Scenario: The program has no permission to write to the specified directory Handling Method: Capture permission errors and throw an IOError
2. Data Validation Exception Handling
2.1 Empty File Content Exception
def test_read_empty_file(self):
"""Test handling of empty files"""
filepath = self.create_test_file("", "empty.txt")
with self.assertRaises(ValueError):
self.detector.read_file(filepath)
Design Objective: Ensure input data validity Error Scenario: User provides an empty file Handling Method: Throw ValueError to prompt user to check file content
2.2 Encoding Format Exception
def test_file_encoding_error(self):
"""Test handling of file encoding errors"""
# Create binary file to simulate encoding error
filepath = os.path.join(self.test_dir, "binary.bin")
with open(filepath, 'wb') as f:
f.write(b'\xff\xfe\x00\x01')
with self.assertRaises(IOError):
self.detector.read_file(filepath)
Design Objective: Handle unsupported file encoding formats Error Scenario: File encoding does not match the program's expected format Handling Method: Try multiple encodings and throw an IOError if it still fails
3. Calculation Process Exception Handling
3.1 Zero Vector Exception
def test_zero_vector_similarity(self):
"""Test similarity calculation with zero vectors"""
# Two texts are stop words, may produce zero vectors
text1 = "of the then"
text2 = "the way"
similarity = self.detector.calculate_similarity(text1, text2)
self.assertEqual(similarity, 0.0) # Should return 0 instead of error
Design Objective: Prevent division by zero errors caused by zero vectors Error Scenario: Text becomes an empty vector after filtering Handling Method: Check vector magnitude before calculation, return 0 if magnitude is zero
3.2 Memory Overflow Exception
def test_large_file_processing(self):
"""Test handling of large files"""
# Generate large text to test memory management
large_text = "Test text " * 1000000
file1 = self.create_test_file(large_text, "large1.txt")
file2 = self.create_test_file(large_text, "large2.txt")
# Should handle normally without memory overflow
similarity = self.detector.check_plagiarism(file1, file2)
self.assertEqual(similarity, 1.0)
Design Objective: Ensure the program can handle large files without crashing Error Scenario: Insufficient memory when processing very large text files Handling Method: Use generators and streaming processing to reduce memory usage
4. Summary of Exception Handling Strategies
Through a comprehensive exception handling mechanism, the system can maintain stability under various abnormal conditions and provide users with clear and explicit error information, greatly enhancing the system's robustness and user experience.
Usage Instructions
- Install dependencies:
pip install -r requirements.txt
- Run the program:
python main.py /path/to/original.txt /path/to/copied.txt /path/to/output.txt
- Run tests:
python -m pytest test_main.py -v