Home > Tech > Content

Removing Duplicate Documents in MongoDB with PyMongo

Tech Apr 15 21

When working with PyMongo, a common challenge is eliminating duplicate documants from a collection while preserving one instence of each unique entry. The distinct method returns only the unique values without affecting the original data, which may not suffice for data cleaning tasks that require permanent removal of dupilcates.

To address this, a custom script can be developed too identify and delete duplicates based on a specified field. The process involves iterating through unique values, counting occurrences, and removing excess copies.

import pymongo

def deduplicate_collection(collection_name, field_name):
    client = pymongo.MongoClient('localhost', 27017)
    db = client.local
    coll = db[collection_name]
    
    for unique_val in coll.distinct(field_name):
        duplicate_count = coll.count_documents({field_name: unique_val})
        if duplicate_count > 1:
            for i in range(1, duplicate_count):
                coll.delete_one({field_name: unique_val})
        remaining = coll.find({field_name: unique_val})
        for doc in remaining:
            print(doc)
    print(coll.distinct(field_name))

# Example usage
deduplicate_collection('person', 'name')

For cleaning an entire database, the script can be extended to process all collections.

def clean_database_duplicates(db_name, field_name):
    client = pymongo.MongoClient('localhost', 27017)
    db = client[db_name]
    
    for coll_name in db.list_collection_names():
        print(f'Processing collection: {coll_name}')
        coll = db[coll_name]
        for unique_val in coll.distinct(field_name):
            duplicate_count = coll.count_documents({field_name: unique_val})
            if duplicate_count > 1:
                for i in range(1, duplicate_count):
                    coll.delete_one({field_name: unique_val})
            for doc in coll.find({field_name: unique_val}):
                print(doc)

# Example usage
clean_database_duplicates('GifDB', 'gif_url')

These scripts ensure that only one document per unique field value remains, effectively deduplicating the data.

Tags: MongoDB PyMongo Data Deduplication Python

Back to List

Prev: Implementing Chinese Character Pinyin Conversion in Frontend Applications

Next: Enhancing AI Prompt Generation with PromptPilot for Workplace Efficiency

Fading Coder

Removing Duplicate Documents in MongoDB with PyMongo

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Removing Duplicate Documents in MongoDB with PyMongo

Related Articles

Understanding Strong and Weak References in Java

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment