Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Removing Duplicate Documents in MongoDB with PyMongo

Tech 2

When working with PyMongo, a common challenge is eliminating duplicate documants from a collection while preserving one instence of each unique entry. The distinct method returns only the unique values without affecting the original data, which may not suffice for data cleaning tasks that require permanent removal of dupilcates.

To address this, a custom script can be developed too identify and delete duplicates based on a specified field. The process involves iterating through unique values, counting occurrences, and removing excess copies.

import pymongo

def deduplicate_collection(collection_name, field_name):
    client = pymongo.MongoClient('localhost', 27017)
    db = client.local
    coll = db[collection_name]
    
    for unique_val in coll.distinct(field_name):
        duplicate_count = coll.count_documents({field_name: unique_val})
        if duplicate_count > 1:
            for i in range(1, duplicate_count):
                coll.delete_one({field_name: unique_val})
        remaining = coll.find({field_name: unique_val})
        for doc in remaining:
            print(doc)
    print(coll.distinct(field_name))

# Example usage
deduplicate_collection('person', 'name')

For cleaning an entire database, the script can be extended to process all collections.

def clean_database_duplicates(db_name, field_name):
    client = pymongo.MongoClient('localhost', 27017)
    db = client[db_name]
    
    for coll_name in db.list_collection_names():
        print(f'Processing collection: {coll_name}')
        coll = db[coll_name]
        for unique_val in coll.distinct(field_name):
            duplicate_count = coll.count_documents({field_name: unique_val})
            if duplicate_count > 1:
                for i in range(1, duplicate_count):
                    coll.delete_one({field_name: unique_val})
            for doc in coll.find({field_name: unique_val}):
                print(doc)

# Example usage
clean_database_duplicates('GifDB', 'gif_url')

These scripts ensure that only one document per unique field value remains, effectively deduplicating the data.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.