Removing Duplicate Documents in MongoDB with PyMongo
When working with PyMongo, a common challenge is eliminating duplicate documants from a collection while preserving one instence of each unique entry. The distinct method returns only the unique values without affecting the original data, which may not suffice for data cleaning tasks that require permanent removal of dupilcates.
To address this, a custom script can be developed too identify and delete duplicates based on a specified field. The process involves iterating through unique values, counting occurrences, and removing excess copies.
import pymongo
def deduplicate_collection(collection_name, field_name):
client = pymongo.MongoClient('localhost', 27017)
db = client.local
coll = db[collection_name]
for unique_val in coll.distinct(field_name):
duplicate_count = coll.count_documents({field_name: unique_val})
if duplicate_count > 1:
for i in range(1, duplicate_count):
coll.delete_one({field_name: unique_val})
remaining = coll.find({field_name: unique_val})
for doc in remaining:
print(doc)
print(coll.distinct(field_name))
# Example usage
deduplicate_collection('person', 'name')
For cleaning an entire database, the script can be extended to process all collections.
def clean_database_duplicates(db_name, field_name):
client = pymongo.MongoClient('localhost', 27017)
db = client[db_name]
for coll_name in db.list_collection_names():
print(f'Processing collection: {coll_name}')
coll = db[coll_name]
for unique_val in coll.distinct(field_name):
duplicate_count = coll.count_documents({field_name: unique_val})
if duplicate_count > 1:
for i in range(1, duplicate_count):
coll.delete_one({field_name: unique_val})
for doc in coll.find({field_name: unique_val}):
print(doc)
# Example usage
clean_database_duplicates('GifDB', 'gif_url')
These scripts ensure that only one document per unique field value remains, effectively deduplicating the data.