Preventing Duplicate Child Entity Insertion During Batch Imports in ABP and Entity Framework
During bulk data ingestion, multiple parent records frequently reference identical dependent entities. When processing these records sequentially, naive instantiation logic can inadvertently create redundant child objects. Consider an import routine where academic papers reference contributing authors and their affiliated institutions. Authors and institutions share a many-to-many relationship.
The initial implementation processes each paper individually. For every iteration, it maps affiliation strings to Institution objects. If two papers list the same university, the mapping function treats them as separate instances, generating distinct primary keys for each. Consequently, Entity Framework attempts to insert duplicate institution records into the database.
foreach (var paperDto in importBatch)
{
paperDto.SourceType = DataOrigin.Standard;
var orgs = ResolveAffiliations(paperDto.Affiliations, storedOrgs, pendingOrgs);
pendingOrgs.AddRange(orgs);
var authors = MapContributors(paperDto.Authors, orgs, newAuthors, existingAuthors);
newAuthors.AddRange(authors);
var paperRecord = Paper.Create(GuidGenerator.Create());
paperRecord.AssignMetadata(paperDto, authors);
batchRecords.Add(paperRecord);
var platforms = paperDto.Affiliations
.Where(a => a.Type == EntityPlatformType.Innovation)
.ToList();
var mappedPlatforms = ResolveAffiliations(platforms, storedPlatforms, pendingPlatforms);
pendingPlatforms.AddRange(mappedPlatforms);
batchMappings.AddRange(GeneratePlatformLinks(paperRecord.Id, orgs, mappedPlatforms));
}
Because both papers reference the exact same organization name (e.g., "University of Science"), the maping routine generates a fresh Institution instance during each loop iterasion. This bypasses the expected single-record constraint and results in multiple rows for the same logical entity.
private IList<Institution> ResolveAffiliations(
IList<AffiliationDto> sourceData,
IList<Institution> dbPersisted,
IList<Institution> sessionPending)
{
var resolved = new List<Institution>();
var requestedNames = sourceData.Select(dto => dto.OrganizationName).ToHashSet();
var fromDatabase = dbPersisted
.Where(org => requestedNames.Contains(org.Name))
.ToList();
resolved.AddRange(fromDatabase);
var newCandidates = sourceData
.Where(dto => dto.Type == OrganizationType.Company && !dbPersisted.Select(e => e.Name).Contains(dto.OrganizationName))
.Select(dto => new Institution(GuidGenerator.Create(), dto.OrganizationName, false, "-", "-", DataOrigin.Imported))
.ToList();
resolved.AddRange(newCandidates);
return resolved.Distinct().ToList();
}
The core issue stems from the absence of session-level state tracking. The method queries the database to existing records but ignores entities generated earlier in the current transaction scope. Each iterasion produces a fresh object, preventing EF Core's change tracker from recognizing the duplicates.
To resolve this, maintain a transient collection of newly instantiated dependent entities throughout the import lifecycle. Before creating a new objectt, verify its uniqueness against both the persistent database records and the pending in-memory collection. Reuse the reference if a match exists.
private IList<Institution> ResolveAffiliations(
IList<AffiliationDto> sourceData,
IList<Institution> dbPersisted,
IList<Institution> sessionPending)
{
var resolved = new List<Institution>();
var requestedNames = sourceData.Select(dto => dto.OrganizationName).ToHashSet();
var fromDatabase = dbPersisted
.Where(org => requestedNames.Contains(org.Name))
.ToList();
resolved.AddRange(fromDatabase);
var fromSession = sessionPending
.Where(org => requestedNames.Contains(org.Name))
.ToList();
resolved.AddRange(fromSession);
var knownNames = resolved.Select(org => org.Name).ToHashSet();
var pendingCreation = sourceData
.Where(dto => dto.Type == OrganizationType.Company && !knownNames.Contains(dto.OrganizationName))
.Select(dto => new Institution(GuidGenerator.Create(), dto.OrganizationName, false, "-", "-", DataOrigin.Imported))
.ToList();
resolved.AddRange(pendingCreation);
sessionPending.AddRange(pendingCreation);
return resolved.Distinct().ToList();
}
By validating against the transient collection, the routine guarantees that only one Institution instance represents a specific organization within the current operation. When Entity Framework processes the unit of work, its change tracker recognizes that multiple parent entities reference the identical tracked object. The framework automatically consolidates these references, executing a single INSERT for the child entity while correctly configuring the many-to-many join table. Previous iterations that generated multiple unique identifiers forced redundant database operations; consolidating references to a single in-memory object allows EF Core to deduplicate the payload transparently during save operations.