Merging Data from Multiple Sources and Resolving Conflicts
DataFocus provides a full data analytics solution. By using DataSpring (the data integration platform) and Data Warehouse, you can efficiently merge data from multiple sources and resolve conflicts. The following guide explains how to use these product features.
DataSpring Data Integration Platform: Multi-source Data Ingestion and Cleaning
Core functionality: Supports extracting data from heterogeneous sources such as databases, APIs, and files (Excel/CSV), and then cleaning and preprocessing the data.
Step 1: Connect to Multi-source Data
- Configure data sources
- Add data connections in DataSpring (e.g., MySQL, Oracle, third-party APIs, or local files).
- Example: Simultaneously connect to a CRM system's user table (MySQL) and an e-commerce platform's order log (API).
- Extract data
- Set up scheduled tasks or real-time synchronization to extract data in to a temporary storage area.
Step 2: Data Cleaning and Standardization
- Handle missing values and anomalies
- Use built-in cleaning rules (e.g., fill default values, filter invalid records).
- Example: Mark records with negative order amounts as anomalies and isolate them.
- Standardize formats and map fields
- Define field transformation rules through a visual interface:
- Standardize date formats (e.g.,
YYYY-MM-DD). - Map enumeration values (e.g., unify "Male" and "男" to "M").
- Standardize date formats (e.g.,
- Define field transformation rules through a visual interface:
Data Warehouse: Data Integration and Conflict Resolution
Core functionality: Provides a high-performance storage engine and SQL computing capabilities, supporting complex data merging logic.
Step 3: Data Merging Strategies
- Vertical merge (appending data)
- Merge tables with the same structure (e.g., sales data from multiple months) into a single wide table:
CREATE TABLE sales_combined AS SELECT * FROM sales_2023q1 UNION ALL SELECT * FROM sales_2023q2;
- Merge tables with the same structure (e.g., sales data from multiple months) into a single wide table:
- Horizontal merge (relating data)
- Join different business tables using primary keys (e.g., user information + order records):
SELECT u.user_id, u.name, o.order_amount FROM user_info u LEFT JOIN orders o ON u.user_id = o.user_id;
- Join different business tables using primary keys (e.g., user information + order records):
Step 4: Conflict Resolution Methods
- Primary key conflict handling
- Timestamp priority: Keep the most recently updated record.
SELECT user_id, LAST_VALUE(address) OVER (PARTITION BY user_id ORDER BY update_time) AS final_address FROM user_data; - Data source priority: Define priorities based on business rules (e.g., CRM data has higher priority).
SELECT COALESCE(crm_data.email, survey_data.email) AS email FROM crm_data FULL JOIN survey_data ON crm_data.user_id = survey_data.user_id;
- Timestamp priority: Keep the most recently updated record.
- Field value conflict handling
- Dynamic weighted calculation: Fuse numerical fields from different sources with weights (e.g., score = 0.7 * App Score + 0.3 * Survey Score).
- Manual review flag: Export conflicting records as CSV for business teams to confirm and fill back.
Unique Advantages of DataFocus
- Low-code operation: Configure cleaning rules and merging logic through a visual interface without writing complex code (suitable for non-technical users). Example: Drag and drop fields to create an ETL process, automatically handling date format conflicts.
- Automated monitoring: Built-in data quality monitoring module allows you to set rules (e.g., "User ID cannot be null") and trigger alerts on anomalies.
- High-performance computing: The data warehouse supports distributed computing, enabling fast merging even with billions of records.
- Security and permissiosn: Supports field-level permission control, automatically masking sensitive data (e.g., phone numbers) during merging.
Operation Example
Scenario: Merge CRM user table and survey data, resolving conflicts in "user status".
- DataSpring configuration:
- Connect MySQL (CRM) and Excel (survey data).
- Cleaning rule: Standardize phone number format (remove spaces/area codes).
- Data Warehouse SQL processing:
-- Keep the latest status based on timestamp CREATE TABLE merged_users AS SELECT user_id, FIRST_VALUE(status) OVER (PARTITION BY user_id ORDER BY update_time DESC) AS status, phone FROM ( SELECT * FROM crm_users UNION ALL SELECT * FROM survey_users ); - Result output: Publish the merged table to a BI tool (e.g., DataFocus analysis module) to generate user segmentation reports.
Best Practices
- Phased testing: Validate merging rules on a small sample of data first, then apply to the full dataset.
- Version control: Manage versions of ETL processes and data models to facilitate rollback and iteration.
- Collaboration mechanisms: Use DataFocus's team permission features to allow business stakeholders to participate in reviewing key field rules.
By using the DataSpring + Data Warehouse combination, you can complete the entire process from data ingestion, cleaning, merging, to analysis within a single platform, significantly reducing the complexity of integrating data from multiple sources.