Merging DataFrames in Pandas: Methods and Use Cases
Concatenation with pd.concat
Use pd.concat to stack multiple DataFrames vertically or horizontally. Its suitable for simple aggregation of datasets along an axis.
import pandas as pd
# Sample data sets
data_primary = pd.DataFrame({
'identifier': ['X', 'Y', 'Z'],
'metric_A': [10, 20, 30]
})
data_secondary = pd.DataFrame({
'identifier': ['Y', 'Z', 'W'],
'metric_B': [40, 50, 60]
})
# Stack vertically (row-wise)
stacked_rows = pd.concat([data_primary, data_secondary], axis=0)
# Combine horizontally (column-wise)
combined_cols = pd.concat([data_primary, data_secondary], axis=1)
Relational Merge with pd.merge
Perform SQL-like joins between DataFrames using pd.merge. This method combines datasets based on common columns or indices.
# Merge on a single common key
merged_inner = pd.merge(data_primary, data_secondary, on='identifier')
# Merge on different keys
merged_diff_keys = pd.merge(data_primary, data_secondary, left_on='identifier', right_on='identifier')
Join Types
Control the merged dataset's composition with the how parameter.
- Inner Join: Keeps rows with matching keys in both DataFrames.
- Outer Join: Retains all rows from both DataFrames, filling gaps with NaN.
- Left Join: Preserves all rows from the left DataFrame, adding matching data from the right.
- Right Join: Preserves all rows from the right DataFrame, adding matching data from the left.
# Inner join
inner_result = pd.merge(data_primary, data_secondary, on='identifier', how='inner')
# Outer join
outer_result = pd.merge(data_primary, data_secondary, on='identifier', how='outer')
# Left join
left_result = pd.merge(data_primary, data_secondary, on='identifier', how='left')
# Right join
right_result = pd.merge(data_primary, data_secondary, on='identifier', how='right')
Index-based Combination with DataFrame.join
The join method combines DataFrames primarily using their indices. It is efficient for appending a smaller DataFrame's columns to a larger one.
# Set a common key as the index
indexed_primary = data_primary.set_index('identifier')
indexed_secondary = data_secondary.set_index('identifier')
# Join on index
joined_index = indexed_primary.join(indexed_secondary, how='inner')
Appending Rows with DataFrame.append
The append method adds the rows of one DataFrame to the end of another. Note that this method is deprecated in recent pandas versions in favor of pd.concat.
# Append rows
appended_data = data_primary.append(data_secondary, ignore_index=True)