Conditional Column Creation with Pandas case_when()
The case_when() method in Pandas provides a SQL-like approach to creating new columns based on conditional logic. This method evaluates multiple conditions sequential and assigns corresponding values, offering a cleaner alternative to nested if-else statements when transforming data.
Method Overview
The case_when() function is available on Pandas Series objects and processes conditions in order. When a condition evaluates to True, the associated value is assigned. This behavior mirrors SQL's CASE WHEN construct and significantly improves code readability when handling multiple conditional branches.
Syntax and Parameters
Series.case_when(conditions, values, default=None)
Parameters:
conditions: A list of boolean Series or arrays defining evaluasion criteriavalues: A list of values to assign when corresponding conditions are metdefault: Fallback value for rows where no condition matches
The first matching condition determines the output value. If no conditions match and no default is specified, the result contains NaN values.
Practical Examples
Example 1: Categorizing Numeric Ranges
Consider a dataset containing employee performance scores that need conversion into categorical grades:
import pandas as pd
# Employee performance data
performance_data = {
'employee_id': [1001, 1002, 1003, 1004, 1005],
'performance_score': [92, 78, 88, 55, 83]
}
df = pd.DataFrame(performance_data)
# Define evaluation criteria
criteria = [
df['performance_score'] >= 85,
df['performance_score'] >= 70,
df['performance_score'] >= 60
]
# Define corresponding grade labels
grade_labels = ['Exceptional', 'Satisfactory', 'Needs Improvement']
# Apply conditional logic
df['performance_grade'] = df['performance_score'].case_when(criteria, grade_labels, default='At Risk')
print(df)
Example 2: Handling Missing Values
Real-world datasets frequently contain null values requiring special treatment:
import pandas as pd
import numpy as np
# Product inventory data with missing quantities
inventory_data = {
'product_code': ['A001', 'A002', 'A003', 'A004', 'A005'],
'stock_quantity': [150, np.nan, 85, 200, np.nan]
}
df = pd.DataFrame(inventory_data)
# Define stock level conditions
stock_conditions = [
df['stock_quantity'].notna() & (df['stock_quantity'] > 150),
df['stock_quantity'].notna() & (df['stock_quantity'] >= 100),
df['stock_quantity'].notna()
]
stock_levels = ['High', 'Normal', 'Low']
df['stock_status'] = df['stock_quantity'].case_when(stock_conditions, stock_levels, default='Unknown')
print(df)
Example 3: Combining Multiple Columns
When conditions span multiple columns, case_when() can be integrated with apply():
import pandas as pd
# Customer analysis dataset
customer_data = {
'customer_id': [100, 101, 102, 103, 104],
'annual_spending': [5000, 2500, 8000, 1200, 4500],
'purchase_frequency': [12, 4, 15, 2, 8]
}
df = pd.DataFrame(customer_data)
# Define multi-column evaluation criteria
def classify_customer(row):
conditions = [
(row['annual_spending'] >= 5000) & (row['purchase_frequency'] >= 10),
(row['annual_spending'] >= 3000) & (row['purchase_frequency'] >= 5),
(row['annual_spending'] >= 1000)
]
values = ['VIP', 'Regular', 'Occasional']
return row['annual_spending'].case_when(conditions, values, default='New')
df['customer_segment'] = df.apply(classify_customer, axis=1)
print(df)
Example 4: Complex Business Rules
For intricate transformations involving multiple business rules:
import pandas as pd
# Sales representative performance data
sales_data = {
'rep_name': ['Johnson', 'Williams', 'Brown', 'Jones', 'Garcia'],
'quarterly_sales': [125000, 85000, 150000, 45000, 95000],
'client_retention_rate': [0.92, 0.78, 0.88, 0.55, 0.83]
}
df = pd.DataFrame(sales_data)
# Define complex evaluation criteria
sales_conditions = [
(df['quarterly_sales'] >= 100000) & (df['client_retention_rate'] >= 0.85),
(df['quarterly_sales'] >= 75000) & (df['client_retention_rate'] >= 0.70),
(df['quarterly_sales'] >= 50000)
]
performance_tiers = ['Top Performer', 'Solid Contributor', 'Developing']
df['performance_category'] = df['quarterly_sales'].case_when(sales_conditions, performance_tiers, default='Needs Support')
print(df)
Example 5: Aggregating Multiple Numeric Columns
Combining values from different columns into a single derived metric:
import pandas as pd
# Exam results dataset
exam_data = {
'student_id': [2001, 2002, 2003, 2004, 2005],
'midterm_score': [85, 70, 95, 60, 75],
'final_score': [90, 80, 85, 70, 90]
}
df = pd.DataFrame(exam_data)
# Calculate weighted total
df['weighted_total'] = df.apply(lambda row: row['midterm_score'] * 0.4 + row['final_score'] * 0.6, axis=1)
print(df)
Example 6: Data Validation and Flagging
Using conditional logic to validate data quality and flag enomalies:
import pandas as pd
import numpy as np
# Transaction log with potential anomalies
transaction_data = {
'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005'],
'transaction_amount': [2500, np.nan, 150, 99999, 800]
}
df = pd.DataFrame(transaction_data)
# Define validation rules
validation_conditions = [
df['transaction_amount'].notna() & (df['transaction_amount'] > 50000),
df['transaction_amount'].notna() & (df['transaction_amount'] > 0),
df['transaction_amount'].isna()
]
status_labels = ['Review Required', 'Valid', 'Missing Data']
df['data_quality_flag'] = df['transaction_amount'].case_when(validation_conditions, status_labels, default='Invalid')
print(df)
Important Notes
The case_when() method processes conditions strictly in order. The first condition evaluating to True determines the output value, making condition ordering critical. All conditions must return boolean values matching the Series length. This method integrates well with Pandas' method chaining patterns, enabling concise and maintainable data transformation pipelines.