Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Conditional Column Creation with Pandas case_when()

Tech May 17 2

The case_when() method in Pandas provides a SQL-like approach to creating new columns based on conditional logic. This method evaluates multiple conditions sequential and assigns corresponding values, offering a cleaner alternative to nested if-else statements when transforming data.

Method Overview

The case_when() function is available on Pandas Series objects and processes conditions in order. When a condition evaluates to True, the associated value is assigned. This behavior mirrors SQL's CASE WHEN construct and significantly improves code readability when handling multiple conditional branches.

Syntax and Parameters

Series.case_when(conditions, values, default=None)

Parameters:

  • conditions: A list of boolean Series or arrays defining evaluasion criteria
  • values: A list of values to assign when corresponding conditions are met
  • default: Fallback value for rows where no condition matches

The first matching condition determines the output value. If no conditions match and no default is specified, the result contains NaN values.

Practical Examples

Example 1: Categorizing Numeric Ranges

Consider a dataset containing employee performance scores that need conversion into categorical grades:

import pandas as pd

# Employee performance data
performance_data = {
    'employee_id': [1001, 1002, 1003, 1004, 1005],
    'performance_score': [92, 78, 88, 55, 83]
}

df = pd.DataFrame(performance_data)

# Define evaluation criteria
criteria = [
    df['performance_score'] >= 85,
    df['performance_score'] >= 70,
    df['performance_score'] >= 60
]

# Define corresponding grade labels
grade_labels = ['Exceptional', 'Satisfactory', 'Needs Improvement']

# Apply conditional logic
df['performance_grade'] = df['performance_score'].case_when(criteria, grade_labels, default='At Risk')

print(df)

Example 2: Handling Missing Values

Real-world datasets frequently contain null values requiring special treatment:

import pandas as pd
import numpy as np

# Product inventory data with missing quantities
inventory_data = {
    'product_code': ['A001', 'A002', 'A003', 'A004', 'A005'],
    'stock_quantity': [150, np.nan, 85, 200, np.nan]
}

df = pd.DataFrame(inventory_data)

# Define stock level conditions
stock_conditions = [
    df['stock_quantity'].notna() & (df['stock_quantity'] > 150),
    df['stock_quantity'].notna() & (df['stock_quantity'] >= 100),
    df['stock_quantity'].notna()
]

stock_levels = ['High', 'Normal', 'Low']

df['stock_status'] = df['stock_quantity'].case_when(stock_conditions, stock_levels, default='Unknown')

print(df)

Example 3: Combining Multiple Columns

When conditions span multiple columns, case_when() can be integrated with apply():

import pandas as pd

# Customer analysis dataset
customer_data = {
    'customer_id': [100, 101, 102, 103, 104],
    'annual_spending': [5000, 2500, 8000, 1200, 4500],
    'purchase_frequency': [12, 4, 15, 2, 8]
}

df = pd.DataFrame(customer_data)

# Define multi-column evaluation criteria
def classify_customer(row):
    conditions = [
        (row['annual_spending'] >= 5000) & (row['purchase_frequency'] >= 10),
        (row['annual_spending'] >= 3000) & (row['purchase_frequency'] >= 5),
        (row['annual_spending'] >= 1000)
    ]
    values = ['VIP', 'Regular', 'Occasional']
    return row['annual_spending'].case_when(conditions, values, default='New')

df['customer_segment'] = df.apply(classify_customer, axis=1)

print(df)

Example 4: Complex Business Rules

For intricate transformations involving multiple business rules:

import pandas as pd

# Sales representative performance data
sales_data = {
    'rep_name': ['Johnson', 'Williams', 'Brown', 'Jones', 'Garcia'],
    'quarterly_sales': [125000, 85000, 150000, 45000, 95000],
    'client_retention_rate': [0.92, 0.78, 0.88, 0.55, 0.83]
}

df = pd.DataFrame(sales_data)

# Define complex evaluation criteria
sales_conditions = [
    (df['quarterly_sales'] >= 100000) & (df['client_retention_rate'] >= 0.85),
    (df['quarterly_sales'] >= 75000) & (df['client_retention_rate'] >= 0.70),
    (df['quarterly_sales'] >= 50000)
]

performance_tiers = ['Top Performer', 'Solid Contributor', 'Developing']

df['performance_category'] = df['quarterly_sales'].case_when(sales_conditions, performance_tiers, default='Needs Support')

print(df)

Example 5: Aggregating Multiple Numeric Columns

Combining values from different columns into a single derived metric:

import pandas as pd

# Exam results dataset
exam_data = {
    'student_id': [2001, 2002, 2003, 2004, 2005],
    'midterm_score': [85, 70, 95, 60, 75],
    'final_score': [90, 80, 85, 70, 90]
}

df = pd.DataFrame(exam_data)

# Calculate weighted total
df['weighted_total'] = df.apply(lambda row: row['midterm_score'] * 0.4 + row['final_score'] * 0.6, axis=1)

print(df)

Example 6: Data Validation and Flagging

Using conditional logic to validate data quality and flag enomalies:

import pandas as pd
import numpy as np

# Transaction log with potential anomalies
transaction_data = {
    'transaction_id': ['T001', 'T002', 'T003', 'T004', 'T005'],
    'transaction_amount': [2500, np.nan, 150, 99999, 800]
}

df = pd.DataFrame(transaction_data)

# Define validation rules
validation_conditions = [
    df['transaction_amount'].notna() & (df['transaction_amount'] > 50000),
    df['transaction_amount'].notna() & (df['transaction_amount'] > 0),
    df['transaction_amount'].isna()
]

status_labels = ['Review Required', 'Valid', 'Missing Data']

df['data_quality_flag'] = df['transaction_amount'].case_when(validation_conditions, status_labels, default='Invalid')

print(df)

Important Notes

The case_when() method processes conditions strictly in order. The first condition evaluating to True determines the output value, making condition ordering critical. All conditions must return boolean values matching the Series length. This method integrates well with Pandas' method chaining patterns, enabling concise and maintainable data transformation pipelines.

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.