Predicting Passenger Survival on the Titanic Using Ensemble Methods
Import the necessary libraries to data manipulation, visualization, and machine learning:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
Load the training and test datasets:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
target = df_train['Survived']
passenger_ids = df_test['PassengerId']
Combine datasets for consistent preprocessing:
df_train = df_train.drop(['Survived'], axis=1)
full_data = pd.concat([df_train, df_test], axis=0, ignore_index=True)
full_data = full_data.drop(['PassengerId'], axis=1)
Exploratory Data Analysis
Analyze the relationship between categorical variables and survvial. First, examine gender distribution:
palette = {0: '#e74c3c', 1: '#2ecc71'}
sns.countplot(data=pd.concat([df_train, target], axis=1), x='Sex', hue='Survived', palette=palette)
plt.show()
Perform chi-square test to validate statistical significance:
contingency_table = pd.crosstab(df_train['Sex'], target)
chi2, p_value, dof, expected = chi2_contingency(contingency_table)
print(f"P-value: {p_value}")
The extremely low p-value indicates gender is a strong predictor of survival.
For continuous variables like age, visualize distributions:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
df_train[target == 1]['Age'].plot(kind='hist', color='green', ax=axes[0], bins=30, alpha=0.7)
axes[0].set_title('Survivors Age Distribution')
df_train[target == 0]['Age'].plot(kind='hist', color='red', ax=axes[1], bins=30, alpha=0.7)
axes[1].set_title('Non-Survivors Age Distribution')
plt.tight_layout()
plt.show()
Analyze fare distributions using boxplots and density plots:
plt.figure(figsize=(10, 6))
sns.boxplot(x=target, y='Fare', data=df_train, palette=palette)
plt.show()
Passenger class shows clear survival patterns:
sns.countplot(x='Pclass', hue=target, data=df_train, palette=palette)
plt.show()
# Statistical validation
class_table = pd.crosstab(df_train['Pclass'], target)
print(f"Chi-square p-value: {chi2_contingency(class_table)[1]}")
Embarkation port also correlates with outcomes:
sns.countplot(x='Embarked', hue=target, data=df_train, palette=palette)
plt.show()
Feature Engineering
Extract titles from passenger names to capture social status:
full_data['Honorific'] = full_data['Name'].str.extract(r',\s([^\.]+)\.', expand=False).str.strip()
title_mapping = {
'Mr': 'Mr',
'Mrs': 'Mrs',
'Miss': 'Miss',
'Master': 'Master',
'Dr': 'Professional',
'Rev': 'Professional',
'Col': 'Professional',
'Major': 'Professional',
'Capt': 'Professional',
'Don': 'Nobility',
'Sir': 'Nobility',
'Lady': 'Nobility',
'the Countess': 'Nobility',
'Jonkheer': 'Nobility',
'Mme': 'Mrs',
'Ms': 'Mrs',
'Mlle': 'Miss',
'Dona': 'Mrs'
}
full_data['TitleCategory'] = full_data['Honorific'].map(title_mapping)
full_data = full_data.drop(['Name', 'Honorific'], axis=1)
Create family size features:
full_data['HouseholdSize'] = full_data['SibSp'] + full_data['Parch'] + 1
def categorize_family(size):
if size == 1:
return 'Solo'
elif size <= 4:
return 'Small'
else:
return 'Large'
full_data['FamilyType'] = full_data['HouseholdSize'].apply(categorize_family)
full_data = full_data.drop(['SibSp', 'Parch', 'HouseholdSize'], axis=1)
Handle missing values. For age, use median values grouped by sex, class, and title:
age_medians = full_data.groupby(['Sex', 'Pclass', 'TitleCategory'])['Age'].transform('median')
full_data['Age'] = full_data['Age'].fillna(age_medians)
full_data['Age'] = full_data['Age'].fillna(full_data['Age'].median())
Fill remianing missing values:
full_data['Fare'] = full_data['Fare'].fillna(full_data['Fare'].median())
full_data['Embarked'] = full_data['Embarked'].fillna(full_data['Embarked'].mode()[0])
Simplify cabin information to binary availabiltiy:
full_data['HasCabin'] = full_data['Cabin'].notna().astype(int)
full_data = full_data.drop(['Cabin'], axis=1)
Convert categorical variables to dummy variables:
categorical_cols = ['Sex', 'Embarked', 'Pclass', 'TitleCategory', 'FamilyType']
full_data = pd.get_dummies(full_data, columns=categorical_cols, drop_first=True)
full_data = full_data.drop(['Ticket'], axis=1)
Split back into training and test sets:
X_train = full_data.iloc[:len(df_train)]
X_test = full_data.iloc[len(df_train):]
Model Training and Evaluation
Train a Random Forest classifier with cross-validation:
classifier = RandomForestClassifier(n_estimators=200, max_features='sqrt', random_state=42)
cv_scores = cross_val_score(classifier, X_train, target, cv=10, scoring='accuracy')
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
Fit the model and analyze feature importance:
classifier.fit(X_train, target)
importance_series = pd.Series(classifier.feature_importances_, index=X_train.columns)
importance_series = importance_series.sort_values(ascending=True)
importance_series.plot(kind='barh', figsize=(10, 8))
plt.title('Feature Importance Ranking')
plt.show()