Who Survived the Titanic? A Data-Driven Look at Chance, Class, and Choice

Classification

Author

Patrick Linke

Published

November 2, 2025

Introduction

1 Key Results

Who survived the Titanic shipwreck — and why? Beneath this tragic event lies a fascinating data story of class, age, and gender.

Through Exploratory Data Analysis (EDA) and machine learning, I built models to predict passenger survival and uncover the hidden patterns behind the numbers. While the overall survival rate was only about 38%, survival wasn’t random — women, children, and first-class passengers had dramatically higher chances. This project became a hands-on introduction to classification modeling and key evaluation metrics like Precision, Recall, and F1-Score.

Fig. 1 Shows the Accuracy scores of different Baseline Models, along with a box plot highlighting the lower and upper quartiles, and “whiskers” showing the extent of the scores. The SVM was then selected for fine-tuning and evaluating Performance Metrics.

2. Why It Matters

Predicting survival in the Titanic disaster dataset isn’t just an academic challenge — it’s central to how we think about risk, fairness and decision-making in real-world scenarios. When we study what factors (such as gender, class, age) determined survival, these insights help policy makers, emergency planners and data scientists to better design protocols that save lives in future crises.

By applying Exploratory Data Analysis (EDA) and machine learning techniques to this classification problem, we learn not only how to build models that predict outcomes, but also why those outcomes occurred. That matters because in the fields of transportation safety, disaster response and social equity, it’s not enough to know who survived — we want to understand why. These learnings help organisations and analysts make informed decisions around resource allocation, training and evacuation strategies — ultimately leading to safer systems and more equitable responses.

3. The Data

The dataset from the Kaggle competition “Titanic – Machine Learning from Disaster” consists of a training set of 891 passenger records and a test set of 418 records. Each row describes a passenger on the RMS Titanic and includes features such as:

Ticket class: 1st, 2nd, 3rd
Sex (male or female)
Age (in years)
Number of siblings/spouses aboard
Number of parents/children aboard
Ticket fare paid
Port of embarkation: Cherbourg/Queenstown/Southampton

The target variable is Survived (0 = died, 1 = survived) for each passenger in the training set.

Link to the public dataset: https://www.kaggle.com/competitions/titanic/data

Fig. 2 Shows the Distribution of the Features in the Dataset. Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton

Fig. 3 Distribution of the Target. Overall Survival Rate is ~38%

Methods & Results

4. What I Did

The workflow followed a straightforward pattern: EDA → Feature Engineering → Model Training → Evaluation.

During feature engineering, I created meaningful variables such as AgeGroup (Child, Teen, Young Adult, Adult, Senior), cabin_present (indicating whether a cabin number was available), and RelativesOnboard (the sum of SibSp and Parch).

I trained several baseline models — SGD, SVM, Decision Tree, and Random Forest — and compared their performance using accuracy. The SVM model performed best, so I fine-tuned it with GridSearchCV to optimize hyperparameters.

Final performance was evaluated using Accuracy, Precision, Recall, and F1-Score to ensure balanced assessment across all prediction outcomes.

5. What I Found

Influence of Features on Survival Rates

Gender was the single strongest predictor of survival. Females had consistently higher survival probabilities (~0.59–0.94), while males saw a steep drop after childhood (~0.09–0.23).
Passenger Class clearly stratified outcomes — higher class meant better chances. The extremes: 1st-class females (~0.97 survival) vs 3rd-class males (~0.14).
Age mattered: children were prioritized during evacuation. Boys and girls under 12 had similar survival rates (~0.57–0.59), but after adolescence, the gender gap widened sharply.

Fig. 4 Survival Rate by Passenger Class and Gender. Gender dominates survival: females outlive males in every class. Class amplifies it: higher class → higher survival for both genders.

Fig. 5 Survival Rate by Age and Gender. Clear “women and children first” pattern; gender dominates age once past childhood. Strong gender effect: females had high survival across ages (~0.59–0.94), males low after childhood (~0.09–0.23). Children were prioritized: boys and girls had similar, relatively high survival.

Performance Metrics of the Best Model (SVM): - Accuracy 82% - Precision: 80% - Recall: 72% - F1-Score: 76%

Fig. 6 Precision-Recall Curve of the SVM Model.

Fig. 7 ROC Curve of the SVM Model. AUC = 0.83.

Conclusion

6. What I Learned

This project taught me how to build and evaluate Classification Models end to end — from preprocessing raw data to fine-tuning hyperparameters. I deepened my understanding of key evaluation metrics such as Accuracy, Precision, Recall, F1-Score, and the Confusion Matrix, as well as diagnostic tools like the Precision-Recall and ROC Curves.

Through EDA and thoughtful feature engineering, I saw how data visualization and preprocessing pipelines can transform raw numbers into insights that explain why models make certain predictions — not just how well they perform.

Credits

Thanks for Reading the Post!

Inspired by Chapter 3 from Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition by Aurélien Géron

⬇️ Download The Notebook