# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
os.path.join(dirname, filename)
The most crucial part of building a model is to pre-process the data, which amounts to 75%~80% of the work. Data preprocessing includes several steps.
- Selecting relevant features
- Handling missing values
- Data type conversion
- Data normalization (if necessary)
This notebook would provide a step-by-step guide on how to preprocess the data.
You need to visualize the data to understand the features and their relationship with the response variable.
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")
train_df.head()

Removing Irrelevant Attributes
Context Information Provided
Removing irrelevant attributes becomes easier when context information is known. By looking at the above data, we can say that explanatory attributes PassengerId, Name, Fare, and Ticket (Ticket number) don’t have any relation to the Survived response variable. Ticket and Fare don’t seem to have any relation to survival chances when thinking logically. Thus, both of these attributes would not be used to develop a predictive model. These hypotheses can be used because we have contextual information about Titanic. In case of missing contextual information, we need to apply feature reduction which is explained in the latter part.
Also, it can be guessed that Age, Sex, P-class (class of the passenger, represents financial status) influence survival chances.
# Removing Irrelevant attributes
irrelv_attrs = ['PassengerId','Name','Ticket','Fare']
relv_attrs=[x for x in train_df.columns.values if x not in irrelv_attrs]
train_df_tmp=train_df[relv_attrs]
# Removing Irrelevant attributes from test sest
relv_attrs=[x for x in test_df.columns.values if x not in irrelv_attrs]
test_df_tmp=test_df[relv_attrs]
Handling Missing Values
First, let's see if there are any missing values in the data. I have combined test data as well, to know features with missing values in the dataset. However, as the number of observations grows, all features might get missing values at one point or another. Or, in other words, you might not know which features would have missing values in the future.
pd.concat([train_df_tmp,test_df_tmp]).isnull().apply(lambda x: x.any(), axis=0)Survived True
Pclass False
Sex False
Age True
SibSp False
Parch False
Cabin True
Embarked True
dtype: bool
Thus, we need to worry about the missing values in Age, Cabin, and Embarked.
Age
To fill the missing value of Age, we would need to do some analysis. First, it could be obviously guessed that age group of male would be higher than female. It can also be hypothesized that age groups would be different for different passenger classes. The age group of higher class passengers (P1) would be greater than the lower passenger group (P3) because most young passengers were laborers and older passengers were business people. I have also included Survived attribute, however, it might not be essential and unusable if you include test data as well. The analysis could be visualized in the bar graph below.
grouped_train_df=train_df[['Survived','Sex','Age','Pclass']].groupby(['Pclass','Survived','Sex'])['Age'].mean()
grouped_train_df.unstack().unstack().plot(kind='bar', figsize=(15,8),
title="Fig 1: Age vs Pclass based upon gender & survived",
xlabel="Passenger class",
ylabel="Average age",
colormap='nipy_spectral')<AxesSubplot:title={'center':'Fig 1: Age vs Pclass based upon gender & survived'}, xlabel='Passenger class', ylabel='Average age'>

# Filling out missing values for Age
calc_NA_vals=train_df_tmp[['Sex','Age','Pclass']].groupby(['Pclass','Sex'])['Age'].mean().reset_index().rename(columns={"Age":"Missing_Age"})
print("Values used to replace missing age")
print(calc_NA_vals)
train_df_tmp=train_df_tmp.merge(calc_NA_vals, how='left',on=['Sex','Pclass'])
test_df_tmp=test_df_tmp.merge(calc_NA_vals, how='left',on=['Sex','Pclass'])
train_df_tmp['Age'].fillna(train_df_tmp['Missing_Age'], inplace=True)
test_df_tmp['Age'].fillna(test_df_tmp['Missing_Age'], inplace=True)Values used to replace missing age
Pclass Sex Missing_Age
0 1 female 34.611765
1 1 male 41.281386
2 2 female 28.722973
3 2 male 30.740707
4 3 female 21.750000
5 3 male 26.507589
Embarked
To fill the missing values for Embarked (the location where the passengers have boarded Titanic from), I think it doesn’t require a deep analysis like that of Age. Thus, I would fill the missing values with the most frequent value in Embarked.
grouped_emb_train_df=train_df[['PassengerId','Embarked']].groupby(['Embarked'])['PassengerId'].count()
grouped_emb_train_df.plot(kind='bar', figsize=(15,8),
title="Fig 2: Count vs Embarked",
xlabel="Embarked location",
ylabel="Total count",
colormap='seismic')<AxesSubplot:title={'center':'Fig 2: Count vs Embarked'}, xlabel='Embarked location', ylabel='Total count'>

From figure 2, it can be seen the value S is the most frequent, thus it would be used to replace null values.
# Filling the NA values for Embarked
train_df_tmp['Embarked'].fillna('S', inplace=True)
test_df_tmp['Embarked'].fillna('S', inplace=True)
Cabin
This attribute is very tricky because it has lots of missing value and I could not identify whether it does or doesn’t contribute to the survival chance of a person. Let’s do the analysis.
# Fill all NA by others and count values
train_df['Cabin'].fillna('Others').value_counts()Others 687
B96 B98 4
C23 C25 C27 4
G6 4
F33 3
...
E77 1
D9 1
B94 1
A7 1
C70 1
Name: Cabin, Length: 148, dtype: int64
As we can see there is a large number of missing values. At this point, I could simply drop the column; however, for the sake of learning purpose, I would transform the column into two new separate columns. These variables might be used to fit the model if they pass the relevance test.
df['Cabin_Type'] = # First letter from the Cabin
df['Cabin_Count']=# Number of Cabins# These are the unique first letter for the data points that have cabin
# I would use a different letter to fill the Null values, i.e. Z
(pd.concat([train_df,test_df]))[~(pd.concat([train_df,test_df]))["Cabin"].isnull()]['Cabin'].str[0].unique()array(['C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)# Transforming Cabin to Two separate attributes
train_df_tmp_ = train_df_tmp[~train_df_tmp["Cabin"].isnull()][["Cabin"]]
train_df_tmp_["Cabin_fl"] = train_df_tmp_["Cabin"].str[0] # Get first letter
train_df_tmp_["Cabin_cnt"]= train_df_tmp_["Cabin"].str.strip().str.split(' ').apply(lambda x: len(x)) # Get cabin count
train_df_tmp=train_df_tmp.join(train_df_tmp_[["Cabin_fl","Cabin_cnt"]], lsuffix='_left', rsuffix='_right') # merge back to original dataframe
# Same process with test dataframe
test_df_tmp_ = test_df_tmp[~test_df_tmp["Cabin"].isnull()][["Cabin"]]
test_df_tmp_["Cabin_fl"] = test_df_tmp_["Cabin"].str[0] # Get first letter
test_df_tmp_["Cabin_cnt"]= test_df_tmp_["Cabin"].str.strip().str.split(' ').apply(lambda x: len(x)) # Get cabin count
test_df_tmp=test_df_tmp.join(test_df_tmp_[["Cabin_fl","Cabin_cnt"]], lsuffix='_left', rsuffix='_right') # merge back to original dataframe
Filling missing values for Cabin, the count would be zero and the first alphabet would be Z.
train_df_tmp['Cabin_cnt'].fillna(0, inplace=True)
train_df_tmp['Cabin_fl'].fillna('Z', inplace=True)
test_df_tmp['Cabin_cnt'].fillna(0, inplace=True)
test_df_tmp['Cabin_fl'].fillna('Z', inplace=True)
train_df_tmp.head()

Data Type Conversion
Here, we need to convert categorical datatypes to numerical so that the ML model could work on them. There are four categorical attributes in the data: P-class, Sex, Embarked, and Cabin_fl, which need conversion.
Sex would be the easiest to be converted as we can do male=0, female=1.
# Sex Categorial to Numerical
train_df_tmp['Sex'].replace({'male':0,'female':1}, inplace=True)
train_df_tmp=train_df_tmp.join(pd.get_dummies(train_df_tmp['Pclass'], prefix='Pclass'))
train_df_tmp=train_df_tmp.join(pd.get_dummies(train_df_tmp['Embarked'], prefix='Embarked'))
train_df_tmp=train_df_tmp.join(pd.get_dummies(train_df_tmp['Cabin_fl'], prefix='Cabin_fl'))
# For test dataframe
test_df_tmp['Sex'].replace({'male':0,'female':1}, inplace=True)
test_df_tmp=test_df_tmp.join(pd.get_dummies(test_df_tmp['Pclass'], prefix='Pclass'))
test_df_tmp=test_df_tmp.join(pd.get_dummies(test_df_tmp['Embarked'], prefix='Embarked'))
test_df_tmp=test_df_tmp.join(pd.get_dummies(test_df_tmp['Cabin_fl'], prefix='Cabin_fl'))
Context Information Not Provided
There might be some cases where context information would be readily available. For example: for me, it is unclear if Cabin, Embarked, SibSp, Parch would contribute anything for the Survived variable. Thus, we would use dimensionality reduction to select relevant attributes only.
Here, I would be using Embedded Method for feature selection. You can use the Filter or Wrapper Method as well.
Embedded Method
So, A LASSO model would be used and it would penalize the features which would reduce the weights of irrelevant attributes to 0.
from sklearn.linear_model import LassoCV
model = LassoCV(random_state=0)
# Only numerical attributes are selected to fit into the model
cols_req=['Sex', 'Age', 'SibSp', 'Parch', 'Cabin_cnt', 'Pclass_1',
'Pclass_2', 'Pclass_3', 'Cabin_fl_A', 'Cabin_fl_B', 'Cabin_fl_C',
'Cabin_fl_D', 'Cabin_fl_E', 'Cabin_fl_F', 'Cabin_fl_G',
'Cabin_fl_T', 'Cabin_fl_Z','Embarked_C','Embarked_Q','Embarked_S']
model.fit(train_df_tmp[cols_req],train_df_tmp.Survived)LassoCV(random_state=0)coef = pd.Series(model.coef_, index = train_df_tmp[cols_req].columns)
imp_coef = coef.sort_values()
# filtering out all the attributes that have 0 coefficient
imp_coef[abs(imp_coef)>0].plot(kind = "barh", title="Fig 3: ", figsize=(15,8))<AxesSubplot:title={'center':'Fig 3: '}>

From figure 3, horizontal bar plot, it can be seen that some of the features have no influence on the response variable Survived. The variable was Cabin, thus, it would not be used further down the process.
Another statistical tool, a correlation plot, can be used to observe relationships among explanatory and response variables.
Correlation Method
cols_req=['Survived','Sex', 'Age', 'SibSp', 'Parch', 'Cabin_cnt', 'Pclass_1',
'Pclass_2', 'Pclass_3', 'Cabin_fl_A', 'Cabin_fl_B', 'Cabin_fl_C',
'Cabin_fl_D', 'Cabin_fl_E', 'Cabin_fl_F', 'Cabin_fl_G',
'Cabin_fl_T', 'Cabin_fl_Z','Embarked_C','Embarked_Q','Embarked_S']
cor = train_df_tmp[cols_req].corr()
fig, ax = plt.subplots(figsize=(18, 12))
im = ax.imshow(cor, cmap='Wistia')
ax.set_title("Fig 4: Correlation among the required variables")
ax.set_xticks(np.arange(len(cols_req)))
ax.set_yticks(np.arange(len(cols_req)))
ax.set_xticklabels(cols_req)
ax.set_yticklabels(cols_req)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
for i in range(len(cols_req)):
for j in range(len(cols_req)):
text = ax.text(j, i, round(cor.iloc[i, j],2),
ha="center", va="center", color="w")
fig.tight_layout()
plt.show()

Thus, by analyzing both the graph we would use Sex, Age, Parch, SibSp, Pclass_1, Pclass_3, Embarked_C, Embarked_Q, Embarked_S explanatory features.
Now, we have concluded the features to use. We will train an SVM to do the classification.
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
# All used features
features=['Sex','Age','Parch','SibSp','Pclass_1','Pclass_3', 'Embarked_C','Embarked_Q','Embarked_S']
# Populate test set for the missing columns
for x in features:
if x not in test_df_tmp.columns.values:
test_df_tmp[x]=0
X_train,y_train=train_df_tmp[features].to_numpy(),train_df_tmp["Survived"]
clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
clf.fit(X_train, y_train)Pipeline(steps=[('standardscaler', StandardScaler()),
('svc', SVC(gamma='auto'))])
Model Validity
To check the validity of models, I would perform a cross-validation test. Also, to test prediction capability we would create a confusion matrix.
Cross-Validation Scores
scores = cross_val_score(clf, X_train, y_train, cv=10)
scoresarray([0.83333333, 0.83146067, 0.7752809 , 0.87640449, 0.86516854,
0.82022472, 0.83146067, 0.76404494, 0.86516854, 0.86516854])
Confusion Matrix & Metrics
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, random_state=0)
clf.fit(X_train, y_train)
plot_confusion_matrix(clf, X_test, y_test)<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x7f6f9bba7370>

func_vars = [('precision_score',precision_score),('recall_score',recall_score),('f1_score',f1_score),('accuracy_score',accuracy_score)]
y_pred = clf.predict(X_test)
for lab,fn in func_vars:
print(f"The {lab} score for classifier is {round(fn(y_test,y_pred),4)}")The precision_score score for classifier is 0.8143
The recall_score score for classifier is 0.6786
The f1_score score for classifier is 0.7403
The accuracy_score score for classifier is 0.8206
Conclusion
Thus, the model is pretty accurate to be used for the prediction. The model could be improved by working upon to improve the recall score. Thus, we have seen how model training is only 20% of the task in machine learning. There are a lot of advanced methods that could be used to determine the useful features for a classifier. If you like the notebook, please do upvote.