As is the case with every programming language, this Titanic:Machine Learning from Disaster has become the Hello World ! to machine learning.Therefore I try my hands on this dataset to begin my journey into the field of machine learning.
I intend to do exploratory data analysis,missing value imputation on the dataset and then implement predictive modelling ,cross validate with test data and make the submission.Lets begin the journey.
library(dplyr)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(rpart)
library(randomForest)
library(pscl)
library(Deducer)
library(Amelia)
library(forcats)
library(Hmisc)
library(VIM)
library(rpart.plot)
train=read.csv("train.csv",header=TRUE,stringsAsFactors = FALSE)
test=read.csv("test.csv",header=TRUE,stringsAsFactors = FALSE)
We then use the summary function on the dataset to understand the variable type and missing values.
cat("There are ",nrow(train),"rows and",ncol(train),"columns in train dataset")
## There are 891 rows and 12 columns in train dataset
cat("There are",nrow(test),"rows and",ncol(train),"columns in test dataset")
## There are 418 rows and 12 columns in test dataset
summary(train)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
summary(test)
## PassengerId Pclass Name Sex
## Min. : 892.0 Min. :1.000 Length:418 Length:418
## 1st Qu.: 996.2 1st Qu.:1.000 Class :character Class :character
## Median :1100.5 Median :3.000 Mode :character Mode :character
## Mean :1100.5 Mean :2.266
## 3rd Qu.:1204.8 3rd Qu.:3.000
## Max. :1309.0 Max. :3.000
##
## Age SibSp Parch Ticket
## Min. : 0.17 Min. :0.0000 Min. :0.0000 Length:418
## 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
## Median :27.00 Median :0.0000 Median :0.0000 Mode :character
## Mean :30.27 Mean :0.4474 Mean :0.3923
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :76.00 Max. :8.0000 Max. :9.0000
## NA's :86
## Fare Cabin Embarked
## Min. : 0.000 Length:418 Length:418
## 1st Qu.: 7.896 Class :character Class :character
## Median : 14.454 Mode :character Mode :character
## Mean : 35.627
## 3rd Qu.: 31.500
## Max. :512.329
## NA's :1
Train Dataset - From the train summary it is clear that Age has 177 missing values.For a visually appealing view of the NA values,we use missmap from the library Amelia.For documentation refer this link.
missmap(train,main="Titanic Train Data-Missing Value Visualisation",col=c("red","green"),legend=FALSE)
missmap(test,main="Titanic Test Data-Missing Value Visualisation",col=c("red","green"),legend=FALSE)
From the missing value visualisation of train data,it is understood that Age has most missing values whereas in test dataset one row in fare is missing in addition to Age.
Now let us comine both the train and test data to do some EDA.
titanic=full_join(train,test)
summary(titanic)
## PassengerId Survived Pclass Name
## Min. : 1 Min. :0.0000 Min. :1.000 Length:1309
## 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median : 655 Median :0.0000 Median :3.000 Mode :character
## Mean : 655 Mean :0.3838 Mean :2.295
## 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1309 Max. :1.0000 Max. :3.000
## NA's :418
## Sex Age SibSp Parch
## Length:1309 Min. : 0.17 Min. :0.0000 Min. :0.000
## Class :character 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000
## Mode :character Median :28.00 Median :0.0000 Median :0.000
## Mean :29.88 Mean :0.4989 Mean :0.385
## 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000
## Max. :80.00 Max. :8.0000 Max. :9.000
## NA's :263
## Ticket Fare Cabin
## Length:1309 Min. : 0.000 Length:1309
## Class :character 1st Qu.: 7.896 Class :character
## Mode :character Median : 14.454 Mode :character
## Mean : 33.295
## 3rd Qu.: 31.275
## Max. :512.329
## NA's :1
## Embarked
## Length:1309
## Class :character
## Mode :character
##
##
##
##
Let us first focus on the missing value imputation and do some visual exploration after that.We first consider the age.
prop.table(table(is.na(titanic$Age)))
##
## FALSE TRUE
## 0.7990833 0.2009167
20 % of the values in the Age column is missing.Therefore we use the rpart (recursive partitioning) to impute the age values.
age=rpart(Age ~Pclass+Sex+SibSp+Parch+Fare+Embarked,data=titanic[!(is.na(titanic$Age)),],method="anova")
titanic$Age[is.na(titanic$Age)]=predict(age,titanic[is.na(titanic$Age),])
Let us check ,
prop.table(table(is.na(titanic$Age)))
##
## FALSE
## 1
The column does not have any missing values.
ggplot(titanic,aes(Age,fill="green"))+geom_density(alpha=0.4)+labs(x="Age",y="Count",title="Distribution of Age after imputation")+theme(legend.position="none")
cat("There is",sum(is.na(titanic$Fare)),"missing value in Fare")
## There is 1 missing value in Fare
We find out the row and the details.
which(is.na(titanic$Fare))
## [1] 1044
titanic[1044,]
## PassengerId Survived Pclass Name Sex Age SibSp Parch
## 1044 1044 NA 3 Storey, Mr. Thomas male 60.5 0 0
## Ticket Fare Cabin Embarked
## 1044 3701 NA S
The passenger belongs to 3rd class and has emparked on S.The passenger is male.We again use the rpart function for imputation.
fare=rpart(Fare ~Parch+SibSp+Sex+Pclass,data=titanic[!(is.na(titanic$Fare)),],method="anova")
titanic$Fare[(is.na(titanic$Fare))]=predict(fare,data=titanic[is.na(titanic$Fare),])
rpart.plot(fare,shadow.col="pink",box.col="gray",split.col="magenta",main="Decision Tree for Imputation")
From the decision tree,we interpret that passengers in 2nd or 3rd class paid lesser compared to 1st class and those having parents or childrens shelled out more compared to those who travelled alone.
prop.table(table(is.na(titanic$Fare)))
##
## FALSE
## 1
We plot the density plot to understand about the fare dynamics.
ggplot(titanic,aes(Fare,fill="green"))+geom_density(alpha=0.4)+labs(x="Fare",y="Fare Density",title="Distribution of Fare after imputation")+theme(legend.position="none")
Clearly we see that the data is highly skewed towards right. # Data Wrangling and Visualisation:
We now focus on the name,sex,SibSp,Parch,Pclass to do some data wrangling and visualisation.We start with name.
str(titanic$Name)
## chr [1:1309] "Braund, Mr. Owen Harris" ...
For better understanding of the data we extract the title from the Name.
titanic$Title=gsub('(.*, )|(\\..*)','',titanic$Name)
head(titanic$Title)
## [1] "Mr" "Mrs" "Miss" "Mrs" "Mr" "Mr"
table(titanic$Title,titanic$Sex)
##
## female male
## Capt 0 1
## Col 0 4
## Don 0 1
## Dona 1 0
## Dr 1 7
## Jonkheer 0 1
## Lady 1 0
## Major 0 2
## Master 0 61
## Miss 260 0
## Mlle 2 0
## Mme 1 0
## Mr 0 757
## Mrs 197 0
## Ms 2 0
## Rev 0 8
## Sir 0 1
## the Countess 1 0
We convert the variable into factor and using function from forcats library we collapse some of these levels.Following code is inspired from Andrew Kinsman.
titanic <- titanic %>% mutate(Title = factor(Title)) %>% mutate(Title = fct_collapse(Title, "Miss" = c("Mlle", "Ms"), "Mrs" = "Mme", "Ranked" = c( "Major", "Dr", "Capt", "Col", "Rev"),"Royalty" = c("Lady", "Dona", "the Countess", "Don", "Sir", "Jonkheer")))
str(titanic$Title)
## Factor w/ 6 levels "Ranked","Royalty",..: 6 5 4 5 6 6 6 3 5 5 ...
Next we create a column called Family to know who all have travelled alone and who travelled with their families.This is achieved through simple ifelse statement where the condition will be if a person has travelled with parents/children or sibling/spouse then the statement will be evaluated as true else it is false.
titanic$Families= factor(ifelse(titanic$SibSp + titanic$Parch + 1> 1,"Yes","No"))
prop.table(table(titanic$Families))
##
## No Yes
## 0.6035141 0.3964859
Almost 40 % of them have travelled with families.
Now let us look at the passenger class.
prop.table(table(titanic$Pclass))
##
## 1 2 3
## 0.2467532 0.2116119 0.5416348
54 % of them have travelled in third class whereas an 25 % of them have been in 1st class.Let us compare the survival rates
titanic=titanic %>% mutate(Survived=factor(Survived)) %>% mutate(Survived=fct_recode(Survived,"No"="0","Yes"="1"))
# train=titanic[1:891,]
# test=titanic[1:1309,]
prop.table(table(train$Survived))
##
## 0 1
## 0.6161616 0.3838384
In the train data, only 38 % have survived whereas 61 % have perished.Lets dig deeper and see the trend.
ggplot(titanic[1:891,],aes(Sex,fill=Survived))+geom_bar(position="fill")+theme_fivethirtyeight()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Gender",y="Survival Rate",title="Survival by Gender")
On board Titanic,the chances of survival being a male is very less whereas for female it is subtantially greater.This indicates that there was gender disparity in the ship while saving lives.Let us see the passenger class.
str(titanic$Pclass)
## int [1:1309] 3 1 3 1 3 3 1 3 3 2 ...
ggplot(titanic[1:891,],aes(Pclass,fill=Survived))+geom_bar(position="fill")+facet_wrap(~Sex)+theme_fivethirtyeight()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Passenger Class",y="Survival Rate",title="Survival by Passenger Class Vs Gender")
From the two plots,it is understood that irrespective of the gender,there is a higher chance of survival if you were from 1st class.But,if you happen to be female 1st class passenger,then the chances of survival increases subtantially.Those who are unlucky are people from 2nd and 3rd class for male.
Let us understand the scenario with the Title and survival rate.
ggplot(titanic[1:891,],aes(Title,fill=Survived))+geom_bar(position="fill")+facet_wrap(Pclass~Sex)+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5),plot.subtitle = element_text(size=10),axis.text.x=element_text(angle=90))+labs(x="Title",y="Survival Rate",title="Survival by Title Vs Gender",subtitle="Visualizing by Passenger Class")
ggplot(titanic[1:891,],aes(Embarked,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Title",y="Survival Rate",title="Survival by Embarkment")
There seems to be one row with no value for embarkmemt.let us add this to the majority class.
titanic =titanic %>% mutate(Embarked=ifelse(Embarked=="",names(which.max(table(titanic$Embarked))),Embarked))
ggplot(titanic[1:891,],aes(Embarked,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Title",y="Survival Rate",title="Survival by Embarkment")+facet_wrap(Pclass~Sex,scales="free")
Let us now focus on the survival rates with respect to family.
ggplot(titanic[1:891,],aes(Families,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="With Family or Not",y="Survival Rate",title="Chance of Survival if travelled with Family")
Let us draw a boxplot to understand the median age of survival with respect to gender.
ggplot(titanic[1:891,],aes(Survived,Age,fill=Sex))+geom_boxplot()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Survived",y="Age",title="Median age of Survival")
str(titanic$Cabin)
## chr [1:1309] "" "C85" "" "C123" "" "" "E46" "" "" "" "G6" "C103" "" ...
We try to split the first character from the cabin variable and visualize the survival rate.
titanic$Deck=factor(sapply(titanic$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
str(titanic$Deck)
## Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...
table(is.na(titanic$Deck)) #297 missing values
##
## FALSE TRUE
## 295 1014
round(prop.table(table(titanic$Deck,titanic$Survived))*100,2)
##
## No Yes
## A 3.92 3.43
## B 5.88 17.16
## C 11.76 17.16
## D 3.92 12.25
## E 3.92 11.76
## F 2.45 3.92
## G 0.98 0.98
## T 0.49 0.00
C and B deck have higher rate of survival whereas the decks f,g,t have lower rates of survival.
We try to impute the missing values using Hmisc package.Other packages are also available to deal with missing values like amelia,missForest,mice,mi etc.Hmisc offers 2 powerful functions to impute missing values.They are impute() and aregImpute()
impute() function simply imputes missing value using user defined statistical method (mean, max, mean). It’s default is median. On the other hand, aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching.
set.seed(100)
titanic$Deck=with(titanic,impute(Deck,'random'))
ggplot(titanic[1:891,],aes(Deck,fill=Survived))+geom_bar(position="fill")+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Deck",y="Survival Rate",title="Survival by Deck")
We create a separate column called family size and know about whether the chances of survival are higher if you are with family.
titanic=titanic %>% mutate(FamilySize=SibSp+Parch+1) %>% mutate(Type=ifelse(FamilySize==1,"Single",ifelse(FamilySize>=3,"Large","2 People")))
ggplot(titanic[1:891,],aes(Type,fill=Survived))+geom_bar(position="fill")+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="FamilyType",y="Survival Rate",title="Survival by FamilySize")
The survival rate is high if you are a 2 people family whereas for singleton,the survival rate is low.
Now let us focus on the modelling part.I intent to use the random forest model .Let us first visualize whether relevant missing values are imputed correctly.
aggr(titanic,prop=FALSE,combined=TRUE,sortVars=TRUE,sortCombs=TRUE,numbers=TRUE)
##
## Variables sorted by number of missings:
## Variable Count
## Survived 418
## PassengerId 0
## Pclass 0
## Name 0
## Sex 0
## Age 0
## SibSp 0
## Parch 0
## Ticket 0
## Fare 0
## Cabin 0
## Embarked 0
## Title 0
## Families 0
## Deck 0
## FamilySize 0
## Type 0
Thus from the data we understand that there are no missing value in training data.The test data has 418 missing values on Survived which needs to be predicted.Let us split the train and test dataset.
We convert the character variables into factors.
titanic = titanic %>% mutate(Type=factor(Type)) %>% mutate(Embarked=factor(Embarked)) %>% mutate(Sex=factor(Sex))
train=titanic[1:891,]
test=titanic[892:1309,]
names(train)
## [1] "PassengerId" "Survived" "Pclass" "Name" "Sex"
## [6] "Age" "SibSp" "Parch" "Ticket" "Fare"
## [11] "Cabin" "Embarked" "Title" "Families" "Deck"
## [16] "FamilySize" "Type"
str(train)
## 'data.frame': 891 obs. of 17 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
## $ Title : Factor w/ 6 levels "Ranked","Royalty",..: 6 5 4 5 6 6 6 3 5 5 ...
## $ Families : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 1 2 2 2 ...
## $ Deck : Factor w/ 8 levels "A","B","C","D",..: 7 3 3 3 3 3 5 6 1 3 ...
## ..- attr(*, "imputed")= int 1 3 5 6 8 9 10 13 14 15 ...
## $ FamilySize : num 2 2 1 2 1 1 1 5 3 2 ...
## $ Type : Factor w/ 3 levels "2 People","Large",..: 1 1 3 1 3 3 3 2 2 1 ...
rfmodel=randomForest(factor(Survived) ~ Pclass+Sex+Age+Fare+Embarked+Title+Deck+FamilySize+Type+SibSp+Parch,data=train,importance=TRUE)
print(rfmodel)
##
## Call:
## randomForest(formula = factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + Deck + FamilySize + Type + SibSp + Parch, data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 17.4%
## Confusion matrix:
## No Yes class.error
## No 496 53 0.09653916
## Yes 102 240 0.29824561
The confusion matrix shows that the classification error is 30 %.
plot(rfmodel, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest")
The plot shows that somewhere between 0-100 trees,the optimum is reached and after that the OOB error becomes flat.Let us check the variable importance.
varImpPlot(rfmodel)
The mean decrease in accuraccy is 100 % for pclass which means that if we do a random permutation on the variable,the decrease is 100%.
Let us tune our randomforest.
variable=c("Pclass","Sex","Age","Fare","Embarked","Title","Deck","FamilySize","Type","SibSp","Parch")
tunedrfmodel=tuneRF(x=train[,variable],y=as.factor(train$Survived),mtryStart = 3,ntreeTry = 100,stepFactor = 2,improve=0.001,trace=FALSE,plot=FALSE,doBest = TRUE,nodesize=200,importance=TRUE)
## -0.04597701 0.001
## 0.05172414 0.001
## -0.1818182 0.001
varImpPlot(tunedrfmodel)
From the tuning,we see that title,sex are most important variable for our prediction.
Let us predict using the train data.
trainpredict=table(predict(tunedrfmodel),train$Survived)
caret::confusionMatrix(trainpredict)
## Confusion Matrix and Statistics
##
##
## No Yes
## No 486 92
## Yes 63 250
##
## Accuracy : 0.826
## 95% CI : (0.7995, 0.8504)
## No Information Rate : 0.6162
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.6263
## Mcnemar's Test P-Value : 0.02451
##
## Sensitivity : 0.8852
## Specificity : 0.7310
## Pos Pred Value : 0.8408
## Neg Pred Value : 0.7987
## Prevalence : 0.6162
## Detection Rate : 0.5455
## Detection Prevalence : 0.6487
## Balanced Accuracy : 0.8081
##
## 'Positive' Class : No
##
The accuracy is 80 %.Let us use the test data.
test$Survived=NULL
titanicpred=predict(tunedrfmodel,test,OOB=TRUE,type="response")
titanicpred=ifelse(titanicpred=="No",0,1)
solution=data.frame(PassengerID=test$PassengerId,Survived=titanicpred)
write.csv(solution,file="submission.csv",row.names=F)
library(lmtest)
logit=glm(factor(Survived) ~ Pclass+Sex+Age+Fare+Embarked+Title+Deck+FamilySize+Type+SibSp+Parch,data=train,family=binomial)
lrtest(logit)
## Likelihood ratio test
##
## Model 1: factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title +
## Deck + FamilySize + Type + SibSp + Parch
## Model 2: factor(Survived) ~ 1
## #Df LogLik Df Chisq Pr(>Chisq)
## 1 23 -356.03
## 2 1 -593.33 -22 474.59 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#summary(logit)
pR2(logit)
## llh llhNull G2 McFadden r2ML
## -356.0312578 -593.3275684 474.5926212 0.3999415 0.4129537
## r2CU
## 0.5610749
The P value is highly significant which means that the likelihood of survival depends on the factors encoded in the formula.This implies that the null hypothesis is rejected.
Macfaddem R2 value is 0.39 which means that 39 % of uncertainity of the intercept only model is explained by the full model.
predlogit=predict(logit,type="response")
gg1=floor(predlogit+0.50)
table(Actual=train$Survived,Prediction=gg1)
## Prediction
## Actual 0 1
## No 489 60
## Yes 90 252
rocplot(logit)
Using test data,
logittest=predict(logit,type="response",newdata=test)
logittest=floor(logittest>0.5)
titanicpred=ifelse(logittest=="No",0,1)
solution=data.frame(PassengerID=test$PassengerId,Survived=titanicpred)
write.csv(solution,file="submission1.csv",row.names=F)
This well known dataset in the kaggle community gives us the leverage to see the data in different angle gaining vast knowledge on how to visualise the data,impute missing values,various packages to model the data,accuraccy of the model etc.
Overall,this dataset is an exposure to machine learning algorithms and interpretation of the same.