Introduction

As is the case with every programming language, this Titanic:Machine Learning from Disaster has become the Hello World ! to machine learning.Therefore I try my hands on this dataset to begin my journey into the field of machine learning.
I intend to do exploratory data analysis,missing value imputation on the dataset and then implement predictive modelling ,cross validate with test data and make the submission.Lets begin the journey.

Loading the libraries and dataset

library(dplyr)
library(ggplot2)
library(ggthemes)
library(corrplot)
library(rpart)
library(randomForest)
library(pscl)
library(Deducer)
library(Amelia)
library(forcats)
library(Hmisc)
library(VIM)
library(rpart.plot)
train=read.csv("train.csv",header=TRUE,stringsAsFactors = FALSE)
test=read.csv("test.csv",header=TRUE,stringsAsFactors = FALSE)

Glimpse of the data:

We then use the summary function on the dataset to understand the variable type and missing values.

cat("There are ",nrow(train),"rows and",ncol(train),"columns in train dataset")

## There are  891 rows and 12 columns in train dataset

cat("There are",nrow(test),"rows and",ncol(train),"columns in test dataset")

## There are 418 rows and 12 columns in test dataset

summary(train)

##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##                                                                     
##      Sex                 Age            SibSp           Parch       
##  Length:891         Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  Class :character   1st Qu.:20.12   1st Qu.:0.000   1st Qu.:0.0000  
##  Mode  :character   Median :28.00   Median :0.000   Median :0.0000  
##                     Mean   :29.70   Mean   :0.523   Mean   :0.3816  
##                     3rd Qu.:38.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##                     Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##                     NA's   :177                                     
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33                                        
##

summary(test)

##   PassengerId         Pclass          Name               Sex           
##  Min.   : 892.0   Min.   :1.000   Length:418         Length:418        
##  1st Qu.: 996.2   1st Qu.:1.000   Class :character   Class :character  
##  Median :1100.5   Median :3.000   Mode  :character   Mode  :character  
##  Mean   :1100.5   Mean   :2.266                                        
##  3rd Qu.:1204.8   3rd Qu.:3.000                                        
##  Max.   :1309.0   Max.   :3.000                                        
##                                                                        
##       Age            SibSp            Parch           Ticket         
##  Min.   : 0.17   Min.   :0.0000   Min.   :0.0000   Length:418        
##  1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.0000   Class :character  
##  Median :27.00   Median :0.0000   Median :0.0000   Mode  :character  
##  Mean   :30.27   Mean   :0.4474   Mean   :0.3923                     
##  3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.0000                     
##  Max.   :76.00   Max.   :8.0000   Max.   :9.0000                     
##  NA's   :86                                                          
##       Fare            Cabin             Embarked        
##  Min.   :  0.000   Length:418         Length:418        
##  1st Qu.:  7.896   Class :character   Class :character  
##  Median : 14.454   Mode  :character   Mode  :character  
##  Mean   : 35.627                                        
##  3rd Qu.: 31.500                                        
##  Max.   :512.329                                        
##  NA's   :1

Train Dataset - From the train summary it is clear that Age has 177 missing values.For a visually appealing view of the NA values,we use missmap from the library Amelia.For documentation refer this link.

missmap(train,main="Titanic Train Data-Missing Value Visualisation",col=c("red","green"),legend=FALSE)

missmap(test,main="Titanic Test Data-Missing Value Visualisation",col=c("red","green"),legend=FALSE)

From the missing value visualisation of train data,it is understood that Age has most missing values whereas in test dataset one row in fare is missing in addition to Age.

Now let us comine both the train and test data to do some EDA.

titanic=full_join(train,test)
summary(titanic)

##   PassengerId      Survived          Pclass          Name          
##  Min.   :   1   Min.   :0.0000   Min.   :1.000   Length:1309       
##  1st Qu.: 328   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median : 655   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   : 655   Mean   :0.3838   Mean   :2.295                     
##  3rd Qu.: 982   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :1309   Max.   :1.0000   Max.   :3.000                     
##                 NA's   :418                                        
##      Sex                 Age            SibSp            Parch      
##  Length:1309        Min.   : 0.17   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:21.00   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :28.00   Median :0.0000   Median :0.000  
##                     Mean   :29.88   Mean   :0.4989   Mean   :0.385  
##                     3rd Qu.:39.00   3rd Qu.:1.0000   3rd Qu.:0.000  
##                     Max.   :80.00   Max.   :8.0000   Max.   :9.000  
##                     NA's   :263                                     
##     Ticket               Fare            Cabin          
##  Length:1309        Min.   :  0.000   Length:1309       
##  Class :character   1st Qu.:  7.896   Class :character  
##  Mode  :character   Median : 14.454   Mode  :character  
##                     Mean   : 33.295                     
##                     3rd Qu.: 31.275                     
##                     Max.   :512.329                     
##                     NA's   :1                           
##    Embarked        
##  Length:1309       
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
##

Missing value imputation

Let us first focus on the missing value imputation and do some visual exploration after that.We first consider the age.

Age

prop.table(table(is.na(titanic$Age)))

## 
##     FALSE      TRUE 
## 0.7990833 0.2009167

20 % of the values in the Age column is missing.Therefore we use the rpart (recursive partitioning) to impute the age values.

age=rpart(Age ~Pclass+Sex+SibSp+Parch+Fare+Embarked,data=titanic[!(is.na(titanic$Age)),],method="anova")
titanic$Age[is.na(titanic$Age)]=predict(age,titanic[is.na(titanic$Age),])

Let us check ,

prop.table(table(is.na(titanic$Age)))

## 
## FALSE 
##     1

The column does not have any missing values.

ggplot(titanic,aes(Age,fill="green"))+geom_density(alpha=0.4)+labs(x="Age",y="Count",title="Distribution of Age after imputation")+theme(legend.position="none")

Fare:

cat("There is",sum(is.na(titanic$Fare)),"missing value in Fare")

## There is 1 missing value in Fare

We find out the row and the details.

which(is.na(titanic$Fare))

## [1] 1044

titanic[1044,]

##      PassengerId Survived Pclass               Name  Sex  Age SibSp Parch
## 1044        1044       NA      3 Storey, Mr. Thomas male 60.5     0     0
##      Ticket Fare Cabin Embarked
## 1044   3701   NA              S

The passenger belongs to 3rd class and has emparked on S.The passenger is male.We again use the rpart function for imputation.

fare=rpart(Fare ~Parch+SibSp+Sex+Pclass,data=titanic[!(is.na(titanic$Fare)),],method="anova")
titanic$Fare[(is.na(titanic$Fare))]=predict(fare,data=titanic[is.na(titanic$Fare),])
rpart.plot(fare,shadow.col="pink",box.col="gray",split.col="magenta",main="Decision Tree for Imputation")

From the decision tree,we interpret that passengers in 2nd or 3rd class paid lesser compared to 1st class and those having parents or childrens shelled out more compared to those who travelled alone.

prop.table(table(is.na(titanic$Fare)))

## 
## FALSE 
##     1

We plot the density plot to understand about the fare dynamics.

ggplot(titanic,aes(Fare,fill="green"))+geom_density(alpha=0.4)+labs(x="Fare",y="Fare Density",title="Distribution of Fare after imputation")+theme(legend.position="none")

Clearly we see that the data is highly skewed towards right. # Data Wrangling and Visualisation:

We now focus on the name,sex,SibSp,Parch,Pclass to do some data wrangling and visualisation.We start with name.

str(titanic$Name)

##  chr [1:1309] "Braund, Mr. Owen Harris" ...

For better understanding of the data we extract the title from the Name.

titanic$Title=gsub('(.*, )|(\\..*)','',titanic$Name)
head(titanic$Title)

## [1] "Mr"   "Mrs"  "Miss" "Mrs"  "Mr"   "Mr"

table(titanic$Title,titanic$Sex)

##               
##                female male
##   Capt              0    1
##   Col               0    4
##   Don               0    1
##   Dona              1    0
##   Dr                1    7
##   Jonkheer          0    1
##   Lady              1    0
##   Major             0    2
##   Master            0   61
##   Miss            260    0
##   Mlle              2    0
##   Mme               1    0
##   Mr                0  757
##   Mrs             197    0
##   Ms                2    0
##   Rev               0    8
##   Sir               0    1
##   the Countess      1    0

We convert the variable into factor and using function from forcats library we collapse some of these levels.Following code is inspired from Andrew Kinsman.

titanic <- titanic %>% mutate(Title = factor(Title)) %>% mutate(Title = fct_collapse(Title, "Miss" = c("Mlle", "Ms"), "Mrs" = "Mme", "Ranked" = c( "Major", "Dr", "Capt", "Col", "Rev"),"Royalty" = c("Lady", "Dona", "the Countess", "Don", "Sir", "Jonkheer")))
str(titanic$Title)

##  Factor w/ 6 levels "Ranked","Royalty",..: 6 5 4 5 6 6 6 3 5 5 ...

Next we create a column called Family to know who all have travelled alone and who travelled with their families.This is achieved through simple ifelse statement where the condition will be if a person has travelled with parents/children or sibling/spouse then the statement will be evaluated as true else it is false.

titanic$Families= factor(ifelse(titanic$SibSp + titanic$Parch + 1> 1,"Yes","No"))
prop.table(table(titanic$Families))

## 
##        No       Yes 
## 0.6035141 0.3964859

Almost 40 % of them have travelled with families.

Now let us look at the passenger class.

prop.table(table(titanic$Pclass))

## 
##         1         2         3 
## 0.2467532 0.2116119 0.5416348

54 % of them have travelled in third class whereas an 25 % of them have been in 1st class.Let us compare the survival rates

Surival Scenario

titanic=titanic %>% mutate(Survived=factor(Survived)) %>% mutate(Survived=fct_recode(Survived,"No"="0","Yes"="1"))
# train=titanic[1:891,]
# test=titanic[1:1309,]
prop.table(table(train$Survived))

## 
##         0         1 
## 0.6161616 0.3838384

In the train data, only 38 % have survived whereas 61 % have perished.Lets dig deeper and see the trend.

By Gender

ggplot(titanic[1:891,],aes(Sex,fill=Survived))+geom_bar(position="fill")+theme_fivethirtyeight()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Gender",y="Survival Rate",title="Survival by Gender")

On board Titanic,the chances of survival being a male is very less whereas for female it is subtantially greater.This indicates that there was gender disparity in the ship while saving lives.Let us see the passenger class.

By Passenger Class

str(titanic$Pclass)

##  int [1:1309] 3 1 3 1 3 3 1 3 3 2 ...

ggplot(titanic[1:891,],aes(Pclass,fill=Survived))+geom_bar(position="fill")+facet_wrap(~Sex)+theme_fivethirtyeight()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Passenger Class",y="Survival Rate",title="Survival by Passenger Class Vs Gender")

From the two plots,it is understood that irrespective of the gender,there is a higher chance of survival if you were from 1st class.But,if you happen to be female 1st class passenger,then the chances of survival increases subtantially.Those who are unlucky are people from 2nd and 3rd class for male.

By Title

Let us understand the scenario with the Title and survival rate.

ggplot(titanic[1:891,],aes(Title,fill=Survived))+geom_bar(position="fill")+facet_wrap(Pclass~Sex)+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5),plot.subtitle = element_text(size=10),axis.text.x=element_text(angle=90))+labs(x="Title",y="Survival Rate",title="Survival by Title Vs Gender",subtitle="Visualizing by Passenger Class")

From the plot,we understand that the survival rate is highly influenced by Passenger class and gender.
As seen earlier,the survival rate is usually higher for female which is reinstated here.
1st class and 2nd class,female passengers had almost 100 % chance of survival compared to their male counterpart.
Chanses of survival is 50% for female travelling in 3rd class.
For male,the only way a male could have survived is that he should have been a boy (as indicated by master).For 1st and 2nd class,the survival is almost 100 % whereas it is 50 % in 3rd class.
The probability of survival is worse for an adult male in 2nd and 3rd class whereas in 1st class it is around 50 %.

By Embarkment:

ggplot(titanic[1:891,],aes(Embarked,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Title",y="Survival Rate",title="Survival by Embarkment")

There seems to be one row with no value for embarkmemt.let us add this to the majority class.

titanic =titanic %>% mutate(Embarked=ifelse(Embarked=="",names(which.max(table(titanic$Embarked))),Embarked))
ggplot(titanic[1:891,],aes(Embarked,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Title",y="Survival Rate",title="Survival by Embarkment")+facet_wrap(Pclass~Sex,scales="free")

In each class,the chances of survival is high forthose who embarked on “C” port.
All males who have embarked on “Q” port belonging to 2nd class have perished whereas the situation is just opposite for female from same class.This suggests that gender might be an more important factor in predicting the survival compared to the embarkment port.
Same situation is observed in the case of 1st class passengers.

By Families:

Let us now focus on the survival rates with respect to family.

ggplot(titanic[1:891,],aes(Families,fill=Survived))+geom_bar(position="fill")+theme_few()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="With Family or Not",y="Survival Rate",title="Chance of Survival if travelled with Family")

The chances of survival seems to be a bit higher for those who travel with their families.

Median age of Survival

Let us draw a boxplot to understand the median age of survival with respect to gender.

ggplot(titanic[1:891,],aes(Survived,Age,fill=Sex))+geom_boxplot()+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Survived",y="Age",title="Median age of Survival")

The median age has been aroung 30 for both male and female surviors whereas the median age of female who have not survived is around 25 and for males it is around 28.

By Cabin:

str(titanic$Cabin)

##  chr [1:1309] "" "C85" "" "C123" "" "" "E46" "" "" "" "G6" "C103" "" ...

We try to split the first character from the cabin variable and visualize the survival rate.

titanic$Deck=factor(sapply(titanic$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
str(titanic$Deck)

##  Factor w/ 8 levels "A","B","C","D",..: NA 3 NA 3 NA NA 5 NA NA NA ...

table(is.na(titanic$Deck)) #297 missing values

## 
## FALSE  TRUE 
##   295  1014

round(prop.table(table(titanic$Deck,titanic$Survived))*100,2)

##    
##        No   Yes
##   A  3.92  3.43
##   B  5.88 17.16
##   C 11.76 17.16
##   D  3.92 12.25
##   E  3.92 11.76
##   F  2.45  3.92
##   G  0.98  0.98
##   T  0.49  0.00

C and B deck have higher rate of survival whereas the decks f,g,t have lower rates of survival.

We try to impute the missing values using Hmisc package.Other packages are also available to deal with missing values like amelia,missForest,mice,mi etc.Hmisc offers 2 powerful functions to impute missing values.They are impute() and aregImpute()

impute() function simply imputes missing value using user defined statistical method (mean, max, mean). It’s default is median. On the other hand, aregImpute() allows mean imputation using additive regression, bootstrapping, and predictive mean matching.

set.seed(100)
titanic$Deck=with(titanic,impute(Deck,'random'))
ggplot(titanic[1:891,],aes(Deck,fill=Survived))+geom_bar(position="fill")+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="Deck",y="Survival Rate",title="Survival by Deck")

Family Size:

We create a separate column called family size and know about whether the chances of survival are higher if you are with family.

titanic=titanic %>% mutate(FamilySize=SibSp+Parch+1) %>% mutate(Type=ifelse(FamilySize==1,"Single",ifelse(FamilySize>=3,"Large","2 People")))
ggplot(titanic[1:891,],aes(Type,fill=Survived))+geom_bar(position="fill")+theme(legend.position="bottom",plot.title=element_text(size=15,hjust=0.5))+labs(x="FamilyType",y="Survival Rate",title="Survival by FamilySize")

The survival rate is high if you are a 2 people family whereas for singleton,the survival rate is low.

Modelling:

Now let us focus on the modelling part.I intent to use the random forest model .Let us first visualize whether relevant missing values are imputed correctly.

aggr(titanic,prop=FALSE,combined=TRUE,sortVars=TRUE,sortCombs=TRUE,numbers=TRUE)

## 
##  Variables sorted by number of missings: 
##     Variable Count
##     Survived   418
##  PassengerId     0
##       Pclass     0
##         Name     0
##          Sex     0
##          Age     0
##        SibSp     0
##        Parch     0
##       Ticket     0
##         Fare     0
##        Cabin     0
##     Embarked     0
##        Title     0
##     Families     0
##         Deck     0
##   FamilySize     0
##         Type     0

Thus from the data we understand that there are no missing value in training data.The test data has 418 missing values on Survived which needs to be predicted.Let us split the train and test dataset.

We convert the character variables into factors.

titanic = titanic %>% mutate(Type=factor(Type)) %>% mutate(Embarked=factor(Embarked)) %>% mutate(Sex=factor(Sex))

train=titanic[1:891,]
test=titanic[892:1309,]
names(train)

##  [1] "PassengerId" "Survived"    "Pclass"      "Name"        "Sex"        
##  [6] "Age"         "SibSp"       "Parch"       "Ticket"      "Fare"       
## [11] "Cabin"       "Embarked"    "Title"       "Families"    "Deck"       
## [16] "FamilySize"  "Type"

str(train)

## 'data.frame':    891 obs. of  17 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : Factor w/ 2 levels "No","Yes": 1 2 2 2 1 1 1 1 2 2 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : chr  "" "C85" "" "C123" ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
##  $ Title      : Factor w/ 6 levels "Ranked","Royalty",..: 6 5 4 5 6 6 6 3 5 5 ...
##  $ Families   : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 1 1 2 2 2 ...
##  $ Deck       : Factor w/ 8 levels "A","B","C","D",..: 7 3 3 3 3 3 5 6 1 3 ...
##   ..- attr(*, "imputed")= int  1 3 5 6 8 9 10 13 14 15 ...
##  $ FamilySize : num  2 2 1 2 1 1 1 5 3 2 ...
##  $ Type       : Factor w/ 3 levels "2 People","Large",..: 1 1 3 1 3 3 3 2 2 1 ...

rfmodel=randomForest(factor(Survived) ~ Pclass+Sex+Age+Fare+Embarked+Title+Deck+FamilySize+Type+SibSp+Parch,data=train,importance=TRUE)
print(rfmodel)

## 
## Call:
##  randomForest(formula = factor(Survived) ~ Pclass + Sex + Age +      Fare + Embarked + Title + Deck + FamilySize + Type + SibSp +      Parch, data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 17.4%
## Confusion matrix:
##      No Yes class.error
## No  496  53  0.09653916
## Yes 102 240  0.29824561

The confusion matrix shows that the classification error is 30 %.

plot(rfmodel, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest")

The plot shows that somewhere between 0-100 trees,the optimum is reached and after that the OOB error becomes flat.Let us check the variable importance.

varImpPlot(rfmodel)

The mean decrease in accuraccy is 100 % for pclass which means that if we do a random permutation on the variable,the decrease is 100%.

Let us tune our randomforest.

variable=c("Pclass","Sex","Age","Fare","Embarked","Title","Deck","FamilySize","Type","SibSp","Parch")
tunedrfmodel=tuneRF(x=train[,variable],y=as.factor(train$Survived),mtryStart = 3,ntreeTry = 100,stepFactor = 2,improve=0.001,trace=FALSE,plot=FALSE,doBest = TRUE,nodesize=200,importance=TRUE)

## -0.04597701 0.001 
## 0.05172414 0.001 
## -0.1818182 0.001

varImpPlot(tunedrfmodel)

From the tuning,we see that title,sex are most important variable for our prediction.

Let us predict using the train data.

trainpredict=table(predict(tunedrfmodel),train$Survived)
caret::confusionMatrix(trainpredict)

## Confusion Matrix and Statistics
## 
##      
##        No Yes
##   No  486  92
##   Yes  63 250
##                                           
##                Accuracy : 0.826           
##                  95% CI : (0.7995, 0.8504)
##     No Information Rate : 0.6162          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.6263          
##  Mcnemar's Test P-Value : 0.02451         
##                                           
##             Sensitivity : 0.8852          
##             Specificity : 0.7310          
##          Pos Pred Value : 0.8408          
##          Neg Pred Value : 0.7987          
##              Prevalence : 0.6162          
##          Detection Rate : 0.5455          
##    Detection Prevalence : 0.6487          
##       Balanced Accuracy : 0.8081          
##                                           
##        'Positive' Class : No              
##

The accuracy is 80 %.Let us use the test data.

test$Survived=NULL
titanicpred=predict(tunedrfmodel,test,OOB=TRUE,type="response")
titanicpred=ifelse(titanicpred=="No",0,1)
solution=data.frame(PassengerID=test$PassengerId,Survived=titanicpred)
write.csv(solution,file="submission.csv",row.names=F)

Logistic Regression:

library(lmtest)
logit=glm(factor(Survived) ~ Pclass+Sex+Age+Fare+Embarked+Title+Deck+FamilySize+Type+SibSp+Parch,data=train,family=binomial)
lrtest(logit)

## Likelihood ratio test
## 
## Model 1: factor(Survived) ~ Pclass + Sex + Age + Fare + Embarked + Title + 
##     Deck + FamilySize + Type + SibSp + Parch
## Model 2: factor(Survived) ~ 1
##   #Df  LogLik  Df  Chisq Pr(>Chisq)    
## 1  23 -356.03                          
## 2   1 -593.33 -22 474.59  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#summary(logit)
pR2(logit)

##          llh      llhNull           G2     McFadden         r2ML 
## -356.0312578 -593.3275684  474.5926212    0.3999415    0.4129537 
##         r2CU 
##    0.5610749

The P value is highly significant which means that the likelihood of survival depends on the factors encoded in the formula.This implies that the null hypothesis is rejected.
Macfaddem R2 value is 0.39 which means that 39 % of uncertainity of the intercept only model is explained by the full model.

predlogit=predict(logit,type="response")
gg1=floor(predlogit+0.50)
table(Actual=train$Survived,Prediction=gg1)

##       Prediction
## Actual   0   1
##    No  489  60
##    Yes  90 252

rocplot(logit)

Using test data,

logittest=predict(logit,type="response",newdata=test)
logittest=floor(logittest>0.5)
titanicpred=ifelse(logittest=="No",0,1)
solution=data.frame(PassengerID=test$PassengerId,Survived=titanicpred)
write.csv(solution,file="submission1.csv",row.names=F)

Conclusion:

This well known dataset in the kaggle community gives us the leverage to see the data in different angle gaining vast knowledge on how to visualise the data,impute missing values,various packages to model the data,accuraccy of the model etc.
Overall,this dataset is an exposure to machine learning algorithms and interpretation of the same.

Learning ML with Titanic Survival

Deepak Kumar G S

November 23, 2017