Load graphlab


import graphlab


Load the data


passengers = graphlab.SFrame('train.csv')

PROGRESS: Finished parsing file /Users/vishnu/git/hadoop/ipython/train.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.020899 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[int,int,int,str,str,float,int,int,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/vishnu/git/hadoop/ipython/train.csv
PROGRESS: Parsing completed. Parsed 891 lines in 0.010159 secs.


Analyze


graphlab.canvas.set_target('ipynb')
passengers.show()


Pre process


Age column has null values, fill it with Avg age

passengers = passengers.fillna("Age",passengers["Age"].mean())


Feature engineering


Consider the family size = 1 if (#siblings + #parents) > 3 else 0

passengers['family'] = passengers['SibSp']+passengers['Parch'] >3

Create a new feature child, if the age is less than 15

passengers["Child"] = passengers["Age"]<15

Extract title from Name

import re
def findTitle(name):
match = re.search("(Dr|Mrs?|Ms|Miss|Master|Rev|Capt|Mlle|Col|Major|Sir|Jonkheer|Lady|the Countess|Mme|Don)\\.",name)
if match:
title = match.group(0)
if (title == 'Don.' or title == 'Major.' or title == 'Capt.'):
title = 'Sir.'
if (title == 'Mlle.' or title == 'Mme.'):
title = 'Miss.'
return title
else:
return "Other"
passengers["Title"] = passengers["Name"].apply(findTitle)
passengers["Title"].show()

Feature binning

from graphlab.toolkits.feature_engineering import *

binner = graphlab.feature_engineering.create(passengers, FeatureBinner(features = ['Fare'],strategy='quantile',num_bins = 5))
fit_binner = binner.fit(passengers)
passengers_binned = fit_binner.transform(passengers)
passengers_binned["Fare"].show()


Feature selection


features = ["Pclass","Sex","Age","family","Child","Fare","Title"]


Model building


Split data into train and test set

train,test = passengers_binned.random_split(0.8,seed=0)


model = graphlab.logistic_classifier.create(passengers_binned,
target="Survived",
features = features,
validation_set = test)

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples : 891
PROGRESS: Number of classes : 2
PROGRESS: Number of feature columns : 7
PROGRESS: Number of unpacked features : 7
PROGRESS: Number of coefficients : 21
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | Iteration | Passes | Elapsed Time | Training-accuracy | Validation-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+
PROGRESS: | 1 | 2 | 0.002642 | 0.831650 | 0.781915 |
PROGRESS: | 2 | 3 | 0.004899 | 0.835017 | 0.781915 |
PROGRESS: | 3 | 4 | 0.007302 | 0.831650 | 0.776596 |
PROGRESS: | 4 | 5 | 0.009823 | 0.831650 | 0.776596 |
PROGRESS: | 5 | 6 | 0.012186 | 0.831650 | 0.776596 |
PROGRESS: | 6 | 7 | 0.014614 | 0.831650 | 0.776596 |
PROGRESS: +-----------+----------+--------------+-------------------+---------------------+


Evaluation


ROC curve

model.evaluate(test,metric='roc_curve')

{'roc_curve': Columns:
threshold float
fpr float
tpr float
p int
n int

Rows: 1001

Data:
+------------------+----------------+-----+----+-----+
| threshold | fpr | tpr | p | n |
+------------------+----------------+-----+----+-----+
| 0.0 | 0.0 | 0.0 | 75 | 113 |
| 0.0010000000475 | 1.0 | 1.0 | 75 | 113 |
| 0.00200000009499 | 1.0 | 1.0 | 75 | 113 |
| 0.00300000002608 | 1.0 | 1.0 | 75 | 113 |
| 0.00400000018999 | 1.0 | 1.0 | 75 | 113 |
| 0.00499999988824 | 1.0 | 1.0 | 75 | 113 |
| 0.00600000005215 | 0.982300884956 | 1.0 | 75 | 113 |
| 0.00700000021607 | 0.982300884956 | 1.0 | 75 | 113 |
| 0.00800000037998 | 0.982300884956 | 1.0 | 75 | 113 |
| 0.00899999961257 | 0.982300884956 | 1.0 | 75 | 113 |
+------------------+----------------+-----+----+-----+
[1001 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.}

model.show(view='Evaluation')


Build model again using the entre input


model = graphlab.logistic_classifier.create(passengers_binned,
target="Survived",
features = features,
validation_set = None)

PROGRESS: Logistic regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples : 891
PROGRESS: Number of classes : 2
PROGRESS: Number of feature columns : 7
PROGRESS: Number of unpacked features : 7
PROGRESS: Number of coefficients : 21
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+-------------------+
PROGRESS: | Iteration | Passes | Elapsed Time | Training-accuracy |
PROGRESS: +-----------+----------+--------------+-------------------+
PROGRESS: | 1 | 2 | 0.002700 | 0.831650 |
PROGRESS: | 2 | 3 | 0.004679 | 0.835017 |
PROGRESS: | 3 | 4 | 0.006863 | 0.831650 |
PROGRESS: | 4 | 5 | 0.008501 | 0.831650 |
PROGRESS: | 5 | 6 | 0.010505 | 0.831650 |
PROGRESS: | 6 | 7 | 0.012663 | 0.831650 |
PROGRESS: +-----------+----------+--------------+-------------------+


Predict


passengers_submission = graphlab.SFrame('test.csv')

PROGRESS: Finished parsing file /Users/vishnu/git/hadoop/ipython/test.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.021006 secs.
------------------------------------------------------
Inferred types from first line of file as
column_type_hints=[int,int,str,str,float,int,int,str,float,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/vishnu/git/hadoop/ipython/test.csv
PROGRESS: Parsing completed. Parsed 418 lines in 0.008928 secs.

passengers_submission.show()

passengers_submission['family'] = passengers_submission['SibSp']+passengers_submission['Parch'] >3
passengers_submission["Child"] = passengers_submission["Age"]<15
passengers_submission["Title"] = passengers_submission["Name"].apply(findTitle)
binner = graphlab.feature_engineering.create(passengers_submission, FeatureBinner(features = ['Fare'],strategy='quantile',num_bins = 5))
fit_binner = binner.fit(passengers_submission)
passengers_submission_binned = fit_binner.transform(passengers_submission)

passengers["Pclass","Sex","Age","family","Child","Fare","Title"].show()
prediction = model.predict(passengers_submission_binned,output_type='class')
passengers_submission["Survived"] = prediction
result = passengers_submission["PassengerId","Survived"]
result
PassengerId Survived
892 0
893 1
894 0
895 0
896 1
897 0
898 1
899 0
900 1
901 0
[418 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.


result.save('submission.csv')

Received score of 0.78469