본문 바로가기

카테고리 없음

[Kaggle] 타이타닉 생존자 예측 | Titanic - Machine Learning from Disaster | 앙상블, XGBoost

타이타닉 생존자 예측

- 문제 유형 : 회귀

- 평가지표 : RMSLE

- 제출 시 사용한 모델 : XGboost

- 캐글노트북 : https://www.kaggle.com/code/jinkwonskk/notebook7ac86867dd

 

notebook7ac86867dd

Explore and run machine learning code with Kaggle Notebooks | Using data from Titanic - Machine Learning from Disaster

www.kaggle.com

 

- 참고블로그 : https://chaesoong2.tistory.com/27

 

타이타닉 생존자 예측, 분류 성능 평가 지표(Accuracy, Confusion Matrix, Precision, Recall, F1 Score, ROC-AUC),

* 해당 글은 inflearn의 강의 '[개정판] 파이썬 머신러닝 완벽가이드'를 정리한 글입니다. 회색 - 강의 제목 노란색, 주황색 - 강조 3일 사이킷런으로 수행하는 타이타닉 생존자 예측 타이타닉 생존

chaesoong2.tistory.com

 

# 피쳐 설명

  • Passengerid: 탑승자 데이터 일련번호
  • survived: 생존 여부, 0 = 사망, 1 = 생존
  • Pclass: 티켓의 선실 등급, 1 = 일등석, 2 = 이등석, 3 = 삼등석
  • sex: 탑승자 성별
  • name: 탑승자 이름
  • Age: 탑승자 나이
  • sibsp: 같이 탑승한 형제자매 또는 배우자 인원수
  • parch: 같이 탑승한 부모님 또는 어린이 인원수
  • ticket: 티켓 번호
  • fare: 요금
  • cabin: 선실 번호
  • embarked: 중간 정착 항구 C = Cherbourg, Q = Queenstown, S = Southampton

0. 데이터 불러오기

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import pandas as pd
import numpy as np
import re
import sklearn
import xgboost as xgb
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls

import warnings
warnings.filterwarnings('ignore')

from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier, 
                              GradientBoostingClassifier, ExtraTreesClassifier)
from sklearn.svm import SVC
from sklearn.model_selection import KFold

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

train=pd.read_csv('/kaggle/input/titanic/train.csv')
test=pd.read_csv('/kaggle/input/titanic/test.csv')

 

1. EDA (+ 데이터 전처리)

full_data=[train, test]

# 이름의 길이 구하기
train['Name_length']=train['Name'].apply(len)
test['Name_length']=test['Name'].apply(len)

# cabin 데이터 0, 1로 분류하기
train['Has_Cabin'] = train["Cabin"].apply(lambda x: 0 if type(x) == float else 1)
test['Has_Cabin'] = test["Cabin"].apply(lambda x: 0 if type(x) == float else 1)

# 가족 데이터 구하기
for dataset in full_data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
    
# 홀로 탄 사람들과 아닌 사람 1, 0 
for dataset in full_data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
    
# Embarked 에서 Null은 'S'로 지정하기
for dataset in full_data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')
    
# Fare에서 Null은 평균값으로 채워넣기
for dataset in full_data:
    dataset['Fare'] = dataset['Fare'].fillna(train['Fare'].median())

# Fare 데이터 나누기 (qcut은 동일한 개수로 나눠준다)
train['CategoricalFare'] = pd.qcut(train['Fare'], 4)

# Age 데이터 수정하기 (cut은 동일한 길이로 나눠준다)
for dataset in full_data:
    age_avg = dataset['Age'].mean() #평균
    age_std = dataset['Age'].std() #표준편차
    age_null_count = dataset['Age'].isnull().sum() # Null값 개수
    # null값에 특정 값 넣어주기
    age_null_random_list = np.random.randint(age_avg - age_std, age_avg + age_std, size=age_null_count)
    dataset['Age'][np.isnan(dataset['Age'])] = age_null_random_list
    dataset['Age'] = dataset['Age'].astype(int)
train['CategoricalAge'] = pd.cut(train['Age'], 5)
# 이름 검색해서 Title로 반환해주기
def get_title(name):
    title_search = re.search(' ([A-Za-z]+)\.', name)
    if title_search:
        return title_search.group(1)
    return ""

for dataset in full_data:
    dataset['Title'] = dataset['Name'].apply(get_title)
    
# 틀린 단어들 하나로 통일시키기
for dataset in full_data:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
# 데이터 숫자로 변환하기
for dataset in full_data:
    # Mapping Sex
    dataset['Sex'] = dataset['Sex'].map( {'female': 0, 'male': 1} ).astype(int)
    # Title 숫자로 바꾸기

    title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
    dataset['Title'] = dataset['Title'].map(title_mapping)
    dataset['Title'] = dataset['Title'].fillna(0)
    dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare']= 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
    dataset['Fare'] = dataset['Fare'].astype(int)
    dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
    dataset.loc[ dataset['Age'] > 64, 'Age'] = 4
    
# 수정한 column 지우기
drop_elements = ['PassengerId', 'Name', 'Ticket', 'Cabin', 'SibSp']
train = train.drop(drop_elements, axis = 1)
train = train.drop(['CategoricalAge', 'CategoricalFare'], axis = 1)
test  = test.drop(drop_elements, axis = 1)
train.head()

데이터 전처리 결과

# Heatmap을 통한 피어슨 상관계수 파악

colormap = plt.cm.RdBu
plt.figure(figsize=(14,12))
plt.title('Pearson Correlation of Features', y=1.05, size=15)
sns.heatmap(train.astype(float).corr(),linewidths=0.1,vmax=1.0, 
            square=True, cmap=colormap, linecolor='white', annot=True)

 

g = sns.pairplot(train[[u'Survived', u'Pclass', u'Sex', u'Age', u'Parch', u'Fare', u'Embarked',
       u'FamilySize', u'Title']], hue='Survived', palette = 'seismic',size=1.2,diag_kind = 'kde',diag_kws=dict(shade=True),plot_kws=dict(s=10) )
g.set(xticklabels=[])

 

2. 베이스 모델(OOF 예측, 앙상블, 스태킹)

from sklearn.model_selection import KFold

ntrain = train.shape[0]
ntest = test.shape[0]
SEED = 0
NFOLDS = 5


kf = KFold(n_splits = NFOLDS, shuffle=True, random_state=SEED)

display(ntrain, ntest)

 

class SklearnHelper(object):
    """
    사이킷런의 객체를 상속받고, 사이킷런에서 제공하는 메소드를 구현하여 
    학습, 예측, 피처 중요도 구현
    """
    def __init__(self, clf, seed, params):
        params["random_state"] = seed
        self.clf = clf(**params)
        
    def get_name(self):
        return self.clf.__class__.__name__
    
    def train(self, X_train, y_train):
        self.clf.fit(X_train, y_train)
        
    def predict(self, X):
        """
        인자 - 테스트세트
        """
        return self.clf.predict(X)
    
    def fit(self,x,y):
        return self.clf.fit(x,y)
    
    def feature_importances(self,x,y):
        print(self.clf.fit(x,y).feature_importances_)
def loop_get_oof(clfs, X_train, y_train, X_test):
    """
    Out Of Fold Prediction를 모델의 갯수만큼 반복 수행
    """
    result = dict()
    
    for clf in clfs:
        display(clf.get_name())
        oof_train,oof_test = get_oof(clf, X_train, y_train, X_test)
        result[clf] = (oof_train, oof_test)
    return result
oof_train = np.zeros((891,))
display('oof_train  :', oof_train.shape)

oof_test  = np.zeros((481, ))
display('oof_test  :', oof_test.shape)

oof_test_skf = np.empty((5, 481))
display('oof_test_skf  :', oof_test_skf.shape, oof_test_skf)

 

Generating our Base First-Level Models

먼저 앙상블의 첫번째 단계로 5개의 ML모델을 대상으로 한다.

  1. Random Forest classifier
  2. Extra Trees classifier
  3. AdaBoost classifer
  4. Gradient Boosting classifer
  5. Support Vector Machine
linkcode

Parameters

  • n_jobs : 학습시에 사용할 CPU코어 수. -1의 의미는 모든 CPU코아를 다 사용
  • n_estimators : ML모델 분류기의 수 ( 디폴트는 10)
  • max_depth : 트리의 최대 깊이 수(크게 설정될수록 오버피팅이 위험이 존재함을 유념)
  • verbose : 학습이 진행될 동안 출력결과를 보고 싶을 경우 설정( 0 - 결과를 보이지 않음, 1- 결과보기)
#hyper parameter for RandomForestClassifier
rf_params = {
    "n_jobs": -1,
    "n_estimators":500,
    "warm_start": True,  # 객체 생성한 것을 재사용,
    #"max_feature":0.2,
    "max_depth":6,
    "min_samples_leaf": 2,
    "max_features": "sqrt",
    "verbose": 0
}

# hyper parameter for ExtraTreeClassifier
ext_params = {
    "n_jobs": -1,
    "n_estimators": 500,
    #"max_features": 0.2,
    "max_depth": 6,
    "min_samples_leaf": 2,
    "max_features": "sqrt",
    "verbose": True
}

#hyper parameter for AdaBoostClassifer
ada_params = {
    "n_estimators": 500,
    "learning_rate": 0.75
}
#hyper parameter for GradientBoost
gb_params = {
    'n_estimators': 500,
     #'max_features': 0.2,
    'max_depth': 5,
    'min_samples_leaf': 2,
    'verbose': 0
}

# hyper parameter for SVC
svc_params = {
    'kernel' : 'linear',
    'C' : 0.025
}
# Create 5 objects that represent our 4 models
rf = SklearnHelper(clf = RandomForestClassifier, seed = SEED, params = rf_params)
et = SklearnHelper(clf = ExtraTreesClassifier, seed = SEED, params = ext_params)
ada = SklearnHelper(clf = AdaBoostClassifier, seed = SEED, params = ada_params)
gb  = SklearnHelper(clf = GradientBoostingClassifier, seed = SEED, params = gb_params)
svc = SklearnHelper(clf = SVC, seed = SEED, params = svc_params)

 

Creating NumPy arrays out of our train and test sets

학습 및 테스트 세트로 부터 모델을 학습시킬 데이터세트 생성

 

# Numpy 배열 만들기 (ravel은 평평하게 만들어주는 함수)

y_train = train['Survived'].ravel()
train = train.drop(['Survived'], axis=1)
x_train = train.values 
x_test = test.values
rf_oof_train, rf_oof_test = get_oof(rf, x_train, y_train, x_test) # Random Forest
et_oof_train, et_oof_test = get_oof(et, x_train, y_train, x_test) # Extra Trees
ada_oof_train, ada_oof_test = get_oof(ada, x_train, y_train, x_test) # AdaBoost 
gb_oof_train, gb_oof_test = get_oof(gb, x_train, y_train, x_test) # Gradient Boost
svc_oof_train, svc_oof_test = get_oof(svc, x_train, y_train, x_test) # Support Vector Classifier
rf_feature = rf.feature_importances(x_train,y_train)
et_feature = et.feature_importances(x_train, y_train)
ada_feature = ada.feature_importances(x_train, y_train)
gb_feature = gb.feature_importances(x_train,y_train)
cols = train.columns.values
feature_dataframe = pd.DataFrame( {'features': cols,
     'Random Forest feature importances': rf_features,
     'Extra Trees  feature importances': et_features,
      'AdaBoost feature importances': ada_features,
    'Gradient Boost feature importances': gb_features
    })
trace = go.Scatter(
    y = feature_dataframe['Random Forest feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        color = feature_dataframe['Random Forest feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Random Forest Feature Importance',
    hovermode= 'closest',
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

trace = go.Scatter(
    y = feature_dataframe['Extra Trees  feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        color = feature_dataframe['Extra Trees  feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]

layout= go.Layout(
    autosize= True,
    title= 'Extra Trees Feature Importance',
    hovermode= 'closest',
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

trace = go.Scatter(
    y = feature_dataframe['AdaBoost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        color = feature_dataframe['AdaBoost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
layout= go.Layout(
    autosize= True,
    title= 'AdaBoost Feature Importance',
    hovermode= 'closest',
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')
trace = go.Scatter(
    y = feature_dataframe['Gradient Boost feature importances'].values,
    x = feature_dataframe['features'].values,
    mode='markers',
    marker=dict(
        sizemode = 'diameter',
        sizeref = 1,
        size = 25,
        color = feature_dataframe['Gradient Boost feature importances'].values,
        colorscale='Portland',
        showscale=True
    ),
    text = feature_dataframe['features'].values
)
data = [trace]
layout= go.Layout(
    autosize= True,
    title= 'Gradient Boosting Feature Importance',
    hovermode= 'closest',
    yaxis=dict(
        title= 'Feature Importance',
        ticklen= 5,
        gridwidth= 2
    ),
    showlegend= False
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig,filename='scatter2010')

 

StackingSubmission = pd.DataFrame({ 'PassengerId': PassengerId,
                            'Survived': predictions })
StackingSubmission.to_csv("StackingSubmission.csv", index=False)