사이킷런으로 수행하는 타이타닉 생존자 예측

이번에는 파이썬의 대표적인 시각화 패키지인 맷플롯립과 시본을 이용해 차트와 그래프도 함께 시각화하면서 데이터 분석을 진행해보자.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

titanic_df = pd.read_csv('titanic_train.csv')
titanic_df.head(3)
Python
복사

데이터의 칼럼 타입과 결측치 수를 확인해보자. —>info() 메서드 이용

print('\n ### 학습 데이터 정보 ### \n')
print(titanic_df.info())
'''
### 학습 데이터 정보 ### 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
'''

# 판다스의 object 타입은 그냥 string 타입이다.
# Age, Cabin, Embarked 칼럼은 결측치를 가지고 있다.
Python
복사

일단, 결측치를 해결해야한다. Age는 평균값으로 대체하고 나머지 칼럼에 대한 결측치는 N으로 바꾸겠다.

titanic_df['Age'].fillna(titanic_df['Age'].mean(),inplace=True)
titanic_df['Cabin'].fillna('N',inplace=True)
titanic_df['Embarked'].fillna('N',inplace=True)
print('데이터 세트 Null 값 개수', titanic_df.isnull().sum().sum()) # 데이터 세트 Null 값 개수 0
Python
복사

이제, 문자열 피처들에 대해 살펴보자. 먼저 어떤 값들로 분류가 되어있는지 확인하자.

print(' Sex 값 분포:\n', titanic_df['Sex'].value_counts())
print('\n Cabin 값 분포:\n', titanic_df['Cabin'].value_counts())
print('\n Embarked 값 분포:\n', titanic_df['Embarked'].value_counts())
'''
Sex 값 분포:
 male      577
female    314
Name: Sex, dtype: int64

 Cabin 값 분포:
 N              687
G6               4
B96 B98          4
C23 C25 C27      4
D                3
              ... 
C90              1
C46              1
A32              1
A14              1
E46              1
Name: Cabin, Length: 148, dtype: int64

 Embarked 값 분포:
 S    644
C    168
Q     77
N      2
Name: Embarked, dtype: int64
'''

# 여기서 Cabin의 특성을 정리할 필요가 있는데, 일단, 속성분류 자체가 정리가 안되있다는 점과 
# 사실상 속성값 중 알파벳이 선실등급을 나타낸다고 예상되기 때문이다.
Python
복사

Cabin 특성치 정리

titanic_df['Cabin'] = titanic_df['Cabin'].str[:1] # 맨 처음 알파벳만 추출
print(titanic_df['Cabin'].head(3))
'''
0    N
1    C
2    N
Name: Cabin, dtype: object
'''
Python
복사

데이터 탐색을 진행해보자. 먼저 성별이 생존 확률에 영향을 미쳤을 것으로 예상되기 때문에 성별에 따른 생존자 수를 비교해 보자.

titanic_df.groupby(['Sex','Survived'])['Survived'].count()
'''
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
'''
# 생존 비율을 확인해보니 여성은 대략 74.2 % 생존했고, 남성은 대략 18.8% 생존했다. 

# 그래프로 확인해보자. 시본 패키지를 활용하자. 
sns.barplot(x='Sex',y='Survived',data=titanic_df)
Python
복사

이번에는 부자와 가난한 사람 간의 생존확률이 다를것으로 예상하고 분석해보자. 객실 등급별 성별에 따른 생존 확률을 보자. 마찬가지로 seaborn barplot을 그려보자.

sns.barplot(x='Pclass',y='Survived',hue='Sex',data=titanic_df)
Python
복사

이번에는 Age에 따른 생존 확률을 알아보자. 먼저, 분류형태가 아니기 때문에 쉽게 분석하기 위해서 범주를 만들어보자. (0~5 : Baby, 6~12 : Child, 13~18 : Teenager, 19~25 : Student, 26~35 : Young Adult, 36~60 : Adult, 61~ : Elderly, -1이하의 오류값 : Unknown)

# 입력 age에 따라 구분 값을 반환하는 함수 설정. 데이터프레임의 apply lambda 식에 사용
# 먼저 분류가 많기 때문에 분류하는 함수를 만든다. 인자는 후에 들어올 나이값을 받는 하나의 인
# 자만 지정해준다.

def get_category(age):
	cat=''
	if age <= -1: cat='Unknown'
	elif age <= 5: cat='Baby'
	elif age <= 12: cat='Child'
	elif age <= 18: cat='Teenager'
	elif age <= 25: cat='Student'
	elif age <= 35: cat='Young Adult'
	elif age <= 60: cat='Adult'
	else : cat='Elderly'

	return cat

# 막대그래프의 크기 figure를 더 크게 설정
plt.figure(figsize=(10,6))

# X축의 값을 순차적으로 표시하기 위한 설정
group_names=['Unknown','Baby','Child','Teenager','Student','Young Adult','Adult','Elderly']

# lambda 식에 위에서 생성한 get_category() 함수를 반환값으로 지정
# get_category(X)는 입력값으로 'Age' 칼럼값을 받아서 해당하는 cat 반환
titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
sns.barplot(x='Age_cat', y='Survived', hue='Sex', data=titanic_df, order=group_names)
titanic_df.drop('Age_cat',axis=1, inplace=True) # 만든 열 삭제
Python
복사

여자 아기와 여자 노인이 생존 확률이 높았고 이상하게 여자 아이는 다른 나이대에 비해 생존확률이 아주 낮았다. 남자인 경우에는 아기일 때 생존확률이 높은 것을 확인할 수 있다.

이제 남아있는 문자열 카테고리 피처를 숫자형 카테고리 피처로 변환해보자. 즉, LabelEncoder 클래스를 이용해 레이블 인코딩을 해보자.

문자열 칼럼을 한꺼번에 변환하기 위해 여러 칼럼을 encode_features() 함수를 만들어 변환해보자.

from sklearn import preprocessing

def encode_features(dataDF):
	features = ['Cabin','Sex','Embarked'] # 중요하다고 생각되는 문자열 칼럼만 뽑고
	for feature in features:
		le = preprocessing.LabelEncoder()
		dataDF[feature] = le.fit_transform(dataDF[feature])

	return dataDF

titanic_df = encode_features(titanic_df)
titanic_df.head()
# 문자열 속성이 숫자형으로 변환됐다. 
Python
복사

이제, 사실상 가장 중요한 위의 모든 전처리 과정을 정리하여 한꺼번에 수행하고 후에 재사용할 수 있도록 함수로 정리해보는 과정을 해보겠다.

함수를 호출하면 결측값 처리, 포매팅, 인코딩을 모두 수행해주게 만들어야 한다.

# Null 처리 함수
def fillna(df):
	df['Age'].fillna(df['Age'].mean(),inplace=True)
	df['Cabin'].fillna('N',inplace=True)
	df['Embarked'].fillna('N', inplace=True)
	df['Fare'].fillna(0,inplace=True)
	return df


# 머신러닝 알고리즘에 불필요한 특성 제거
def drop_features(df):
	df.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)
	return df

# 레이블 인코딩 및 정리
def format_features(df):
	df['Cabin'] = df['Cabin'].str[:1]
	features = ['Cabin','Sex','Embarked']
	for feature in features:
		le = LabelEncoder()
		df[feature] = le.fit_transform(df[feature])
	return df

# 앞에서 설정한 데이터 전처리 함수 호출
def transform_features(df):
	df = fillna(df)
	df = drop_features(df)
	df = format_features(df)
	return df
Python
복사

이제, 전처리 함수를 만들었으니 다시 원본 데이터셋을 로딩하고 결정값 데이터세트와 피처 데이터 세트를 다시 만들자.

# 원본 데이터 재로딩 및 피처 데이터 세트와 레이블 데이터 세트 만들기
titanic_df = pd.read_csv('titanic_train.csv')
y_titanic_df = titanic_df['Survived']
X_titanic_df = titanic_df.drop('Survived',axis=1)


# 이제 피처 데이터 세트에 대해서 전처리 진행
X_titanic_df = transform_features(X_titanic_df)


# train_test_split()으로 테스트 데이터셋 분리
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_titanic_df,y_titanic_df, test_size=0.2, random_state=11)
Python
복사

사용할 알고리즘은 결정트리, 랜덤포레스트, 로지스틱 회귀다.

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 알고리즘 객체 생성
dt_clf = DecisionTreeClassifier(random_state=11)
rf_clf = RandomForestClassifier(random_state=11)
lr_clf = LogisticRegression()

# 결정 트리 학습/예측/평가
dt_clf.fit(X_train,y_train)
dt_pred = dt_clf.predict(X_test)
print('DecisionTreeClassifier 정확도:{0:.4f}'.format(accuracy_score(y_test,dt_pred)))

# 랜덤포레스트 학습/예측/평가
rf_clf.fit(X_train,y_train)
rf_pred = rf_clf.predict(X_test)
print('RandomForestClassifier 정확도:{0:.4f}'.format(accuracy_score(y_test,rf_pred)))

# 로지스틱회귀 학습/예측/평가
lr_clf.fit(X_train,y_train)
lr_pred = lr_clf.predict(X_test)
print('LogisticRegression 정확도:{0:.4f}'.format(accuracy_score(y_test,lr_pred)))

'''
DecisionTreeClassifier 정확도:0.7877
RandomForestClassifier 정확도:0.8547
LogisticRegression 정확도:0.8492
'''
Python
복사
랜덤포레스트분류가 타 알고리즘보다 높은 정확도를 보이고 있다. 

이제, 교차 검증을 통해서 좀 더 제대로 평가해보자. 일단 결정 트리 알고리즘에 대해서 구해본다.

먼저 KFold 교차 검증을 해본다.

from sklearn.model_selection import KFold

def exec_kfold(clf,folds=5):
	# 폴드 세트를 5개인 KFold 객체를 생성, 폴드 수만큼 예측결과 저장을 위한 리스트 객체 생성
	kfold = KFold(n_splits=folds)
	scores = []

	# KFold 교차 검증 수행.
	# enumerate() 함수 사용-> for loop를 인덱스와 원소를 동시에 접근
	# enumerate()은 인덱스와 원소로 이루어진 튜플을 만들어줌
	for iter_count, (train_index,test_index) in enumerate(kfold.split(X_titanic_df)):
		# X_titanic_df 데이터에서 교차 검증별로 학습과 검증 데이터를 가리키는 index 생성
		X_train,X_test = X_titanic_df.values[train_index], X_titanic_df.values[test_index]
		y_train,y_test = y_titanic_df.values[train_index], y_titanic_df.values[test_index]
		# Classifier 학습, 예측, 정확도 계산
		clf.fit(X_train,y_train)
		predictions = clf.predict(X_test)
		accuracy = accuracy_score(y_test,predictions)
		scores.append(accuracy)
		print("교차 검증 {0} 정확도 : {1:.4f}".format(iter_count, accuracy)) # iter_count가 인덱스
	
	# 5개 fold에서의 평균 정확도 계산
	mean_score = np.mean(scores)
	print("평균 정확도: {0:.4f}".format(mean_score))

# exec_kfold 호출
exec_kfold(dt_clf,folds=5)
'''
교차 검증 0 정확도 : 0.7542
교차 검증 1 정확도 : 0.7809
교차 검증 2 정확도 : 0.7865
교차 검증 3 정확도 : 0.7697
교차 검증 4 정확도 : 0.8202
평균 정확도: 0.7823
'''
Python
복사

cross_val_score 교차 검증을 해본다. —> 얘는 stratifiedkfold 를 이용해서 교차 검증을 수행한다.

from sklearn.model_selection import cross_val_score

scores = cross_val_score(dt_clf,X_titanic_df, y_titanic_df,cv=5)
for iter_count, accuracy in enumerate(scores):
	print('교차 검증 {0} 정확도: {1:.4f}'.format(iter_count, accuracy))

print('평균 정확도: {0:.4f}'.format(np.mean(scores)))
'''
교차 검증 0 정확도: 0.7430
교차 검증 1 정확도: 0.7753
교차 검증 2 정확도: 0.7921
교차 검증 3 정확도: 0.7865
교차 검증 4 정확도: 0.8427
평균 정확도: 0.7879
'''
Python
복사

마지막으로 GridSearchCV 로 최적의 하이퍼 파라미터를 찾는다. 하이퍼 파라미터는 max_depth, min_samples_split, min_samples_leaf 로 한다.

from sklearn.model_selection import GridSearchCV

parameters = {'max_depth':[2,3,5,10], 'min_samples_split':[2,3,5], 'min_samples_leaf':[1,5,8]}

grid_dclf = GridSearchCV(dt_clf, param_grid=parameters, scoring='accuracy', cv=5)
grid_dclf.fit(X_train,y_train)

print('GridSearchCV 최적 하이퍼 파라미터 :', grid_dclf.best_params_)
print('GridSearchCV 최고 정확도: {0:.4f}'.format(grid_dclf.best_score_))
best_dclf = grid_dclf.best_estimator_

# GridSearchCV의 최적 하이퍼 파라미터로 학습된 Estimator로 예측 및 평가 수행
dpredictions = best_dclf.predict(X_test)
accuracy = accuracy_score(y_test,dpredictions)
print('테스트 세트에서의 DecisionTreeClassifier 정확도: {0:.4f}'.format(accuracy))
'''
GridSearchCV 최적 하이퍼 파라미터 : {'max_depth': 3, 'min_samples_leaf': 5, 'min_samples_split': 2}
GridSearchCV 최고 정확도: 0.7992
테스트 세트에서의 DecisionTreeClassifier 정확도: 0.8715
'''

Python
복사