다중공선성

두 변수간에 높은 상관관계가 나타날 경우, 두 변수가 서로 같은 정보를 가지고 있진 않은지 확인해줄 필요가 있다. (이를 다중공선성이 있다라고 표현)

다중공선성이 있을 경우, 두 변수 중 하나만 남기는 것이 일반적이고, 변수 선택을 통해서 모델의 복잡도를 낮추고 정확도를 높일 수 있다는 장점이 있다.

와인 품질 분류 경진대회

변수 total sulfur dioxide와 free sulfur dioxide 에서 다중공선성 해결하기

total_count = sum(train['total sulfur dioxide'] > train['free sulfur dioxide'])
same_count =  sum(train['total sulfur dioxide'] == train['free sulfur dioxide'])
sulfur_count = sum(train['total sulfur dioxide'] < train['free sulfur dioxide'])              

print('total > free에 해당하는 개수 :', total_count)
print('두 변수가 같은 경우의 개수 :', same_count)
print('total < free에 해당하는 개수 :', sulfur_count)
Python
복사

total > free에 해당하는 개수 : 5497
두 변수가 같은 경우의 개수 : 0
total < free에 해당하는 개수 : 0
Plain Text
복사

위의 결과를 통해서 총 이산화황 = 유리 이산화황 + 알파값

이라고 판단할 수 있고, 따라서 총 이산화황 - 유리 이산화황 파생변수를 만들고, 총 이산화황 변수를 삭제해볼 수 있겠다.

파생 변수 ( total sulfur dioxide - free sulfur dioxide = 유리 이산화황을 제외한 이산화황) 

train['free et sulfur dioxid'] = train['total sulfur dioxide'] - train['free sulfur dioxide']   
test['free et sulfur dioxid'] = test['total sulfur dioxide'] - test['free sulfur dioxide'] 
Python
복사

total sulfur dioxide 제거

train = train.drop(['total sulfur dioxide'], axis = 1)   
test = test.drop(['total sulfur dioxide'], axis = 1)
Python
복사

범주형 변수 변환

train['type'] = train['type'].apply(lambda x : 0 if x == "white" else 1)   
test['type'] = test['type'].apply(lambda x : 0 if x == "white" else 1)
Python
복사

정규화

from sklearn.preprocessing import MinMaxScaler

features = ['fixed acidity', 'volatile acidity', 'citric acid',       
       'residual sugar', 'chlorides', 'free sulfur dioxide',         
       'free et sulfur dioxid', 'density', 'pH', 'sulphates', 'alcohol']        

scaler = MinMaxScaler()          
scaler.fit(train[features])          
train[features] = scaler.transform(train[features])
test[features] = scaler.transform(test[features])
Python
복사

독립변수 & 종속변수 설정

features = train.columns[2:]

X = train[features]
y = train['quality']
Python
복사

import numpy as np

##### 평가산식 : ACCURACY(정확도) #####
def ACC(y_true, pred):   
    score = np.mean(y_true==pred)
    return score

##### 모델 검증 시각화 #####
def make_plot(y_true, pred):
    
    acc = ACC(y_true, pred)
    df_validation = pd.DataFrame({'y_true':y_true, 'y_pred':pred})

    # 검증 데이터 정답지('y_true') 빈도수 (sorted)
    df_validation_count = pd.DataFrame(df_validation['y_true'].value_counts().sort_index())
    # 검증 데이터 예측치('y_pred') 빈도수 (sorted)
    df_pred_count =  pd.DataFrame(df_validation['y_pred'].value_counts().sort_index())

    # pd.concat - 검증 데이타 정답지, 예측치 빈도수 합치기
    df_val_pred_count = pd.concat([df_validation_count,df_pred_count], axis=1).fillna(0)

    ############################################################
    # 그래프 그리기
    ############################################################
    
    x = df_validation_count.index
    y_true_count = df_val_pred_count['y_true']
    y_pred_count = df_val_pred_count['y_pred']

    width = 0.35
    plt.figure(figsize=(5,3),dpi=150)

    plt.title('ACC : ' + str(acc)[:6])
    plt.xlabel('quality')
    plt.ylabel('count')

    p1 = plt.bar([idx-width/2 for idx in x], y_true_count, width, label='real')
    p2 = plt.bar([idx+width/2 for idx in x], y_pred_count,  width, label='pred')

    plt.legend()
    plt.show()
Python
복사

모델 학습 ( stratified - k fold cross validation 을 활용한 randomforest 모델 )

Python
복사