하이퍼파라미터 튜닝

하이퍼파라미터는 머신러닝 모델에서 모델을 학습할 때 설정하는 외부 구성 값들을 의미한다.

모델 학습 과정 자체에 의해 학습되지 않지만 모델의 성능에 큰 영향을 미치는 파라미터다.

learning rate, depth, batch size, epochs 등 다양하며, 모델의 종류마다 다르다.

하이퍼파라미터 튜닝

튜닝은 이런 파라미터들의 최적의 조합을 찾는 과정을 의미한다.

•

경험과 직관에 의한 조정

•

모델 및 데이터 이해에 기반한 조정

•

시험 및 오류에 의한 조정

회귀 모델 하이퍼파라미터 튜닝 연습

•

릿지 회귀 모델

Ridge 회귀는 선형 회귀의 일종으로, 선형 회귀 모델의 손실함수 MSE에서 규제가 추가된 모델이다.

\text{MSE} + 규제

Ridge 회귀에서 사용하는 규제는 L2 규제라고도 불리며 모델의 계수의 제곱합을 규제 항으로 사용한다.

이 규제 항이 비용 함수에 추가되어 모델이 학습하는 동안 계수의 크기가 커지는 것을 제한한다.

즉, 이 비용 함수는 모델의 일반화 능력을 향상시키고 과적합 위험을 줄이는데 도움을 주지만 적절한 알파 값의 선택이 중요하다.

from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split

# 연습용 데이터 생성
X, y = make_regression(n_samples=500, n_features=2, noise=0.1, random_state=0)

# 훈련 데이터, 테스트 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

print(X_train.shape)
print(y_train.shape)

print(X_test.shape)
print(y_test.shape)
Python
복사
(80, 2)
(80,)
(20, 2)
(20,)

from sklearn.linear_model import Ridge

ridge_model = Ridge()

# 교차 검증 수행
from sklearn.model_selection import cross_val_score

scores = cross_val_score(ridge_model, X_train, y_train, scoring='neg_mean_squared_error')
scores # 교차검증 neg_MSE 점수
Python
복사
array([-2.83839588, -2.27326453, -3.15435097, -2.3522752 , -5.7908437 ])

•

Ridge 모델의 하이퍼파라미터 alpha값이 변함에 따라 모델의 성능을 관찰해보자.

◦

alpha : 정규화의 강도를 조절

Loss = \text{MSE} + alpha \times Regularization Term

import numpy as np
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=0)

alpha_list = [0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008]
scores_list = []

# 각 alpha에 대한 교차 검증
for alpha in alpha_list:
		ridge_model = Ridge(alpha=alpha)
		scores = cross_val_score(ridge_model, X_train, y_train, scoring='neg_mean_squared_error', cv=kf)
		scores_list.append(np.mean(scores))

scores_list
Python
복사
[-0.009961574660232828,
 -0.00996131342479463,
 -0.009961114852547201,
 -0.009960978942643922,
 -0.00996090569423835,
 -0.009960895106483417,
 -0.009960947178533557,
 -0.009961061909540138]

#### 최고점을 갖는 alpha 찾기 ####

# 최적 alpha 값 및 성능 확인
best_score = max(scores_list) # 최고득점
print(f"Best Score: {best_score}")

optimal_alpha = alpha_list[np.argmax(scores_list)] # 최고득점에서의 alpha값
print(f"Optimal alpha: {optimal_alpha}")
Python
복사
Best Score: -0.009960895106483417
Optimal alpha: 0.0006

### alpha값에 따른 성능 지표값 시각화 ###
import matplotlib.pyplot as plt

# 결과 시각화
plt.figure(figsize=(10,6))
plt.plot(alpha_list, scores_list, marker='o', linestyle='--')
plt.xlabel('Alpha')
plt.ylabel('Cross-Validation Score (Neg Mean Squared Error)')
plt.title('Alpha vs. CV Score')
plt.xscale('log')
plt.show()
Python
복사

분류 모델 하이퍼파라미터 튜닝 연습

•

로지스틱 회귀

모델의 복잡성과 정규화 사이의 균형을 조정하는 하이퍼 파라미터 C

C 가 작을수록 규제가 강해지고, 일반화 성능이 좋아진다.

Loss = Cross Entropy + \frac{1}{C} \times Regularization Term

penalty

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_sample=1000, # 샘플 수
														n_features=20, # 특성 수
														n_informative=15, # 유익한 특성 수
														n_redundant=5, # 중복 특성 수
														n_clusters_per_class=2, # 클래스당 클러스터 수
														weights=[0.7,0.3], # 클래스 비율 조정
														flip_y=0.05, # 레이블 노이즈 비율
														class_sep=1.5, # 클래스를 분리하는 정도
														random_state=42,
														n_classes=2)
														

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver='liblinear')
Python
복사

# C 규제 정도에 따른 교차 검증의 결과 확인
import numpy as np
from sklearn.model_selection import KFold, cross_val_score

# K-Fold 교차 검증 설정
kf = KFold(n_splits=5, shuffle=True, random_state=40)

# C 값의 후보 리스트
c_list = [10e-7, 10e-6, 10e-5, 10e-4, 10e-3, 10e-2, 10e-1, 1, 10, 10^2]
scores_list = []

# 각 C에 대한 교차 검증 수행
for c_val in c_list:
    # 로지스틱 회귀 모델 객체 정의, C 값 설정
    logreg_model = LogisticRegression(C=c_val, solver='liblinear', random_state=42)
    # 교차검증, 분류 문제이므로 'accuracy'를 사용
    scores = cross_val_score(logreg_model, X_train, y_train, scoring='accuracy', cv=kf)
    # 평균 점수 저장
    scores_list.append(np.mean(scores))

print('모델의 성능: ', scores_list)



# 최적 alpha 값 및 성능 확인
best_score = max(scores_list) # scores_list에서 최고득점
print(f"Best Score: {best_score}")

optimal_c = c_list[np.argmax(scores_list)] # 최고득점에서의 C의 값
print(f"Optimal C: {optimal_c}")



# C값에 따른 성능 지표 시각화
import matplotlib.pyplot as plt

# 결과 시각화
plt.figure(figsize=(10,6))
plt.plot(c_list, scores_list, marker='o', linestyle='--')
plt.xlabel('C')
plt.ylabel('Cross-Validation Score (Accuacy)')
plt.title('C vs. Accuacy Score')
plt.xscale('log')
plt.show()
Python
복사

모델의 성능: [0.7749999999999999, 0.7825, 0.81125, 0.86, 0.8949999999999999, 0.9025000000000001, 0.89375, 0.89375, 0.8925000000000001, 0.8925000000000001] Best Score: 0.9025000000000001 Optimal C: 0.1