EDA

EDA 의 방법은 변수의 개수(단변량 / 다변량) , Graphic / Non-graphic 에 따라 나눌 수 있다. 

단변량 (Univariate)

데이터의 분포를 확인하는 것이 주 목적이다.

Non-Graphic

•

수치형 데이터의 경우, 통계 기법을 다양하게 활용할 수 있다

◦

Center(mean, median, mode) 

◦

spread (variance, SD, IQR, Range)

◦

Modality (Peak)

◦

Shape (tail, skewness, kurtosis)

◦

Outliers 등 

•

범주형 데이터의 경우

◦

occurence, frequency, tabulation 등

◦

범주형 데이터의 경우에는 수치형 자료로 표현할 수도 있음. 단, 수치적인 의미를 내포하고 있지 않음에 유의할 것!(ex. 원핫 인코딩)

Graphic

히스토그램, 파이 차트, 상자 그림, QQ plot 등이 있다.

다변량 (Multi - variate)

변수간의 관계를 보는 것이 주된 목표다.

Non-Graphic

Cross-tabulation(교차분석), Cross-Statistics(상관관계, 공분산 등)이 있다.

Graphic

상자그림, 누적 막대 그래프, 평행 좌표, 히트맵 등이 있다.

판다스를 활용한 기초 EDA

Missing Values (결측치)

•

isna, isnull, notna, notnull, dropna, fillna

•

파이썬에서는 결측치를 모두 NaN 으로만 통용해서 사용한다

Data Frame

•

index, columns, dtypes, select_dtypes, loc, iloc, head, tail, shape, info, describe, apply, aggregate, drop, rename, replace, nsmallest, nlargest, sort_values, sort_index, value_counts, reset_index , …

Visualization

•

plot, plot.area, plot.bar, plot.barh, plot.box, plot.density, plot.hexbin, plot.hist, plot.kde, plot.line, plot.pie, plot.scatter

Preprocessing

preprocessing 을 요약하자면 다음으로 나눌 수 있다.

•

Data Cleansing 

결측치, 잘못 입력된 데이터, 일관성 없는 데이터 등의 노이즈를 보정하는 과정을 통컬어 말한다.

•

Data Integration

필요한 데이터들은 한 데이터프레임에 있지 않은 경우가 대다수다. 합치는 과정(merge, join 등)이 필요하다

•

Transformation

데이터 형태를 변환하는 작업이다

•

Reduction

필요한 데이터만 추출하거나 차원을 축소하는 과정이다. (분석의 효율을 위해 필요)