seaborn

1. Seaborn

1-1. 스타일 사용자 정의

seaborn의 set_style, set_context를 활용해서 시각화된 자료를 세부적으로 스타일링 하는 방법을 소개한다. 

1-2. Seaborn으로 스타일링

크게 두 차원으로 생각할 수 있다.

•

set_style : background color, grid, spine, tick 을 정의해서 그림의 전반적인 모양을 스타일링

•

set_context : 프레젼테이션이나 보고서와 같은 다양한 매체에 활용할 수 있도록 스타일링

1-2-1. set_style : 그림의 전반적인 모양 스타일링

•

내장 테마 활용하기

seaborn에는 5가지 기본 제공 테마가 있다. ‘darkgrid’, ‘whitegrid’, ‘dark’, ‘white’, ‘ticks’

기본값은 darkgrid이지만, 원하는대로 변경 가능하다.

테마를 사용하려면 sns.set_style() 로 지정하면 된다.

•

sns.set_style(’darkgrid’)

sns.set_style('darkgrid')
sns.stripplot(x='day',y='total_bill', data=tips)
SQL
복사

despine (축/테두리 제거 옵션)

축은 left, right, bottom, top 의 네가지 옵션이 존재한다.

plot을 지정한 다음 그 뒤에 sns.despine() 이라고 지정하면 기본적으로 top, right 테두리를 제거해준다.

없애고 싶은 축이 있다면 방향=True 로 지정하여 축을 없애면 된다.

sns.set_style("white")
sns.stripplot(x='day',y='total_bill', data=tips)
sns.despine() # 오른쪽과 위의 테두리를 제거
SQL
복사

만약, 하단과 왼쪽 테두리도 모두 제거하고 싶다면 다음과 같이 한다.

sns.despine(left=True, bottom=True)
SQL
복사

1-2-2. set_context : 다양한 매체에 활용할 수 있도록 스타일링

matplotlib을 사용하면 프레젠테이션을 목적으로 다양하게 스타일링하는 것은 힘들다. 그러나 seaborn은 쉽게 할 수 있다. sns.set_context() 만 해주면 된다.

다음의 세 가지 수준의 복잡성을 고려한다.

Pass in one parameter that adjusts the scale of the plot

Pass in two parameters – one for the scale and the other for the font size

Pass in three parameters – including the previous two, as well as the rc with the style parameter that you want to override

1) 전체 스케일 조정

총 4종류의 스케일(사이즈)를 선택할 수 있다. paper, notebook, talk, poster

여기서, 기본값은 notebook이다.

가장 작은 스케일 paper로 그리면 다음과 같다.

sns.set_style('ticks')

sns.set_context('paper') # 가장 작은 스케일
sns.stripplot(x='day',y='total_bill', data=tips)
SQL
복사

가장 큰 poster로 그리면 다음과 같다.

sns.set_style("ticks")
sns.set_context("poster") # 가장 큰 스케일
sns.stripplot(x="day", y="total_bill", data=tips)
SQL
복사

글씨가 큼직해진 것을 확인할 수 있다.

2) 폰트 사이즈 조정

스케일로도 폰트 사이즈가 조정되지만 실제 폰트 사이즈를 조정하는 파라미터는 별도로 있다.

sns.set_context()내에 font_scale을 넣어주면 된다.

sns.set_context('poster',font_scale=.5)
sns.stripplot(x="day", y="total_bill", data=tips)
SQL
복사

3) rc 파라미터로 세부 조정

디테일하게 조정하고 싶다면 rc 파라미터를 딕셔너리 형태로 넣어주면 된다. rc는 run command의 약자라고 한다.

예를들어, 그리드의 너비를 조정하고 싶다면 rc={’grid.linewidth’:5} 이렇게 해주면 된다.

sns.set_context("poster", font_scale = 1, rc={"grid.linewidth": 5})
sns.stripplot(x="day", y="total_bill", data=tips)
SQL
복사

사용할 수 있는 rc 파라미터의 옵션들

1-2-3. Chart

Line Charts

•

sns.lineplot(data) 

 # set the width and height of the figure
plt.figure(figsize=(14,6))

# add title
plt.title("title name")

# just check what the column name is
list(dataset.columns) 

# line the subset of the dataset
sns.lineplot(data = dataset['column_name'], label = 'column_name')

# add label of the x axis
plt.xlabel('x_label_name')


Python
복사

Bar Charts

시간의 흐름에 따라 특정 값의 변화를 확인할 수 있다.

•

sns.barplot(x = 시간적 흐름을 보이는 index, y = 특정변수)

아래의 주어진 데이터를 활용한다. (by kaggle learn data)

# PC 게임에서 가장 높은 평균 score는 얼마인가?
high_score = ign_data.loc['PC'].max()

# PlayStation Vita 플랫폼에서 어떤 장르가 가장 낮은 평균 스코어를 받았는지 확인해보자
playstation_min_score = ign_data.loc['PlayStation Vita'].min()
worst_genre = ign_data.loc['PlayStation Vita'][ign_data.loc['PlayStation Vita']==playstation_min_score].index
Python
복사

# creating bar chart that shows score for racing games, for each platform
sns.barplot(x=ign_data.index, y = ign_data['Racing'])

plt.xticks(rotation= 90) 
Python
복사

Heatmap

•

sns.heatmap(data, annot=True, fmt)

◦

annot=True : 각 셀별로 값을 표시할지의 여부, 보통 표시한다. 

◦

fmt : 값을 표시할 때, 어떤 타입으로 표시할 것인지 설정

▪

‘d’ : 정수로 표시

▪

‘f’ : 실수로 표시

◦

cmap : 사용할 색깔 팔레트

◦

linewidth : 셀의 간격 표시

# plot heatmap of average score by genre and platform
sns.heatmap(data = ign_data, annot=True, fmt='d', cmap='YlGnBu')
Python
복사

ScatterPlot

•

sns.scatterplot(x = dataset[’칼럼1’], y = dataset[’칼럼2’]) : 산점도만 표시

◦

3개 이상의 변수들의 관계 파악하기(점들의 색깔이나 크기로 가능)

sns.scatterplot(x=dataset['칼럼1'], y=dataset['칼럼2'], hue=dataset['칼럼3'])
Python
복사

•

sns.swarmplot(x=dataset[’범주형칼럼1’], y = dataset[’연속형칼럼2’], color=’0.5’) : boxplot 과 비슷하지만, 데이터의 밀집도를 확인할 수 있는 plot

◦

color : 점의 어둡기를 조절할 수 있다

•

sns.regplot(x = dataset[’칼럼1’], y = dataset[’칼럼2’]) : 산점도 + 회귀선

•

sns.lmplot(x = ’칼럼’, y = ’칼럼2’, hue = ‘칼럼3’, data = dataset, order=1, height, ci) : 산점도 + 회귀선 

◦

hue : 범주형 데이터를 추가하여 색깔로 구분할  있다.

◦

height : figsize 와 같은 기능. 그래프의 사이즈를 조절한다.

◦

ci : 신뢰구간을 표현할것인지의 여부. ci=None → 신뢰구간 제거

◦

order : default 는 1. 1은 선형 회귀, 2는 2차 곡선, 3은 3차 곡선을 그려준다.

sns.lmplot(x="칼럼1", y="칼럼2", hue="칼럼3", data=dataset)
Python
복사

참고 : 조건에 만족하는 데이터만 추출해서 lmplot 그리기

◦

data = df.query(”필드와 관련된 조건”)

산점도와 관련된 세부적인 옵션 변경

◦

scatter_kws 옵션 사용

scatter_kws 를 사용해서 딕셔너리로 마크의 크기를 조정해줬다.

boxplot : 상자그래프

•

sns.boxplot(x, y, data, hue, palette)

pairplot : 관계그래프

•

sns.pairplot(data, hue)

: 모든 데이터들에 대해 pairplot

•

sns.pairplot(data, x_vars=[’x축에 올 필드1’ , ’x축에 올 필드2’], y_vars = [’y축에 올 필드1’ , ‘y축에 올 필드2’])

일부 필드만 pairplot

•

sns.pairplot(data, vars=[’필드1’ , ’필드2’ , ‘필드3’ ], kind =’reg’, height=3)

일부 필드에 대해 산점도 뿐만 아니라, 회귀선도 포함해서 그리기

Distributions(histogram & density)

histogram

•

sns.histplot(dataset[’연속형 칼럼’])

◦

범주형 칼럼을 히스토그램에 나타내기 : sns.histplot(data = dataset, x = ’연속형 칼럼’, hue = ‘범주형 칼럼’)

density(KDE)

•

sns.kdeplot(dataset[’연속형 칼럼’], fill = True)

◦

범주형 칼럼을 kdeplot에 나타내기 : sns.kdeplot(data = dataset, x = ‘연속형 칼럼’, hue = ‘범주형 칼럼’, fill = True)

•

2차원 density plot 

◦

sns.jointplot(x = dataset[’연속형 칼럼1’], y = dataset[’연속형 칼럼2’], kind = ‘kde’)