Google Bigquery

빅쿼리?

구글에서 제공하는 클라우드 데이터 웨어하우스

부분적으로 무료이고, 스토리지, 쿼리량에 상한이 정해져있다. 신규 고객에게는 처음 90일간 사용할 수 있는 300달러 크레딧이 제공되고, 매월 10GB 스토리지, 최대 1TB 쿼리가 무료로 제공된다.

GA를 이용해 데이터를 쌓고 있는 회사의 경우, 해당 데이터를 빅쿼리에 연동해서 보는 경우가 많다. GA 보고서의 경우에는 요약값만 확인할 수 있어서 개별적으로 뜯어보기 힘들기 때문이다. 즉, SQL을 활용해서 GA의 개별 데이터를 분석하고 싶을 때 빅쿼리를 연동해서 사용하면 개별 데이터를 추출해서 볼 수 있다는 장점이 있다.

물론 GA 를 사용하지 않아도 빅쿼리는 사용할 수 있다. 빅쿼리에 CSV 데이터를 업로드하거나 데이터셋을 연결해서 쿼리하고 분석할 수 있다.

빅쿼리에서 프로젝트

먼저, 빅쿼리에서 데이터를 보려면 프로젝트를 만들어야 한다.

프로젝트는 데이터 테이블을 담는 폴더 역할이다.

프로젝트를 만든 뒤

‘Bigquery 에서 쿼리 실행’ 버튼을 눌러 빅쿼리 콘솔을 실행한다.

빅쿼리에 데이터를 연결하는 3가지 방법

구글에서 제공하는 공개 데이터

csv 데이터 직접 업로드

GA4 데이터 빅쿼리에 연결

Bigquery 이용하면 SQL을 이용하여 큰 데이터셋을 이용할 수 있다. 

import pandas as pd
from google.cloud import bigquery
Python
복사

# create a "Client" object
client = bigquery.Client()

# Construct a reference to the dataset
dataset_ref = client.dataset("dataset", project = "project-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# write the code you need here to figure out the answer
tables = list(client.list_tables(dataset))

# 데이터셋에 어떤 데이터들이 있는지 살펴보자.
for table in tables:
		print(table.table_id)


# construct a reference to the 'sth' table
table_ref = dataset_ref.table("sth")

# API request - fetch the table
table = client.get_table(table_ref)


# preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

# 원하는 피처만 확인하기, 1열 변수만 확인
client.list_rows(table, max_results=5, selected_fields = table.schema[:1]).to_dataframe()
Python
복사

sql 쿼리 날리기

query = """
				SELECT 변수1, 변수2, 변수3
				FROM bigquery - public - data.dataset.sth
				WHERE 조건
				"""

# set up the query with the limit set to 10 GB -> 용량이 커지지 않게 job_config를 지정해줘도 된다. 
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and return a pandas DataFrame
results = query_job.to_dataframe()

# View top few rows of results
print(results.head())
Python
복사