๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/57

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 2-1 pandas๋กœ ์—ฌ๋Ÿฌ csv ํŒŒ์ผ ํ•ฉ์น˜๊ธฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/56 https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding...

silvercoding.tistory.com

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
raw = pd.read_excel('./data/subway_raw.xlsx')
raw.head()

์ด์ „ ๊ธ€์—์„œ ๋ณ‘ํ•ฉํ–ˆ๋˜ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์™€ ์ค€๋‹ค. 

raw.info()

์ด 99342๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ  ,  null๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 


 ์Šน๊ฐ์ด ๋งŽ์•„์ง€๋Š” ๋‚ ์€ ์–ธ์ œ์ธ๊ฐ€ ? 

- ์ผ์ž , ์š”์ผ ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜ ํ™•์ธ ํ•ด ๋ณด๊ธฐ 

data_date = pd.pivot_table(raw, index = ['์‚ฌ์šฉ์ผ์ž', '์š”์ผ'], values= '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')
data_date.head()

 

- ์ผ์ž , ์š”์ผ ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ ์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 

data_date_sort = data_date.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_date_sort

 

์ผ์ž, ์š”์ผ ๋ณ„ ์Šน์ฐจ์Šน๊ฐ์ˆ˜๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ ํ•ด ๋ณด์•˜๋‹ค. ์ด ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ™•์ธํ•ด ๋ณด๋ฉด ๋‘๊ฐ€์ง€ ๊ฐ€์ •์„ ์„ธ์šธ ์ˆ˜ ์žˆ๋‹ค. 

 

1. 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค. 

2. ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค. 

 

 

(1) 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค 

- '์—ฐ์›”' ์ปฌ๋Ÿผ , '์›”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ 

yearmonth_list = []
monthday_list = []
for date in raw['์‚ฌ์šฉ์ผ์ž']:
    yearmonth = str(date)[:6]   # ์™ผ์ชฝ๋ถ€ํ„ฐ 6์ž๋ฆฌ ๋ฌธ์ž ์„ ํƒ
    yearmonth_list.append(yearmonth)
    monthday = str(date)[4:]    # ์™ผ์ชฝ์—์„œ 5๋ฒˆ์งธ ๋ฌธ์ž๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์„ ํƒ
    monthday_list.append(monthday)
    
# ์—ฐ์›”/ ์›”์ผ ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ
raw['์—ฐ์›”'] = yearmonth_list
raw['์›”์ผ'] = monthday_list
raw.head()

์›” ๋ณ„ ์Šน๊ฐ์ˆ˜๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ์ผ์ž๋ฅผ ์—ฐ์›”, ์›”์ผ๋กœ ๋‚˜๋ˆ„์–ด ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. 

data_month = pd.pivot_table(raw, index = '์—ฐ์›”', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc='sum')
data_month = data_month.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_month

์›”๋ณ„ ์Šน์ฐจ์ด๊ฐ์ˆ˜ ํ”ผ๋ฒ—๋ฐ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  , ์Šน์ฐจ์ข…์Šน๊ฐ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ์„ ํ•ด๋ณด๋‹ˆ , 5์›”์— ๊ฐ€์žฅ ์Šน๊ฐ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

๊ฒฐ๋ก  -> 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค : True

 

 

 

(2) ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค 

data_week = pd.pivot_table(raw, index = '์š”์ผ', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc='sum')
data_week = data_week.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_week

์š”์ผ๋ณ„ ์Šน์ฐจ์ด๊ฐ์ˆ˜ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  , ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ์„ ํ•ด๋ณด๋‹ˆ ๊ธˆ์š”์ผ์— ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๊ฒฐ๋ก  -> ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค : True

 

 

 

 

 


* ์›”๋ณ„ / ์ผ์ž๋ณ„ ์Šน๊ฐ์ˆ˜ ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” 

- 2019๋…„ 1์›” ๋ฐ์ดํ„ฐ๋กœ ํ…Œ์ŠคํŠธ ํ•ด๋ณด๊ธฐ 

df_selected = raw[ raw['์—ฐ์›”'] == '201901']
df_selected.head()

df_pivot = pd.pivot_table(df_selected, index = ['์›”์ผ','์š”์ผ'], values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')
df_pivot = df_pivot.reset_index()
df_pivot

2019๋…„ 1์›” ๋‚ ์งœ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•˜๋Š” ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋ฅผ pointplot์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™” ํ•ด๋ณด์ž. 

import matplotlib.pyplot as plt
import seaborn as sns 
from matplotlib import font_manager, rc
import platform 

# ํ•œ๊ธ€ ํฐํŠธ ์‚ฌ์šฉ
if platform.system() == 'Windows': 
    path = 'c:/Windows/Fonts/malgun.ttf'
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc('font', family=font_name)
elif platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
fig, ax = plt.subplots( figsize=(20,6) )
# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
sns.pointplot(data = df_pivot, x = '์›”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

fig, ax = plt.subplots( figsize=(20,6) )
# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
sns.pointplot(data = df_pivot, x = '์š”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

1์›” ๊ทธ๋ž˜ํ”„๋งŒ ๊ทธ๋ ค๋ณด์•˜๋”๋‹ˆ , ๊ธˆ์š”์ผ์— ๊ฐ€์žฅ ๋งŽ์€ ์Šน๊ฐ์ˆ˜๋ฅผ ์ฐ๊ณ , ๊ทธ ์ดํ›„๋กœ ์ค„์–ด๋“ค์–ด ์ผ์š”์ผ์—๋Š” ์Šน๊ฐ์ˆ˜๊ฐ€ ๋Œ€ํญ ์ค„์–ด๋“œ๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. ๋งค ๋‹ฌ๋งˆ๋‹ค ํŽธ์ฐจ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด 1-6์›”์˜ ๋‚ ์งœ๋ณ„ ์Šน๊ฐ์ˆ˜ ์‹œ๊ฐํ™”๋ฅผ ํ•ด๋ณธ๋‹ค. 

 

 

 

 

- ์ƒ๋ฐ˜๊ธฐ ๋‚ ์งœ๋ณ„ ์Šน๊ฐ์ˆ˜ ์‹œ๊ฐํ™” 

raw['์—ฐ์›”'].unique()

for yearmonth in raw['์—ฐ์›”'].unique():
    df_selected = raw[ raw['์—ฐ์›”'] == yearmonth]  # ํ•ด๋‹น ์—ฐ์›” ๋ฐ์ดํ„ฐ ์„ ํƒํ•˜๊ธฐ
    df_pivot = pd.pivot_table(df_selected, index = ['์›”์ผ','์š”์ผ'], values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')# ์ผ์ž๋ณ„ ์Šน๊ฐ์ˆ˜ ๊ณ„
    df_pivot = df_pivot.reset_index()
    
    fig, ax = plt.subplots( figsize=(20,6) )
    
    ax.set_title(f'์ผ์ž๋ณ„ ์ง€ํ•˜์ฒ ์Šน๊ฐ์ˆ˜({yearmonth})')  # ๊ทธ๋ž˜ํ”„ ์ œ๋ชฉ ์ถ”๊ฐ€ํ•˜๊ธฐ
    sns.pointplot(data = df_pivot, x = '์›”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

 

ํ•ด์„ :  ์ „์ฒด์ ์œผ๋กœ ํ˜•ํƒœ๊ฐ€ ๋น„์Šทํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ธˆ์š”์ผ ์ดํ›„ ์ฃผ๋ง์—๋Š” ์Šน๊ฐ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€๊ณ  , ์ผ์š”์ผ์— ๋Œ€ํญ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 2์›”๋‹ฌ์„ ๋ณด๋ฉด ์„ค์—ฐํœด๋กœ ์ธํ•ด์„œ 2์›” ์ดˆ ์ฃผ์ค‘์—๋„ ์Šน๊ฐ์ˆ˜์˜ ์ˆ˜๊ฐ€ ๋‚ฎ์œผ๋ฉฐ , 6์›” 6์ผ ํ˜„์ถฉ์ผ์—๋„ ์Šน๊ฐ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ ๋˜ํ•œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๋Š” ๊ณตํœด์ผ์˜ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

+ Recent posts