๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/56

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 1-3. ํŠน์ • ํŒ€์— ๊ฐ•ํ•œ ์•ผ๊ตฌ์„ ์ˆ˜ ๋ถ„์„ํ•˜๊ธฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. se..

silvercoding.tistory.com

 

 


import pandas as pd

* pandas๋กœ ํŒŒ์ผ ์ฝ์–ด์˜ค๊ธฐ 

์ถœ์ฒ˜ - ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ

< ๊ด€๋ จ ์ฐธ๊ณ  ๋ฌธ์„œ > 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

 

pandas.read_excel — pandas 1.3.1 documentation

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IN

pandas.pydata.org

 

 

 

 


 pandas ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—‘์…€ ํŒŒ์ผ ๋ณ‘ํ•ฉํ•˜๊ธฐ 

(1) ์—‘์…€ํŒŒ์ผ ํ•œ๊ฐœ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ดํŽด๋ณด๊ธฐ 

* ํŒŒ์ผ : ์ผ์ž / ๋…ธ์„  / ์ง€ํ•˜์ฒ ์—ญ ๋ณ„ ์Šนํ•˜์ฐจ ๊ณ ๊ฐ ์ˆ˜ ( 2019๋…„ ์ƒ๋ฐ˜๊ธฐ - 1์›” ~ 6์›”, ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต ) 

file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
raw = pd.read_csv(file)
raw.head()

raw.info()

์ด 18334 ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ณ , ๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(2) ๋‘๊ฐœ ์—‘์…€ํŒŒ์ผ ํ•ฉ์ณ๋ณด๊ธฐ  

raw = pd.DataFrame()  # ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ 
raw.head()

# ์ฒซ ๋ฒˆ์งธ ํŒŒ์ผ
file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
temp = pd.read_csv(file)
# temp.head()
raw = raw.append(temp)  # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ถ”๊ฐ€ 
# raw.head()

# ๋‘ ๋ฒˆ์งธ ํŒŒ์ผ
file = './rawfiles/CARD_SUBWAY_MONTH_201902.csv'
temp = pd.read_csv(file)
raw = raw.append(temp)  # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ถ”๊ฐ€ 

raw.info()

์šฐ์„  , ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„, ํŒŒ์ผ์„ ์ฝ์–ด ์™€์„œ append๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŒŒ์ผ์„ ํ•ฉ์น˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—…์„ for๋ฌธ์œผ๋กœ ์ž‘์„ฑํ•ด์ฃผ๋ฉด ๋œ๋‹ค. 

 

 

 

(3) ํด๋”์— ์žˆ๋Š” ๋ชจ๋“  ์—‘์…€ํŒŒ์ผ ๋ณ‘ํ•ฉํ•˜๊ธฐ

- ํด๋” ์•ˆ์˜ ํŒŒ์ผ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ 

# ํด๋” , ํŒŒ์ผ์„ ๊ด€๋ฆฌํ•˜๋Š” os ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 
import os
os.listdir()

os.listdir() ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜„์žฌ ํด๋”์— ์žˆ๋Š” ํด๋”์™€ ํŒŒ์ผ๋“ค์˜ ์ด๋ฆ„์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

dirpath = './rawfiles/'
files = os.listdir(dirpath)
files

๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ํ˜„์žฌ ํด๋” ์•ˆ์˜ rawfiles์•ˆ์˜ ํŒŒ์ผ๋“ค์„ ๋ถˆ๋Ÿฌ ์˜จ๋‹ค. 

 

์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ผ์ผ์ด ํŒŒ์ผ ์ด๋ฆ„์„ ๋ณต๋ถ™ํ•˜์ง€ ์•Š์•„๋„ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

- ๋ณ‘ํ•ฉํ•˜๊ธฐ 

# ๋นˆ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ค€๋น„
raw = pd.DataFrame() 

# ๋ณ‘ํ•ฉ
for file in os.listdir('./rawfiles'):
#     print(file)
    fpath = './rawfiles/'+file
    print(fpath)
    temp = pd.read_csv(fpath)
    raw = raw.append(temp, ignore_index = True)   #ignore_index = True  --> ๊ธฐ์กด ์ธ๋ฑ์Šค๋Š” ๋ฌด์‹œํ•˜๋ผ.

์œ„์™€ ๊ฐ™์ด ๋นˆ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ํŒŒ์ผ์„ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•œ๋‹ค. ignore_index=True๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ๋ฌด์‹œํ•˜์ง€ ์•Š์œผ๋ฉด ์›๋ž˜ ์ž์‹ ์˜ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๊ฐ€ ๋’ค์ฃฝ๋ฐ•์ฃฝ์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ๊ฒŒ๋œ๋‹ค. ์ˆœ์„œ๋Œ€๋กœ ํ•ด์ฃผ๊ธฐ ์œ„ํ•จ์ด๋‹ค. 

 

 

 


 ๋ณ‘ํ•ฉํ•œ ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 
raw.head()

raw.tail()

raw.info()

์ด 99342 ๊ฐœ์˜ row๋กœ , raw.tail()์„ ํ–ˆ์„ ๋•Œ index์˜ ๊ฒฐ๊ณผ์™€ ๊ฐ™์œผ๋ฏ€๋กœ ์ •์ƒ์ ์œผ๋กœ ๋ณ‘ํ•ฉ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


 '์š”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ ํ•ด๋ณด๊ธฐ - datetime ์‚ฌ์šฉ 
from datetime import datetime

์ถœ์ฒ˜ - ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ

 

- datetime.strptime()

date_str = str(20190601)     # ์ˆซ์ž๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž๋กœ ์ž…๋ ฅ๋˜์–ด์•ผ ํ•จ
date = datetime.strptime(date_str, "%Y%m%d")
date

datetime.datetime(2019, 6, 1, 0, 0)

# ์›”์š”์ผ : 0 ~ ์ผ์š”์ผ : 6
weekday = date.weekday()
weekday

5

์ง์ ‘ ๋‹ฌ๋ ฅ์„ ์ฐพ์•„๋ณด์ง€ ์•Š๊ณ ๋„ , datetime ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ ‡๊ฒŒ ์š”์ผ์„ ๋ฐ˜ํ™˜ํ•ด ์ค€๋‹ค. 

 

- '์š”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ 

weekday_dict = [ '์›”','ํ™”','์ˆ˜','๋ชฉ','๊ธˆ','ํ† ','์ผ']
weekday_list = []

for date_str in raw['์‚ฌ์šฉ์ผ์ž']:
    date = datetime.strptime(str(date_str), "%Y%m%d")
    weekday_index  = date.weekday()
    weekday = weekday_dict[weekday_index]
    weekday_list.append(weekday)

weekday_list์— ๊ฐ row์˜ ์š”์ผ์„ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

raw['์š”์ผ'] = weekday_list
raw.sample(5)

'์š”์ผ' ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•ด ์ฃผ๊ณ  , sample์„ ์ด์šฉํ•˜์—ฌ ๋žœ๋ค์œผ๋กœ row๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

raw.columns

new_columns = ['์‚ฌ์šฉ์ผ์ž',  '์š”์ผ', '๋…ธ์„ ๋ช…', '์—ญID', '์—ญ๋ช…', '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', 'ํ•˜์ฐจ์ด์Šน๊ฐ์ˆ˜', '๋“ฑ๋ก์ผ์ž']
raw = raw[ new_columns ]
raw.head()

์š”์ผ์„ ์•ž์— ๋‘๊ธฐ ์œ„ํ•ด ์ปฌ๋Ÿผ์˜ ์œ„์น˜๋ฅผ ๋ฐ”๊พธ์–ด ์„ ํƒํ•ด์ค€๋‹ค. 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์ €์žฅ 
raw.to_excel('./data/subway_raw.xlsx', index = False)

to_excel() ์„ ์ด์šฉํ•˜์—ฌ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•ด ์ค€๋‹ค. 


 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/55

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 1-2. ์•ผ๊ตฌ์„ ์ˆ˜๊ฐ€ ๊ฐ•ํ•ด์ง€๋Š” ๊ณ„์ ˆ์ด ์žˆ์„๊นŒ?

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot,..

silvercoding.tistory.com

 

 


 ํŠน์ • ํŒ€์— ๊ฐ•ํ•œ ์„ ์ˆ˜๊ฐ€ ์žˆ์„๊นŒ? 

* idea : '์ƒ๋Œ€' ์ปฌ๋Ÿผ์œผ๋กœ ๊ฒจ๋ฃจ์—ˆ๋˜ ์ƒ๋Œ€ ํŒ€์„ ์ถ”์ถœํ•˜์—ฌ ๊ฐ๊ฐ์˜ ์ƒ๋Œ€ ํŒ€๊ณผ์˜ ๊ฒฝ๊ธฐ์—์„œ์˜ ์ถœ๋ฃจ์œจ์„ ๊ณ„์‚ฐํ•œ๋‹ค.  

 

- ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd

file  = './data/KBO_2019_player_gamestats.csv'
raw = pd.read_csv(file, encoding = 'cp949')
raw.head()

 

 

- ์ƒ๋Œ€ ํŒ€๋ณ„ ๊ธฐ๋ก ์ •๋ฆฌ 

raw['์ƒ๋Œ€'].unique()

์šฐ์„  unique() ๋ฅผ ์ด์šฉํ•˜์—ฌ '์ƒ๋Œ€' ์ปฌ๋Ÿผ์— ์žˆ๋Š” ๊ฐ’๋“ค์„ ํ™•์ธํ•ด ์ค€๋‹ค. ์•ž์— @๊ฐ€ ๋ถ™์€ ๊ฒฝ์šฐ๋Š” ์›์ •๊ฒฝ๊ธฐ, ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ํ™ˆ ๊ฒฝ๊ธฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์ด๋ฅผ 'ํ™ˆ์–ด์›จ์ด' ๋ผ๋Š” ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ๊ตฌ๋ถ„ํ•˜๋„๋ก ํ•˜๊ณ  , '์ƒ๋Œ€ํŒ€' ์ปฌ๋Ÿผ์—๋Š” ํŒ€๋ช…๋งŒ์„ ๋„ฃ์–ด์ค€๋‹ค.  

opp_list = [ ]
home_away_list = [ ]

for opp in raw['์ƒ๋Œ€']:
    if "@" in opp:
        home_away = '์›์ •'
        opp = opp.replace('@', '')
    else:
        home_away = 'ํ™ˆ'
    home_away_list.append(home_away)
    opp_list.append(opp)

raw['ํ™ˆ์–ด์›จ์ด'] = home_away_list
raw['์ƒ๋Œ€ํŒ€'] = opp_list
raw.head()

for๋ฌธ์„ ์ด์šฉํ•˜์—ฌ ํ™ˆ์–ด์›จ์ด์™€ ์ƒ๋Œ€ํŒ€์„ ๊ตฌ๋ถ„ํ•˜๊ณ  ,  ๋‘๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ๋‹ค. 

factors = ['ํƒ€์ˆ˜','์•ˆํƒ€','ํ™ˆ๋Ÿฐ', '๋ฃจํƒ€', 'ํƒ€์ ','๋ณผ๋„ท', '์‚ฌ๊ตฌ', 'ํฌ๋น„']
data = raw.pivot_table(index = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ', '์ƒ๋Œ€ํŒ€'],
                      values = factors,
                       aggfunc = 'sum')
data.head()

์„ ์ˆ˜๋“ค์˜ ์ƒํƒœํŒ€ ๋ณ„ ์‹ค์ ์„ ์ง‘๊ณ„ํ•˜๊ธฐ ์œ„ํ•ด ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค. 

cond = data['ํƒ€์ˆ˜'] > 0 
data = data[ cond ]
data.head()

ํƒ€์ˆ˜๊ฐ€ ์—†๋Š” ์„ ์ˆ˜๋“ค์€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ์ œ์™ธ์‹œํ‚จ๋‹ค. 

data = data.reset_index()
data.head()

reset_index๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ index๋ฅผ ๋ชจ๋‘ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

 

 

 

 

- ํƒ€์ž vs ์ƒ๋Œ€ํŒ€ ๋ณ„ ์‹ค์  ๊ณ„์‚ฐ 

def cal_hit(df):
    '''
    - ํƒ€์œจ : ๊ณต์„ ์ณ์„œ ๋‚˜๊ฐ€๋Š” ๋น„์œจ --> ์•ˆํƒ€ / ํƒ€์ˆ˜
    - ์ถœ๋ฃจ์œจ: ์ง„๋ฃจํ•ด์„œ ๋‚˜๊ฐ€๋Š” ๋น„์œจ -->  (์•ˆํƒ€+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ)/(ํƒ€์ˆ˜+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ+ํฌ์ƒํ”Œ๋ผ์ด)
    - ์žฅํƒ€์œจ : ํƒ€์œจ์— ์ง„๋ฃจํ•œ ๋ฒ ์ด์Šค ๊ฐ€์ค‘์น˜ ์ถ”๊ฐ€ -->   ๋ฃจํƒ€ / ํƒ€์ˆ˜
    '''
    
    df['ํƒ€์œจ'] = df['์•ˆํƒ€'] / df['ํƒ€์ˆ˜']
    df['์ถœ๋ฃจ์œจ'] = (df['์•ˆํƒ€'] + df['๋ณผ๋„ท'] + df['์‚ฌ๊ตฌ']) / (df['ํƒ€์ˆ˜'] + df['์‚ฌ๊ตฌ'] + df['ํฌ๋น„'])
    df['์žฅํƒ€์œจ'] = df['๋ฃจํƒ€'] / df['ํƒ€์ˆ˜']
    df['OPS'] = df['์ถœ๋ฃจ์œจ'] + df['์žฅํƒ€์œจ']
    return df

์ด์ „ ๊ธ€์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ํ•จ์ˆ˜ , ์‹ค์  ๊ณ„์‚ฐ์„ ํ•˜๊ณ  ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

player_stats_opp = cal_hit(data)
player_stats_opp

 

 

 

 

- ๊ฒฐ๊ณผ ๋ณด๊ธฐ : DataFrame 

(1) ๋‘์‚ฐ์— ๊ฐ•ํ•œ ์„ ์ˆ˜ ? - ์ƒ์œ„ 10๋ช…

team = '๋‘์‚ฐ'
cond = (player_stats_opp['์ƒ๋Œ€ํŒ€'] == team) & (player_stats_opp['ํƒ€์ˆ˜'] > 10)
player_stats_opp[cond].sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False).head(10)

์ƒ๋Œ€ํŒ€์ด ๋‘์‚ฐ์ด๋ฉด์„œ , ํƒ€์ˆ˜๊ฐ€ 20๋ณด๋‹ค ํฐ ์„ ์ˆ˜๋“ค์„ ๋ฝ‘์•„ ๋‚ด๊ณ , ์ถœ๋ฃจ์œจ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•˜์—ฌ 10๋ช…์˜ ์„ ์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ด ๋ณธ๋‹ค. 

 

๋”ฐ๋ผ์„œ ์ƒ๋Œ€ํŒ€์ด '๋‘์‚ฐ' ์ผ ๋•Œ ์ถœ๋ฃจ์œจ ์ƒ์œ„ 10๋ช…์˜ ์ด๋ฆ„์„ ๋ฝ‘์•„๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

player_stats_opp[cond].sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False)['์ด๋ฆ„'].head(10)

 

(2) ๋กฏ๋ฐ์— ๊ฐ•ํ•œ ์„ ์ˆ˜ ? - ์ƒ์œ„ 10๋ช…

team = '๋กฏ๋ฐ'
cond = (player_stats_opp['์ƒ๋Œ€ํŒ€'] == team) & (player_stats_opp['ํƒ€์ˆ˜'] > 20)
player_stats_opp[cond].sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False).head(10)

๋™์ผํ•˜๊ฒŒ ์‹œํ–‰ ํ•ด ์ค€๋‹ค. 

 

์ƒ๋Œ€ํŒ€์ด '๋‘์‚ฐ' ์ผ ๋•Œ ์ถœ๋ฃจ์œจ ์ƒ์œ„ 10๋ช…์˜ ์ด๋ฆ„์„ ๋ฝ‘์•„๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

player_stats_opp[cond].sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False)['์ด๋ฆ„'].head(10)

 

 

(3) KBO ์ „์ฒด ํŒ€์„ ์ƒ๋Œ€๋กœ ํŒ€ ๋ณ„ ์ถœ๋ฃจ์œจ ์ƒ์œ„ 5์ธ ํƒ€์ž๋“ค ํ™•์ธํ•ด ๋ณด๊ธฐ 

hitter_df = pd.DataFrame()

for team in player_stats_opp['์ƒ๋Œ€ํŒ€'].unique():
    print(team)
    cond = (player_stats_opp['์ƒ๋Œ€ํŒ€'] == team) & (player_stats_opp['ํƒ€์ˆ˜'] > 20)
    df = player_stats_opp[cond].sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False).head(5)
    hitter_df = hitter_df.append(df)

hitter_df

ํŠน์ • ํŒ€ ์ƒ๋Œ€ ์ถœ๋ฃจ์œจ Top5 ์•ˆ์— ๋“ค์–ด ์žˆ๋Š” ํƒ€์ž ๋ฆฌ์ŠคํŠธ (์ค‘๋ณต ์ œ๊ฑฐ) 

hitter_df['์ด๋ฆ„'].unique()

 

 

 

- ๊ฒฐ๊ณผ๋ณด๊ธฐ : Heatmap (์‹œ๊ฐํ™”) 

cond = player_stats_opp['์ด๋ฆ„'].isin(hitter_df['์ด๋ฆ„'].unique())
top_df = player_stats_opp[cond]
top_pivot = top_df.pivot_table(index = ['ํŒ€','์ด๋ฆ„'], values = '์ถœ๋ฃจ์œจ', columns = '์ƒ๋Œ€ํŒ€', aggfunc = 'sum')
top_pivot

์œ„์—์„œ ๋งŒ๋“ค์–ด ๋†“์•˜๋˜ ํŠน์ • ํŒ€์„ ์ƒ๋Œ€๋กœ ์ถœ๋ฃจ์œจ top5 ์•ˆ์— ๋“ค์—ˆ๋˜ ์ด๋ฆ„๋“ค๋งŒ player_stats_opp์—์„œ ๋ฝ‘์•„์˜จ ํ›„ , ํ•ด๋‹น ์„ ์ˆ˜๋“ค์˜ ์ƒ๋Œ€ํŒ€ ๋ณ„ ์ถœ๋ฃจ์œจ pivot_table์„ ์ƒ์„ฑํ•œ๋‹ค.  

import matplotlib
from matplotlib import font_manager, rc
import platform
import matplotlib.pyplot as plt
import seaborn as sns

# ์ด๋ฏธ์ง€ ํ•œ๊ธ€ ํ‘œ์‹œ ์„ค์ •
if platform.system() == 'Windows':  # ์œˆ๋„์šฐ์ธ ๊ฒฝ์šฐ ๋ง‘์€๊ณ ๋”•
    font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
    rc('font', family=font_name)
else:    # Mac ์ธ ๊ฒฝ์šฐ ์• ํ”Œ๊ณ ๋”•
    rc('font', family='AppleGothic')

#๊ทธ๋ž˜ํ”„์—์„œ ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ํ‘œ์‹œ๋˜๋„๋ก ํ•˜๋Š” ์„ค์ •์ž…๋‹ˆ๋‹ค.
matplotlib.rcParams['axes.unicode_minus'] = False
fig, ax = plt.subplots( figsize=(15,15) )

sns.heatmap(data = top_pivot, 
            annot = True, fmt = '.3f', 
            cmap = 'Reds',
            center= 0.4   # ์ปฌ๋Ÿฌ๋งต ์ค‘๊ฐ„๊ฐ’ ์ง€์ •
           )

์ƒ‰์ด ์ง„ํ• ์ˆ˜๋ก ์ถœ๋ฃจ์œจ์ด ๋†’์Œ์„ ์˜๋ฏธํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด , NC์˜ ์–‘์˜์ง€ ์„ ์ˆ˜๋Š” ํ•ด๋‹น ์‹œ์ฆŒ์—์„œ KIA๋ฅผ ์ƒ๋Œ€๋กœ ํ•œ ๊ฒฝ๊ธฐ์—์„œ ์ถœ๋ฃจ์œจ์ด ๋†’์•˜์œผ๋ฉฐ , ํ•œํ™”์˜ ์ •๊ทผ์šฐ ์„ ์ˆ˜๋Š” LG์™€์˜ ๊ฒฝ๊ธฐ์—์„œ ๊ฐ•ํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.  

sns.heatmap(data = top_pivot, 
            annot = True, fmt = '.3f', 
            cmap = 'Reds',
            center= 0.6   # ์ปฌ๋Ÿฌ๋งต ์ค‘๊ฐ„๊ฐ’ ์ง€์ •
           )

์ƒ๋Œ€์ ์ธ ํฌ๊ธฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ ์ž ํ•  ๋•Œ๋Š” center๋ฅผ ๋ณ€๊ฒฝํ•ด ๊ฐ€๋ฉฐ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/54

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 1. best baseball player ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/53 https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot, jointplot, pairplot, boxplot, swarm..

silvercoding.tistory.com

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 
import pandas as pd
file  = './data/KBO_2019_player_gamestats.csv'
raw = pd.read_csv(file, encoding = 'cp949')
raw.head()

์ด์ „ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๋ถˆ๋Ÿฌ์˜จ๋‹ค. 

 

 


 ์•ผ๊ตฌ์„ ์ˆ˜๊ฐ€ ๊ฐ•ํ•ด์ง€๋Š” ๊ณ„์ ˆ์ด ์žˆ์„๊นŒ ? 

idea : '์ผ์ž' ์ปฌ๋Ÿผ์—์„œ ์›” ์ •๋ณด๋งŒ ์ถ”์ถœํ•ด ๋‚ด์–ด ์›” ๋ณ„ ์ถœ๋ฃจ์œจ์„ ์‹œ๊ฐํ™” ํ•ด๋ณธ๋‹ค. 

 

1. ์›” ๋ณ„ ๊ธฐ๋ก ์ •๋ฆฌํ•˜๊ธฐ 

- '์ผ์ž' ์ปฌ๋Ÿผ์—์„œ ์›” ์ถ”์ถœํ•˜๊ธฐ 

month_list = []
for monthdate in raw['์ผ์ž']:
    month, date = monthdate.split('-')
    month_list.append(month)
raw['์›”'] = month_list
raw.head()

 

- ๋ถ„์„์— ํ•„์š”ํ•œ ์ปฌ๋Ÿผ ์„ ํƒ 

columns_select = ['ํŒ€', '์ด๋ฆ„', '์ƒ์ผ','์ผ์ž', '์ƒ๋Œ€','ํƒ€์ˆ˜','์•ˆํƒ€','ํ™ˆ๋Ÿฐ', '๋ฃจํƒ€', 'ํƒ€์ ','๋ณผ๋„ท', '์‚ฌ๊ตฌ', 'ํฌ๋น„', '์›”']

data = raw[columns_select]
data.head()

 

- ์›”๋ณ„ ์‹ค์  ์ง‘๊ณ„ 

data_player_month = data.pivot_table(index = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ', '์›”'], 
                               values = ['ํƒ€์ˆ˜','์•ˆํƒ€','ํ™ˆ๋Ÿฐ','๋ฃจํƒ€','ํƒ€์ ','๋ณผ๋„ท','์‚ฌ๊ตฌ','ํฌ๋น„'], 
                              aggfunc = 'sum', fill_value = 0
                                )
data_player_month

ํŒ€, ์ด๋ฆ„, ์ƒ์ผ๋กœ ์„ ์ˆ˜๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ณ , ์›”๋ณ„ ์‹ค์ ์„ ์ง‘๊ณ„ํ•˜๋Š” pivot table์„ ์ƒ์„ฑํ•œ๋‹ค. ๋น„์–ด์žˆ๋Š” ๊ณณ์€ 0์œผ๋กœ ๊ฐ’์„ ์ฑ„์›Œ ์ค€๋‹ค. 

data_player_month = data_player_month.reset_index()
data_player_month

reset_index๋กœ ๋ฉ€ํ‹ฐ์ธ๋ฑ์Šค๋ฅผ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

 

 

- ํƒ€์œจ, ์ถœ๋ฃจ์œจ, ์žฅํƒ€์œจ, OPS ( ์ฃผ์š” ์‹ค์  ๊ณ„์‚ฐ ) ์ปฌ๋Ÿผ ์ถ”๊ฐ€ 

def cal_hit(df):
    '''
    - ํƒ€์œจ : ๊ณต์„ ์ณ์„œ ๋‚˜๊ฐ€๋Š” ๋น„์œจ --> ์•ˆํƒ€ / ํƒ€์ˆ˜
    - ์ถœ๋ฃจ์œจ: ์ง„๋ฃจํ•ด์„œ ๋‚˜๊ฐ€๋Š” ๋น„์œจ -->  (์•ˆํƒ€+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ)/(ํƒ€์ˆ˜+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ+ํฌ์ƒํ”Œ๋ผ์ด)
    - ์žฅํƒ€์œจ : ํƒ€์œจ์— ์ง„๋ฃจํ•œ ๋ฒ ์ด์Šค ๊ฐ€์ค‘์น˜ ์ถ”๊ฐ€ -->   ๋ฃจํƒ€ / ํƒ€์ˆ˜
    '''
    
    df['ํƒ€์œจ'] = df['์•ˆํƒ€'] / df['ํƒ€์ˆ˜']
    df['์ถœ๋ฃจ์œจ'] = (df['์•ˆํƒ€'] + df['๋ณผ๋„ท'] + df['์‚ฌ๊ตฌ']) / (df['ํƒ€์ˆ˜'] + df['๋ณผ๋„ท'] + df['์‚ฌ๊ตฌ'] + df['ํฌ๋น„'])
    df['์žฅํƒ€์œจ'] = df['๋ฃจํƒ€'] / df['ํƒ€์ˆ˜']
    df['OPS'] = df['์ถœ๋ฃจ์œจ'] + df['์žฅํƒ€์œจ']
    return df
player_month_stat = cal_hit(data_player_month)
player_month_stat.head()

 

player_month_stat.info()

๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

player_month_stat = player_month_stat.dropna()
player_month_stat.head()

dropna()๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” row๋ฅผ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

 

player_month_stat.info()

๊ฒฐ์ธก๊ฐ’์ด ๋ชจ๋‘ ์ฑ„์›Œ์กŒ๋‹ค! 

 

- ์›”๋ณ„ ์ถœ๋ฃจ์œจ 

month_pivot = player_month_stat.pivot_table(index = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ'],
                             columns = '์›”',
                             values = '์ถœ๋ฃจ์œจ')
month_pivot = month_pivot.reset_index()
month_pivot

์œ„์™€ ๊ฐ™์ด ์„ ์ˆ˜๋“ค์˜ ์›” ๋ณ„ ์ถœ๋ฃจ์œจ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜์˜€๋‹ค. 

 

 

2. ๊ฒฐ๊ณผ : KBO ์ถœ๋ฃจ์œจ ์ตœ๊ณ ํƒ€์ž๋“ค์˜ ์›” ๋ณ„ ์ถœ๋ฃจ์œจ ํ™•์ธํ•ด ๋ณด๊ธฐ 

- KBO ์ถœ๋ฃจ์œจ ์ตœ๊ณ ํƒ€์ž ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

file = './data/player_stat.csv'
player_stat = pd.read_csv(file, encoding = 'cp949')
player_stat.head(20)

์ €๋ฒˆ ๊ธ€์˜ ์ฃผ์ œ์˜€๋˜ ์ถœ๋ฃจ์œจ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•ด ๋†“์€ ๋ฐ์ดํ„ฐ์ด๋‹ค. ์ด ๋ฐ์ดํ„ฐ์™€ ์œ„์—์„œ ๋งŒ๋“ค์–ด ๋†“์€ ์›” ๋ณ„ ์ถœ๋ฃจ์œจ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ๋ณ‘ํ•˜์—ฌ ์ถœ๋ฃจ์œจ ์ƒ์œ„ 50๋ช…์˜ ์„ ์ˆ˜๋“ค์˜ ์›” ๋ณ„ ์ถœ๋ฃจ์œจ ๊ธฐ๋ก์„ ํ™•์ธํ•ด ๋ณด์ž. 

 

 

- ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘ 

df = pd.merge(player_stat, month_pivot, how = 'left', on = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ'])
# left_on = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ'], right_on = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ']
df.head(10)

df_sort = df.sort_values(by = '์ถœ๋ฃจ์œจ', ascending = False).head(50)
df_sort

๋‹ค์‹œํ•œ๋ฒˆ ์ถœ๋ฃจ์œจ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ ํ•˜๊ณ  , 50๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ df_sort ์— ๋„ฃ์–ด ์ค€๋‹ค. 

df_selected = df_sort[['ํŒ€', '์ด๋ฆ„', '์ถœ๋ฃจ์œจ', '03', '04', '05', '06', '07', '08', '09', '10']]
df_selected

๊ทธ๋ฆฌ๊ณ  df_sort์—์„œ ์ถœ๋ฃจ์œจ๊ณผ ๊ด€๋ จ๋œ ์ปฌ๋Ÿผ๋งŒ ๋ฝ‘์•„์„œ df_selectied ๋ณ€์ˆ˜๋ฅผ ์„ ์–ธํ•œ๋‹ค. 

์ˆซ์ž๋กœ ๋‚˜์™€์žˆ์œผ๋‹ˆ 50๋ช…์˜ ์„ ์ˆ˜๋“ค์˜ ์›” ๋ณ„ ๊ธฐ๋ก์ด ์–ด๋–ค์ง€ ๊ฐ€๋Š ์ด ์•ˆ ๊ฐ„๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ๊ฐํ™”๋ฅผ ํ†ตํ•˜์—ฌ ํ™•์ธํ•˜๋„๋ก ํ•œ๋‹ค. 

 

 

 

- ์‹œ๊ฐํ™”๋กœ ํ•œ๋ˆˆ์— ๋ณด๊ธฐ 

df_selected = df_selected.set_index(['ํŒ€','์ด๋ฆ„'])
df_selected

ํžˆํŠธ๋งต์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™”๋ฅผ ํ•  ๊ฒƒ์ด๋‹ค. ์ด ๋•Œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ values ๋ถ€๋ถ„์€ ๋ชจ๋‘ ์ˆซ์žํ˜• ๋ฐ์ดํ„ฐ๋กœ๋งŒ ๋˜์–ด ์žˆ์–ด์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ํŒ€, ์ด๋ฆ„ ์ปฌ๋Ÿผ์€ index๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

import matplotlib
from matplotlib import font_manager, rc
import platform
import matplotlib.pyplot as plt
import seaborn as sns

# ์ด๋ฏธ์ง€ ํ•œ๊ธ€ ํ‘œ์‹œ ์„ค์ •
if platform.system() == 'Windows':  # ์œˆ๋„์šฐ์ธ ๊ฒฝ์šฐ ๋ง‘์€๊ณ ๋”•
    font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
    rc('font', family=font_name)
else:    # Mac ์ธ ๊ฒฝ์šฐ ์• ํ”Œ๊ณ ๋”•
    rc('font', family='AppleGothic')

#๊ทธ๋ž˜ํ”„์—์„œ ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ํ‘œ์‹œ๋˜๋„๋ก ํ•˜๋Š” ์„ค์ •์ž…๋‹ˆ๋‹ค.
matplotlib.rcParams['axes.unicode_minus'] = False
sns.heatmap(df_selected)

์กฐ๊ธˆ ๋” ์„ค์ •์„ ์ฃผ์–ด ๋ณด๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด ๋ณด์ž. 

fig, ax = plt.subplots( figsize=(15,15) )
sns.heatmap(data = df_selected, 
            annot = True, fmt = '.3f', 
            cmap = 'Reds'
           )

์ง„ํ•œ ๋นจ๊ฐ„์ƒ‰์ผ ์ˆ˜๋ก ๋ณด๋‹ค ๋†’์€ ์ถœ๋ฃจ์œจ์„ ๊ฐ€์ง„๋‹ค. ์•„์ง์€ ์›” ๋ณ„๋กœ ๋” ์ž˜ํ•œ ๊ฑด์ง€ ๋ชปํ•œ ๊ฑด์ง€ ํ•œ ๋ˆˆ์— ๋ณด๊ธฐ๋Š” ์–ด๋ ต๋‹ค. ๋”ฐ๋ผ์„œ ์‹œ์ฆŒ ์ „์ฒด ์ถœ๋ฃจ์œจ๊ณผ์˜ ์ฐจ์ด๋ฅผ ๊ตฌํ•˜์—ฌ  ํ™•์ธํ•ด ๋ณด์ž. 

for col in df_selected.columns[1:]:
    df_selected[col] = df_selected[col] - df_selected['์ถœ๋ฃจ์œจ'] 
df_selected['์ถœ๋ฃจ์œจ'] = 0.0

์‹œ์ฆŒ ์ถœ๋ฃจ์œจ๊ณผ ์›”๋ณ„ ์ถœ๋ฃจ์œจ์˜ ์ฐจ์ด๋ฅผ ๋ชจ๋‘ ๊ตฌํ•œ ํ›„, ์‹œ์ฆŒ ์ถœ๋ฃจ์œจ์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋‘ 0์œผ๋กœ ๋ฐ”๊พธ์–ด ์ค€๋‹ค. 

fig, ax = plt.subplots( figsize=(10,10) )

sns.heatmap(data = df_selected.head(50), 
            annot = True, fmt = '.3f', 
            cmap = 'RdBu_r'
           )

๋นจ๊ฐ„์ƒ‰์ด ์ง™์œผ๋ฉด ์‹œ์ฆŒ ์ถœ๋ฃจ์œจ ๋ณด๋‹ค ๋†’์€ ์ถœ๋ฃจ์œจ์„ ๊ฐ–๊ณ  ์žˆ์œผ๋ฉฐ, ํŒŒ๋ž€์ƒ‰์ด ์ง™๋‹ค๋ฉด , ์‹œ์ฆŒ ์ถœ๋ฃจ์œจ ๋ณด๋‹ค ๋” ๋‚ฎ์€ ์ถœ๋ฃจ์œจ์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์˜๋ฏธ ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•œํ™” ์ตœ์žฌํ›ˆ ์„ ์ˆ˜๋Š” 3์›” ์ถœ๋ฃจ์œจ์ด ์šฐ์„ธํ–ˆ๊ณ , KIA ์œ ๋ฏผ์ƒ ์„ ์ˆ˜๋Š” 6์›”์— ๋น„ํ•ด 7์›”์— ์—„์ฒญ๋‚œ ์ถœ๋ฃจ์œจ ์ƒ์Šน์„ ๋ณด์ธ๋‹ค. ๊ณ„์ ˆ์— ๋”ฐ๋ฅธ ์˜ํ–ฅ์ด ์žˆ๋Š”์ง€ ๊ถ๊ธˆํ–ˆ๋Š”๋ฐ , (๋”์šด ์—ฌ๋ฆ„์ฒ ์—๋Š” ๊ธฐ๋ก์ด ์ค„์–ด๋“œ๋Š” ๋“ฑ ) ์„ ์ˆ˜ ๋งˆ๋‹ค ๋ชจ๋‘ ๋‹ค๋ฅด๊ณ , ํ™•์‹คํžˆ ํŠน์ • ๋‹ฌ์— ์ถœ๋ฃจ์œจ์ด ๋†’์•„์ง€๋Š” ์„ ์ˆ˜๋“ค์ด ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/53

 

[python ์‹œ๊ฐํ™”] 2. ์„œ์šธ์‹œ ๋Œ€ํ”ผ์†Œ ํ˜„ํ™ฉ ์ง€๋„ ๋งŒ๋“ค๊ธฐ , ์ง€๋„ ์‹œ๊ฐํ™” ( folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot, jointplot, pairplot, boxplot, swarmplot, heatmap) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ..

silvercoding.tistory.com

 

 


 ์‚ฌ์ „ ์ง€์‹ 
  • ํƒ€์œจ : ํƒ€๊ฒฉ์— ์„ฑ๊ณตํ•ด ์‚ด์•„๋‚˜๊ฐ€๋Š” ์ •๋„ 

= ํƒ€๊ฒฉ ์„ฑ๊ณต ํšŸ์ˆ˜ / ํƒ€๊ฒฉ ๊ธฐํšŒ ์ˆ˜ 

= ์•ˆํƒ€ ์ˆ˜ / ํƒ€์ˆ˜ 

  • ์ถœ๋ฃจ์œจ : ์‚ด์•„์„œ ๋‚˜๊ฐ€๋Š” ์ •๋„ 

= ์ง„๋ฃจ ์„ฑ๊ณต ํšŸ์ˆ˜ / ์ง„๋ฃจ ๊ธฐํšŒ ์ˆ˜ 

= (์•ˆํƒ€+๋ณผ๋„ท+๋ชธ์— ๋งž๋Š” ๋ณผ) / (ํƒ€์ˆ˜+๋ณผ๋„ท+๋ชธ์— ๋งž๋Š” ๋ณผ+ํฌ์ƒ ํ”Œ๋ผ์ด) 

  • ์žฅํƒ€์œจ : ํƒ€๊ฒฉ์— ์„ฑ๊ณตํ•ด ๋ฉ€๋ฆฌ ์‚ด์•„๋‚˜๊ฐ€๋Š” ์ •๋„ 

= ์ง„๋ฃจํ•œ ๋ฒ ์ด์Šค ์ˆ˜ / ํƒ€๊ฒฉ ๊ธฐํšŒ ์ˆ˜ 

= ๋ฃจํƒ€ ์ˆ˜ / ํƒ€์ˆ˜ 

(ํƒ€์œจ์— ๊ฑฐ๋ฆฌ ๊ฐœ๋… ์ถ”๊ฐ€ : 2๋ฃจํƒ€ = 1๋ฃจํƒ€ x 2) 

  • OPS  : ์‚ด์•„์„œ ๋ฉ€๋ฆฌ ๋‚˜๊ฐ€๋Š” ์ •๋„   

= ์ถœ๋ฃจ์œจ + ์žฅํƒ€์œจ 

 

 

 

*** ๋ณธ ๊ธฐ์ดˆ ์‹œ๊ฐํ™” ํ”„๋กœ์ ํŠธ์—์„œ๋Š” ์ถœ๋ฃจ์œจ - ์žฅํƒ€์œจ - ops - ํƒ€์œจ ์„ ๊ธฐ์ค€์œผ๋กœ best player ๋ฅผ ๋ถ„์„ํ•œ๋‹ค. ***

 

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
file  = './data/KBO_2019_player_gamestats.csv'
raw = pd.read_csv(file, encoding = 'cp949')

๋ฐ์ดํ„ฐ : KBO 2019 ์‹œ์ฆŒ ํƒ€์ž ๊ธฐ๋ก์ง€ ๋ฐ์ดํ„ฐ (๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต) 

raw.head()

ํƒ€์ž & ๊ฒŒ์ž„ ๋ณ„๋กœ ๊ธฐ๋ก๋˜์–ด ์žˆ๋‹ค. 

raw.info()

raw.columns

์‚ฌ์šฉํ•  ์ปฌ๋Ÿผ์„ ๋ฝ‘๊ธฐ ์œ„ํ•ด ์ „์ฒด ์ปฌ๋Ÿผ์„ ์‚ดํŽด ๋ณธ๋‹ค. 

columns_select = ['ํŒ€', '์ด๋ฆ„', '์ƒ์ผ','์ผ์ž', '์ƒ๋Œ€','ํƒ€์ˆ˜','์•ˆํƒ€','ํ™ˆ๋Ÿฐ', '๋ฃจํƒ€', 'ํƒ€์ ','๋ณผ๋„ท', '์‚ฌ๊ตฌ', 'ํฌ๋น„']
data = raw[columns_select]
data.head()

์›ํ•˜๋Š” ์ปฌ๋Ÿผ๋งŒ ๋ฝ‘์•„์„œ DataFrame์„ ์ƒ์„ฑํ•˜๊ณ , data์— ๋„ฃ์–ด์ฃผ์—ˆ๋‹ค. 

 

 

 

 

 


 KBO best player ๋ถ„์„ํ•˜๊ธฐ 

- ์„ ์ˆ˜๋ณ„ ๊ธฐ๋ก ์ง‘๊ณ„ 

data_player = data.pivot_table(index = ['ํŒ€','์ด๋ฆ„','์ƒ์ผ'], 
                               values = ['ํƒ€์ˆ˜','์•ˆํƒ€','ํ™ˆ๋Ÿฐ','๋ฃจํƒ€','ํƒ€์ ','๋ณผ๋„ท','์‚ฌ๊ตฌ','ํฌ๋น„'], 
                              aggfunc = 'sum')

data_player

pivot_table ์„ ์‚ฌ์šฉํ•˜์—ฌ ํŒ€, ์ด๋ฆ„ , ์ƒ์ผ์„ ๊ธฐ์ค€์œผ๋กœ ํƒ€์ˆ˜, ์•ˆํƒ€, ํ™ˆ๋Ÿฐ, ๋ฃจํƒ€, ํƒ€์ , ๋ณผ๋„ท, ์‚ฌ๊ตฌ, ํฌ๋น„์˜ ์ด ํ•ฉ๊ณ„๋ฅผ ์ง‘๊ณ„ํ•œ๋‹ค. 

data_player['ํƒ€์ˆ˜'].hist()

ํŒ๋‹ค์Šค์—์„œ ๊ธฐ๋ณธ์œผ๋กœ ๋‚ด์žฅ๋˜์–ด ์žˆ๋Š” ์‹œ๋ฆฌ์ฆˆ.hist()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํƒ€์ˆ˜์˜ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ . ( ํƒ€์ˆ˜๊ฐ€ ์ ์€ ์„ ์ˆ˜๋ฅผ ์ œ์™ธํ•˜๊ธฐ ์œ„ํ•ด ์–ด๋Š ์ •๋„๊ฐ€ ์ ์€์ง€ ํŒ๋‹จํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ ) ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณด๋‹ˆ, ํƒ€์ˆ˜๊ฐ€ 50 ์ดํ•˜์ธ ์„ ์ˆ˜๋ฅผ ์ œ์™ธํ•˜๋ฉด ์ ์ ˆํ•  ๊ฒƒ ๊ฐ™๋‹ค. 

cond = data_player['ํƒ€์ˆ˜'] > 50
data_player = data_player[cond].reset_index()    # ๋‹ค์ค‘ ์ธ๋ฑ์Šค --> ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๊ธฐ
data_player

ํƒ€์ˆ˜๊ฐ€ 50 ์ดˆ๊ณผ์ธ ์„ ์ˆ˜๋“ค๋งŒ ์ถ”๋ฆฌ๊ณ , ์ˆ˜์›”ํ•œ ์ปจํŠธ๋กค์„ ์œ„ํ•ด reset_index()๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ์ธ๋ฑ์Šค๋ฅผ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด์ค€๋‹ค. 

def cal_hit(df):
    '''
    - ํƒ€์œจ : ํƒ€๊ฒฉ์— ์„ฑ๊ณตํ•ด์„œ ์ง„๋ฃจํ•˜๋Š” ๋น„์œจ --> ์•ˆํƒ€ / ํƒ€์ˆ˜
    - ์ถœ๋ฃจ์œจ: ์‚ด์•„์„œ ์ง„๋ฃจํ•˜๋Š” ๋น„์œจ -->  (์•ˆํƒ€+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ)/(ํƒ€์ˆ˜+๋ณผ๋„ท+๋ชธ์—๋งž๋Š”๋ณผ+ํฌ์ƒํ”Œ๋ผ์ด)
    - ์žฅํƒ€์œจ : ํƒ€์œจ์— ์ง„๋ฃจํ•œ ๋ฒ ์ด์Šค ๊ฐ€์ค‘์น˜ ์ถ”๊ฐ€ -->   ๋ฃจํƒ€ / ํƒ€์ˆ˜
    - OPS : ์ถœ๋ฃจ์œจ + ์žฅํƒ€์œจ 
    '''
    
    df['ํƒ€์œจ'] = df['์•ˆํƒ€'] / df['ํƒ€์ˆ˜']
    df['์ถœ๋ฃจ์œจ'] = (df['์•ˆํƒ€'] + df['๋ณผ๋„ท'] + df['์‚ฌ๊ตฌ']) / (df['ํƒ€์ˆ˜'] + df['๋ณผ๋„ท'] + df['์‚ฌ๊ตฌ'] + df['ํฌ๋น„'])
    df['์žฅํƒ€์œจ'] = df['๋ฃจํƒ€'] / df['ํƒ€์ˆ˜']
    df['OPS'] = df['์ถœ๋ฃจ์œจ'] + df['์žฅํƒ€์œจ']
    return df

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋„ฃ์–ด ์ฃผ๋ฉด best player ๋ฅผ ์„ ์ •ํ•˜๋Š” ๊ธฐ์ค€์ธ ํƒ€์œจ, ์ถœ๋ฃจ์œจ, ์žฅํƒ€์œจ, OPS ์ปฌ๋Ÿผ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ƒ์„ฑํ•˜๋Š” ํ•จ์ˆ˜์ด๋‹ค. 

player_stat = cal_hit(data_player)
player_stat

 

- ์ถœ๋ฃจ์œจ -> ์žฅํƒ€์œจ -> OPS -> ํƒ€์œจ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•˜์—ฌ best 10 player ๋ฝ‘์•„๋ณด๊ธฐ 

player_stat = player_stat.sort_values(by = ['์ถœ๋ฃจ์œจ','์žฅํƒ€์œจ','OPS', 'ํƒ€์œจ'], ascending = False)
player_stat = player_stat.reset_index(drop = True)
player_stat.head(10)

์ธ๋ฑ์Šค๊ฐ€ ๋’ค์ฃฝ๋ฐ•์ฃฝ์œผ๋กœ ์ •๋ ฌ๋˜๋ฏ€๋กœ, reset_index()๋ฅผ ์ด์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๋ฅผ ์ •๋ ฌํ•ด ์ค€๋‹ค. 

 

๊ฒฐ๋ก ์ ์œผ๋กœ ์ถœ๋ฃจ์œจ์„ ๊ธฐ์ค€์œผ๋กœ ํ•œ KBO 2019 ์‹œ์ฆŒ best player๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

player_stat['์ด๋ฆ„'][:10]

 

 

 

 


 ํŒ€๋ณ„ ์„ ์ˆ˜๋“ค์˜ ์ถœ๋ฃจ์œจ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ 
import matplotlib
from matplotlib import font_manager, rc
import platform
import matplotlib.pyplot as plt
import seaborn as sns

# ์ด๋ฏธ์ง€ ํ•œ๊ธ€ ํ‘œ์‹œ ์„ค์ •
if platform.system() == 'Windows':  # ์œˆ๋„์šฐ์ธ ๊ฒฝ์šฐ ๋ง‘์€๊ณ ๋”•
    font_name = font_manager.FontProperties(fname="c:/Windows/Fonts/malgun.ttf").get_name()
    rc('font', family=font_name)
else:    # Mac ์ธ ๊ฒฝ์šฐ ์• ํ”Œ๊ณ ๋”•
    rc('font', family='AppleGothic')

#๊ทธ๋ž˜ํ”„์—์„œ ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ๊ฐ€ ํ‘œ์‹œ๋˜๋„๋ก ํ•˜๋Š” ์„ค์ •์ž…๋‹ˆ๋‹ค.
matplotlib.rcParams['axes.unicode_minus'] = False

์šฐ์„  ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ธ์„ ๋•Œ ํ•œ๊ธ€์ด ๊นจ์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ํฐํŠธ์„ค์ •์„ ํ•˜๊ณ  , ๋งˆ์ด๋„ˆ์Šค ๊ธฐํ˜ธ ํ‘œ์‹œ ์—ฌ๋ถ€๋ฅผ ์„ค์ •ํ•œ๋‹ค. 

 

 

- ํŒ€ ๋ณ„ ์ถœ๋ฃจ์œจ ๋ถ„ํฌ ( boxplot & swarmplot ) 

sns.boxplot(data = player_stat, x = 'ํŒ€', y = '์ถœ๋ฃจ์œจ')

๊ฐœ์ˆ˜๋„ ์ง๊ด€์ ์œผ๋กœ ๋ณด๊ธฐ ์œ„ํ•ด swarmplot์„ ์ถ”๊ฐ€ ํ•œ๋‹ค. 

sns.swarmplot(data = player_stat, x = 'ํŒ€', y = '์ถœ๋ฃจ์œจ')
sns.boxplot(data = player_stat, x = 'ํŒ€', y = '์ถœ๋ฃจ์œจ')

์ƒ‰๊น”์ด ๊ฒน์น˜๊ธฐ ๋•Œ๋ฌธ์— boxplot์— ์—ฌ๋Ÿฌ ์˜ต์…˜์„ ์ฃผ์–ด ๋ณด๊ธฐ ํŽธํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์ค€๋‹ค. 

sns.swarmplot(data = player_stat, x = 'ํŒ€', y = '์ถœ๋ฃจ์œจ')
sns.boxplot(data = player_stat, x = 'ํŒ€', y = '์ถœ๋ฃจ์œจ',
            showcaps=False,             # ๋ฐ•์Šค ์ƒ๋‹จ ๊ฐ€๋กœ๋ผ์ธ ๋ณด์ด์ง€ ์•Š๊ธฐ
            whiskerprops={'linewidth':0}, # ๋ฐ•์Šค ์ƒ๋‹จ ์„ธ๋กœ ๋ผ์ธ ๋ณด์ด์ง€ ์•Š๊ธฐ 
            showfliers=False,           # ๋ฐ•์Šค ๋ฒ”์œ„ ๋ฒ—์–ด๋‚œ ์•„์›ƒ๋ผ์ด์–ด ํ‘œ์‹œํ•˜์ง€ ์•Š๊ธฐ
            boxprops={'facecolor':'None'}, # ๋ฐ•์Šค ์ƒ‰์ƒ ์ง€์šฐ๊ธฐ
           )

์ด๋ ‡๊ฒŒ ํŒ€๋ณ„ ์„ ์ˆ˜๋“ค์˜ ์ถœ๋ฃจ์œจ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. NC ํŒ€์ด ๊ฐ€์žฅ ๋†’์€ ์ถœ๋ฃจ์œจ์„ ๊ฐ€์ง„ ์„ ์ˆ˜๋ฅผ ๋ณด์œ ํ•˜๊ณ  ์žˆ๊ณ , ์ถœ๋ฃจ์œจ์ด ๊ฐ€์žฅ ๋งŽ์ด ๋ชฐ๋ ค ์žˆ๋Š” ํŒ€์€ LGํŒ€์ธ ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ์ง€๋ฉฐ, ์ถœ๋ฃจ์œจ์˜ ์ค‘์œ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ํŒ€์€ ๋‘์‚ฐ์ธ ๊ฒƒ ๋“ฑ์„ ํ•œ ๋ˆˆ์— ์•Œ ์ˆ˜ ์žˆ๋‹ค.    

 

 

 

 


 ๋งˆ๋ฌด๋ฆฌ : csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๊ธฐ 
file = './data/player_stat.csv'
player_stat.to_csv(file, encoding = 'cp949', index = False)

๋ณธ ํฌ์ŠคํŒ…์—์„œ ๋ถ„์„์— ์‚ฌ์šฉํ•œ player_stat์„ csv ํŒŒ์ผ๋กœ ์ €์žฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. 


 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/52

 

[python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot, jointplot, pairplot, boxplot, swarmplot, heatmap)

๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ ์ˆ˜์น˜ํ˜• x ์ˆ˜์น˜ํ˜• : scatterplot, lmplot, jointplot ์ˆ˜์น˜ํ˜• x ์นดํ…Œ๊ณ ๋ฆฌํ˜• : boxplot, violinplot, barplot, heatmap ์ˆ˜์น˜ํ˜• x ์œ„์น˜ํ˜• : folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” seaborn ๋ผ์ด..

silvercoding.tistory.com

 

 

 

 

1. folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์‚ฌ์šฉํ•ด๋ณด๊ธฐ 


 folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 
!pip install folium

folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜ํ•˜๊ธฐ 

import folium

folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 


 folium ์ง€๋„ ์‹œ๊ฐํ™” ์‚ฌ์šฉ ๋ฒ• 

์ง€๋„ ์‹œ๊ฐํ™”

  • ์ง€๋„์ƒ์„ฑํ•˜๊ธฐ
    • m = folium.Map(location = [์œ„๋„, ๊ฒฝ๋„], zoom_start = ํ™•๋Œ€์ •๋„)
  • ์ •๋ณด ์ถ”๊ฐ€ํ•˜๊ธฐ
    • ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
      • folium.Marker([์œ„๋„, ๊ฒฝ๋„]).add_to(m)
    • ์› ์ถ”๊ฐ€ํ•˜๊ธฐ
      • folium.CircleMarker([์œ„๋„, ๊ฒฝ๋„], radius= ์›ํฌ๊ธฐ).add_to(m)
    • ์ถ”๊ฐ€์˜ต์…˜:
      • tooltip="๋งˆ์šฐ์Šค ์˜ฌ๋ฆฌ๋ฉด ๋ณด์—ฌ์งˆ ์ •๋ณด"
      • popup="ํด๋ฆญํ•˜๋ฉด ๋ณด์—ฌ์งˆ ์ •๋ณด"
    • ๊ธฐํƒ€) ClickForMarker('์ฒดํฌ').add_to(m) ์ง€๋„์—์„œ ํด๋ฆญํ•  ๊ฒฝ์šฐ ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ

 ์„œ์šธ์—ญ ์ง€๋„ ์‹œ๊ฐํ™”
m = folium.Map(location=[37.5536067,126.9674308],  
               zoom_start=12)
m

 

- ๋งˆ์ปค ์ถ”๊ฐ€ 

folium.Marker([37.5536067,126.9674308],    # ์„œ์šธ์—ญ์œ„์น˜
              tooltip="์„œ์šธ์—ญ(๋งˆ์šฐ์Šค์˜ฌ๋ฆฌ๋ฉด๋ณด์—ฌ์ง)",
              popup="์„œ์šธ์—ญ(ํด๋ฆญํ•˜๋ฉด ๋ณด์—ฌ์ง)",
              ).add_to(m)
m

tooltip๊ณผ popup์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์šฐ์Šค๋ฅผ ์˜ฌ๋ฆฌ๊ฑฐ๋‚˜ , ํด๋ฆญํ•˜๋ฉด ๋ฉ”์‹œ์ง€๋ฅผ ๋„์šธ ์ˆ˜ ์žˆ๋‹ค. 

 

- ์„œํด๋งˆ์ปค ์ถ”๊ฐ€ 

folium.CircleMarker([37.5536067,126.9674308],
                    radius=20,
                    tooltip = '๋งˆ์šฐ์Šค์˜ฌ๋ฆด๊ฒฝ์šฐ'
               ).add_to(m)
m

 

- ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ MiniMap 

from folium.plugins import MiniMap

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location=[37.5536067,126.9674308],   # ๊ธฐ์ค€์ขŒํ‘œ: ์„œ์šธ์—ญ
               zoom_start=12)

# ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ํ•˜๊ธฐ
minimap = MiniMap() 
minimap.add_to(m)


# ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
folium.Marker([37.5536067,126.9674308],    # ์„œ์šธ์—ญ์œ„์น˜
              tooltip="์„œ์šธ์—ญ(๋งˆ์šฐ์Šค์˜ฌ๋ฆฌ๋ฉด๋ณด์—ฌ์ง)",
              popup="์„œ์šธ์—ญ(ํด๋ฆญํ•˜๋ฉด ๋ณด์—ฌ์ง)",
              ).add_to(m)
m

๋ฏธ๋‹ˆ๋งต์„ ์ƒ์„ฑํ•˜๊ณ , add_to๋กœ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

 

 

 - ํด๋ฆญํ•ด์„œ ๋งˆํฌ ์ถ”๊ฐ€ํ•˜๊ธฐ 

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location=[37.5536067,126.9674308],   # ๊ธฐ์ค€์ขŒํ‘œ: ์„œ์šธ์—ญ
               zoom_start=12)

# ์จํด๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
folium.CircleMarker([37.5536067,126.9674308],
                    radius=20,
                    tooltip = '๋งˆ์šฐ์Šค์˜ฌ๋ฆด๊ฒฝ์šฐ'
               ).add_to(m)
folium.ClickForMarker('์ฒดํฌ์ถ”๊ฐ€').add_to(m)    
m

๋งˆ์šฐ์Šค๋กœ ํด๋ฆญํ•ด์„œ ์ง์ ‘ ๋งˆ์ปค๋ฅผ ์ถ”๊ฐ€ํ•ด์ค„ ์ˆ˜๋„ ์žˆ๋‹ค. (ClickForMarker ์‚ฌ์šฉ) 


 

 

 

 

2. ์‘์šฉ : ์„œ์šธ ๋Œ€ํ”ผ์†Œ ํ˜„ํ™ฉ ์ง€๋„ ๋งŒ๋“ค๊ธฐ 


< ๋ฐ์ดํ„ฐ ์ถœ์ฒ˜ > 

http://data.seoul.go.kr/dataList/OA-2189/S/1/datasetView.do

 

์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ๊ด‘์žฅ

๋ชจ๋“  ์„œ์šธ์‹œ๋ฏผ์„ ์œ„ํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ๊ด‘์žฅ์—์„œ ์„œ์šธ์‹œ์™€ ์—ฐ๊ณ„ ๊ธฐ๊ด€์ด ๊ณต๊ฐœํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„œ์šธ์‹œ์™€ ๊ด€๋ จ๋œ ๋‹ค์–‘ํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด ๋ณด์„ธ์š”.

data.seoul.go.kr


import pandas as pd

DataFrame์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— pandas๋ฅผ import ํ•ด์ค€๋‹ค. 

file = './data/์„œ์šธ์‹œ ๋Œ€ํ”ผ์†Œ ๋ฐฉ์žฌ์‹œ์„ค ํ˜„ํ™ฉ (์ขŒํ‘œ๊ณ„_ WGS1984).csv'
raw = pd.read_csv(file, encoding = 'cp949')   # encoding = 'cp949' : MS ํ”„๋กœ๊ทธ๋žจ ์‚ฌ์šฉ์‹œ, ๊ทธ์™ธ์˜ ๊ฒฝ์šฐ encoding = 'utf-8'  (๊ธฐ๋ณธ๊ฐ’)
raw.head()

raw.info()

 

์ˆ˜์—…์—์„œ๋Š” ์œ„๋„, ๊ฒฝ๋„๋ฅผ ์„ค์ •ํ•˜์—ฌ ๋งˆ์ปค๋ฅผ ์ถ”๊ฐ€ํ•˜๊ณ  , ๋งˆ์šฐ์Šค๋ฅผ ๋งˆ์ปค ์œ„์— ์˜ฌ๋ฆฌ๋ฉด ๋Œ€ํ”ผ์†Œ ๋ช…์นญ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ ๊นŒ์ง€ ๋ฐฐ์› ๊ณ  , ๊ฑฐ๊ธฐ์— ๋”ํ•ด์„œ ํด๋ฆญํ•˜๋ฉด ์ตœ๋Œ€ ์ˆ˜์šฉ์ธ์›์ด ๋‚˜์˜ค๋„๋ก ํ•˜๋Š” ์ง€๋„๋ฅผ ๋งŒ๋“ค์–ด ๋ณธ๋‹ค. 

i = 7

lat = raw.loc[i, '์œ„๋„']
long = raw.loc[i, '๊ฒฝ๋„']
name = raw.loc[i,'๋Œ€ํ”ผ์†Œ๋ช…์นญ']
num = raw.loc[i, '์ตœ๋Œ€์ˆ˜์šฉ์ธ์›']

print(lat, long, name, num)

37.5365343 126.9650202 ๋‚จ์ •์ดˆ๋“ฑํ•™๊ต 300

์šฐ์„  ์ธ๋ฑ์Šค๊ฐ€ 7์ผ ๋•Œ์˜ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์–ป์–ด๋ณด๊ธฐ 

for i in range(len(raw)):
    lat = raw.loc[i, '์œ„๋„']
    long = raw.loc[i, '๊ฒฝ๋„']
    name = raw.loc[i,'๋Œ€ํ”ผ์†Œ๋ช…์นญ']
    num = raw.loc[i, '์ตœ๋Œ€์ˆ˜์šฉ์ธ์›']


    print(name, lat, long, num)

์ด 694๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์™”์„ ๊ฒƒ์ด๋‹ค. 

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location=[37.5536067,126.9674308],
               zoom_start=12)
m
# ๋Œ€ํ”ผ์†Œ ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ

for i in range(len(raw)):
    lat = raw.loc[i, '์œ„๋„']
    long = raw.loc[i, '๊ฒฝ๋„']
    name = raw.loc[i,'๋Œ€ํ”ผ์†Œ๋ช…์นญ']
    num = raw.loc[i, '์ตœ๋Œ€์ˆ˜์šฉ์ธ์›']


    folium.Marker([lat, long],tooltip= name, popup='%d๋ช…'%num).add_to(m)    
m

for๋ฌธ์„ ์ด์šฉํ•˜์—ฌ ๋Œ€ํ”ผ์†Œ ๋ชจ๋“  ๊ณณ์— ๋งˆ์ปค๋ฅผ ์ฐ์–ด ์ค€๋‹ค. 

 


* ๋ฌธ์ œ์  : ๋งˆ์ปค๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ์ง€๋„๋ฅผ ์‚ดํŽด๋ณด๊ธฐ์— ์–ด๋ ค์›€์ด ์ƒ๊ธด๋‹ค 

* ํ•ด๊ฒฐ : MarkerCluster ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ทผ์ฒ˜์— ์žˆ๋Š” ๋งˆ์ปค๋“ค ๋ผ๋ฆฌ ๊ทธ๋ฃน์œผ๋กœ ํ‘œํ˜„ํ•œ๋‹ค. 


 MarkerCluster ์‚ฌ์šฉํ•˜์—ฌ ๊ทธ๋ฃน ์ƒ์„ฑํ•˜๊ธฐ 
# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location=[37.5536067,126.9674308],
               zoom_start=12)
marker_cluster = MarkerCluster().add_to(m)  # ํด๋Ÿฌ์Šคํ„ฐ ์ถ”๊ฐ€ํ•˜๊ธฐ

# ๋Œ€ํ”ผ์†Œ ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ

for i in range(len(raw)):
    lat = raw.loc[i, '์œ„๋„']
    long = raw.loc[i, '๊ฒฝ๋„']
    name = raw.loc[i,'๋Œ€ํ”ผ์†Œ๋ช…์นญ']
    num = raw.loc[i, '์ตœ๋Œ€์ˆ˜์šฉ์ธ์›']


    folium.Marker([lat, long],tooltip= name, popup='%d๋ช…'%num).add_to(marker_cluster)    
m

๊ทธ๋ฃน์„ ํด๋ฆญํ•˜์—ฌ ์ž์„ธํžˆ ํ™•๋Œ€๊ฐ€ ๋˜๋ฉด, ๋งˆ์ปค๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋ณด๋‹ค ์ง€๋„๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. 

 

 

- ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ 

from folium.plugins import MiniMap

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location=[37.5536067,126.9674308],
               zoom_start=12)
marker_cluster = MarkerCluster().add_to(m)  # ํด๋Ÿฌ์Šคํ„ฐ ์ถ”๊ฐ€ํ•˜๊ธฐ


# ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ 
minimap = MiniMap()
m.add_child(minimap) 


# ๋Œ€ํ”ผ์†Œ ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
for i in range(len(raw)):
    lat = raw.loc[i, '์œ„๋„']
    long = raw.loc[i, '๊ฒฝ๋„']
    name = raw.loc[i,'๋Œ€ํ”ผ์†Œ๋ช…์นญ']
    num = raw.loc[i, '์ตœ๋Œ€์ˆ˜์šฉ์ธ์›']


    folium.Marker([lat, long],tooltip= name, popup='%d๋ช…'%num).add_to(marker_cluster)    
m


 ๋งˆ๋ฌด๋ฆฌ : ์ง€๋„ ์ €์žฅ (html) 
m.save('./data/Sheltermap.html')

html๋กœ ์ €์žฅํ•˜์—ฌ ํฐ ํ™”๋ฉด์œผ๋กœ ์–ธ์ œ๋“  ์ง€๋„๋ฅผ ๊บผ๋‚ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 


 

๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

์ˆ˜์น˜ํ˜• x ์ˆ˜์น˜ํ˜• : scatterplot, lmplot, jointplot 

์ˆ˜์น˜ํ˜• x ์นดํ…Œ๊ณ ๋ฆฌํ˜• : boxplot, violinplot, barplot, heatmap 

์ˆ˜์น˜ํ˜• x ์œ„์น˜ํ˜• : folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ™œ์šฉ 

 

 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ distplot, relplot, jointplot, pairplot, boxplot, swarmplot, heatmap ์„ ์‚ฌ์šฉํ•œ๋‹ค. 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import seaborn as sns

seaborn ์„ import ํ•ด์ฃผ๊ณ , ์•ฝ์–ด๋Š” sns๋ฅผ ์“ด๋‹ค. 

raw = sns.load_dataset('tips')

seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์— ์ €์žฅ๋œ ๋ฐ์ดํ„ฐ์…‹์„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•์‹์œผ๋กœ, pandas๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

raw.head()

total_bill ( ์ด ๊ธˆ์•ก ) , tip ( ํŒ ) , sex ( ์„ฑ๋ณ„ ) , smoker ( ํก์—ฐ ์—ฌ๋ถ€ ) , day ( ์š”์ผ ) , time ( ์‹์‚ฌ ์‹œ๊ฐ„ ) , size ( ์ธ์› ์ˆ˜ ) 

raw.info()

info()๋ฅผ ์ด์šฉํ•˜์—ฌ row, column์˜ ๊ฐœ์ˆ˜, data type, ๊ฒฐ์ธก๊ฐ’ ํ™•์ธ์„ ํ•ด ์ค€๋‹ค. 

 


 ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ (์ˆ˜์น˜ํ˜•) 
  • sns.distplot( df[ '์ปฌ๋Ÿผ๋ช…' ] )
raw['total_bill']

์ˆ˜์น˜ํ˜• ์ž๋ฃŒ์ธ total_bill ์ปฌ๋Ÿผ์˜ ๋ถ„ํฌ๋ฅผ ์‚ดํŽด๋ณธ๋‹ค. 

sns.distplot(raw['total_bill'])

 

 

 

 ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ ( ์ˆ˜์น˜ํ˜• vs ์ˆ˜์น˜ํ˜• ) 
  • relplot( data=df, x=, y=, hue=, kind='scatter' ) 

: ๋‘ ๊ฐœ์˜ ์ˆ˜์น˜ํ˜• ๋ณ€์ˆ˜ ๋ถ„ํฌ ํ™•์ธ 

- kind = 'scatter' (default

- kind = 'line' 

sns.relplot(x = 'tip', y = 'total_bill', data = raw)  # kind ์˜ต์…˜ ๊ฐ’ ๋ฏธ์ง€์ •์‹œ "scatter"

sns.relplot(x = 'tip', y = 'total_bill', data = raw, kind = 'line')

sns.relplot(x = 'tip', y = 'total_bill', data = raw, hue = 'sex')

 

hue๋ฅผ ์„ฑ๋ณ„๋กœ ์ง€์ •ํ•˜์—ฌ ์„ฑ๋ณ„์„ ๊ธฐ์ค€์œผ๋กœ ๋‚˜๋ˆ„์–ด ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 


 ๊ด€๊ณ„ ์‚ดํŽด๋ณด๊ธฐ (์ˆ˜์น˜ํ˜• vs ์ˆ˜์น˜ํ˜• ) 
  • jointplot ( data=df, x=, y=, kind='scatter' ) 

- kind = 'scatter' : (default / point) 

- kind = 'reg' : (point + regression)

- kind = 'kde' : ๋ˆ„์ ๋ถ„ํฌ์ฐจํŠธ like ์ง€๋„

 

sns.jointplot(data = raw, x = 'tip', y = 'total_bill')  # kind ๊ฐ’ ๋ฏธ ์ง€์ •์‹œ ๊ธฐ๋ณธ ๊ฐ’์€ kind = 'scatter'

sns.jointplot(data = raw, x = 'tip', y = 'total_bill', kind = 'kde')

sns.jointplot(data = raw, x = 'tip', y = 'total_bill', kind = 'reg')

sns.jointplot(data = raw, x = 'tip', y = 'total_bill', kind = 'hex')

kind='hex'๋Š” kde์™€ ๋น„์Šทํ•œ๋ฐ, ๊ฒน์น˜์ง€ ์•Š๋Š” ์ •์œก๊ฐํ˜•์œผ๋กœ ํ‘œ์‹œ๋œ๋‹ค. 

 

 

 

  • pairplot( data=df ) 

: df ์˜ ๋ชจ๋“  ์ˆ˜์น˜ํ˜•๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ์—์„œ ๋‘ ์ปฌ๋Ÿผ์”ฉ ๊ด€๊ณ„๋ฅผ ์‹œ๊ฐํ™” ํ•จ 

 

sns.pairplot(data = raw)

์„œ๋กœ ๊ฐ™์€ ์ปฌ๋Ÿผ์ด๋ฉด ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆฌ๊ณ , ๋‹ค๋ฅธ ์ปฌ๋Ÿผ์€ ๋‘ ์ˆ˜์น˜ํ˜• ๋ฐ์ดํ„ฐ์˜ ๊ด€๊ณ„๋ฅผ ์‚ฐ์ ๋„๋กœ ํ‘œ์‹œํ•ด ์ค€๋‹ค. 

sns.pairplot(data = raw, hue = 'sex')

hue๋ฅผ ์„ฑ๋ณ„๋กœ ์ง€์ •ํ•˜์—ฌ ์„ฑ๋ณ„ ๊ฐ„ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 


 ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ์‚ดํŽด๋ณด๊ธฐ (์ˆ˜์น˜ํ˜• vs ์นดํŽ˜๊ณ ๋ฆฌํ˜•) 
  • boxplot( data = df, x = , y = , hue = )
sns.boxplot(data = raw, x = 'day', y = 'tip')

์š”์ผ ๋ณ„ ํŒ์˜ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•œ๋‹ค. 

sns.boxplot(data = raw, x = 'day', y = 'tip', hue = 'smoker')

hue๋ฅผ smoker๋กœ ์ง€์ •ํ•˜์—ฌ ํก์—ฐ์ž ์—ฌ๋ถ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„ํฌ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

  • swarmplot( data=, x=, y=, hue=, dodge= )

boxplot๊ณผ ๋น„์Šทํ•˜๋‹ค. boxplot์€ ํ•˜๋‚˜์˜ ๊ธฐ์ค€์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ ๋ฒ”์œ„๋ฅผ ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์šฉ์ดํ•˜์ง€๋งŒ , ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜๋ฅผ ํ‘œํ˜„ํ•˜์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— , ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ๊ฐ’์„ ๋น„๊ตํ•˜๊ธฐ์—๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋•Œ swarmplot์„ ์ด์šฉํ•œ๋‹ค. 

sns.swarmplot(data = raw, x = 'day', y = 'tip', hue = 'smoker', dodge=True)

 

dodge=False๋กœ ํ•˜๋ฉด ๋‘ ์ƒ‰์ด ๊ฒน์ณ ๋‚˜์˜จ๋‹ค. 

sns.boxplot(data = raw, x = 'day', y = 'tip', hue = 'smoker')
sns.swarmplot(data = raw, x = 'day', y = 'tip', hue = 'smoker', dodge=True)

 

boxplot๊ณผ swarplot์„ ๊ฒน์ณ์…” ๊ทธ๋ ค๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. 

sns.boxplot(data = raw, x = 'size', y = 'tip', hue = 'sex')

 

 

 

 

  • barplot( data=df, x=, y=, hue= ) 
sns.barplot(data = raw, x = 'size', y = 'tip', hue = 'sex')

 

 

 

 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ (์ˆ˜์น˜ํ˜• vs ์นดํ…Œ๊ณ ๋ฆฌํ˜• vs ์นดํ…Œ๊ณ ๋ฆฌํ˜•)
  • heatmap( data = df )
df = raw.pivot_table(index = 'day', columns = 'size', values = 'tip', aggfunc='mean')
df

์šฐ์„  ์ธ๋ฑ์Šค๋ฅผ day, column์„ size, value๋ฅผ tip์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์š”์ผ ๋ณ„ ์ธ์›์ˆ˜์— ๋”ฐ๋ฅธ tip ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค. 

sns.heatmap(data = df)

์นดํ…Œ๊ณ ๋ฆฌํ˜• - ์นดํ…Œ๊ณ ๋ฆฌํ˜•์˜ ๊ด€๊ณ„๋ฅผ ์ˆ˜์น˜ํ˜•์„ ์ด์šฉํ•˜์—ฌ ํ‘œํ˜„ํ•˜๋Š” heatmap์ด๋‹ค. 

 

๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ์„ค์ •์„ ํ•  ์ˆ˜ ์žˆ๋‹ค. 

sns.heatmap(data = df, 
           annot = True, fmt = '.2f')

 

sns.heatmap(data = df, 
           annot = True, fmt = '0.2f',
           cmap = 'Pastel1')

* annot : ์ˆซ์ž ํ‘œ์‹œ, fmt : ์ˆซ์ž ํฌ๋งคํŒ… , cmap : ์ƒ‰๊น” 

* cmap ์ถ”์ฒœ ์ƒ‰ : Reds , Blues, Vlag, Pastel1, RdBu_r


 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/50

 

[python pandas] 3. pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (3) - ์ง‘๊ณ„, ๊ฒฐ์ธก๊ฐ’, ์ •๋ ฌ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/49 https://silvercoding.tistory.com/48 [python pandas] pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (1) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ * ํŒ๋‹ค์Šค ๊ธฐ๋ณธ ํ•จ์ˆ˜ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ์ฝ๊ธฐ :..

silvercoding.tistory.com

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
file = './data/babyNamesUS.csv'
raw = pd.read_csv(file)
raw.head()

raw.info()

 


 

 ๋‚จ๋…€ ๊ตฌ๋ถ„์—†์ด '๋งŽ์ด' ์‚ฌ์šฉ๋˜๋Š” ๊ณตํ†ต ์ด๋ฆ„ ? 

idea : ๋‚จ๋…€ ์ด๋ฆ„ ๊ฐœ์ˆ˜์˜ ๋น„์œจ ์ฐจ์ด๊ฐ€ ์ž‘์„์ˆ˜๋ก ์„ฑ๋ณ„ ๊ตฌ๋ถ„์ด ์—†๋Š” ์ด๋ฆ„์ผ ๊ฒƒ์ด๋‹ค !  

# ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ์ด๋ฆ„ ๊ฐœ์ˆ˜ ์ง‘๊ณ„
name_df = raw.pivot_table(index = 'Name', columns = 'Sex', values = 'Number', aggfunc='sum')

# ๊ฒฐ์ธก๊ฐ’ ์ฑ„์šฐ๊ธฐ (0) 
name_df = name_df.fillna(0)

# float -> int 
name_df = name_df.astype(int)
name_df.head()

์—ฌ๊ธฐ๊นŒ์ง€ ์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ํ–ˆ๋˜ ๋‚ด์šฉ์ด๋‹ค. 

name_df['Sum'] = name_df['M'] + name_df['F']
name_df.head()

๋‚จ๋…€ ์ด๋ฆ„ ๊ฐœ์ˆ˜๋ฅผ ๋ชจ๋‘ ๋”ํ•ด์„œ sum ์ด๋ผ๋Š” ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

# ๋‚จ, ๋…€ ๋น„์œจ ๊ณ„์‚ฐ 
name_df['F_ratio'] = name_df['F'] / name_df['Sum']
name_df['M_ratio'] = name_df['M'] / name_df['Sum']

# ๋‚จ, ๋…€ ๋น„์œจ ๊ฐ„ ์ฐจ์ด
name_df['M_F_Gap'] = abs(name_df['F_ratio'] - name_df['M_ratio'])
name_df.head()

-1 ~ 1 ์˜ ๋ฒ”์œ„๋ฅผ abs() (์ ˆ๋Œ“๊ฐ’) ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 0 ~ 1 ๋ฒ”์œ„๋กœ ๋ฐ”๊พธ์–ด ์ค€๋‹ค. 

# ์ด๋ฆ„ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 
name_df = name_df.sort_values(by = 'Sum', ascending=False)
name_df.head(20)

๋งŽ์ด ์‚ฌ์šฉ๋œ ์ด๋ฆ„์„ ๋ฝ‘๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์šฐ์„  ์ด ํ•ฉ๊ณ„ ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•ด์ค€๋‹ค. 

cond = name_df['M_F_Gap'] < 0.1
name_df[cond].head(10)

์ด ๋•Œ ๋น„์œจ์ฐจ์ด๊ฐ€ ์ ์€ ๊ฒƒ์„ 0.1 ๋ฏธ๋งŒ์œผ๋กœ ๊ธฐ์ค€ ์žก๊ณ ,  M_F_Gap ์ปฌ๋Ÿผ์ด 0.1 ๋ณด๋‹ค ์ž‘์€ ํ–‰๋“ค์„ ์ถœ๋ ฅ์‹œํ‚จ๋‹ค.  

# ์„ฑ๋ณ„ ๊ตฌ๋ถ„์—†์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ด๋ฆ„ Top 10 
name_df[cond].head(10).index

 

 

 

 

 

 ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋ฏธ๊ตญ์˜ ์ด๋ฆ„ ? ( ์ตœ๊ทผ ํŠธ๋ Œ๋“œ ) 

idea : ์„ธ๋Œ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ตœ๊ทผ ์„ธ๋Œ€(2020, 1990) ์ด๋ฆ„ ๊ฐœ์ˆ˜์˜ ๋น„์œจ์ด ํฐ ์ด๋ฆ„์ด ์ตœ๊ทผ ํŠธ๋ Œ๋“œ์— ๋งž๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฏธ๊ตญ ์ด๋ฆ„์ผ ๊ฒƒ์ด๋‹ค ! 

raw.head()

# unique() ๋ฅผ ํ†ตํ•ด, ๊ธฐ๊ฐ„์— ๋“ค์–ด๊ฐ€๋Š” ๊ฐ’๋“ค์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. 
raw['YearOfBirth'].unique()

array([1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], dtype=int64)

 

 

* ์„ธ๋Œ€ ๋‚˜๋ˆ„๊ธฐ 

ํ•œ ์„ธ๋Œ€ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€ 30๋…„ : 2020๋…„ ๊ธฐ์ค€ 30๋…„์”ฉ ๊ตฌ๋ถ„

  • 1930๋…„๋Œ€ ์ด์ „
  • 1960๋…„๋Œ€ ์ด์ „
  • 1990๋…„๋Œ€ ์ด์ „
  • 2020๋…„ ์ด์ „
year_class_list = [ ]

for year in raw['YearOfBirth']:
    if year <= 1930: 
        year_class = '1930๋…„์ด์ „'
    elif year<= 1960: 
        year_class = '1960๋…„์ด์ „'
    elif year <= 1990:
        year_class = '1990๋…„์ด์ „'
    else:
        year_class = '2020๋…„์ด์ „'
    year_class_list.append(year_class)

์œ„์™€๊ฐ™์ด ๋ฐ˜๋ณต๋ฌธ๊ณผ if๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ถœ์ƒ๋…„๋„๋ฅผ 4๊ฐœ์˜ ์„ธ๋Œ€ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ค€๋‹ค. 

raw['year_class'] = year_class_list
raw.head()

์„ธ๋Œ€ ๊ทธ๋ฃน์„ ์ €์žฅํ•œ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ year_class ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

name_period = raw.pivot_table(index = ['Name', 'Sex'], columns = 'year_class', values = 'Number', aggfunc='sum')
name_period = name_period.fillna(0)
name_period = name_period.astype(int)
name_period.head()

์ด๋ฆ„๊ณผ ์„ฑ๋ณ„์„ ์ธ๋ฑ์Šค๋กœ ์„ค์ •ํ•˜๊ณ , year_class์— ๋”ฐ๋ฅธ number์˜ ํ•ฉ๊ณ„๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. 

name_period['sum'] = name_period.sum(axis = 1)
name_period.head()

์ด๋ฆ„ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด sum(axis=1) ์„ ์‚ฌ์šฉํ•œ๋‹ค.  axis=1์ด๋ฉด ๊ฐ€๋กœ๋ฐฉํ–ฅ์œผ๋กœ ๊ณ„์‚ฐ์„ ํ•˜๊ฒŒ ๋œ๋‹ค. 

# ์„ธ๋Œ€ ๋ณ„ ๋น„์œจ ๊ณ„์‚ฐ 
for col in name_period.columns:
    col_new = col+"๋น„์œจ"
    name_period[col_new] = name_period[col] / name_period['sum']
    
name_period.head()

์„ธ๋Œ€ ๋ณ„ ๋น„์œจ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ ์ปฌ๋Ÿผ์„ ๋งŒ๋“ค์–ด ์ค€๋‹ค. 

# ์ด๋ฆ„ ์‚ฌ์šฉ์ˆ˜ ํ•ฉ๊ณ„, 2020๋…„ ์ด์ „ ๋น„์œจ, 1990๋…„์ด์ „ ๋น„์œจ ๊ธฐ์ค€ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 
name_period = name_period.sort_values(by = ['sum', '2020๋…„์ด์ „๋น„์œจ','1990๋…„์ด์ „๋น„์œจ'], ascending=False)
name_period

1์ˆœ์œ„ ์ด๋ฆ„ ๊ฐœ์ˆ˜ ์ด ํ•ฉ , 2์ˆœ์œ„ 2020๋…„ ์ด์ „ ๋น„์œจ , 3์ˆœ์œ„ 1990๋…„ ์ด์ „ ๋น„์œจ ๋กœ ์ •๋ ฌ์„ ํ•˜์—ฌ ์ตœ์‹  ํŠธ๋ Œ๋“œ์— ๋งž๋Š” ๋ฏธ๊ตญ ๋Œ€ํ‘œ์ด๋ฆ„์„ ์•Œ์•„๋ณธ๋‹ค. 

# ์ธ๋ฑ์Šค๊ฐ€ ์—ฌ๋Ÿฌ ๋ ˆ๋ฒจ๋กœ ๋˜์–ด์žˆ์„ ๊ฒฝ์šฐ, ์ธ๋ฑ์Šค๋ฅผ ํ™œ์šฉํ•ด ์ปจํŠธ๋กค ํ•˜๋Š” ๊ฒƒ์€ ๋ณต์žก
# reset_index()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๋กœ ์„ค์ •๋œ ์ด๋ฆ„๊ณผ ์„ฑ๋ณ„์„ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝ
name_period = name_period.reset_index()
name_period.head()

์ธ๋ฑ์Šค๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ปจํŠธ๋กค์ด ์–ด๋ ค์šฐ๋ฏ€๋กœ ํ•„์š”ํ•œ ์ง‘๊ณ„, ์—ฐ์‚ฐ์ด ๋๋‚œ ๋’ค์—๋Š” reset_index๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ column์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

# ๋‚จ์ž ์ด๋ฆ„๋งŒ ์„ ํƒ
cond = name_period['Sex'] =='M'
name_period[cond].head(10)

์„ฑ๋ณ„์ด ๋‚จ์„ฑ์ธ ์ด๋ฆ„๋“ค ์ค‘์—์„œ ์ƒ์œ„ 10๊ฐœ๋ฅผ ๋ฝ‘์•„๋ณด๋ฉด ์œ„์™€ ๊ฐ™๋‹ค. 

# ์ด๋ฒˆ์—๋Š” ์—ฌ์ž์ด๋ฆ„
cond = name_period['Sex'] =='F'
name_period[cond].head(10)

์ด๋ฒˆ์—” ์„ฑ๋ณ„์ด ์—ฌ์„ฑ์ธ ์ด๋ฆ„์˜ ์ƒ์œ„ 10๊ฐœ๋ฅผ ๋ฝ‘์€ ๊ฒƒ์ด๋‹ค. 

 

๊ทธ๋Ÿฐ๋ฐ ์•„์ง์€ ์ด์ƒํ•˜๋‹ค. ํŠนํžˆ ์„ฑ๋ณ„์ด ์—ฌ์ž์ธ ์ด๋ฆ„์˜ ํ‘œ์—์„œ ์ฒซ๋ฒˆ์งธ row๋Š” 1960๋…„ ์ด์ „ ์„ธ๋Œ€์—์„œ ์•ฝ 50%์˜ ๋น„์œจ์„ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด์„ ๊ฑด๋‹ค. 

cond_age = name_period['2020๋…„์ด์ „๋น„์œจ'] > 0.3
cond_sex = name_period['Sex'] == 'M'
cond = cond_age & cond_sex
name_period[cond].head(5)

 

2020๋…„ ์ด์ „๋น„์œจ์ด 0.3 ์ด์ƒ์ด๋ฉด์„œ ์„ฑ๋ณ„์ด ๋‚จ์„ฑ์ธ ์กฐ๊ฑด์ธ row๋ฅผ ์„ ํƒํ•œ๋‹ค. 

๊ฒฐ๊ณผ : < ๋‚จ์„ฑ Top 5 ์ด๋ฆ„ Christopher, Daniel, Matthew, Anthony, Andrew >

cond_age = name_period['2020๋…„์ด์ „๋น„์œจ'] > 0.3
cond_sex = name_period['Sex'] == 'F'
cond = cond_age & cond_sex
name_period[cond].head(5)

์—ฌ์„ฑ์˜ ๊ฒฝ์šฐ๋„ ๋™์ผํ•œ ์กฐ๊ฑด์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค. 

๊ฒฐ๊ณผ : < ์—ฌ์„ฑ Top 5 ์ด๋ฆ„ Jessica, Sarah, Ashley, Stephanie, Emily >

 


 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/49

 

[python pandas] 2. pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (2) - ์ถ”๊ฐ€, ๋ณ‘ํ•ฉ, ์ €์žฅ

๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/48 [python pandas] pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (1) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ * ํŒ๋‹ค์Šค ๊ธฐ๋ณธ ํ•จ์ˆ˜ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ์ฝ๊ธฐ : read_excel(), read_csv() ๋ฐ์ดํ„ฐ ์„ ํƒ..

silvercoding.tistory.com

 

 

 


 1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd

pandas import ํ•ด์ฃผ๊ธฐ 

file = './data/babyNamesUS.csv'
raw = pd.read_csv(file)

์˜ค๋Š˜ ํ•™์Šตํ•  csv ํŒŒ์ผ์„ pandas๋กœ ๋ถˆ๋Ÿฌ์™€ ์ค€๋‹ค. 

raw.head()

raw.info()

1048575 ๊ฐœ์˜ row์™€ 5๊ฐœ์˜ column ์ด ์กด์žฌํ•˜๋ฉฐ, ๊ฒฐ์ธก๊ฐ’์„ ์—†๋‹ค. 

[ ์ปฌ๋Ÿผ ์ •๋ณด : ์ฃผ, ์„ฑ๋ณ„, ์ถœ์ƒ๋…„๋„, ์ด๋ฆ„, ์ด๋ฆ„ ๊ฐœ์ˆ˜ ]

 


 

 2. ์ง‘๊ณ„ํ•˜๊ธฐ ( pivot_table ) 

pd.pivot_table(index = '์ปฌ๋Ÿผ๋ช…', columns = '์ปฌ๋Ÿผ๋ช…', values = '์ปฌ๋Ÿผ๋ช…', aggfunc = 'sum')

raw.pivot_table(index = 'Name', values = 'Number', aggfunc='sum')

์ด๋ฆ„ ๋ณ„ ๋นˆ๋„์ˆ˜ ์ง‘๊ณ„ํ•ด์„œ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. columns๋ฅผ ๋”ฐ๋กœ ์„ค์ •ํ•˜์ง€ ์•Š์œผ๋ฉด values๊ฐ€ column์ด ๋œ๋‹ค. 

name_df = raw.pivot_table(index = 'Name', values = 'Number', columns = 'Sex', aggfunc='sum')
name_df.head()

์ด๋ ‡๊ฒŒ ์„ฑ๋ณ„์„ ๊ธฐ์ค€์œผ๋กœ ์ด๋ฆ„์˜ ๋นˆ๋„์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

name_df.info()

์œ„์˜ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ info๋ฅผ ๋ณด๋ฉด F, M ๋ชจ๋‘ ๊ฒฐ์ธก๊ฐ’์ด ๊ฝค ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 3. ๊ฒฐ์ธก๊ฐ’ ์ฑ„์šฐ๊ธฐ 
  • ๊ณตํ†ต๋œ ๊ฐ’์„ ์ž…๋ ฅ(ex 0)
  • ์ž„์˜์˜ ์ˆ˜๋ฅผ ์ž…๋ ฅ(ex ํ‰๊ท , ์ตœ๋Œ€๊ฐ’, ์ตœ์†Œ๊ฐ’, ๋น„์–ด์žˆ๋Š” ์ž๋ฆฌ ์ฃผ๋ณ€์˜ ๊ฐ’ ๋“ฑ)
  • ๋น„์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋ถ„์„์—์„œ ์ œ์™ธ

- fillna()

name_df = name_df.fillna(0)
name_df.head()

์ด ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ ๊ฐœ์ˆ˜๊ฐ€ ์ฑ„์›Œ์ ธ ์žˆ์ง€ ์•Š์€ ๊ฒƒ์€ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ 0์œผ๋กœ ๋ชจ๋“  ๊ฒฐ์ธก๊ฐ’์„ ์ฑ„์šด๋‹ค. 

name_df.info()

๊ฒฐ์ธก๊ฐ’์ด ๋ชจ๋‘ ์ฑ„์›Œ์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค 

 

 

 

 

 4. ์ •๋ ฌํ•˜๊ธฐ : sort_values(by='์ปฌ๋Ÿผ๋ช…', ascending=True)

- ๋‚จ์ž, ์—ฌ์ž ๊ฐ๊ฐ ๊ฐ€์žฅ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ด๋ฆ„ ์•Œ์•„๋ณด๊ธฐ 

name_df.sort_values(by = 'M')

๋‚จ์„ฑ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๊ธฐ๋ณธ์ด ascending=True ์ด๋ฏ€๋กœ ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ๋˜์–ด์žˆ๋‹ค. ์ƒ์œ„ 5๊ฐœ๋ฅผ ์•Œ์•„๋ณผ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ๋ฐ”๊พธ์–ด ์ค€๋‹ค. 

name_df.sort_values(by = 'M', ascending = False)

name_df.sort_values(by = 'M', ascending = False).head().index

Index(['Michael', 'James', 'Robert', 'John', 'David'], dtype='object', name='Name')

index๋งŒ ์ถ”์ถœํ•ด์„œ ์ƒ์œ„ 5๊ฐœ์˜ ์ด๋ฆ„๋งŒ ๋ฝ‘์€ ๊ฒƒ์ด๋‹ค. 

name_df.sort_values(by = 'F', ascending = False).head().index

Index(['Mary', 'Jennifer', 'Elizabeth', 'Patricia', 'Linda'], dtype='object', name='Name')

์—ฌ์„ฑ์˜ ์ƒ์œ„ 5๊ฐœ ์ด๋ฆ„๋„ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ฝ‘์•„์ค€๋‹ค. 

 

 

 

 

 5. ์ปฌ๋Ÿผ๋ณ„ ๋ฐ์ดํ„ฐ ์ข…๋ฅ˜ ํ™•์ธํ•˜๊ธฐ 

- unique : ์ข…๋ฅ˜ ์•Œ์•„๋ณด๊ธฐ 

raw['StateCode'].unique()

- value_counts() : ์ข…๋ฅ˜ + ๊ฐœ์ˆ˜

raw['StateCode'].value_counts()

raw['YearOfBirth'].value_counts()

2007๋…„์— ๊ธฐ๋ก๋œ ์ด๋ฆ„์ด ๊ฐ€์žฅ ๋งŽ๋‹ค! 


 

 

 

 

 

 

 

 

 

 

+ Recent posts