๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/71

 

[rossmann data]์ƒ์  ๋งค์ถœ ์˜ˆ์ธก/ kaggle ์ถ•์†Œ๋ฐ์ดํ„ฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ <์ด์ „ ๊ธ€> https://silvercoding.tistory.com/70 https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.ti..

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

[ Home Credit Data ]

์›๋ณธ ๋ฐ์ดํ„ฐ: ์บ๊ธ€ 

ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต 

  • ๊ณ ๊ฐ์˜ ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ ์˜ˆ์ธก: ๊ณ ๊ฐ์˜ ์ธ์  ์ •๋ณด, ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด๋‹น ๊ณ ๊ฐ์—๊ฒŒ ๋ˆ์„ ๋นŒ๋ ค์ฃผ์—ˆ์„ ๋•Œ ์ด๋ฅผ ์ƒํ™˜ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ
loan_before.csv - ๊ฐ ์‚ฌ๋žŒ์ด ์ด์ „์— ์ง„ํ–‰ํ–ˆ๋˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด

 

import pandas as pd
import os
os.chdir('../data')
lb = pd.read_csv("loan_before.csv")
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

 

lb.head()

 

- loan before ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€ DAYS_CREDIT
๋Œ€์ถœ ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€ CNT_CREDIT_PROLONG
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT_SUM
๋Œ€์ถœ ์œ ํ˜• CREDIT_TYPE

 

- train, test ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํƒ€๊ฒŸ๊ฐ’(0: ์ •์ƒ ์ƒํ™˜, 1: ์—ฐ์ฒด ํ˜น์€ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด ๊ฒฝ์šฐ) TARGET
์„ฑ๋ณ„(0: ์—ฌ์„ฑ, 1: ๋‚จ์„ฑ) CODE_GENDER
์ฐจ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_CAR
์ฃผํƒ ํ˜น์€ ์•„ํŒŒํŠธ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_REALTY
์ž๋…€ ์ˆ˜ CNT_CHILDREN
์ˆ˜์ž… AMT_INCOME_TOTAL
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT
1๋‹ฌ๋งˆ๋‹ค ๊ฐš์•„์•ผ ํ•˜๋Š” ๊ธˆ์•ก AMT_ANNUITY
๋Œ€์ถœ์‹ ์ฒญ์„ ํ•  ๋•Œ ๋ˆ„๊ฐ€ ๋™ํ–‰ํ–ˆ๋Š”์ง€ NAME_TYPE_SUITE
์ง์—… ์ข…๋ฅ˜ NAME_INCOME_TYPE
ํ•™์œ„ NAME_EDUCATION_TYPE
์ฃผ๊ฑฐ ์ƒํ™ฉ NAME_HOUSING_TYPE
์ง€์—ญ์˜ ์ธ๊ตฌ REGION_POPULATION_RELATIVE
๋‚˜์ด DAYS_BIRTH
์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€(365243๋Š” ๊ฒฐ์ธก์น˜) DAYS_EMPLOYED
๊ณ ๊ฐ์ด ๋Œ€์ถœ์„ ์‹ ์ฒญํ•œ ID ๋ฌธ์„œ๋ฅผ ๋ณ€๊ฒฝํ•œ ๋‚ ์งœ DAYS_ID_PUBLISH
๋ณด์œ ํ•œ ์ฐจ์˜ ๋‚˜์ด OWN_CAR_AGE
๊ฐ€์กฑ ์ˆ˜ CNT_FAM_MEMBERS
์–ธ์ œ ๋Œ€์ถœ์‹ ์ฒญ์„ ํ–ˆ๋Š”์ง€ ์‹œ๊ฐ„ HOUR_APPR_PROCESS_START
์ผํ•˜๋Š” ์กฐ์ง์˜ ์ข…๋ฅ˜ ORGANIZATION_TYPE
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ1๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_1
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ2๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_2
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ3๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_3
๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ DAYS_LAST_PHONE_CHANGE
์‹ ์ฒญ ์ „ 1๋…„๊ฐ„ ์‹ ์šฉํ‰๊ฐ€๊ธฐ๊ด€์— ํ•ด๋‹น ์‚ฌ๋žŒ์— ๋Œ€ํ•œ ์‹ ์šฉ์ •๋ณด๋ฅผ ์กฐํšŒํ•œ ๊ฐœ์ˆ˜ AMT_REQ_CREDIT_BUREAU_YEAR

1. ๋ฌธ์ œ ์ •์˜ 

์งˆ๋ฌธ 1 - ์–ด๋–ค ์š”์†Œ๊ฐ€ ๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

์งˆ๋ฌธ 2 - ๊ทธ ์š”์†Œ๋“ค์ด ์ƒํ™˜์—ฌ๋ถ€์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

 

2. ๋ฐฉ๋ฒ•๋ก  

- ๋ถ„์„ ๊ณผ์ • 

์งˆ๋ฌธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ด์„๊ฐ€๋Šฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ (xAI) ํ™œ์šฉ 

(1) Feature Engineering

- AMT_CREDIT_TO_ANNUITY_RATIO ๋ณ€์ˆ˜ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ๋ช‡๊ฐœ์›”์— ๊ฑธ์ณ ๋ˆ์„ ๊ฐš์•„์•ผ ํ•˜๋Š”์ง€ 

train['AMT_CREDIT_TO_ANNUITY_RATIO'] = train['AMT_CREDIT']/train['AMT_ANNUITY']
test['AMT_CREDIT_TO_ANNUITY_RATIO'] = test['AMT_CREDIT']/test['AMT_ANNUITY']

- lb๋ฐ์ดํ„ฐ: groupby ํ›„ ํ‰๊ท  

  • AMT_CREDIT_SUM (์ด์ „ ๋Œ€์ถœ์˜ ๊ธˆ์•ก) 
  • DAYS_CREDIT (train, test์˜ ๋Œ€์ถœ๋กœ๋ถ€ํ„ฐ ๋ฉฐ์น  ์ „์— ์ด์ „ ๋Œ€์ถœ์„ ์ง„ํ–‰ํ–ˆ๋Š”์ง€) 
  • CNT_CREDIT_PROLONG (๋Œ€์ถœ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€) 
train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )

- lb ๋ฐ์ดํ„ฐ: groupby ํ›„ ๊ฐฏ์ˆ˜ 

  • count ์ปฌ๋Ÿผ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ์ด์ „์— ๋Œ€์ถœ์„ ๋ช‡ ๋ฒˆ ์ง„ํ–‰ํ–ˆ๋Š”์ง€
train = pd.merge(train , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')
test = pd.merge(test , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')

 

- ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์€ ๋ชจ๋ธ ํ•ด์„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด์— ๋ฐฉํ•ด๋ฅผ ์ฃผ๋Š” ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์ œ๊ฑฐ

์ œ๊ฑฐ ๋ณ€์ˆ˜๋ชฉ๋ก

  • CODE_GENDER : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • FLAG_OWN_CAR : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_TYPE_SUITE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_INCOME_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_EDUCATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_HOUSING_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • ORGANIZATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • EXT_SOURCE_1 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_2 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_3 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
del_list = ['CODE_GENDER','FLAG_OWN_CAR','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE','ORGANIZATION_TYPE',
'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
train = train.drop(del_list,axis=1)
test = test.drop(del_list,axis=1)
train.columns

 

(2) ๋ชจ๋ธ๋ง 

- ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ input๋ณ€์ˆ˜๋Š” ์‚ญ์ œํ•œ๋‹ค. 

: Input ๋ณ€์ˆ˜๊ฐ€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋Œ ๋•Œ shap value๋Š” ์ œ๋Œ€๋กœ ๋œ ์„ค๋ช…๋ ฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•จ. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

ํƒ€๊ฒŸ๋ณ€์ˆ˜์ธ TARGET  ์„ ์ œ์™ธํ•œ ๋ณ€์ˆ˜๋“ค์„ input_var ์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

corr = train[input_var].corr()
corr.style.background_gradient(cmap='coolwarm')

์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์‹œ๊ฐํ™” ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง€๊ณ , ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜๋“ค์„ ๋‚˜์—ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

[ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜ ๋ชฉ๋ก ]  

  • CNT_FAM_MEMBERS & CNT_CHILDREN 0.883051
  • AMT_CREDIT_TO_ANNUITY_RATIO & AMT_CREDIT 0.656337
  • AMT_ANNUITY & AMT_CREDIT 0.770938

cf) ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜์˜ ํ•ด์„ 

r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„


ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ์ƒ๊ด€์„ฑ์ด ๋” ๋‚ฎ์€ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. 

print(train['CNT_FAM_MEMBERS'].corr(train['TARGET']))
print(train['CNT_CHILDREN'].corr(train['TARGET']))

0.018876651698723705

0.025357359317615676

del train['CNT_FAM_MEMBERS']
del test['CNT_FAM_MEMBERS']

CNT_FAM_MEMBERS๊ฐ€ TARGET๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

print(train['AMT_CREDIT_TO_ANNUITY_RATIO'].corr(train['TARGET']))
print(train['AMT_CREDIT'].corr(train['TARGET']))

-0.024740288335190132

-0.02255843084934759

del train['AMT_CREDIT']
del test['AMT_CREDIT']

AMT_CREDIT๊ณผ TARGER์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

์ œ๊ฑฐํ•œ ๋ณ€์ˆ˜๋“ค์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜๋“ค์„ input_var์— ๋‹ค์‹œ ์ €์žฅํ•ด ์ค€๋‹ค. 

 

-xgboost ๋ชจ๋ธ๋ง 

: shap value๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ˜•ํƒœ์˜ treeํ˜• ๋ชจ๋ธ์ด์–ด์•ผ ํ•œ๋‹ค. ์ด ์ค‘ xgboost๊ฐ€ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉด์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ์„ ํƒ. 

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(train[input_var],train['TARGET'])

 

 

(3) shap value 

import shap
shap_values = shap.TreeExplainer(model).shap_values(train[input_var])
shap.summary_plot(shap_values, train[input_var], plot_type='bar')

 

ํƒ€๊ฒŸ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ƒ์œ„ 5๊ฐ€์ง€ ๋ณ€์ˆ˜ ๋ชฉ๋ก

  • AMT_CREDIT_TO_ANNUITY_RATIO
  • DAYS_EMPLOYED
  • DAYS_CREDIT
  • DAYS_BIRTH
  • DAYS_LAST_PHONE_CHANGE

 

(4) 5๊ฐœ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ๋ณ€์ˆ˜(๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€) ์™€์˜ ๊ด€๊ณ„ 

-1. AMT_CREDIT_TO_ANNUITY_RATIO: ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„

shap.dependence_plot('AMT_CREDIT_TO_ANNUITY_RATIO', shap_values, train[input_var])

ํ•ด๋‹น ๊ทธ๋ž˜ํ”„๋Š” ์„ธ๋กœ์ถ•์˜ ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ ํ•œ๋‹ค๊ณ  ํ•ด์„(TARGET์ด 0์ผ ํ™•๋ฅ ์ด ๋†’์Œ)ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ๊ฐ„์ด 12-20๊ฐœ์›”์ผ ๋•Œ ์ƒํ™˜์„ ์ž˜ ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, 12๊ฐœ์›” ์ดํ•˜, 20๊ฐœ์›” ์ด์ƒ์ผ ๋•Œ๋Š” ๋น„๊ต์  ์ƒํ™˜์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

 

 

- 2. DAYS_EMPLOYED: ์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€

shap.dependence_plot('DAYS_EMPLOYED', shap_values, train[input_var])

๋Œ€์ถœ์ผ ๊ธฐ์ค€์œผ๋กœ 9000์ผ ๋ณด๋‹ค ์ „์— ์ทจ์—…ํ–ˆ์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๊ธ‰ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 3. DAYS_CREDIT: ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€

shap.dependence_plot('DAYS_CREDIT', shap_values, train[input_var])

-3000์ผ ๋ถ€ํ„ฐ -2000์ผ๊นŒ์ง€ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ์ƒ์Šนํ•˜๋‹ค๊ฐ€ ๊ทธ ์ดํ›„๋ถ€ํ„ฐ ํ•˜๋ฝํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋„ˆ๋ฌด ์˜ค๋ž˜ ์ „์— ๋Œ€์ถœ์„ ๋ฐ›์•˜๊ฑฐ๋‚˜, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 4. DAYS_BIRTH: ๋‚˜์ด

shap.dependence_plot('DAYS_BIRTH', shap_values, train[input_var])

ํƒœ์–ด๋‚œ์ง€ ์˜ค๋ž˜ ๋˜์—ˆ์„ ์ˆ˜๋ก(๋‚˜์ด๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก) ๋Œ€์ถœ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 

 

 

- 5. DAYS_LAST_PHONE_CHANGE: ๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ

shap.dependence_plot('DAYS_LAST_PHONE_CHANGE', shap_values, train[input_var])

ํ•ธ๋“œํฐ์„ ์˜ค๋ž˜ ์ „์— ๋ฐ”๊พธ์—ˆ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋ณด์ธ๋‹ค. 

 

 


3. ๊ฒฐ๋ก  

  • ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„์ด ์ƒํ™˜์—ฌ๋ถ€์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค. ํ•ด๋‹น ์˜ํ–ฅ์€ ๋น„์„ ํ˜•์  ๊ด€๊ณ„์ด๋‹ค. (์˜ํ–ฅ์ด ํฌ๋‹ค๊ณ  ํ•ด์„œ ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‹จ์ •์ง“๊ธฐ๋Š” ์–ด๋ ต๋‹ค. )
  • ์ฃผํƒ ๋ณด์œ  ์—ฌ๋ถ€์™€ ์ž์‹์˜ ์ˆ˜๋Š” ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ์ตœ๊ทผ์— ์ทจ์—…ํ–ˆ์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ํ•ธ๋“œํฐ์„ ๋ฐ”๊ฟจ์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ๋Œ€์ถœ๊ธˆ ์ƒํ™ฉ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค.  
train['DAYS_EMPLOYED'].quantile(0.75)

-748.0

์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ์œ„ 25%์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ 4๊ฐœ์˜ ๋ณ€์ˆ˜์˜ ์ƒ์œ„ 25% ์ด์ƒ ๊ทธ๋ฃน๊ณผ ํ•˜์œ„ 25%๋ฏธ๋งŒ ๊ทธ๋ฃน์„ ๋‚˜๋ˆ„์–ด ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•ด ๋ณธ๋‹ค. 

 

- ์ƒ์œ„ 25%

group1 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.75)< train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.75)< train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)< train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.75)< train['DAYS_BIRTH']) ]

- ํ•˜์œ„ 25 %

group2 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.25)> train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.25)> train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)> train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.25)> train['DAYS_BIRTH']) ]
group1['group'] = 1
group2['group'] = 0

group1์€ group๋ณ€์ˆ˜์— 1์„, group2๋Š” group ๋ณ€์ˆ˜์— 0์„ ๋„ฃ์–ด ์ค€๋‹ค. 

full = pd.concat([group1,group2],axis=0)

group1๊ณผ group2๋ฅผ ํ•ฉ์ณ์ค€๋‹ค. 

import seaborn as sns
sns.barplot('group','TARGET',data=full)

group2 (group=0, ํ•˜์œ„ 25%)  ์˜ Target๊ฐ’์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค(0์ด ๋งŽ๋‹ค=์ •์ƒ ์ƒํ™˜). ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒฐ๋ก ๊ณผ ๊ฐ™์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/70

 

[FIFA DATA] 2019/2020 ์‹œ์ฆŒ Manchester United ์— ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€?, EDA ๊ณผ์ •

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding...

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

<Rossmann Store Sales> 

https://www.kaggle.com/c/rossmann-store-sales/data?select=test.csv 

 

Rossmann Store Sales | Kaggle

 

www.kaggle.com

ํ•ด๋‹น ๋งํฌ์˜ ์บ๊ธ€ ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ๋กœ์Šค๋งŒ ๋ฐ์ดํ„ฐ์ด๋‹ค. 

  • train.csv - historical data including Sales
  • test.csv - historical data excluding Sales
  • sample_submission.csv - a sample submission file in the correct format
  • store.csv - supplemental information about the stores

 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ถ•์†Œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์ ์˜ ๋งค์ถœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค.  

(๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต)

 

import os
import pandas as pd
os.chdir('../data')
train = pd.read_csv("lspoons_train.csv")
test = pd.read_csv("lspoons_test.csv")
store = pd.read_csv("store.csv")

lspoons_train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
lspoons_test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ

store.csv - ์ƒ์ ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ณด์กฐ ๋ฐ์ดํ„ฐ

 

 

train.head()


์ปฌ๋Ÿผ ์ •๋ณด 

  • id
  • Store: ๊ฐ ์ƒ์ ์˜ id
  • Date: ๋‚ ์งœ
  • Sales: ๋‚ ์งœ์— ๋”ฐ๋ฅธ ๋งค์ถœ
  • Promo: ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์ง„ํ–‰ ์—ฌ๋ถ€
  • StateHoliday: ๊ณตํœด์ผ ์—ฌ๋ถ€/ ๊ณตํœด์ผ X-> 0, ๊ณตํœด์ผ-> ๊ณตํœด์ผ์˜ ์ข…๋ฅ˜(a, b, c)
  • SchoolHoliday: ํ•™๊ต ํœด์ผ์ธ์ง€ ์—ฌ๋ถ€

์œ„์˜ ์ปฌ๋Ÿผ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ Sales(๋งค์ถœ) ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 


- ๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง ( feature engineering - ๋ณ€์ˆ˜์„ ํƒ - ๋ชจ๋ธ๋ง ) 

2. 2์ฐจ ๋ชจ๋ธ๋ง ( store ๋ฐ์ดํ„ฐ merge - feature engineering - ๋ณ€์ˆ˜ ์„ ํƒ - ๋ชจ๋ธ๋ง )

3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

... ๋ชจ๋ธ๋ง ๋ฐ˜๋ณต ( ์ด ํ›„ ๋ชจ๋ธ๋ง์€ ์ž์œจ, ๊นƒํ—™ ์ •๋ฆฌ ) 

 


1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง 

: ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค. (๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ, ์›ํ•ซ ์ธ์ฝ”๋”ฉ) 


ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์ด๋ž€? 

  • ์˜ˆ์ธก์„ ์œ„ํ•ด ๊ธฐ์กด์˜ input ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด input ๋ณ€์ˆ˜ ์ƒ์„ฑ
  • ๋จธ์‹ ๋Ÿฌ๋‹ ์˜ˆ์ธก ์„ฑ๋Šฅ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•

train.info()

๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , object ํƒ€์ž…์ธ Date, StateHoliday ์ปฌ๋Ÿผ์„ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€๋‹ค. 

 

- StateHoliday column one-hot encoding 

train = pd.get_dummies(columns=['StateHoliday'],data=train)
test = pd.get_dummies(columns=['StateHoliday'],data=test)

get_dummies ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ StateHoliday ์ปฌ๋Ÿผ์„ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

print("train_columns: ", train.columns, end="\n\n\n")
print("test_columns: ", test.columns)

์ƒˆ๋กœ ์ƒ์„ฑ๋œ ์นผ๋Ÿผ์„ ๋ณด๋ฉด train์—๋Š” b, c ๊ฐ€ ์žˆ์ง€๋งŒ test์—๋Š” b, c ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๊ฒฝ์šฐ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

test['StateHoliday_b'] = 0
test['StateHoliday_c'] = 0

๋”ฐ๋ผ์„œ ๊ฐ™์€ ์นผ๋Ÿผ์„ test ๋ฐ์ดํ„ฐ์…‹์— ์ƒ์„ฑํ•ด ์ค€๋‹ค.

 

- feature engineering using Date column

train['Date']

Date ์นผ๋Ÿผ์€ ๋‚ ์งœํ˜• ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์ง€๋งŒ dtype์ด object์ด๋ฏ€๋กœ ๋‚ ์งœ๋กœ์„œ์˜ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

train['Date'] = pd.to_datetime( train['Date'] )
test['Date'] = pd.to_datetime( test['Date'] )

๋”ฐ๋ผ์„œ pandas์—์„œ ๋‚ ์งœ ๊ณ„์‚ฐ์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” to_datetime ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ ์งœํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๋‹ค. 

 

 

# ์š”์ผ ์ปฌ๋Ÿผ weekday ์ƒ์„ฑ 

train['weekday'] = train['Date'].dt.weekday
test['weekday'] = test['Date'].dt.weekday

# ๋…„๋„ ์ปฌ๋Ÿผ year ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

# ์›” ์ปฌ๋Ÿผ month ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

 

 

- ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง 

from xgboost import XGBRegressor
train.columns

xgb = XGBRegressor( n_estimators= 300 , learning_rate=0.1 , random_state=2020 )
xgb.fit(train[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']],
        train['Sales'])

 

XGB ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ผœ ์ค€๋‹ค. 

 

from sklearn.model_selection import cross_val_score
cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

cross validation ์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ๊ตฌํ•ด๋ณด์•˜๋”๋‹ˆ ์œ„์™€ ๊ฐ™์ด ๋‚˜์™”๋‹ค.  ์ถ”๊ฐ€ ์ž‘์—…์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ์ค„์—ฌ๋‚˜๊ฐ€ ๋ณด์ž! 

 

 

cf.  ์บ๊ธ€ ์ œ์ถœ ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ 

test['Sales'] = xgb.predict(test[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์— ๋„ฃ์–ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

test[['id','Sales']].to_csv("submission.csv",index=False)

 

- ๋ณ€์ˆ˜ ์„ ํƒ 

xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์˜ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

input_var = ['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']

input_var์— Sales๋ฅผ ์ œ์™ธํ•œ ์ธํ’‹ ๋ณ€์ˆ˜๋ฅผ ์ €์žฅํ•ด ์ค€๋‹ค. 

imp_df = pd.DataFrame({"var": input_var,
                       "imp": xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
imp_df

๋ณ€์ˆ˜ ์ค‘์š”๋„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ์„ ํ•ด ์ค€๋‹ค. Promo๊ฐ€ ์••๋„์ ์œผ๋กœ ๋ณ€์ˆ˜์ค‘์š”๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. State_Holiday๋Š” ๋Œ€์ฒด์ ์œผ๋กœ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

import matplotlib.pyplot as plt
plt.bar(imp_df['var'],imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

ํ•œ๋ˆˆ์— ๋ณด๊ธฐ์œ„ํ•ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๋ณด์•˜๋”๋‹ˆ SchoolHoliday ์ดํ›„ ์ปฌ๋Ÿผ๋“ค์€ ๋ณ„ ์˜๋ฏธ๊ฐ€ ์—†์–ด ๋ณด์ธ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ปฌ๋Ÿผ์„ ๋ช‡๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜ค๋ฅ˜์œจ์„ ์ค„๊ฒŒ ํ•˜๋Š”์ง€ ์‹คํ—˜ํ•ด ๋ณธ๋‹ค. 

import numpy as np
score_list=[]
selected_varnum=[]
for i in range(1,10):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

 

๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜ ๋ณ„๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ 2๊ฐœ์ผ ๋•Œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ cross validation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋‘๋ฒˆ์งธ ๋นผ๊ณ ๋Š” ๋ชจ๋‘ ์ค„์–ด๋“  ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ ๋ชจ๋ธ ํ•™์Šต์„ ํ•œ ํ›„, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ œ์ถœํ•œ ์บ๊ธ€ ์Šค์ฝ”์–ด๋„ ๋” ์ค„์–ด๋“ค์—ˆ๋‹ค. (๋ฐ˜๋ณต์ž‘์—…์ด๋ฏ€๋กœ ํฌ์ŠคํŒ…์—์„œ ์ƒ๋žต) 

 

 

 

 

 


2. 2์ฐจ ๋ชจ๋ธ๋ง 

- store ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘ 

store


store ๋ฐ์ดํ„ฐ์…‹: ๊ฐ ์ƒ์ ์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ •๋ฆฌํ•œ ๊ฒƒ 

์ปฌ๋Ÿผ ์˜๋ฏธ

  • Store: ์ƒ์ ์˜ ์œ ๋‹ˆํฌํ•œ id
  • Store Type: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • Assortment: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • CompetitionDistance: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์ƒ์ ๊ณผ์˜ ๊ฑฐ๋ฆฌ
  • CompetitionOpenSinceMonth: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์˜คํ”ˆ ์›”
  • CompetitionOpenSinceYear: ์˜คํ”ˆ ๋…„๋„
  • Promo2: ์ง€์†์ ์ธ(์ฃผ๊ธฐ์ ์ธ) ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์—ฌ๋ถ€
  • Promo2SinceWeek/ promo2SinceYear: ํ•ด๋‹น ์ƒ์ ์ด promo2๋ฅผ ํ•˜๊ณ ์žˆ๋‹ค๋ฉด ์–ธ์ œ ์‹œ์ž‘ํ–ˆ๋Š”์ง€
  • PromoInterval: ์ฃผ๊ธฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€

train = pd.merge(train, store, on=['Store'], how='left')
test = pd.merge(test, store, on=['Store'], how='left')

Store ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ train, test ๋ฐ์ดํ„ฐ์…‹๊ณผ store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•ด ์ค€๋‹ค. 

 

 

- CompetitionOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ

: ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ (ํ•ด๋‹น ๊ฐ€๊ฒŒ ์ด์ „ ๊ฐœ์žฅ: ์–‘์ˆ˜, ์ดํ›„ ๊ฐœ์žฅ: ์Œ์ˆ˜

train['CompetitionOpen'] = 12*( train['year'] - train['CompetitionOpenSinceYear'] ) + \
                             (train['month'] - train['CompetitionOpenSinceMonth'])

test['CompetitionOpen'] = 12*( test['year'] - test['CompetitionOpenSinceYear'] ) + \
                             (test['month'] - test['CompetitionOpenSinceMonth'])

ํ•ด๋‹น ๊ฐ€๊ฒŒ๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„์—์„œ ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„๋ฅผ ๋บ€ ํ›„ 12๋ฅผ ๊ณฑํ•˜๋ฉด ๊ฐœ์›” ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ๋‹ฌ์—์„œ ๊ฒฝ์Ÿ์—…์ฒด ๊ฐœ์žฅ ๋‹ฌ์˜ ์ฐจ์ด์™€ ๋”ํ•ด์ฃผ๋ฉด ํ•ด๋‹น ๊ฐ€๊ฒŒ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- PromoOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ 

: ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ํ›„์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์‹œ์ž‘๋˜์—ˆ๋Š”์ง€ 

train['WeekOfYear'] = train['Date'].dt.weekofyear # ํ˜„์žฌ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€
test['WeekOfYear'] = test['Date'].dt.weekofyear

ํ”„๋กœ๋ชจ์…˜2์— ๋Œ€ํ•œ ๋‚ ์งœ ์ •๋ณด๊ฐ€ ๋…„๋„(Year)์™€ ์ฃผ(Week)๋กœ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— Date์ปฌ๋Ÿผ์—์„œ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€ ๊ณ„์‚ฐํ•˜์—ฌ WeekOfYear ์ปฌ๋Ÿผ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

train['PromoOpen'] = 12* ( train['year'] - train['Promo2SinceYear'] ) + \
                        (train['WeekOfYear'] - train['Promo2SinceWeek']) / 4

test['PromoOpen'] = 12* ( test['year'] - test['Promo2SinceYear'] ) + \
                        (test['WeekOfYear'] - test['Promo2SinceWeek']) / 4

์ด์ „๊ณผ ๊ฐ™์ด ๋…„๋„๋ฅผ ๊ฐœ์›”์ˆ˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ , ์ฃผ๋ฅผ 4๋กœ ๋‚˜๋ˆ„์–ด ๊ฐœ์›”์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๊ฒƒ์„ ๋”ํ•˜์—ฌ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ๋’ค์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐœ์›” ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

- ์›ํ•ซ์ธ์ฝ”๋”ฉ ( get_dummies() ) 

train.dtypes

๋ฐ์ดํ„ฐํƒ€์ž…์„ ํ™•์ธ ํ•ด ๋ณด๋ฉด object์ธ ์ปฌ๋Ÿผ์ด 3๊ฐ€์ง€ ์žˆ๋‹ค. 3๊ฐœ์˜ ์ปฌ๋Ÿผ์„ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

train = pd.get_dummies(columns=['StoreType'],data=train)
test = pd.get_dummies(columns=['StoreType'],data=test)
train = pd.get_dummies(columns=['Assortment'],data=train)
test = pd.get_dummies(columns=['Assortment'],data=test)
train = pd.get_dummies(columns=['PromoInterval'],data=train)
test = pd.get_dummies(columns=['PromoInterval'],data=test)
train.columns

test.columns

train column๊ณผ test column ์ด ๋™์ผํ•œ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 

 

 

 

- ๋ชจ๋ธ๋ง 

input_var = ['Promo', 'SchoolHoliday',
       'StateHoliday_0', 'StateHoliday_a', 'StateHoliday_b', 'StateHoliday_c',
       'weekday', 'year', 'month', 'CompetitionDistance',
       'Promo2',
       'CompetitionOpen', 'WeekOfYear',
       'PromoOpen', 'StoreType_a', 'StoreType_b', 'StoreType_c', 'StoreType_d',
       'Assortment_a', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec']

ํ•„์š”์—†๋Š” ์ปฌ๋Ÿผ์€ ์‚ญ์ œํ•˜๊ณ  input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

set(train) - set(input_var)

(์ฐธ๊ณ ) input_var์— ๋“ค์–ด๊ฐ€์ง€ ์•Š์€ ์ปฌ๋Ÿผ๋“ค ๋ชฉ๋ก์ด๋‹ค. 

xgb = XGBRegressor( n_estimators=300, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],train['Sales'])

์•ž๊ณผ ๋™์ผํ•˜๊ฒŒ xgb ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.  

cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋ธ๋ง์„ ํ–ˆ๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋Œ€ํญ ํ•˜๋ฝํ•˜์˜€๋‹ค. 

 

 

- ๋ณ€์ˆ˜์ค‘์š”๋„ 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
plt.bar(imp_df['var'],
        imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๋”๋‹ˆ, ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์„ ํƒํ•ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค. 

score_list=[]
selected_varnum=[]
for i in range(1,25):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

์ง€์†์ ์œผ๋กœ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด์ง€๋งŒ 17๊ฐœ ์ดํ›„๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™์ด ๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ 17๊ฐœ๊นŒ์ง€ ์„ ํƒํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ๋ณธ๋‹ค. 

input_var = imp_df['var'].iloc[:17].tolist()
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

์ „์ฒด์ ์œผ๋กœ ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. 

 

 

 

 

 

 


3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

estim_list = [100,200,300,400,500,600,700,800,900]
score_list = []
for i in estim_list:
    xgb = XGBRegressor( n_estimators=i, learning_rate= 0.1, random_state=2020)
    scores = cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    print(i)
plt.plot(estim_list,score_list)
plt.xticks(rotation=90)
plt.show()

n_estimators๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์„ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๊ณ , n_estimators=400์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์ ๋‹นํ•ด ๋ณด์ธ๋‹ค.  

xgb = XGBRegressor( n_estimators=400, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

400์œผ๋กœ ๋ณ€๊ฒฝํ•˜์˜€๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋‚ฎ์•„์กŒ๋‹ค. 

 

์•„์‰ฝ๊ฒŒ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ์ดํ›„๋กœ ์บ๊ธ€์—์„œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ ์˜ค๋ฅ˜์œจ์ด ๋” ๋†’๊ฒŒ ๋‚˜์™”๋‹ค. ์ด์™ธ์— ๊ฒฐ์ธก๊ฐ’, ์ด์ƒ์น˜ ๋“ฑ feature engineering์„ ์ง€์†์ ์œผ๋กœ ์‹œ๋„ํ•ด ๋ณด์•„์•ผ๊ฒ ๋‹ค. (์ถ”ํ›„ github ์—…๋กœ๋“œ ์˜ˆ์ •) 


 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/69

 

[๋จธ์‹ ๋Ÿฌ๋‹] ๋ณ€์ˆ˜์ค‘์š”๋„, shap value

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding...

silvercoding.tistory.com

 

 


Menchester United ํŒ€์—์„œ 2013๋…„ Alex Ferguson ๊ฐ๋…์ด ์€ํ‡ด๋ฅผ ํ•˜๊ณ , ํ•˜๋ฝ์„ธ๋ฅผ ํƒ€๋‹ค๊ฐ€ ์†”์ƒค๋ฅด ๊ฐ๋…์ด ํŒ€์„ ๋งก๊ฒŒ๋˜์—ˆ์„ ๋•Œ 2020๋…„ 3์›” ๊ธฐ์ค€ 2019/2020 ์‹œ์ฆŒ ๊ฒจ์šธ ์‹œ์žฅ์—์„œ ๋‘๋ช…์˜ ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•˜์—ฌ ํ•˜๋ฝ์„ธ๋ฅผ ๋ฐ˜์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

์ด๋ฅผ ์„ ์ˆ˜๋“ค์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ†ตํ•ด ๋ฐฉ์ถœ๊ณผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด, ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ๊นŒ? 


 

 

๋ฐ์ดํ„ฐ : FIFA ๋ฐ์ดํ„ฐ (๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ๊ฐ•์˜ ์ œ๊ณต)


1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
import warnings 

warnings.filterwarnings(action='ignore')  # ๊ฒฝ๊ณ ๋ฌธ ์ œ๊ฑฐ
data = pd.read_csv("./data/FIFA_data.csv")
pd.set_option('display.max_columns', 80)

column์ด ๋งŽ์œผ๋ฉด ... ์œผ๋กœ ์ƒ๋žต๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ์ปฌ๋Ÿผ ์ˆ˜์ธ 80๊ฐœ๋กœ ์„ค์ •ํ•ด์ค€๋‹ค. 

data.head()

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

2. ๋ฐ์ดํ„ฐ ํ™•์ธ, ๋ถ„์„๊ณ„ํš 

์ปฌ๋Ÿผ ๋ณ„ ์˜๋ฏธ ํ™•์ธ 

ID ๊ณ ์œ ์˜ ๋ฒˆํ˜ธ
Name ์ด๋ฆ„
Age ๋‚˜์ด
Overall ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜
Potential ์ž ์žฌ ๋Šฅ๋ ฅ์น˜
Club ์†Œ์† ํŒ€
Value ์˜ˆ์ƒ ์ด์ ๋ฃŒ (์œ ๋กœ)
Wage ์ฃผ๊ธ‰ (์œ ๋กœ)
Preferred Foot ์ž˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐœ
Weak Foot ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ฐœ
Skill Moves ๊ฐœ์ธ๊ธฐ
Position ํฌ์ง€์…˜
Jersey Number ๋“ฑ๋ฒˆํ˜ธ
Joined ์†Œ์† ํŒ€ ์ž…๋‹จ ๋‚ ์งœ
Contract Valid Until ๊ณ„์•ฝ ๊ธฐ๊ฐ„
Height ํ‚ค (ํ”ผํŠธ)
Weight ๋ชธ๋ฌด๊ฒŒ (ํŒŒ์šด๋“œ)
LS ~ RB ํฌ์ง€์…˜ ๋ณ„ ๋Šฅ๋ ฅ์น˜
Crossing ~ GKReflexes ์„ธ๋ถ€ ๋Šฅ๋ ฅ์น˜
Release Clause ๋ฐ”์ด์•„์›ƒ

 

๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. Manchester United ์„ ์ˆ˜ ๋ถ„์„ (์–ด๋–ค ์„ ์ˆ˜๋“ค์ด ์กด์žฌํ•˜๋Š”๊ฐ€?) 

2. Manchester United ์ง€์—ญ๋ผ์ด๋ฒŒ Manchester City ์„ ์ˆ˜๋“ค๊ณผ ๋น„๊ต ๋ถ„์„ 

3. ๋ถ€์กฑํ•œ ํฌ์ง€์…˜ 2๊ฐ€์ง€ ์„ ํƒ 

4. ๋‹ค๋ฅธํŒ€์˜ ์„ ์ˆ˜๋“ค ์ค‘ 2๋ช…์˜ ์˜์ž… ์„ ์ˆ˜ ์„ ํƒ (์žฌ์ •, ํ˜„์‹ค๊ฐ€๋Šฅ์„ฑ, ์˜์ž…๋ฐฉ์นจ ๊ณ ๋ ค

 

 

 

 

 


3. Manchester United ์„ ์ˆ˜๋“ค ๋ถ„์„ 

(1) EDA 

- ๋งจ์œ  ์„ ์ˆ˜ ์ถ”์ถœ

mu = data[data['Club'] == 'Manchester United']
mu.head()

Club์ด Manchester United์ธ ํ–‰๋งŒ ๋ฝ‘์•„ mu์— ์ €์žฅํ•ด์ค€๋‹ค.  

mu['Club'].unique()

unique() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•์ธํ•ด ๋ณด๋‹ˆ ๋งจ์œ ๋งŒ ์ž˜ ๋ฝ‘ํžŒ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

- ๋งจ์œ  ์„ ์ˆ˜๋“ค ๊ฐ„๋žตํ•œ ์ •๋ณด ์ถœ๋ ฅ 

print(f"์ธ์›: {mu.shape[0]}")
print(f"๋งจ์œ  ์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜: {mu['Position'].unique()}")
print(f"ํ‰๊ท  ๋Šฅ๋ ฅ์น˜: {mu['Overall'].mean()}")
print(f"ํ‰๊ท  ์ž ์žฌ ๋Šฅ๋ ฅ์น˜: {mu['Potential'].mean()}")

 

 

- ์‹œ๊ฐํ™” 

import seaborn as sns 
sns.countplot(mu['Age'])

์„ ์ˆ˜๋“ค์˜ ๋‚˜์ด ๋ถ„ํฌ์ด๋‹ค. 19์‚ด์ด ๊ฐ€์žฅ ๋งŽ๊ณ , ๊ทธ๋‹ค์Œ์œผ๋ก  25์‚ด, 28์‚ด, 22์‚ด์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

sns.countplot(mu['Position'])

ใ…

์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜ ์ค‘ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์€ CM, CB ์ด๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

Position๋ณ„ ๋Šฅ๋ ฅ์น˜ boxplot ์„ ๊ทธ๋ ค๋ณด์•˜๋”๋‹ˆ CB ํฌ์ง€์…˜์—์„œ ์ด์ƒ์น˜๊ฐ€ ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค. 

 

 

* ์ด์ƒ์น˜ & ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ 


์ด์ƒ์น˜

  • ์ •์ƒ ๋ฒ”์ฃผ์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฐ’
  • ์ด์ƒ์น˜๋ฅผ ํฌํ•จํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ์™œ๊ณก๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ 

๊ฒฐ์ธก์น˜

  • ๋ˆ„๋ฝ๊ฐ’, ๋น„์–ด์žˆ๋Š” ๊ฐ’ 
  • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ ๊ธฐ๋ก๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜, ๋ˆ„๋ฝ๋œ ๊ฐ’

์ด์ƒ์น˜์™€ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฒ•

  • ์ œ๊ฑฐ: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ํ–‰, ํ˜น์€ ์—ด์„ ์ œ๊ฑฐํ•œ๋‹ค. (์ตœํ›„์˜ ์ˆ˜๋‹จ, ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ์†Œ์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ) 
  • ๋Œ€์ฒด: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๋ฅผ ํ•ด๋‹น ์ปฌ๋Ÿผ์˜ ์ตœ๋Œ“๊ฐ’, ํ‰๊ท ๊ฐ’, ์ค‘์•™๊ฐ’ ๋“ฑ์œผ๋กœ ๋Œ€์ฒด (์ถ”์ฒœํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋‹˜.)
  • ์˜ˆ์ธก: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ์ปฌ๋Ÿผ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์˜ˆ์ธก ๊ฐ’์œผ๋กœ ์ฑ„์›Œ ๋„ฃ์Œ (์ถ”์ฒœ) 

mu[mu['Overall']>100]

๋Šฅ๋ ฅ์น˜๊ฐ€ 100์ด์ƒ์ธ row๋ฅผ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

 

 

์ด์ƒ์น˜ ์ฒ˜๋ฆฌ - ์˜ˆ์ธก ์‚ฌ์šฉ 

mu[mu['Position'] == 'CB'][['Position', 'Overall', 'CB']]

๊ฐ™์€ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ ๋น„๊ต๋ฅผ ํ•ด๋ณธ๋‹ค. CB๊ฐ€ ๋น„์Šทํ•œ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ์˜ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๊ฐ™์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด์ƒ์น˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ์„ ์ˆ˜๋Š” 11081 ๋ฒˆ์งธ ์„ ์ˆ˜์™€ CB๊ฐ€ ๊ฐ™์œผ๋ฏ€๋กœ 75๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu['Overall'][11422] = 75

11422 ๋ฒˆ์งธ ์„ ์ˆ˜์˜ ๋Šฅ๋ ฅ์น˜๋ฅผ 75๋กœ ๋ฐ”๊พธ์–ด์ค€๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

๋‹ค์‹œ boxplot์„ ๊ทธ๋ ค๋ณด๋‹ˆ ์ด์ƒ์น˜ ์—†์ด ๊ทธ๋ ค์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Potential')

potential์— ๋Œ€ํ•œ boxplot๋„ ๊ทธ๋ ค์ค€๋‹ค. potential์—๋Š” ์ด์ƒ์น˜๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜๋‹ค. 

 

 

 

mu.info()

mu๋Š” ์ด 33๊ฐœ์˜ row์ธ๋ฐ, 19~44 ๋ฒˆ์งธ ์ปฌ๋Ÿผ์— 3๊ฐœ์˜ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. 

mu[mu.isnull()['LS']]

ํฌ์ง€์…˜์ด GK์ธ ์„ ์ˆ˜๋“ค๋งŒ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. GK๋Š” ๊ณจํ‚คํผ์ด๊ณ , ๊ณจํ‚คํผ๋Š” ๋‹ค๋ฅธ ํฌ์ง€์…˜์— ๋Œ€ํ•œ ๋Šฅ๋ ฅ์น˜๋ฅผ ๋ถ€์—ฌํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ์ธก๊ฐ’์œผ๋กœ ๋‘” ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu = mu.fillna(-1)

๊ฒฐ์ธก๊ฐ’์„ -1๋กœ ์ฑ„์›Œ์ค€๋‹ค. (๊ฐ’์„ ์ธก์ •ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ์˜๋ฏธ์—์„œ ์ž„์˜์˜ ๊ฐ’ -1, ๋‹ค๋ฅธ๊ฐ’์„ ๋„ฃ์–ด์ฃผ์–ด๋„ ๋จ) 

mu.info()

๊ฒฐ์ธก๊ฐ’์ด ๋ชจ๋‘ ์ฑ„์›Œ์กŒ๋‹ค. 

 

 

 

 

 


4. Manchester United vs Manchester City 

(1) ์ „์ฒ˜๋ฆฌ 

df = data[(data['Club'] == 'Manchester United') | (data['Club']=='Manchester City')]

Manchester United์™€ Manchester City๋งŒ ๋ฝ‘์•„ df ์— ์ €์žฅํ•ด์ค€๋‹ค. 

df['Club'].unique()

df['Value'].head()

์ด์ ๋ฃŒ Value๊ฐ€ ๊ธฐํ˜ธ๋กœ ์จ์ ธ์žˆ์œผ๋ฏ€๋กœ, ๊ธฐํ˜ธ ์‚ญ์ œ, ์†Œ์ˆ˜์  ์‚ญ์ œ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 

df['Value'] = df['Value'].str.replace('M', '000000')
df['Value'] = df['Value'].str.replace('K', '000')

M์ด ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 6๊ฐœ, K๊ฐ€ ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 3๊ฐœ ๋ถ™์—ฌ ์ค€๋‹ค. 

df['Value']

df['Value'] = df['Value'].str.slice(1,)

๊ทธ๋‹ค์Œ str.slice๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐํ˜ธ๋ฅผ ์—†์• ์ค€๋‹ค. 

df['Value'].iloc[3]

'64.5000000'

์ด๋ ‡๊ฒŒ ์†Œ์ˆ˜์ ์ด ์žˆ๋Š” ๊ฒƒ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ์ ์„ ์—†์• ๊ณ  ๋’ค์˜ 0์„ ํ•˜๋‚˜ ์‚ญ์ œํ•œ๋‹ค. 

for i in df["Value"]:
    if '.' in i:
        df['Value'] = df['Value'].str.replace('.', '')
        df['Value'] = df['Value'].str.slice(0,-1)
df['Value']

์ ์šฉ์ด ์ž˜ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

df['Value'] = df['Value'].astype('int')

์ด์ œ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ object -> int๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. 

df.head()

 

 

 

- mu, mc ์„ ์ˆ˜ ๋ถ„๋ฆฌ 

mu = df[df['Club'] == "Manchester United"]
mc = df[df['Club'] == "Manchester City"]

df์—์„œ Manchester United, Manchester City ์„ ์ˆ˜๋“ค์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค. 

mc.head()

df['Position'].unique()

์œ„์˜ ํฌ์ง€์…˜์„ ๊ณจ๊ธฐํผ, ์ˆ˜๋น„์ˆ˜, ๋ฏธ๋“œํ•„๋”, ๊ณต๊ฒฉ์ˆ˜, ์ด 4๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค. ํฌ์ง€์…˜์„ ๋‚˜๋ˆ„๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 


  • ๊ณจํ‚คํผ ๋ฆฌ์ŠคํŠธ GK= GK (๊ณจํ‚คํผ)
  • ์ˆ˜๋น„์ˆ˜ ๋ฆฌ์ŠคํŠธ CB = CB(์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LB(์™ผ์ชฝ ์ˆ˜๋น„์ˆ˜), RB(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„์ˆ˜), RCB(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LCB(์™ผ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜) 
  • ๋ฏธ๋“œํ•„๋” ๋ฆฌ์ŠคํŠธ MF = RCM(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), LCM(์™ผ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RDM(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CDM(์ค‘์•™ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CM(์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RM(์˜ค๋ฅธ์ชฝ ๋ฏธ๋“œํ•„๋”), CAM(์ค‘์•™ ๊ณต๊ฒฉํ˜• ๋ฏธ๋“œํ•„๋”)
  • ๊ณต๊ฒฉ์ˆ˜ ๋ฆฌ์ŠคํŠธ ST = ST(์ „๋ฐฉ ๊ณต๊ฒฉ์ˆ˜), LW(์™ผ์ชฝ ๊ณต๊ฒฉ์ˆ˜), RW(์˜ค๋ฅธ์ชฝ ๊ณต๊ฒฉ์ˆ˜)

* GK(๊ณต๊ฒฉ์ˆ˜) : 1๋ช…, CB(์ˆ˜๋น„์ˆ˜) : 4๋ช…, MF(๋ฏธ๋“œํ•„๋”) : 4๋ช…, ST(๊ณต๊ฒฉ์ˆ˜) : 2๋ช… ์„ ๋ฐœ

-> ์„ ๋ฐœ์˜ ๊ธฐ์ค€์€ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(Overall ์ปฌ๋Ÿผ)

 

gk_list = ['GK']
cb_list = ['CB', 'LCB', 'RCB', 'RB', 'LB']
mf_list = ['RCM', 'LCM', 'RDM', 'CDM', 'CM', 'RM', 'CAM']
st_list = ['ST', 'LW', 'RW']

ํฌ์ง€์…˜์„ ๋ถ„๋ฅ˜ํ•œ๋Œ€๋กœ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2



mu_id = []

for index in mu.index:
    if mu['Position'][index] in gk_list: 
        if gk_count != 0:
            mu_id.append(mu['ID'][index])
            gk_count -= 1 
    elif mu['Position'][index] in cb_list:
        if cb_count != 0:
            mu['Position'][index] = 'CB'
            mu_id.append(mu['ID'][index])
            cb_count -= 1 
    elif mu['Position'][index] in mf_list:
        if mf_count != 0:
            mu['Position'][index] = 'MF'
            mu_id.append(mu['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mu['Position'][index] = 'ST'
            mu_id.append(mu['ID'][index])
            st_count -= 1

ํ˜„์žฌ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋Œ€๋กœ ์ƒ์œ„ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค์˜ ID ๊ฐ’์„ ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์ค€๋‹ค. 

mu[mu['ID'].isin(mu_id)]

11๋ช…์˜ ์„ ์ˆ˜๊ฐ€ ์•Œ๋งž๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

mu = mu[mu['ID'].isin(mu_id)]

์„ ๋ฐœ๋œ 11๋ช…์˜ ์„ ์ˆ˜๋“ค๋งŒ mu ๋ณ€์ˆ˜์— ๋„ฃ์–ด ์ค€๋‹ค. 

 

 

 

๊ฐ™์€ ์ ˆ์ฐจ๋กœ Manchester City ๋˜ํ•œ ์ง„ํ–‰ํ•œ๋‹ค. 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2


mc_id = []

for index in mc.index:
    if mc['Position'][index] in gk_list: 
        if gk_count != 0:
            mc_id.append(mc['ID'][index])
            gk_count -= 1 
    elif mc['Position'][index] in cb_list:
        if cb_count != 0:
            mc['Position'][index] = 'CB'
            mc_id.append(mc['ID'][index])
            cb_count -= 1 
    elif mc['Position'][index] in mf_list:
        if mf_count != 0:
            mc['Position'][index] = 'MF'
            mc_id.append(mc['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mc['Position'][index] = 'ST'
            mc_id.append(mc['ID'][index])
            st_count -= 1
mc = mc[mc['ID'].isin(mc_id)]

 


concat vs merge

merge: ์ขŒ์šฐํ•ฉ๋ณ‘, concat: ์ƒํ•˜ํ•ฉ๋ณ‘


df = pd.concat([mu, mc])

์„ ๋ฐœ๋œ mu, mc ์„ ์ˆ˜๋“ค์„ ํ•ฉ์ณ df์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

 

(2) EDA 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(overall) ๋น„๊ต 

df = pd.concat([mu, mc])

๊ณจ๊ธฐํผ๋ฅผ ๋บ€ ํƒ€ ํฌ์ง€์…˜์€ ๋ชจ๋‘ Manchester United ํŒ€์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ์˜ˆ์ƒ์ด์ ๋ฃŒ(Value) ๋น„๊ต

sns.boxplot(data=df, x='Position', y='Value', hue='Club')

์ด์ ๋ฃŒ๋Š” ๊ณจ๊ธฐํผ๋ฅผ ๋นผ๊ณ  ๊ฑฐ์˜ ์ฐจ์ด๊ฐ€ ์—†๊ฑฐ๋‚˜ ๋” ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

์œ„์˜ boxplot์œผ๋กœ ๋‘ ํŒ€์„ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ, ์ด์ ๋ฃŒ ๋Œ€๋น„ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ํฌ์ง€์…˜์€ MF, CB๋กœ ํŒ๋‹จํ•˜์—ฌ ๋‘ ํฌ์ง€์…˜์— ๋Œ€ํ•ด ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ• ์ง€ ๋ถ„์„์„ ํ•ด๋ณธ๋‹ค. 

 

 

 


5. Manchester United๋Š” ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€? 

(1) EDA

* ๋ฐฉ์ถœ ์„ ์ˆ˜ ์„ ์ •

์˜์ž…์ผ, ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ, ๋‚˜์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ณต์‹ ์„ธ์šฐ๊ธฐ 

 Point = (Overall * 2 + Potential) / Age 

๋Šฅ๋ ฅ์น˜(๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€ํ•จ)์™€ ์ž ์žฌ๋ ฅ์ด ๋†’์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ๋‚ฎ์„ ์ˆ˜๋ก ์ข‹์Œ. 

mu['Point'] = (mu['Overall'] * 2 + mu['Potential']) / mu['Age']

 

- MF ํฌ์ง€์…˜ 

mu[mu['Position'] == 'MF'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 211๋ฒˆ ์„ ์ˆ˜์ด๋‹ค.  

 

- CB ํฌ์ง€์…˜ 

mu[mu['Position'] == 'CB'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 377๋ฒˆ ์„ ์ˆ˜์ด๋‹ค. 

 

๋งˆํƒ€, ์Šค๋ชฐ๋ง ๋‘ ์„ ์ˆ˜๋ฅผ ๋ฐฉ์ถœํ•˜๊ณ  MF, CB ํฌ์ง€์…˜์„ ํ•œ๋ช…์”ฉ ์˜์ž…ํ•œ๋‹ค. 

 

 

(2) ์‹œ๊ฐํ™” 

์ „์ฒด ์„ ์ˆ˜ ์‹œ๊ฐํ™” - ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ฅธ ์˜์ž… ์„ ์ˆ˜ ๊ฒฐ์ • 


Manchester United ์˜์ž…๋ฐฉ์นจ (์†”์ƒค๋ฅด๊ฐ๋…) 

- ์„ ์ˆ˜์˜ ๋‚˜์ด๋Š” ์–ด๋ฆด ์ˆ˜๋ก ์ข‹์Œ

- ์ž ์žฌ๋ ฅ ๋ณด๋‹ค ํ˜„์žฌ ๋ฐ”๋กœ ์ฃผ์ „์œผ๋กœ ๋›ธ ์ˆ˜ ์žˆ๋Š” ์„ ์ˆ˜ 


market = data[(data['Position']=='RM') | (data['Position']=='CB')]

ํฌ์ง€์…˜์€ ๋ฐฉ์ถœ ์„ ์ •๋œ ๋‘์„ ์ˆ˜์˜ ์„ธ๋ถ€ ํฌ์ง€์…˜์ธ RM, CB๋ฅผ ์„ ํƒํ•œ๋‹ค. 

market.head()

import matplotlib.pyplot as plt
f, ax = plt.subplots(2, 4, figsize=(20, 10))

vs_list = ['Age', 'Overall', 'Potential', 'Weak Foot']

for i in range(8):
    if i < 4:
        colors = ['firebrick' if x > market[market['Position']=='CB'][:13][vs_list[i]].mean() else 'gray' for x in market[market['Position']=='CB'][:13][vs_list[i]]]
        sns.barplot(x=vs_list[i], y='Name', data=market[market['Position']=='CB'][:13], ax=ax[i//4, i%4], palette=colors)
        ax[i//4, i%4].axvline(market[market['Position']=='CB'][:13][vs_list[i]].mean(), ls = '--', color='k')
   
    else:
        colors = ['firebrick' if x > market[market['Position']=='RM'][:13][vs_list[i%4]].mean() else 'gray' for x in market[market['Position']=='RM'][:13][vs_list[i%4]]]        
        sns.barplot(x=vs_list[i%4], y='Name', data=market[market['Position']=='RM'][:13], ax=ax[i//4, i%4], palette=colors)        
        ax[i//4, i%4].axvline(market[market['Position']=='RM'][:13][vs_list[i%4]].mean(), ls='--', color='k')

๋ฐ์ดํ„ฐ ๋ถ„์„์œผ๋กœ ๋‹ค๋ฅธ ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ณ  ๋‚˜์ด, ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ์œผ๋กœ๋งŒ ๋”ฐ์ง„๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด S. Umtiti, K. Mbappé ์„ ์ˆ˜๊ฐ€ ๋  ๊ฒƒ์ด๋ผ ํŒ๋‹จํ•˜์˜€๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/67

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 2. python ๋ถ€์ŠคํŒ… Boosting, XGBoost ์‚ฌ์šฉ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https..

silvercoding.tistory.com

 

 


 

'๊ฒฐ๋ก ์ด ๋ฌด์—‡์ธ์ง€' ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋กœ์„œ์˜ ์ค‘์š”ํ•œ ์—…๋ฌด์ด๋‹ค. 

์˜ˆ์ธก ๊ฒฐ๊ณผ๋งŒ ๋ณด๊ณ ๋Š” ๋ชจ๋ธ์ด ์–ด๋–ค ํŒจํ„ด์„ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์‹คํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€, ์™œ ๊ทธ๋ ‡๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ์„ค๋ช…ํ•  ์ˆ˜ ์—†๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๋‹ค๋ฅธ ๋ถ„์•ผ์˜ ํ˜‘์—…์ž๋“ค์€ ์‹ ๋ขฐ๋ฅผ ์žƒ๊ฒŒ๋  ๊ฒƒ์ด๋‹ค. 

๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณธ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•˜์—ฌ ์˜ํ™” ํฅํ–‰์„ฑ์ ์„ ์˜ˆ์ธกํ•˜๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ํฅํ–‰ ์‹คํŒจ๋ผ๋Š” ์˜ˆ์ธก์ด ๋‚˜์™”๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์–ด๋–ป๊ฒŒ ํฅํ–‰์‹คํŒจ๋ฅผ ๋ง‰์„ ๊ฒƒ์ด๋ƒ๊ณ  ์งˆ๋ฌธ์ด ๋“ค์–ด์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค. ๊ธฐ์กด์˜ ์ทจ์•ฝ์ ์„ ๋ณด์™„ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด ๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

 

๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„์ฃผ ์ค‘์š”ํ•˜๋‹ค. ์ด ๋•Œ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ ๋ณ€์ˆ˜์™€, ํŠน์ • ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


๋ณ€์ˆ˜์ค‘์š”๋„

- ๋ชจ๋ธ์— ํ™œ์šฉํ•œ input ๋ณ€์ˆ˜ ์ค‘์—์„œ ์–ด๋–ค ๊ฒƒ์ด target ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋‚˜? 
- ํ•ด๋‹น ์ค‘์š”๋„๋ฅผ ์ˆ˜์น˜ํ™”์‹œํ‚จ ๊ฒƒ
- treeํ˜• ๋ชจ๋ธ (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด, ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ) ์—์„œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ 

 

์ด์ „ ๊ธ€์˜ treeํ˜• ๋ชจ๋ธ์ธ random forest์™€ xgboost์—์„œ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ณ„์‚ฐ์„ ์‹คํ–‰ํ–ˆ์—ˆ๋‹ค.  

(์ฐธ๊ณ )  ๋ฐฐ๊น…  ๋ถ€์ŠคํŒ…


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ์˜ ๋ณ€์ˆ˜์ค‘์š”๋„

- ํ•ด๋‹น input ๋ณ€์ˆ˜๊ฐ€ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ๊ตฌ์ถ•์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์“ฐ์ด๋‚˜ 
- ํ•ด๋‹น ๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐ๋ฅผ ํ–ˆ์„ ๋•Œ ๊ฐ ๊ตฌ๊ฐ„์˜ ๋ณต์žก๋„๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค„์–ด๋“œ๋Š”๊ฐ€? 



shapley ๊ฐ’ 

: ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฌผ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ์˜ ํฌ๊ธฐ

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

 

(์˜ˆ) ์ถ•๊ตฌ ์„ ์ˆ˜ A , ์†ํ•œ ํŒ€ B 

- ๊ฐ ์„ ์ˆ˜๊ฐ€ ํŒ€ ์„ฑ์ ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ ํฌํ‚ค

- ํ•ด๋‹น ์„ ์ˆ˜๊ฐ€ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

- (์„ ์ˆ˜ A๊ฐ€ ์žˆ๋Š” ํŒ€ B์˜ ์Šน๋ฅ ) - (์„ ์ˆ˜ A๊ฐ€ ์—†๋Š” ํŒ€ B์˜ ์Šน๋ฅ  = 7% 


 shap value ์‹ค์Šต 

shap value ์‹ค์Šต์— ์ค‘์ ์„ ๋‘๊ธฐ ์œ„ํ•ด  Xgboost ํ•™์Šต๊นŒ์ง€ ์ „์— ํ–ˆ๋˜ ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•ด์ค€๋‹ค. 

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import os
import pandas as pd
import numpy as np
os.chdir('./data') # ๋ณธ์ธ ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ";")

์ด์ „ ๊ธ€์—์„œ ์‚ฌ์šฉํ•˜์˜€๋˜ ์˜ˆ๊ธˆ ๊ฐ€์ž… ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. 

data = pd.get_dummies(data, columns = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

data['y'].value_counts()

๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชฉํ‘œ๋ณ€์ˆ˜๋„ ๋‹น์—ฐํžˆ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋˜์–ด์žˆ๋‹ค. 

data['y'] = np.where( data['y'] == 'no', 0, 1)

ํ•˜์ง€๋งŒ shap value ํŒจํ‚ค์ง€๋Š” ๋ชฉํ‘œ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜์น˜ํ˜•์ด์–ด์•ผ ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์น˜ํ™” ์‹œ์ผœ์ค€๋‹ค. 

 

 

 

Xgboost ํ•™์Šต 

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

y ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•œ ์ธํ’‹๋ณ€์ˆ˜๋ฅผ ๋ฆฌ์ŠคํŠธ์— ๋ชจ๋‘ ๋‹ด์•„์ค€๋‹ค. 

from xgboost import XGBRegressor

์ˆ˜์น˜ํ˜•์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด XBGRegressor ํšŒ๊ท€๋ชจ๋ธ์„ ์ž„ํฌํŠธ ํ•ด์ค€๋‹ค. 

xgb = XGBRegressor( n_estimators = 300, learning_rate=0.1 )
xgb.fit(data[input_var], data['y'])

Xgboost ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

 

 

Shap Value ์˜ˆ์ œ 

import shap

shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

(1) ๋ณ€์ˆ˜์ค‘์š”๋„

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values( data[input_var] )

shap.TreeExplainer์˜ ์ธ์ž์— ํ•™์Šตํ•œ ๋ชจ๋ธ xgb๋ฅผ ๋„ฃ์–ด ๊ฐ์ฒด๋ฅผ ์ €์žฅํ•ด์ค€๋‹ค. ๊ทธ๋‹ค์Œ explainer.shap_values์˜ ์ธ์ž์— ๋ฐ์ดํ„ฐ์…‹์˜ ์ธํ’‹๊ฐ’์„ ๋„ฃ์–ด์ค€๋‹ค. 

shap.summary_plot( shap_values , data[input_var] , plot_type="bar" )

shap.summary_plot์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค์ค€๋‹ค. ๊ฐ€์žฅ ๋†’์€ ๋ณ€์ˆ˜๋Š” duration์ด๋‹ค. duration์€ ์ „ํ™”์‹œ๊ฐ„์ด๋‹ค. ์ „ํ™”์‹œ๊ฐ„์˜ ๊ธธ์ด๊ฐ€ ์ด ๋ชจ๋ธ์˜ ์˜ˆ์ธก์— ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฏธ์นœ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

 

 

(2) dependence plot 

: ํŠน์ • input ๋ณ€์ˆ˜์™€ target ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ 

: ์ ์€ ๊ฐ๊ฐ์˜ row๋ฅผ ์˜๋ฏธ(๋ฐ์ดํ„ฐ ํ•œ๊ฐœ), ํƒ€๊ฒŸ๋ณ€์ˆ˜์— ๋ฏธ์นœ ์˜ํ–ฅ = y 

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'duration' , shap_values , data[input_var] )

duration์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด duration์˜ ๋Œ€๋ถ€๋ถ„์ด 3000 ๋ฏธ๋งŒ์— ์กด์žฌํ•˜๊ณ , ๊ทธ ์ค‘์—์„œ๋Š” duration์ด 50์ด์ƒ์ฏค ๋˜๋ฉด ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์ณ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค. (shpa value for duration์ด 0๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์Œ) 

shap.dependence_plot( 'nr.employed' , shap_values , data[input_var] )

5020์ฏค ๋˜๋Š” ์ง€์ ์—์„œ ์˜ํ–ฅ๋ ฅ์ด ์Œ์ˆ˜๊ฐ€ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  5100์ด ๋„˜์–ด๊ฐ€๊ณ ๋Š” ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ๋ฐ–์— ์—†๋‹ค. (-> 0์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) ๊ทธ ์ด์ „์—๋Š” ์˜ํ–ฅ๋ ฅ์ด ๋†’์œผ๋ฏ€๋กœ ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ๋‹ค. (-> 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) 

shap.dependence_plot( 'euribor3m' , shap_values , data[input_var] )

์Œ์ˆ˜์™€ ์–‘์ˆ˜๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์–ด์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„ ๋ณด์ธ๋‹ค. ์ด ์ค‘์—์„œ ์Œ์ˆ˜๊ฐ€ ์–ผ๋งˆ ์—†๊ณ  ์–‘์ˆ˜๊ฐ€ ๋งŽ์€ ๊ตฌ๊ฐ„์„ ์ฐพ์•„๋ณด๋ฉด 1.3~1.4 - 2, 4-5 ๊ฐ€ ์žˆ๋‹ค. ํ•ด๋‹น ๊ตฌ๊ฐ„์ผ ๋•Œ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'cons.conf.idx' , shap_values , data[input_var] )

์ „์ฒด์ ์œผ๋กœ ์Œ์ˆ˜๋ฅผ ์ด๋ฃจ๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. -45์ดํ•˜์ผ ๋•Œ๋Š” 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'pdays' , shap_values , data[input_var] )

pdays๊ฐ€ 0์ผ๋•Œ ๋Œ€๋‹ค์ˆ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์งˆ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(3) force plot

: ํŠน์ • ๊ฐ’์ด ์–ด๋–ป๊ฒŒ ์˜ˆ์ธก๋˜์—ˆ๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™” 

prediction = xgb.predict(data[input_var])
data['pred'] = prediction

 

shap.initjs()
shap.force_plot( explainer.expected_value , shap_values[41187] , data[input_var].iloc[41187] )

411187๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” 0.09๊ฐ€ ๋‚˜์™”๋Š”๋ฐ, ๋–จ์–ด๋œจ๋ฆฌ๋Š” ๋ณ€์ˆ˜์™€ ์˜ฌ๋ฆฌ๋Š” ๋ณ€์ˆ˜๊ฐ€ ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹ค. 

 

shap.force_plot( explainer.expected_value , shap_values[0] , data[input_var].iloc[41187] )

0์— ๊ฑฐ์˜ ๊ฐ€๊น๊ฒŒ ์˜ˆ์ธก๋œ 0๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ๋ณ€์ˆ˜๊ฐ€ ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

41183๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ์–‘์˜ ์˜ํ–ฅ๋ ฅ์ด ํ›จ์”ฌ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ 0.88์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๊ณ , ์ •๋‹ต์€ 1๋กœ, ๊ทผ์ ‘ํ•˜๊ฒŒ ๋งžํ˜”๋‹ค. 

 

 

์ด๋ ‡๊ฒŒ shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก์— ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.  


 

 

 

 

 

 

 

 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/66

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 1. python ๋ฐฐ๊น… , ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ bagging, randomforest

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [bost..

silvercoding.tistory.com

 

 


๋ถ€์ŠคํŒ… Boosting

๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด (๋ถ€์ŠคํŒ… ์ ˆ์ฐจ) 

  • ์ด์ „ ๋ชจ๋ธ์—์„œ ์˜ค๋ถ„๋ฅ˜ํ•œ ๊ฐ์ฒด์— ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ)๋กœ ๋ชจ๋ธ ํ•™์Šต
  • ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ๋งŒ๋“ฆ
  • ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด

์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ ๊ฒฐํ•ฉ

  • ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜๋ฅผ ๊ฐ€์ค‘ํ‰๊ท 

 

n_estimators ์„ค์ • 

(n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€) 

  • n_estimators ๊ฐ€ ๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ์˜ค๋ฒ„ํ”ผํŒ… ์šฐ๋ ค 
  • n_estimators๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์–ธ๋”ํ”ผํŒ… ์šฐ๋ ค 
  • ์ ์ ˆํ•œ n_estimators๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๊ด€๊ฑด 

 

 


 

 


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import os
import pandas as pd
os.chdir('../data')   # ๋ณธ์ธ ํŒŒ์ผ์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
data = pd.read_csv("bank-additional-full.csv", sep = ';')
data.head()

data.info()

 

 

 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

data = pd.get_dummies(data,columns=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

dtype์ด object์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

 

 

 

train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

data['id']=range(len(data))

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๊ฐ row์— id๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)
test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

์ด์ „๊ธ€๊ณผ ๋™์ผํ•˜๊ฒŒ train, test ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค.

 

 

 

์ธํ’‹๋ณ€์ˆ˜ ์ €์žฅ 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์„ input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

 

 

 


XGBoost ๋ชจ๋ธํ•™์Šต 


XGBoost 

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€ 
  • ๋Œ€์ฒด์ ์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์— ๋น„ํ•ด ๋น ๋ฅด๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Œ

- xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )

  • n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€ 
  • learning_rate : ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ธ์ง€ 

-์„ค์น˜ 

!pip install xgboost

์šฐ์„  xgboost๊ฐ€ ์„ค์น˜๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋ฉด ์„ค์น˜ํ•ด ์ค€๋‹ค. 

from xgboost import XGBClassifier
xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )
xgb.fit(train[input_var], train['y'])

๊ฐ์ฒด ์ƒ์„ฑ์„ ํ•˜๊ณ , train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๊นŒ์ง€ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = xgb.predict(test[input_var])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ ํ›„ predictions์— ์ €์žฅํ•œ๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

์ •ํ™•๋„๊ฐ€ ์•ฝ 91 % ๊ฐ€ ๋‚˜์™”๋‹ค. ํ˜„์žฌ ๋ชจ๋ธ์€ n_estimators๋ฅผ 300์œผ๋กœ ์ง€์ •ํ•˜์˜€๋‹ค. ์•ž์—์„œ ํ•™์Šตํ•˜์˜€๋“ฏ์ด, ์˜ค๋ฒ„ํ”ผํŒ…๊ณผ ์–ธ๋”ํ”ผํŒ…์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ€์ŠคํŒ…์—์„œ n_estimators๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ด๋ผ๊ณ  ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ n_estimators๋ฅผ ์ฐพ์•„๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

์ตœ์  ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์ˆ˜ ( n_estimators ) ์ฐพ๊ธฐ 

for n in [100,200,300,400,500,600,700,800,900]:
    xgb = XGBClassifier( n_estimators = n, learning_rate = 0.05, eval_metric='logloss' )
    xgb.fit(train[input_var], train['y'])
    predictions = xgb.predict(test[input_var])
    print((pd.Series(predictions)==test['y']).mean())

๊ฒฐ๊ณผ : ์ตœ์ ์˜ n_estimators ๋Š” 400์ด๋‹ค. 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด๋ณด๋‹ˆ nr.emplyed ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋กœ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/65

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. Python Decision Tree ( ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ..

silvercoding.tistory.com

 

 

 


๋ฐฐ๊น… bagging 

- ๋ฐฐ๊น…์˜ ์ฒ ํ•™ 

1. ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค. 

2. ๋‹ค์–‘ํ• ์ˆ˜๋ก ์ข‹๋‹ค. 

(ex) ๋‚จ์„ฑ 1๋ช… < ๋‚จ์„ฑ 10๋ช… (์ˆ˜๊ฐ€ ๋งŽ์Œ) < ๋‚จ์„ฑ 5๋ช… , ์—ฌ์„ฑ 5๋ช… (์ˆ˜๊ฐ€ ๋งŽ๊ณ  ๋‹ค์–‘ํ•จ) 

 

 

- ๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€? (๋ฐฐ๊น… ํ”„๋กœ์„ธ์Šค)  

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง ( ๋ณต์› ์ถ”์ถœ / ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ฌ์ˆ˜๋„, ์•„์˜ˆ ๋ฝ‘ํžˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„์ˆ˜๋„. ) -> ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 

2. ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ์ƒ์„ฑ 

3. ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ๋‹ค๋ฅด๋ฏ€๋กœ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด 

 

 

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์˜ ๊ฒฐํ•ฉ? 

: ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜์˜ ๋‹จ์ˆœ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. 

 

 

- ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ (๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋ธ) 

: ๋ฐฐ๊น…์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 


 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์˜ Dataset ์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Bank Marketing dataset > 

https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset

 

Bank marketing campaigns dataset | Opening Deposit

Bank Marketing (with social/economic context) dataset with loan target variable

www.kaggle.com

import os
import pandas as pd
os.chdir('../data')  # ๋ณธ์ธ์˜ ํŒŒ์ผ ํด๋” ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ';')

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์ฃผ์˜ํ•  ์ ์€ sep=';' ์„ ์„ค์ •ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ํŒŒ์ผ์€ csv ํŒŒ์ผ์ด์ง€๋งŒ ์ฝค๋งˆ(,) ๊ฐ€ ์•„๋‹Œ ์„ธ๋ฏธ์ฝœ๋ก (;) ์œผ๋กœ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

data.head()

๋‚˜์ด, ์ง์—…, ๊ฒฐํ˜ผ์—ฌ๋ถ€, ๋Œ€์ถœ์—ฌ๋ถ€ ๋“ฑ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น ๊ณ ๊ฐ์˜ ์˜ˆ๊ธˆ ๊ฐ€์ž…์—ฌ๋ถ€๋ฅผ ๋งžํžˆ๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data.info()

dtype์ด object์ธ ๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ,  ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

 

 

 

 


 ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์‚ฌ์šฉ 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

- dtype์ด object์ธ ์ปฌ๋Ÿผ ์ถ”์ถœ 

obj_column = []
for column in data.columns[:-1]:
    if data[column].dtype == 'object':
        obj_column.append(column)
        
obj_column

data = pd.get_dummies(data,columns=obj_column)

get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data

์ปฌ๋Ÿผ์ˆ˜๊ฐ€ ๋งŽ์ด ๋Š˜์–ด๋‚œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

data['id']=range(len(data))

๋ฐ์ดํ„ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•˜์—ฌ id๊ฐ’์„ ๋ถ€์—ฌํ•œ๋‹ค. 

 

 

- train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)

train ๋ฐ์ดํ„ฐ์…‹์„ ๋น„๋ณต์›์ถ”์ถœ๋กœ 30000๊ฐœ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. 

test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

test๋ฐ์ดํ„ฐ์…‹์€ train์— ์—†๋Š” id๊ฐ’์œผ๋กœ ์ด 11188๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

 

 

 

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ•™์Šต 


๋žœ๋คํฌ๋ ˆ์ŠคํŠธ

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€
  • ๋งค์šฐ ๋Š๋ฆผ
  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋” ๊ฐ๊ด€์ ์ธ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์Œ 

 

- RandomForestClassifier(n_estimators=m, min_samples_split=n)

  • n_estimators : ๋ช‡๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“œ๋Š”๊ฐ€ 
  • max_depth : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด 
  • min_samples_split : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ๊ฐ ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, min_samples_split=10)

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

๋ฐ˜ํ™˜๋œ data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ๋บ€ ์ปฌ๋Ÿผ๋“ค์„ input_var ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

rf.fit(train[input_var],train['y'])

train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ถ„๋ฅ˜๊ธฐ ๋ชจ๋ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = rf.predict(test[input_var])

test๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ณ , predictions ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

predictions์™€ ์ •๋‹ต๊ฐ’(y) ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ท ์„ ๋‚ด์ฃผ๋ฉด ์ •ํ™•๋„๋Š” ์•ฝ 91% ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

 

* ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์™€์˜ ๋น„๊ต 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split=10)

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

dt.fit(train[input_var], train['y'])

predictions = dt.predict(test[input_var])

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

(pd.Series(predictions) == test['y']).mean()

์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•ด๋ณด๋‹ˆ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = rf.feature_importances_
imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด ๋ณด์•˜๋”๋‹ˆ duration์ด ๊ฐ€์žฅ ๋†’๊ณ , default_yes ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  (๋ณ€์ˆ˜์ค‘์š”๋„์— ๋Œ€ํ•œ ๊ฐœ๋…์€ ๋‹ค๋‹ค์Œ์‹œ๊ฐ„์— ์ž์„ธํžˆ ์•Œ์•„๋ณธ๋‹ค.) 


 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/64

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. Python KNN ๋ถ„๋ฅ˜

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ..

silvercoding.tistory.com

 

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

์ด์ „ ๊ธ€๊ณผ ๋™์ผํ•œ Iris Flower Dataset ์„ ์ด์šฉํ•˜์—ฌ ์‹ค์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')  # ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋Š” ๋ณธ์ธ ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

iris['species'].value_counts()

๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค 50๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 


 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‚ฌ์šฉ 

train & Test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

iris['id'] = range(len(iris))

์šฐ์„  ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋„ฃ์–ด์ค€ id ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์•ž์— ์˜ค๋„๋ก ์ •๋ ฌํ•ด์ค€๋‹ค. 

train = iris.sample(100,replace=False,random_state=7).reset_index().drop(['index'],axis=1)

๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜์—ฌ train ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
test = test.reset_index().drop(['index'],axis=1)

train์˜ id๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” iris ๋ฐ์ดํ„ฐ๋“ค์„ test์— ๋„ฃ์–ด์ค€๋‹ค. 

 

 

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ํ•™์Šต 

DecisionTreeClassifier(min_samples_split = n)

---> ํŠน์ง• : ํ•ด์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๋‹ค. 

---> min_samples_split : ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ์ตœ์ข… ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split = 10)

min_samples_split ์„ 10์œผ๋กœ ์„ค์ •ํ•ด์ฃผ์–ด ์ตœ์ข… ๋…ธ๋“œ์˜ ์ƒ˜ํ”Œ์ˆ˜๊ฐ€ 10๋ฏธ๋งŒ์ด ๋˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•œ๋‹ค. 

dt.fit(train[['sepal_length','sepal_width','petal_length','petal_width']],train['species'])

์ƒ์„ฑํ•ด ๋†“์€ dt ๊ฐ์ฒด๋กœ ํ•™์Šต์„ ์‹œ์ผœ์ค€๋‹ค. 

predictions = dt.predict(test[['sepal_length','sepal_width','petal_length','petal_width']])

์˜ˆ์ธก๊ฐ’์„ prediction์— ๋„ฃ์–ด์ค€๋‹ค. 

test['pred'] = predictions

์˜ˆ์ธก๊ฐ’ prediction์„ test์˜ pred ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. 

test.head()

(pd.Series(predictions)==test['species']).mean()

์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.98์ด ๋‚˜์™”๋‹ค. 

 

 

 


์œ„์˜ ์ •ํ™•๋„ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


from sklearn.model_selection import cross_val_score
import numpy as np
dt = DecisionTreeClassifier(min_samples_split = 10)
scores = cross_val_score(dt, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5, scoring="accuracy")
np.mean(scores)

 

์ด๋ฒˆ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ ์„ ๊ฒฝ์šฐ์—๋Š” ์œ„์™€ ๊ฐ™์ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์‹ ๋ขฐ์„ฑ์ด ๋†’๋‹ค. 5 fold cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ , ์ •ํ™•๋„๊ฐ€ ์•ฝ 0.97์ด ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™”

from sklearn import tree
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 16,10
a=tree.plot_tree(dt,feature_names = ['sepal_length','sepal_width','petal_length','petal_width'],impurity=False, max_depth=2, fontsize=10, proportion=True)
plt.show(a)

max_depth๋ฅผ ์ด์šฉํ•˜์—ฌ ๊นŠ์ด๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค. 2๊ฐœ ์ดํ›„๋กœ๋Š” (...) ์œผ๋กœ ์ƒ๋žต๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์œ„์™€ ๊ฐ™์ด ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์‹œ๊ฐํ™” ํ•ด๋ณด๋ฉด ํ•ด์„์„ ์‰ฝ๊ณ  ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/63?category=967543 

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. PCA, ๊ตฐ์ง‘ํ™”๋ฅผ ์‚ฌ์šฉํ•œ ์ง‘๊ฐ’ ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ impo..

silvercoding.tistory.com

 

 

 

 


KNN ๊ฐœ๋… ์ •๋ฆฌ

* 1๊ทธ๋ฃน vs 2๊ทธ๋ฃน KNN ๋ถ„๋ฅ˜ ๊ณผ์ •

1. k ์„ค์ • : ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์ ์„ ์„ ํƒ 

2. k ๊ฐœ์˜ ์  ์ค‘ 1๊ทธ๋ฃน์ด ๋งŽ์€์ง€ 2๊ทธ๋ฃน์ด ๋งŽ์€์ง€ ํ™•์ธ 

3. ๋” ๋งŽ์€ ๊ทธ๋ฃน์˜ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

 

* K๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •

1. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ K๋ณ„๋กœ KNN ๋ชจ๋ธ ํ•™์Šต 

2. ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ) ์—์„œ์˜ ์—๋Ÿฌ์œจ ์ธก์ • 

3. ์—๋Ÿฌ์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k ์„ ํƒ 

 

 

* ์ ์ ˆํ•œ k๋ฅผ ์ฐพ์•„๋‚ด์–ด์•ผ ํ•œ๋‹ค!

- k๊ฐ€ ๋งค์šฐ ์ž‘์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ๊ณผ์ ํ•ฉ ์šฐ๋ ค 

- k๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์ง€์—ญ์  ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋จ 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์€ ์บ๊ธ€์˜ ๋‹ค์Œ๋งํฌ์—์„œ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')   # ๋ณธ์ธ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

(์ฐธ๊ณ ) sepal : ๊ฝƒ๋ฐ›์นจ / petal : ๊ฝƒ์žŽ 

๊ฝƒ๋ฐ›์นจ์˜ ํฌ๊ธฐ์™€ ๊ฝƒ์žŽ์˜ ํฌ๊ธฐ๋ฅผ ๊ทผ๊ฑฐ๋กœ setosa, versicolor, virginica ์ด 3์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„ํ•ด ๋‚ด๋Š” ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค. 

iris.info()

์ด 150๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด ๊ฐ€ ์žˆ๊ณ  , ๊ฒฐ์ธก๊ฐ’์€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. 

iris['species'].value_counts()

value_counts() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ์ข…๋ฅ˜๊ฐ€ ๋ช‡๊ฐ€์ง€์”ฉ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค ๋™์ผํ•˜๊ฒŒ 50๊ฐœ์”ฉ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 


 KNN ์‹ค์Šต - ๋ถ„๋ฅ˜ 

(ex) KNeighborsClassifier(n_neighbors=n)

---> ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ๋Š๋ฆผ 

---> n_neighbors=n : k์˜ ๊ฐœ์ˆ˜ ์ง€์ • (๊ฐ€์žฅ ๊ฐ€๊นŒ์šด K๊ฐœ๋ฅผ ๋ณผ๊ฒƒ์ด๋ผ๋Š” ์˜๋ฏธ) 

 

 

iris['id'] = range(len(iris))

๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•˜์—ฌ id ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ์— ์˜ค๋„๋ก ์ •๋ ฌ ํ•ด ์ค€๋‹ค. 

iris.head()

 

 

train & test data ๋ถ„๋ฆฌ 

train = iris.sample(100, replace=False, random_state=7).reset_index(drop=True)
train

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๋น„๋ณต์›์ถ”์ถœ์ด๊ณ , ๋’ค์ฃฝ๋ฐ•์ฃฝ๋œ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ์‹œ์ผœ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
# test = test.reset_index().drop(['index'],axis=1)  # ๋ฐ‘๊ณผ ๊ฐ™์€ ์ฝ”๋“œ
test = test.reset_index(drop=True)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” id๊ฐ’์ด ์กด์žฌํ•˜๋Š” row๋งŒ ์ถ”์ถœํ•˜์—ฌ ๊ตฌ์„ฑํ•œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค. 

 

 

 

 

KNN ํ•™์Šต (k=3 ์ผ ๋•Œ ํ•™์Šตํ•ด๋ณด๊ธฐ) 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # ๋ชจ๋ธ ์ •์˜

k=3์œผ๋กœ ์„ค์ •ํ•œ KNN ๋ถ„๋ฅ˜๊ธฐ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )

knn.fit(train_X, train_y) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )

knn.predict(test_X) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ณ , predictions์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test['pred'] = predictions
test.head()

pred ์ปฌ๋Ÿผ์— ์˜ˆ์ธก ๊ฒฐ๊ณผ์ธ predictions๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ๋‹ค. ์œ„์˜ 5๊ฐœ๋ฅผ ๋ณด๋‹ˆ ๋ชจ๋‘ ์ •๋‹ต์„ ๋งž์ถ˜ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  

(test['pred'] == test['species']).mean()

์ •๋‹ต๊ณผ ์˜ˆ์ธก์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.94๊ฐ€ ๋‚˜์™”๋‹ค. ์ด์ œ ์—ฌ๋Ÿฌ k๊ฐ’์˜ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•˜์—ฌ ์ตœ์ ์˜ k๋ฅผ ๊ฒฐ์ •ํ•ด ๋ณธ๋‹ค. 

 

 

 

 

์ตœ์  K ์ฐพ๊ธฐ 

- train & test ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ 

for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )
    print((pd.Series(predictions) == test['species']).mean())

1๋ถ€ํ„ฐ 29๊นŒ์ง€์˜ k ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์—ฌ ์–ป์€ ์ •ํ™•๋„์ด๋‹ค. ๋†’์€ ๊ฐ’ ์ค‘์—์„œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋ฅผ ๊ณ ๋ฅด๋ฉด k=5 (์ •ํ™•๋„ 0.98) ์ด๋‹ค. 

 

---> ์ตœ์ ์˜ K : 5

 


ํ•˜์ง€๋งŒ ์œ„์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

- cross validation ์‚ฌ์šฉ

from sklearn.model_selection import cross_val_score
import numpy as np
for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5)
    print(f"{k} : " ,np.mean(scores))

5-fold-cross validation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. k=6 ์ผ๋•Œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋กœ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

---> ์ตœ์ ์˜ K : 6

 

 

 

 


 KNN ์‹ค์Šต - ํšŒ๊ท€ 

ํšŒ๊ท€๋ฌธ์ œ์— KNN ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ KNN ํšŒ๊ท€๋ฌธ์ œ๋ฅผ ์‹ค์Šต์„ ํ•ด๋ณด๊ธฐ ์œ„ํ•ด sepal_length, sepal_width, petal_length ๋ฅผ ์ด์šฉํ•˜์—ฌ petal_width๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

del train['species']
del test['species']

๊ฐ„๋‹จํ•œ ์‹ค์Šต์„ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ species ๋Š” ์‚ญ์ œํ•ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ ๋ถ„๋ฅ˜๋ฌธ์ œ์™€ ๋˜‘๊ฐ™์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )
test['pred'] = predictions
test.head()

ํ•™์Šต๊ณผ ์˜ˆ์ธก์€ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

* Mean absolute error ( MAE ) : ํšŒ๊ท€๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜. 

MAE ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

abs(test['petal_width'] - pd.Series(predictions)).mean()

์ •๋‹ต์—์„œ ์˜ˆ์ธก๊ฐ’์„ ๋นผ๊ณ , ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•ด์ค€ ํ›„ ๊ฐ๊ฐ์˜ ์˜ค๋ฅ˜์œจ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ์ด ํ‰๊ฐ€์ง€ํ‘œ๋Š” ์˜ค๋ฅ˜์œจ์ด๋ฏ€๋กœ ์ž‘์„ ์ˆ˜๋ก ์ž˜ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋ผ ํŒ๋‹จ๋˜์–ด์ง„๋‹ค. 

for k in range(1,30):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )    
    print(str(k)+' :'+str(abs(test['petal_width'] - pd.Series(predictions)).mean()))

์˜ค๋ฅ˜์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k๋Š” 7์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

---> ์ตœ์ ์˜ K : 7

 

 

 

 

 

 

 

+ Recent posts