๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/66

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 1. python ๋ฐฐ๊น… , ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ bagging, randomforest

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [bost..

silvercoding.tistory.com

 

 


๋ถ€์ŠคํŒ… Boosting

๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด (๋ถ€์ŠคํŒ… ์ ˆ์ฐจ) 

  • ์ด์ „ ๋ชจ๋ธ์—์„œ ์˜ค๋ถ„๋ฅ˜ํ•œ ๊ฐ์ฒด์— ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ)๋กœ ๋ชจ๋ธ ํ•™์Šต
  • ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ๋งŒ๋“ฆ
  • ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด

์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ ๊ฒฐํ•ฉ

  • ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜๋ฅผ ๊ฐ€์ค‘ํ‰๊ท 

 

n_estimators ์„ค์ • 

(n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€) 

  • n_estimators ๊ฐ€ ๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ์˜ค๋ฒ„ํ”ผํŒ… ์šฐ๋ ค 
  • n_estimators๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์–ธ๋”ํ”ผํŒ… ์šฐ๋ ค 
  • ์ ์ ˆํ•œ n_estimators๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๊ด€๊ฑด 

 

 


 

 


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import os
import pandas as pd
os.chdir('../data')   # ๋ณธ์ธ ํŒŒ์ผ์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
data = pd.read_csv("bank-additional-full.csv", sep = ';')
data.head()

data.info()

 

 

 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

data = pd.get_dummies(data,columns=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

dtype์ด object์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

 

 

 

train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

data['id']=range(len(data))

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๊ฐ row์— id๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)
test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

์ด์ „๊ธ€๊ณผ ๋™์ผํ•˜๊ฒŒ train, test ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค.

 

 

 

์ธํ’‹๋ณ€์ˆ˜ ์ €์žฅ 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์„ input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

 

 

 


XGBoost ๋ชจ๋ธํ•™์Šต 


XGBoost 

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€ 
  • ๋Œ€์ฒด์ ์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์— ๋น„ํ•ด ๋น ๋ฅด๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Œ

- xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )

  • n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€ 
  • learning_rate : ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ธ์ง€ 

-์„ค์น˜ 

!pip install xgboost

์šฐ์„  xgboost๊ฐ€ ์„ค์น˜๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋ฉด ์„ค์น˜ํ•ด ์ค€๋‹ค. 

from xgboost import XGBClassifier
xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )
xgb.fit(train[input_var], train['y'])

๊ฐ์ฒด ์ƒ์„ฑ์„ ํ•˜๊ณ , train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๊นŒ์ง€ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = xgb.predict(test[input_var])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ ํ›„ predictions์— ์ €์žฅํ•œ๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

์ •ํ™•๋„๊ฐ€ ์•ฝ 91 % ๊ฐ€ ๋‚˜์™”๋‹ค. ํ˜„์žฌ ๋ชจ๋ธ์€ n_estimators๋ฅผ 300์œผ๋กœ ์ง€์ •ํ•˜์˜€๋‹ค. ์•ž์—์„œ ํ•™์Šตํ•˜์˜€๋“ฏ์ด, ์˜ค๋ฒ„ํ”ผํŒ…๊ณผ ์–ธ๋”ํ”ผํŒ…์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ€์ŠคํŒ…์—์„œ n_estimators๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ด๋ผ๊ณ  ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ n_estimators๋ฅผ ์ฐพ์•„๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

์ตœ์  ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์ˆ˜ ( n_estimators ) ์ฐพ๊ธฐ 

for n in [100,200,300,400,500,600,700,800,900]:
    xgb = XGBClassifier( n_estimators = n, learning_rate = 0.05, eval_metric='logloss' )
    xgb.fit(train[input_var], train['y'])
    predictions = xgb.predict(test[input_var])
    print((pd.Series(predictions)==test['y']).mean())

๊ฒฐ๊ณผ : ์ตœ์ ์˜ n_estimators ๋Š” 400์ด๋‹ค. 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด๋ณด๋‹ˆ nr.emplyed ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋กœ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

 

+ Recent posts