๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/65

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. Python Decision Tree ( ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ..

silvercoding.tistory.com

 

 

 


๋ฐฐ๊น… bagging 

- ๋ฐฐ๊น…์˜ ์ฒ ํ•™ 

1. ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค. 

2. ๋‹ค์–‘ํ• ์ˆ˜๋ก ์ข‹๋‹ค. 

(ex) ๋‚จ์„ฑ 1๋ช… < ๋‚จ์„ฑ 10๋ช… (์ˆ˜๊ฐ€ ๋งŽ์Œ) < ๋‚จ์„ฑ 5๋ช… , ์—ฌ์„ฑ 5๋ช… (์ˆ˜๊ฐ€ ๋งŽ๊ณ  ๋‹ค์–‘ํ•จ) 

 

 

- ๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€? (๋ฐฐ๊น… ํ”„๋กœ์„ธ์Šค)  

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง ( ๋ณต์› ์ถ”์ถœ / ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ฌ์ˆ˜๋„, ์•„์˜ˆ ๋ฝ‘ํžˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„์ˆ˜๋„. ) -> ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 

2. ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ์ƒ์„ฑ 

3. ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ๋‹ค๋ฅด๋ฏ€๋กœ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด 

 

 

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์˜ ๊ฒฐํ•ฉ? 

: ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜์˜ ๋‹จ์ˆœ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. 

 

 

- ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ (๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋ธ) 

: ๋ฐฐ๊น…์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 


 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์˜ Dataset ์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Bank Marketing dataset > 

https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset

 

Bank marketing campaigns dataset | Opening Deposit

Bank Marketing (with social/economic context) dataset with loan target variable

www.kaggle.com

import os
import pandas as pd
os.chdir('../data')  # ๋ณธ์ธ์˜ ํŒŒ์ผ ํด๋” ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ';')

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์ฃผ์˜ํ•  ์ ์€ sep=';' ์„ ์„ค์ •ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ํŒŒ์ผ์€ csv ํŒŒ์ผ์ด์ง€๋งŒ ์ฝค๋งˆ(,) ๊ฐ€ ์•„๋‹Œ ์„ธ๋ฏธ์ฝœ๋ก (;) ์œผ๋กœ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

data.head()

๋‚˜์ด, ์ง์—…, ๊ฒฐํ˜ผ์—ฌ๋ถ€, ๋Œ€์ถœ์—ฌ๋ถ€ ๋“ฑ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น ๊ณ ๊ฐ์˜ ์˜ˆ๊ธˆ ๊ฐ€์ž…์—ฌ๋ถ€๋ฅผ ๋งžํžˆ๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data.info()

dtype์ด object์ธ ๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ,  ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

 

 

 

 


 ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์‚ฌ์šฉ 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

- dtype์ด object์ธ ์ปฌ๋Ÿผ ์ถ”์ถœ 

obj_column = []
for column in data.columns[:-1]:
    if data[column].dtype == 'object':
        obj_column.append(column)
        
obj_column

data = pd.get_dummies(data,columns=obj_column)

get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data

์ปฌ๋Ÿผ์ˆ˜๊ฐ€ ๋งŽ์ด ๋Š˜์–ด๋‚œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

data['id']=range(len(data))

๋ฐ์ดํ„ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•˜์—ฌ id๊ฐ’์„ ๋ถ€์—ฌํ•œ๋‹ค. 

 

 

- train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)

train ๋ฐ์ดํ„ฐ์…‹์„ ๋น„๋ณต์›์ถ”์ถœ๋กœ 30000๊ฐœ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. 

test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

test๋ฐ์ดํ„ฐ์…‹์€ train์— ์—†๋Š” id๊ฐ’์œผ๋กœ ์ด 11188๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

 

 

 

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ•™์Šต 


๋žœ๋คํฌ๋ ˆ์ŠคํŠธ

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€
  • ๋งค์šฐ ๋А๋ฆผ
  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋” ๊ฐ๊ด€์ ์ธ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์Œ 

 

- RandomForestClassifier(n_estimators=m, min_samples_split=n)

  • n_estimators : ๋ช‡๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“œ๋Š”๊ฐ€ 
  • max_depth : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด 
  • min_samples_split : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ๊ฐ ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, min_samples_split=10)

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

๋ฐ˜ํ™˜๋œ data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ๋บ€ ์ปฌ๋Ÿผ๋“ค์„ input_var ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

rf.fit(train[input_var],train['y'])

train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ถ„๋ฅ˜๊ธฐ ๋ชจ๋ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = rf.predict(test[input_var])

test๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ณ , predictions ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

predictions์™€ ์ •๋‹ต๊ฐ’(y) ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ท ์„ ๋‚ด์ฃผ๋ฉด ์ •ํ™•๋„๋Š” ์•ฝ 91% ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

 

* ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์™€์˜ ๋น„๊ต 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split=10)

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

dt.fit(train[input_var], train['y'])

predictions = dt.predict(test[input_var])

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

(pd.Series(predictions) == test['y']).mean()

์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•ด๋ณด๋‹ˆ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = rf.feature_importances_
imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด ๋ณด์•˜๋”๋‹ˆ duration์ด ๊ฐ€์žฅ ๋†’๊ณ , default_yes ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  (๋ณ€์ˆ˜์ค‘์š”๋„์— ๋Œ€ํ•œ ๊ฐœ๋…์€ ๋‹ค๋‹ค์Œ์‹œ๊ฐ„์— ์ž์„ธํžˆ ์•Œ์•„๋ณธ๋‹ค.) 


 

 

 

+ Recent posts