๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/67

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 2. python ๋ถ€์ŠคํŒ… Boosting, XGBoost ์‚ฌ์šฉ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https..

silvercoding.tistory.com

 

 


 

'๊ฒฐ๋ก ์ด ๋ฌด์—‡์ธ์ง€' ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋กœ์„œ์˜ ์ค‘์š”ํ•œ ์—…๋ฌด์ด๋‹ค. 

์˜ˆ์ธก ๊ฒฐ๊ณผ๋งŒ ๋ณด๊ณ ๋Š” ๋ชจ๋ธ์ด ์–ด๋–ค ํŒจํ„ด์„ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์‹คํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€, ์™œ ๊ทธ๋ ‡๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ์„ค๋ช…ํ•  ์ˆ˜ ์—†๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๋‹ค๋ฅธ ๋ถ„์•ผ์˜ ํ˜‘์—…์ž๋“ค์€ ์‹ ๋ขฐ๋ฅผ ์žƒ๊ฒŒ๋  ๊ฒƒ์ด๋‹ค. 

๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณธ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•˜์—ฌ ์˜ํ™” ํฅํ–‰์„ฑ์ ์„ ์˜ˆ์ธกํ•˜๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ํฅํ–‰ ์‹คํŒจ๋ผ๋Š” ์˜ˆ์ธก์ด ๋‚˜์™”๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์–ด๋–ป๊ฒŒ ํฅํ–‰์‹คํŒจ๋ฅผ ๋ง‰์„ ๊ฒƒ์ด๋ƒ๊ณ  ์งˆ๋ฌธ์ด ๋“ค์–ด์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค. ๊ธฐ์กด์˜ ์ทจ์•ฝ์ ์„ ๋ณด์™„ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด ๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

 

๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„์ฃผ ์ค‘์š”ํ•˜๋‹ค. ์ด ๋•Œ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ ๋ณ€์ˆ˜์™€, ํŠน์ • ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


๋ณ€์ˆ˜์ค‘์š”๋„

- ๋ชจ๋ธ์— ํ™œ์šฉํ•œ input ๋ณ€์ˆ˜ ์ค‘์—์„œ ์–ด๋–ค ๊ฒƒ์ด target ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋‚˜? 
- ํ•ด๋‹น ์ค‘์š”๋„๋ฅผ ์ˆ˜์น˜ํ™”์‹œํ‚จ ๊ฒƒ
- treeํ˜• ๋ชจ๋ธ (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด, ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ) ์—์„œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ 

 

์ด์ „ ๊ธ€์˜ treeํ˜• ๋ชจ๋ธ์ธ random forest์™€ xgboost์—์„œ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ณ„์‚ฐ์„ ์‹คํ–‰ํ–ˆ์—ˆ๋‹ค.  

(์ฐธ๊ณ )  ๋ฐฐ๊น…  ๋ถ€์ŠคํŒ…


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ์˜ ๋ณ€์ˆ˜์ค‘์š”๋„

- ํ•ด๋‹น input ๋ณ€์ˆ˜๊ฐ€ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ๊ตฌ์ถ•์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์“ฐ์ด๋‚˜ 
- ํ•ด๋‹น ๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐ๋ฅผ ํ–ˆ์„ ๋•Œ ๊ฐ ๊ตฌ๊ฐ„์˜ ๋ณต์žก๋„๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค„์–ด๋“œ๋Š”๊ฐ€? 



shapley ๊ฐ’ 

: ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฌผ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ์˜ ํฌ๊ธฐ

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

 

(์˜ˆ) ์ถ•๊ตฌ ์„ ์ˆ˜ A , ์†ํ•œ ํŒ€ B 

- ๊ฐ ์„ ์ˆ˜๊ฐ€ ํŒ€ ์„ฑ์ ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ ํฌํ‚ค

- ํ•ด๋‹น ์„ ์ˆ˜๊ฐ€ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

- (์„ ์ˆ˜ A๊ฐ€ ์žˆ๋Š” ํŒ€ B์˜ ์Šน๋ฅ ) - (์„ ์ˆ˜ A๊ฐ€ ์—†๋Š” ํŒ€ B์˜ ์Šน๋ฅ  = 7% 


 shap value ์‹ค์Šต 

shap value ์‹ค์Šต์— ์ค‘์ ์„ ๋‘๊ธฐ ์œ„ํ•ด  Xgboost ํ•™์Šต๊นŒ์ง€ ์ „์— ํ–ˆ๋˜ ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•ด์ค€๋‹ค. 

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import os
import pandas as pd
import numpy as np
os.chdir('./data') # ๋ณธ์ธ ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ";")

์ด์ „ ๊ธ€์—์„œ ์‚ฌ์šฉํ•˜์˜€๋˜ ์˜ˆ๊ธˆ ๊ฐ€์ž… ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. 

data = pd.get_dummies(data, columns = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

data['y'].value_counts()

๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชฉํ‘œ๋ณ€์ˆ˜๋„ ๋‹น์—ฐํžˆ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋˜์–ด์žˆ๋‹ค. 

data['y'] = np.where( data['y'] == 'no', 0, 1)

ํ•˜์ง€๋งŒ shap value ํŒจํ‚ค์ง€๋Š” ๋ชฉํ‘œ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜์น˜ํ˜•์ด์–ด์•ผ ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์น˜ํ™” ์‹œ์ผœ์ค€๋‹ค. 

 

 

 

Xgboost ํ•™์Šต 

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

y ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•œ ์ธํ’‹๋ณ€์ˆ˜๋ฅผ ๋ฆฌ์ŠคํŠธ์— ๋ชจ๋‘ ๋‹ด์•„์ค€๋‹ค. 

from xgboost import XGBRegressor

์ˆ˜์น˜ํ˜•์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด XBGRegressor ํšŒ๊ท€๋ชจ๋ธ์„ ์ž„ํฌํŠธ ํ•ด์ค€๋‹ค. 

xgb = XGBRegressor( n_estimators = 300, learning_rate=0.1 )
xgb.fit(data[input_var], data['y'])

Xgboost ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

 

 

Shap Value ์˜ˆ์ œ 

import shap

shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

(1) ๋ณ€์ˆ˜์ค‘์š”๋„

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values( data[input_var] )

shap.TreeExplainer์˜ ์ธ์ž์— ํ•™์Šตํ•œ ๋ชจ๋ธ xgb๋ฅผ ๋„ฃ์–ด ๊ฐ์ฒด๋ฅผ ์ €์žฅํ•ด์ค€๋‹ค. ๊ทธ๋‹ค์Œ explainer.shap_values์˜ ์ธ์ž์— ๋ฐ์ดํ„ฐ์…‹์˜ ์ธํ’‹๊ฐ’์„ ๋„ฃ์–ด์ค€๋‹ค. 

shap.summary_plot( shap_values , data[input_var] , plot_type="bar" )

shap.summary_plot์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค์ค€๋‹ค. ๊ฐ€์žฅ ๋†’์€ ๋ณ€์ˆ˜๋Š” duration์ด๋‹ค. duration์€ ์ „ํ™”์‹œ๊ฐ„์ด๋‹ค. ์ „ํ™”์‹œ๊ฐ„์˜ ๊ธธ์ด๊ฐ€ ์ด ๋ชจ๋ธ์˜ ์˜ˆ์ธก์— ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฏธ์นœ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

 

 

(2) dependence plot 

: ํŠน์ • input ๋ณ€์ˆ˜์™€ target ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ 

: ์ ์€ ๊ฐ๊ฐ์˜ row๋ฅผ ์˜๋ฏธ(๋ฐ์ดํ„ฐ ํ•œ๊ฐœ), ํƒ€๊ฒŸ๋ณ€์ˆ˜์— ๋ฏธ์นœ ์˜ํ–ฅ = y 

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'duration' , shap_values , data[input_var] )

duration์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด duration์˜ ๋Œ€๋ถ€๋ถ„์ด 3000 ๋ฏธ๋งŒ์— ์กด์žฌํ•˜๊ณ , ๊ทธ ์ค‘์—์„œ๋Š” duration์ด 50์ด์ƒ์ฏค ๋˜๋ฉด ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์ณ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค. (shpa value for duration์ด 0๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์Œ) 

shap.dependence_plot( 'nr.employed' , shap_values , data[input_var] )

5020์ฏค ๋˜๋Š” ์ง€์ ์—์„œ ์˜ํ–ฅ๋ ฅ์ด ์Œ์ˆ˜๊ฐ€ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  5100์ด ๋„˜์–ด๊ฐ€๊ณ ๋Š” ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ๋ฐ–์— ์—†๋‹ค. (-> 0์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) ๊ทธ ์ด์ „์—๋Š” ์˜ํ–ฅ๋ ฅ์ด ๋†’์œผ๋ฏ€๋กœ ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ๋‹ค. (-> 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) 

shap.dependence_plot( 'euribor3m' , shap_values , data[input_var] )

์Œ์ˆ˜์™€ ์–‘์ˆ˜๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์–ด์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„ ๋ณด์ธ๋‹ค. ์ด ์ค‘์—์„œ ์Œ์ˆ˜๊ฐ€ ์–ผ๋งˆ ์—†๊ณ  ์–‘์ˆ˜๊ฐ€ ๋งŽ์€ ๊ตฌ๊ฐ„์„ ์ฐพ์•„๋ณด๋ฉด 1.3~1.4 - 2, 4-5 ๊ฐ€ ์žˆ๋‹ค. ํ•ด๋‹น ๊ตฌ๊ฐ„์ผ ๋•Œ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'cons.conf.idx' , shap_values , data[input_var] )

์ „์ฒด์ ์œผ๋กœ ์Œ์ˆ˜๋ฅผ ์ด๋ฃจ๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. -45์ดํ•˜์ผ ๋•Œ๋Š” 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'pdays' , shap_values , data[input_var] )

pdays๊ฐ€ 0์ผ๋•Œ ๋Œ€๋‹ค์ˆ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์งˆ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(3) force plot

: ํŠน์ • ๊ฐ’์ด ์–ด๋–ป๊ฒŒ ์˜ˆ์ธก๋˜์—ˆ๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™” 

prediction = xgb.predict(data[input_var])
data['pred'] = prediction

 

shap.initjs()
shap.force_plot( explainer.expected_value , shap_values[41187] , data[input_var].iloc[41187] )

411187๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” 0.09๊ฐ€ ๋‚˜์™”๋Š”๋ฐ, ๋–จ์–ด๋œจ๋ฆฌ๋Š” ๋ณ€์ˆ˜์™€ ์˜ฌ๋ฆฌ๋Š” ๋ณ€์ˆ˜๊ฐ€ ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹ค. 

 

shap.force_plot( explainer.expected_value , shap_values[0] , data[input_var].iloc[41187] )

0์— ๊ฑฐ์˜ ๊ฐ€๊น๊ฒŒ ์˜ˆ์ธก๋œ 0๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ๋ณ€์ˆ˜๊ฐ€ ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

41183๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ์–‘์˜ ์˜ํ–ฅ๋ ฅ์ด ํ›จ์”ฌ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ 0.88์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๊ณ , ์ •๋‹ต์€ 1๋กœ, ๊ทผ์ ‘ํ•˜๊ฒŒ ๋งžํ˜”๋‹ค. 

 

 

์ด๋ ‡๊ฒŒ shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก์— ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.  


 

 

 

 

 

 

 

 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/66

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 1. python ๋ฐฐ๊น… , ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ bagging, randomforest

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [bost..

silvercoding.tistory.com

 

 


๋ถ€์ŠคํŒ… Boosting

๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด (๋ถ€์ŠคํŒ… ์ ˆ์ฐจ) 

  • ์ด์ „ ๋ชจ๋ธ์—์„œ ์˜ค๋ถ„๋ฅ˜ํ•œ ๊ฐ์ฒด์— ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ)๋กœ ๋ชจ๋ธ ํ•™์Šต
  • ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ๋งŒ๋“ฆ
  • ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด

์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ ๊ฒฐํ•ฉ

  • ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜๋ฅผ ๊ฐ€์ค‘ํ‰๊ท 

 

n_estimators ์„ค์ • 

(n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€) 

  • n_estimators ๊ฐ€ ๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ์˜ค๋ฒ„ํ”ผํŒ… ์šฐ๋ ค 
  • n_estimators๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์–ธ๋”ํ”ผํŒ… ์šฐ๋ ค 
  • ์ ์ ˆํ•œ n_estimators๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๊ด€๊ฑด 

 

 


 

 


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import os
import pandas as pd
os.chdir('../data')   # ๋ณธ์ธ ํŒŒ์ผ์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
data = pd.read_csv("bank-additional-full.csv", sep = ';')
data.head()

data.info()

 

 

 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

data = pd.get_dummies(data,columns=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

dtype์ด object์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

 

 

 

train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

data['id']=range(len(data))

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๊ฐ row์— id๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)
test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

์ด์ „๊ธ€๊ณผ ๋™์ผํ•˜๊ฒŒ train, test ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค.

 

 

 

์ธํ’‹๋ณ€์ˆ˜ ์ €์žฅ 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์„ input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

 

 

 


XGBoost ๋ชจ๋ธํ•™์Šต 


XGBoost 

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€ 
  • ๋Œ€์ฒด์ ์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์— ๋น„ํ•ด ๋น ๋ฅด๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Œ

- xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )

  • n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€ 
  • learning_rate : ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ธ์ง€ 

-์„ค์น˜ 

!pip install xgboost

์šฐ์„  xgboost๊ฐ€ ์„ค์น˜๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋ฉด ์„ค์น˜ํ•ด ์ค€๋‹ค. 

from xgboost import XGBClassifier
xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )
xgb.fit(train[input_var], train['y'])

๊ฐ์ฒด ์ƒ์„ฑ์„ ํ•˜๊ณ , train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๊นŒ์ง€ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = xgb.predict(test[input_var])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ ํ›„ predictions์— ์ €์žฅํ•œ๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

์ •ํ™•๋„๊ฐ€ ์•ฝ 91 % ๊ฐ€ ๋‚˜์™”๋‹ค. ํ˜„์žฌ ๋ชจ๋ธ์€ n_estimators๋ฅผ 300์œผ๋กœ ์ง€์ •ํ•˜์˜€๋‹ค. ์•ž์—์„œ ํ•™์Šตํ•˜์˜€๋“ฏ์ด, ์˜ค๋ฒ„ํ”ผํŒ…๊ณผ ์–ธ๋”ํ”ผํŒ…์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ€์ŠคํŒ…์—์„œ n_estimators๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ด๋ผ๊ณ  ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ n_estimators๋ฅผ ์ฐพ์•„๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

์ตœ์  ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์ˆ˜ ( n_estimators ) ์ฐพ๊ธฐ 

for n in [100,200,300,400,500,600,700,800,900]:
    xgb = XGBClassifier( n_estimators = n, learning_rate = 0.05, eval_metric='logloss' )
    xgb.fit(train[input_var], train['y'])
    predictions = xgb.predict(test[input_var])
    print((pd.Series(predictions)==test['y']).mean())

๊ฒฐ๊ณผ : ์ตœ์ ์˜ n_estimators ๋Š” 400์ด๋‹ค. 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด๋ณด๋‹ˆ nr.emplyed ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋กœ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/65

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. Python Decision Tree ( ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ..

silvercoding.tistory.com

 

 

 


๋ฐฐ๊น… bagging 

- ๋ฐฐ๊น…์˜ ์ฒ ํ•™ 

1. ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค. 

2. ๋‹ค์–‘ํ• ์ˆ˜๋ก ์ข‹๋‹ค. 

(ex) ๋‚จ์„ฑ 1๋ช… < ๋‚จ์„ฑ 10๋ช… (์ˆ˜๊ฐ€ ๋งŽ์Œ) < ๋‚จ์„ฑ 5๋ช… , ์—ฌ์„ฑ 5๋ช… (์ˆ˜๊ฐ€ ๋งŽ๊ณ  ๋‹ค์–‘ํ•จ) 

 

 

- ๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€? (๋ฐฐ๊น… ํ”„๋กœ์„ธ์Šค)  

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง ( ๋ณต์› ์ถ”์ถœ / ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ฌ์ˆ˜๋„, ์•„์˜ˆ ๋ฝ‘ํžˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„์ˆ˜๋„. ) -> ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 

2. ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ์ƒ์„ฑ 

3. ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ๋‹ค๋ฅด๋ฏ€๋กœ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด 

 

 

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์˜ ๊ฒฐํ•ฉ? 

: ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜์˜ ๋‹จ์ˆœ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. 

 

 

- ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ (๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋ธ) 

: ๋ฐฐ๊น…์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 


 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์˜ Dataset ์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Bank Marketing dataset > 

https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset

 

Bank marketing campaigns dataset | Opening Deposit

Bank Marketing (with social/economic context) dataset with loan target variable

www.kaggle.com

import os
import pandas as pd
os.chdir('../data')  # ๋ณธ์ธ์˜ ํŒŒ์ผ ํด๋” ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ';')

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์ฃผ์˜ํ•  ์ ์€ sep=';' ์„ ์„ค์ •ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ํŒŒ์ผ์€ csv ํŒŒ์ผ์ด์ง€๋งŒ ์ฝค๋งˆ(,) ๊ฐ€ ์•„๋‹Œ ์„ธ๋ฏธ์ฝœ๋ก (;) ์œผ๋กœ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

data.head()

๋‚˜์ด, ์ง์—…, ๊ฒฐํ˜ผ์—ฌ๋ถ€, ๋Œ€์ถœ์—ฌ๋ถ€ ๋“ฑ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น ๊ณ ๊ฐ์˜ ์˜ˆ๊ธˆ ๊ฐ€์ž…์—ฌ๋ถ€๋ฅผ ๋งžํžˆ๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data.info()

dtype์ด object์ธ ๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ,  ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

 

 

 

 


 ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์‚ฌ์šฉ 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

- dtype์ด object์ธ ์ปฌ๋Ÿผ ์ถ”์ถœ 

obj_column = []
for column in data.columns[:-1]:
    if data[column].dtype == 'object':
        obj_column.append(column)
        
obj_column

data = pd.get_dummies(data,columns=obj_column)

get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data

์ปฌ๋Ÿผ์ˆ˜๊ฐ€ ๋งŽ์ด ๋Š˜์–ด๋‚œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

data['id']=range(len(data))

๋ฐ์ดํ„ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•˜์—ฌ id๊ฐ’์„ ๋ถ€์—ฌํ•œ๋‹ค. 

 

 

- train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)

train ๋ฐ์ดํ„ฐ์…‹์„ ๋น„๋ณต์›์ถ”์ถœ๋กœ 30000๊ฐœ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. 

test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

test๋ฐ์ดํ„ฐ์…‹์€ train์— ์—†๋Š” id๊ฐ’์œผ๋กœ ์ด 11188๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

 

 

 

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ•™์Šต 


๋žœ๋คํฌ๋ ˆ์ŠคํŠธ

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€
  • ๋งค์šฐ ๋Š๋ฆผ
  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋” ๊ฐ๊ด€์ ์ธ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์Œ 

 

- RandomForestClassifier(n_estimators=m, min_samples_split=n)

  • n_estimators : ๋ช‡๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“œ๋Š”๊ฐ€ 
  • max_depth : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด 
  • min_samples_split : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ๊ฐ ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, min_samples_split=10)

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

๋ฐ˜ํ™˜๋œ data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ๋บ€ ์ปฌ๋Ÿผ๋“ค์„ input_var ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

rf.fit(train[input_var],train['y'])

train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ถ„๋ฅ˜๊ธฐ ๋ชจ๋ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = rf.predict(test[input_var])

test๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ณ , predictions ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

predictions์™€ ์ •๋‹ต๊ฐ’(y) ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ท ์„ ๋‚ด์ฃผ๋ฉด ์ •ํ™•๋„๋Š” ์•ฝ 91% ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

 

* ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์™€์˜ ๋น„๊ต 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split=10)

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

dt.fit(train[input_var], train['y'])

predictions = dt.predict(test[input_var])

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

(pd.Series(predictions) == test['y']).mean()

์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•ด๋ณด๋‹ˆ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = rf.feature_importances_
imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด ๋ณด์•˜๋”๋‹ˆ duration์ด ๊ฐ€์žฅ ๋†’๊ณ , default_yes ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  (๋ณ€์ˆ˜์ค‘์š”๋„์— ๋Œ€ํ•œ ๊ฐœ๋…์€ ๋‹ค๋‹ค์Œ์‹œ๊ฐ„์— ์ž์„ธํžˆ ์•Œ์•„๋ณธ๋‹ค.) 


 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/64

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. Python KNN ๋ถ„๋ฅ˜

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ..

silvercoding.tistory.com

 

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

์ด์ „ ๊ธ€๊ณผ ๋™์ผํ•œ Iris Flower Dataset ์„ ์ด์šฉํ•˜์—ฌ ์‹ค์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')  # ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋Š” ๋ณธ์ธ ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

iris['species'].value_counts()

๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค 50๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 


 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‚ฌ์šฉ 

train & Test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

iris['id'] = range(len(iris))

์šฐ์„  ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋„ฃ์–ด์ค€ id ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์•ž์— ์˜ค๋„๋ก ์ •๋ ฌํ•ด์ค€๋‹ค. 

train = iris.sample(100,replace=False,random_state=7).reset_index().drop(['index'],axis=1)

๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜์—ฌ train ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
test = test.reset_index().drop(['index'],axis=1)

train์˜ id๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” iris ๋ฐ์ดํ„ฐ๋“ค์„ test์— ๋„ฃ์–ด์ค€๋‹ค. 

 

 

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ํ•™์Šต 

DecisionTreeClassifier(min_samples_split = n)

---> ํŠน์ง• : ํ•ด์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๋‹ค. 

---> min_samples_split : ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ์ตœ์ข… ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split = 10)

min_samples_split ์„ 10์œผ๋กœ ์„ค์ •ํ•ด์ฃผ์–ด ์ตœ์ข… ๋…ธ๋“œ์˜ ์ƒ˜ํ”Œ์ˆ˜๊ฐ€ 10๋ฏธ๋งŒ์ด ๋˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•œ๋‹ค. 

dt.fit(train[['sepal_length','sepal_width','petal_length','petal_width']],train['species'])

์ƒ์„ฑํ•ด ๋†“์€ dt ๊ฐ์ฒด๋กœ ํ•™์Šต์„ ์‹œ์ผœ์ค€๋‹ค. 

predictions = dt.predict(test[['sepal_length','sepal_width','petal_length','petal_width']])

์˜ˆ์ธก๊ฐ’์„ prediction์— ๋„ฃ์–ด์ค€๋‹ค. 

test['pred'] = predictions

์˜ˆ์ธก๊ฐ’ prediction์„ test์˜ pred ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. 

test.head()

(pd.Series(predictions)==test['species']).mean()

์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.98์ด ๋‚˜์™”๋‹ค. 

 

 

 


์œ„์˜ ์ •ํ™•๋„ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


from sklearn.model_selection import cross_val_score
import numpy as np
dt = DecisionTreeClassifier(min_samples_split = 10)
scores = cross_val_score(dt, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5, scoring="accuracy")
np.mean(scores)

 

์ด๋ฒˆ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ ์„ ๊ฒฝ์šฐ์—๋Š” ์œ„์™€ ๊ฐ™์ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์‹ ๋ขฐ์„ฑ์ด ๋†’๋‹ค. 5 fold cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ , ์ •ํ™•๋„๊ฐ€ ์•ฝ 0.97์ด ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™”

from sklearn import tree
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 16,10
a=tree.plot_tree(dt,feature_names = ['sepal_length','sepal_width','petal_length','petal_width'],impurity=False, max_depth=2, fontsize=10, proportion=True)
plt.show(a)

max_depth๋ฅผ ์ด์šฉํ•˜์—ฌ ๊นŠ์ด๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค. 2๊ฐœ ์ดํ›„๋กœ๋Š” (...) ์œผ๋กœ ์ƒ๋žต๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์œ„์™€ ๊ฐ™์ด ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์‹œ๊ฐํ™” ํ•ด๋ณด๋ฉด ํ•ด์„์„ ์‰ฝ๊ณ  ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/63?category=967543 

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. PCA, ๊ตฐ์ง‘ํ™”๋ฅผ ์‚ฌ์šฉํ•œ ์ง‘๊ฐ’ ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ impo..

silvercoding.tistory.com

 

 

 

 


KNN ๊ฐœ๋… ์ •๋ฆฌ

* 1๊ทธ๋ฃน vs 2๊ทธ๋ฃน KNN ๋ถ„๋ฅ˜ ๊ณผ์ •

1. k ์„ค์ • : ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์ ์„ ์„ ํƒ 

2. k ๊ฐœ์˜ ์  ์ค‘ 1๊ทธ๋ฃน์ด ๋งŽ์€์ง€ 2๊ทธ๋ฃน์ด ๋งŽ์€์ง€ ํ™•์ธ 

3. ๋” ๋งŽ์€ ๊ทธ๋ฃน์˜ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

 

* K๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •

1. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ K๋ณ„๋กœ KNN ๋ชจ๋ธ ํ•™์Šต 

2. ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ) ์—์„œ์˜ ์—๋Ÿฌ์œจ ์ธก์ • 

3. ์—๋Ÿฌ์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k ์„ ํƒ 

 

 

* ์ ์ ˆํ•œ k๋ฅผ ์ฐพ์•„๋‚ด์–ด์•ผ ํ•œ๋‹ค!

- k๊ฐ€ ๋งค์šฐ ์ž‘์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ๊ณผ์ ํ•ฉ ์šฐ๋ ค 

- k๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์ง€์—ญ์  ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋จ 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์€ ์บ๊ธ€์˜ ๋‹ค์Œ๋งํฌ์—์„œ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')   # ๋ณธ์ธ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

(์ฐธ๊ณ ) sepal : ๊ฝƒ๋ฐ›์นจ / petal : ๊ฝƒ์žŽ 

๊ฝƒ๋ฐ›์นจ์˜ ํฌ๊ธฐ์™€ ๊ฝƒ์žŽ์˜ ํฌ๊ธฐ๋ฅผ ๊ทผ๊ฑฐ๋กœ setosa, versicolor, virginica ์ด 3์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„ํ•ด ๋‚ด๋Š” ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค. 

iris.info()

์ด 150๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด ๊ฐ€ ์žˆ๊ณ  , ๊ฒฐ์ธก๊ฐ’์€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. 

iris['species'].value_counts()

value_counts() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ์ข…๋ฅ˜๊ฐ€ ๋ช‡๊ฐ€์ง€์”ฉ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค ๋™์ผํ•˜๊ฒŒ 50๊ฐœ์”ฉ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 


 KNN ์‹ค์Šต - ๋ถ„๋ฅ˜ 

(ex) KNeighborsClassifier(n_neighbors=n)

---> ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ๋Š๋ฆผ 

---> n_neighbors=n : k์˜ ๊ฐœ์ˆ˜ ์ง€์ • (๊ฐ€์žฅ ๊ฐ€๊นŒ์šด K๊ฐœ๋ฅผ ๋ณผ๊ฒƒ์ด๋ผ๋Š” ์˜๋ฏธ) 

 

 

iris['id'] = range(len(iris))

๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•˜์—ฌ id ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ์— ์˜ค๋„๋ก ์ •๋ ฌ ํ•ด ์ค€๋‹ค. 

iris.head()

 

 

train & test data ๋ถ„๋ฆฌ 

train = iris.sample(100, replace=False, random_state=7).reset_index(drop=True)
train

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๋น„๋ณต์›์ถ”์ถœ์ด๊ณ , ๋’ค์ฃฝ๋ฐ•์ฃฝ๋œ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ์‹œ์ผœ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
# test = test.reset_index().drop(['index'],axis=1)  # ๋ฐ‘๊ณผ ๊ฐ™์€ ์ฝ”๋“œ
test = test.reset_index(drop=True)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” id๊ฐ’์ด ์กด์žฌํ•˜๋Š” row๋งŒ ์ถ”์ถœํ•˜์—ฌ ๊ตฌ์„ฑํ•œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค. 

 

 

 

 

KNN ํ•™์Šต (k=3 ์ผ ๋•Œ ํ•™์Šตํ•ด๋ณด๊ธฐ) 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # ๋ชจ๋ธ ์ •์˜

k=3์œผ๋กœ ์„ค์ •ํ•œ KNN ๋ถ„๋ฅ˜๊ธฐ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )

knn.fit(train_X, train_y) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )

knn.predict(test_X) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ณ , predictions์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test['pred'] = predictions
test.head()

pred ์ปฌ๋Ÿผ์— ์˜ˆ์ธก ๊ฒฐ๊ณผ์ธ predictions๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ๋‹ค. ์œ„์˜ 5๊ฐœ๋ฅผ ๋ณด๋‹ˆ ๋ชจ๋‘ ์ •๋‹ต์„ ๋งž์ถ˜ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  

(test['pred'] == test['species']).mean()

์ •๋‹ต๊ณผ ์˜ˆ์ธก์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.94๊ฐ€ ๋‚˜์™”๋‹ค. ์ด์ œ ์—ฌ๋Ÿฌ k๊ฐ’์˜ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•˜์—ฌ ์ตœ์ ์˜ k๋ฅผ ๊ฒฐ์ •ํ•ด ๋ณธ๋‹ค. 

 

 

 

 

์ตœ์  K ์ฐพ๊ธฐ 

- train & test ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ 

for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )
    print((pd.Series(predictions) == test['species']).mean())

1๋ถ€ํ„ฐ 29๊นŒ์ง€์˜ k ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์—ฌ ์–ป์€ ์ •ํ™•๋„์ด๋‹ค. ๋†’์€ ๊ฐ’ ์ค‘์—์„œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋ฅผ ๊ณ ๋ฅด๋ฉด k=5 (์ •ํ™•๋„ 0.98) ์ด๋‹ค. 

 

---> ์ตœ์ ์˜ K : 5

 


ํ•˜์ง€๋งŒ ์œ„์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

- cross validation ์‚ฌ์šฉ

from sklearn.model_selection import cross_val_score
import numpy as np
for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5)
    print(f"{k} : " ,np.mean(scores))

5-fold-cross validation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. k=6 ์ผ๋•Œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋กœ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

---> ์ตœ์ ์˜ K : 6

 

 

 

 


 KNN ์‹ค์Šต - ํšŒ๊ท€ 

ํšŒ๊ท€๋ฌธ์ œ์— KNN ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ KNN ํšŒ๊ท€๋ฌธ์ œ๋ฅผ ์‹ค์Šต์„ ํ•ด๋ณด๊ธฐ ์œ„ํ•ด sepal_length, sepal_width, petal_length ๋ฅผ ์ด์šฉํ•˜์—ฌ petal_width๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

del train['species']
del test['species']

๊ฐ„๋‹จํ•œ ์‹ค์Šต์„ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ species ๋Š” ์‚ญ์ œํ•ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ ๋ถ„๋ฅ˜๋ฌธ์ œ์™€ ๋˜‘๊ฐ™์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )
test['pred'] = predictions
test.head()

ํ•™์Šต๊ณผ ์˜ˆ์ธก์€ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

* Mean absolute error ( MAE ) : ํšŒ๊ท€๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜. 

MAE ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

abs(test['petal_width'] - pd.Series(predictions)).mean()

์ •๋‹ต์—์„œ ์˜ˆ์ธก๊ฐ’์„ ๋นผ๊ณ , ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•ด์ค€ ํ›„ ๊ฐ๊ฐ์˜ ์˜ค๋ฅ˜์œจ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ์ด ํ‰๊ฐ€์ง€ํ‘œ๋Š” ์˜ค๋ฅ˜์œจ์ด๋ฏ€๋กœ ์ž‘์„ ์ˆ˜๋ก ์ž˜ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋ผ ํŒ๋‹จ๋˜์–ด์ง„๋‹ค. 

for k in range(1,30):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )    
    print(str(k)+' :'+str(abs(test['petal_width'] - pd.Series(predictions)).mean()))

์˜ค๋ฅ˜์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k๋Š” 7์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

---> ์ตœ์ ์˜ K : 7

 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/62

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import matplotlib.pyplot as plt import seaborn as sns - ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ data = pd.read_csv('./data/bosto..

silvercoding.tistory.com

 

 

 


 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('./data/boston.csv')
data.head()

 

 

 


 ๊ตฐ์ง‘ํ™” Clustering 
del data['chas']

์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ๋งŒ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ œ๊ฑฐํ•œ๋‹ค. 

medv = data['medv']
del data['medv']

ํƒ€๊ฒŸ๋ณ€์ˆ˜๋ฅผ ๋ณต์‚ฌํ•ด ๋†“๊ณ , ํƒ€๊ฒŸ๋ณ€์ˆ˜ ์ปฌ๋Ÿผ์„ ์ง€์›Œ์ค€๋‹ค. ( pca๋ฅผ ์œ„ํ•˜์—ฌ )

 

 

์ฐจ์› ์ถ•์†Œ (PCA) : 12์ฐจ์› -> 2์ฐจ์› 

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

์ฐจ์› ์ถ•์†Œ์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

 

- ์ •๊ทœํ™”

scaler = StandardScaler()

์ •๊ทœํ™” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

# ๋ฐ์ดํ„ฐ ํ•™์Šต
scaler.fit(data)
# ๋ณ€ํ™˜
scaler_data = scaler.transform(data)

data ์ „์ฒด๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ scaler_data์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

- PCA

pca = PCA(n_components = 2)

PCA ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 2์ฐจ์› ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•˜์—ฌ ๋ณ€์ˆ˜๋Š” 2๊ฐœ๋กœ ์„ค์ •ํ•œ๋‹ค. 

pca.fit(scaler_data)

pca๋กœ scaler_data๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค. 

data2 = pd.DataFrame(data = pca.transform(scaler_data), columns=['pc1', 'pc2'])

pca๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ data2์— ์ €์žฅํ•œ๋‹ค. 

data2.head()

 

 

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ์ •ํ•˜๊ธฐ - Elbow Point ์ง€์ • 

from sklearn.cluster import KMeans

KMeans(n_cluster = k)

  • k๊ฐœ์˜ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•˜๊ฒ ๋‹ค๋Š” ๊ฐ์ฒด ์ƒ์„ฑ

Kmeans.fit()

  • ํ•™์Šต์‹œํ‚ค๊ธฐ

KMeans.inertia_

  • ํ•™์Šต๋œ KMeans์˜ ์‘์ง‘๋„๋ฅผ ํ™•์ธ
  • ์‘์ง‘๋„๋ž€ ๊ฐ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ž์‹ ์ด ์†ํ•œ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธ
  • ์ฆ‰, ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ง‘ํ™”๊ฐ€ ๋” ์ž˜๋˜์–ด์žˆ์Œ.

KMeans.predict(data)

  • ํ•™์Šต๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜์‹œ์ผœ์คŒ

x = []   # k ๊ฐ€ ๋ช‡๊ฐœ์ธ์ง€ 
y = []   # ์‘์ง‘๋„๊ฐ€ ๋ช‡์ธ์ง€ 

for k in range(1, 30):
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(data2)
    
    x.append(k)
    y.append(kmeans.inertia_)

1๋ถ€ํ„ฐ 30๊นŒ์ง€ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๊ณ , ๊ฐ€์žฅ ์ ์ ˆํ•œ ์‘์ง‘๋„์˜ ๊ตฐ์ง‘๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณธ๋‹ค. 

plt.plot(x, y)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ๋ณ„ ์‘์ง‘๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋‹ˆ 3~5๊ฐœ ์ •๋„๊ฐ€ ์ ๋‹นํ•  ๊ฒƒ ๊ฐ™๋‹ค. Elbow Point๋ฅผ 4๋กœ ์ง€์ •ํ•˜๊ณ  ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

๊ตฐ์ง‘ํ™”

kmeans = KMeans(n_clusters=4)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ 4๋กœ ์„ค์ •ํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

kmeans.fit(data2)

์œ„์—์„œ ์ƒ์„ฑํ•ด ๋†“์€ data2๋ฅผ ํ•™์Šตํ•œ๋‹ค. 

data2['labels'] = kmeans.predict(data2)

๊ฐ๊ฐ์˜ ์˜ˆ์ธก๋œ ๊ตฐ์ง‘ ์ข…๋ฅ˜๋ฅผ labels ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

data2.head()

lebels๊ฐ€ 1์ด๋ผ๋Š” ๊ฒƒ์€ ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ 1๋ฒˆ ๊ตฐ์ง‘์— ํฌํ•จ๋˜์—ˆ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

sns.scatterplot(x='pc1', y='pc2', hue='labels', data=data2)

์œ„์™€ ๊ฐ™์ด ๊ตฐ์ง‘์ด ํ˜•์„ฑ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

๊ฒฐ๊ณผ ํ•ด์„ 

- ์–ด๋–ค ๊ทธ๋ฃน์˜ ์ง‘ ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์„๊นŒ ? : ํ‰๊ท ์œผ๋กœ ๋น„๊ต

data2['medv'] = medv

๊ฐ ๊ทธ๋ฃน์˜ ์ง‘๊ฐ’ ํ‰๊ท ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ์— ์ €์žฅํ•ด ์ฃผ์—ˆ๋˜ medv ์ปฌ๋Ÿผ์„ data2์˜ medv ์ปฌ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

data2.head()

data2[data2['labels']==0]['medv'].mean()

0๋ฒˆ ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์„ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ณด๋„๋ก ํ•œ๋‹ค. 

medv_list = []

for i in range(4):
    medv_avg = data2[data2['labels']==i]['medv'].mean()
    medv_list.append(medv_avg)
sns.barplot(x=['group_0', 'group_1', 'group_2', 'group_3'], y=medv_list)

 


์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœ์ƒ์œ„ ๊ทธ๋ฃน : group_2

์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœํ•˜์œ„ ๊ทธ๋ฃน : group_0

 

---> ์ตœ์ƒ์œ„ ๊ทธ๋ฃน๊ณผ ์ตœํ•˜์œ„ ๊ทธ๋ฃน์„ ๋น„๊ตํ•˜์—ฌ ์ง‘๊ฐ’์˜ ํ‰๊ท ์ด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์€ ์ด์œ ์— ๋Œ€ํ•˜์—ฌ ํ™•์ธํ•ด ๋ณธ๋‹ค. 


* ์›๋ณธ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉํ•˜์—ฌ ์›์ธ ๋ถ„์„ํ•ด๋ณด๊ธฐ 

data['labels'] = data2['labels']

์›๋ณธ๋ฐ์ดํ„ฐ์— ๊ทธ๋ฃน labels๋ฅผ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

group = data[(data['labels']==0) | (data['labels']==2)]

๊ทธ๋ฃน0, ๊ทธ๋ฃน2 ๋งŒ ์„ ํƒํ•˜์—ฌ group ๋ณ€์ˆ˜์— ์ €์žฅํ•œ๋‹ค. 

group = group.groupby('labels').mean().reset_index()

gropuby๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ labels ์ปฌ๋Ÿผ ๋ณ„๋กœ ๋ชจ๋“  ์ปฌ๋Ÿผ์˜ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•˜๊ณ , groupby๋กœ ์ธํ•˜์—ฌ ์ธ๋ฑ์Šค๊ฐ€ ๋˜์—ˆ๋˜ labels๋ฅผ reset_index()๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์‹œ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

group

๊ฐ ๊ทธ๋ฃน๋ณ„ ํ‰๊ท ์ด ๊ตฌํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์‹œ๊ฐํ™” ํ•˜์—ฌ ๋น„๊ตํ•ด ๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

๊ฒฐ๊ณผ ํ•ด์„ - ์‹œ๊ฐํ™” 

column = group.columns
fig, ax = plt.subplots(2, 6, figsize=(30, 13))

for i in range(12):
    sns.barplot('labels', column[i+1], data=group, ax=ax[i//6, i%6])

๋‘๊ฐœ์˜ ๋ง‰๋Œ€๊ฐ€ ๊ทธ๋ ค์ ธ ์žˆ๋Š” ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ, ์™ผ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋‚ฎ์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€, ์˜ค๋ฅธ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋†’์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋œ๋‹ค๊ณ  ํ•ด์„ํ•œ๋‹ค. 

 

 

๊ฒฐ๋ก 

- (0,0) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด crim( ๋ฒ”์ฃ„์œจ )์ด 0๋ฒˆ ๊ทธ๋ฃน์—์„œ ์›”๋“ฑํžˆ ๋†’๋‹ค. ์ด๋Š” ๋ฒ”์ฃ„์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

- (0, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์—์„œ๋Š” zn( 25,000 ํ‰๋ฐฉ๋น„ํŠธ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ ) ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค.

- ๊ทธ๋‹ค์Œ์œผ๋กœ ์ฐจ์ด๊ฐ€ ์ ์–ด ๋ณด์ด๋Š” ๊ฒƒ์€ (1, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์ด๋‹ค. rad( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ ) ๊ฐ€ ๋†’์„์ˆ˜๋ก ( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก ) ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

์ด์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ํ•ด์„์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 


 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

- ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

 

- ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

data = pd.read_csv('./data/boston.csv')
data.head()

 

- ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

data.info()

506๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ , ๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

data.columns

์ด 14๊ฐœ์˜ ์ปฌ๋Ÿผ์ด ์žˆ๋‹ค. ๊ฐ ์ปฌ๋Ÿผ์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

  • crim: ๋ฒ”์ฃ„์œจ
  • zn: 25,000 ํ‰๋ฐฉํ”ผํŠธ๋ฅผ ์ดˆ๊ณผ ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ
  • indus: ๋น„์†Œ๋งค์ƒ์—…์ง€์—ญ ๋ฉด์  ๋น„์œจ
  • chas: ์ฐฐ์Šค๊ฐ•์˜ ๊ฒฝ๊ณ„์— ์œ„์น˜ํ•œ ๊ฒฝ์šฐ๋Š” 1, ์•„๋‹ˆ๋ฉด 0
  • nox: ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„
  • rm: ์ฃผํƒ๋‹น ๋ฐฉ ์ˆ˜
  • age: 1940๋…„ ์ด์ „์— ๊ฑด์ถ•๋œ ์ฃผํƒ์˜ ๋น„์œจ
  • dis: ์ง์—…์„ผํ„ฐ์˜ ๊ฑฐ๋ฆฌ
  • rad: ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ
  • tax: ์žฌ์‚ฐ์„ธ์œจ
  • ptratio: ํ•™์ƒ/๊ต์‚ฌ ๋น„์œจ
  • b: ์ธ๊ตฌ ์ค‘ ํ‘์ธ ๋น„์œจ
  • lstat: ์ธ๊ตฌ ์ค‘ ํ•˜์œ„ ๊ณ„์ธต ๋น„์œจ
  • medv : ๋ณด์Šคํ„ด 506๊ฐœ ํƒ€์šด์˜ 1978๋…„ ์ฃผํƒ ๊ฐ€๊ฒฉ ์ค‘์•™๊ฐ’ (๋‹จ์œ„ 1,000 ๋‹ฌ๋Ÿฌ)

 Feature Selection : ์ƒ๊ด€๊ณ„์ˆ˜์™€ ๊ณต๋ถ„์‚ฐ 

- ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

del data['chas']

์ƒ๊ด€๊ณ„์ˆ˜์™€ ๊ณต๋ถ„์‚ฐ์€ ์—ฐ์†ํ˜• ์ž๋ฃŒ๋ฅผ ๋ถ„์„ํ•˜๋ฏ€๋กœ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ์‹ค์ œ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์€ ์‹ ์ค‘ํ•˜๊ฒŒ ํ–‰ํ•ด์•ผ ํ•˜์ง€๋งŒ , ํ•™์Šต์„ ์œ„ํ•˜์—ฌ ์ œ๊ฑฐํ•œ๋‹ค.

 

๊ฐ€์„ค ์„ธ์šฐ๊ธฐ 

1. ๋ฒ”์ฃ„์œจ์ด ๋†’์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

2. ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋†’์„๊นŒ? 

3. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

4. ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์„ ๋†’์„๊นŒ?

 

1. ๋ฒ”์ฃ„์œจ์ด ๋†’์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

sns.jointplot(data=data, x='crim', y='medv', kind='reg')

๊ทน๋‹จ์ ์ธ ์Œ์˜ ๊ด€๊ณ„๋Š” ์•„๋‹ˆ์ง€๋งŒ , ๋ฒ”์ฃ„์œจ์ด ๋†’์•„์งˆ ์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ์•„์ง€๋Š” ์ถ”์„ธ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ๊ณต๋ถ„์‚ฐ

data['crim'].cov(data['medv'])

์Œ์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ 

data['crim'].corr(data['medv'])    # ํ”ผ์–ด์Šจ์ƒ๊ด€๊ณ„์ˆ˜ 0.3 ~ 0.6 ๊ฐ•ํ•œ ์ƒ๊ด€๊ณ„์ˆ˜


r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„

 

์ถœ์ฒ˜ - https://ko.wikipedia.org/wiki/%EC%83%81%EA%B4%80_%EB%B6%84%EC%84%9D


์œ„์™€ ๊ฐ™์€ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ ํ•ด์„์„ ๋ณด์•˜์„ ๋•Œ ๋ฒ”์ฃ„์œจ๊ณผ ์ง‘๊ฐ’์€ ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

[ ๊ฐ€์„ค1 : True ] 

 

 

2. ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋†’์„๊นŒ? 

sns.jointplot(data=data, x='rm', y='medv', kind='reg')

๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด, ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋†’์•„์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

- ๊ณต๋ถ„์‚ฐ

data['rm'].cov(data['medv'])

๊ณต๋ถ„์‚ฐ์„ ๋ณด์•˜์„ ๋•Œ ์–‘์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ 

data['rm'].corr(data['medv'])

์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋ณด๋‹ˆ ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„์— ๊ฐ€๊นŒ์šด ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ณต๋ถ„์‚ฐ์˜ ํ—ˆ์ ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€์„ค2์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” ๊ฐ€์„ค1์—์„œ ์ƒ๊ด€๊ณ„์ˆ˜๋ณด๋‹ค ๋” ๋šœ๋ ทํ•œ ๊ด€๊ณ„์ด์ง€๋งŒ, ๊ณต๋ถ„์‚ฐ์€ ๊ฐ€์„ค2๊ฐ€ ๋” ๋†’๋‹ค. 

 

[ ๊ฐ€์„ค2 : True ] 

 

 

3. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ?

sns.jointplot(data=data, x='nox', y='medv', kind='reg')

data['nox'].corr(data['medv'])

-0.4273207723732824

๋šœ๋ ทํ•œ ์Œ์  ์ƒ๊ด€๊ด€๊ณ„์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„ ์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์•„์ง€๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

 

[ ๊ฐ€์„ค3 : True ] 

 

 

4. ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์„ ๋†’์„๊นŒ?

sns.jointplot(data=data, x='tax', y='medv', kind='reg')

data['tax'].corr(data['medv'])

-0.46853593356776696

๋šœ๋ ทํ•œ ์Œ์ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ–๊ณ  ์žˆ๊ณ , ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ์•„์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

[ ๊ฐ€์„ค4 : False ] 

 

 

- ๋ชจ๋“ ๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์•Œ๊ธฐ - heatmap 

plt.figure(figsize=(10, 7))
sns.heatmap(data.corr(), cmap='RdBu_r', annot=True, fmt='0.1f')

lstat์™€ rm ์˜ ์ง‘๊ฐ’๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.  ์ธ๊ตฌ ์ค‘ ํ•˜์œ„ ๊ณ„์ธต ๋น„์œจ(lstat)์™€๋Š” ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„, ๋ฐฉ์˜ ๊ฐœ์ˆ˜(rm) ๊ณผ๋Š” ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ˆ๋‹ค. ๋ฐ˜๋ฉด dis(์ง์—…์„ผํ„ฐ์˜ ๊ฑฐ๋ฆฌ), b(์ธ๊ตฌ ์ค‘ ํ‘์ธ ๋น„์œจ) ์™€ ์ง‘๊ฐ’์€ ๋‚ฎ์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค. 

 

 

 

 

 


 Feature Extraction 

- ์ƒ๊ด€๊ด€๊ณ„ ๋น„๊ตํ•˜์—ฌ ๋ช‡๊ฐœ์˜ ๋ณ€์ˆ˜๋ฅผ ๋ช‡๊ฐœ๋กœ ์ค„์ผ ๊ฒƒ์ธ์ง€ ๊ฒฐ์ • 

corr_bar = []

for column in data.columns:
    print(f"{column}๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ด€๊ณ„: {data[column].corr(data['medv'])}")
    corr_bar.append(abs(data[column].corr(data['medv'])))

๊ฐ ์ปฌ๋Ÿผ๋ณ„ ์ง‘๊ฐ‘๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๊ณ , corr_bar ๋ฆฌ์ŠคํŠธ์—๋Š” ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•˜์—ฌ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค. 

corr_bar

sns.barplot(data.columns, corr_bar)

๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋‹ˆ dis, b ๊ฐ€ ๋‹ค๋ฅธ ์ปฌ๋Ÿผ๋“ค๋ณด๋‹ค ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์ž‘์€ ๊ฒƒ์„ ํ•œ๋ˆˆ์— ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

x = data[['dis', 'b']]

data์—์„œ ๋‘ ๋ณ€์ˆ˜๋ฅผ ์„ ํƒํ•˜์—ฌ x์— ์ €์žฅํ•œ๋‹ค. 

x.head()

 

PCA ์‚ฌ์šฉ 

from sklearn.decomposition import PCA

PCA(n_components)

  • n_components : ๋ช‡๊ฐ€์ง€์˜ ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค์ง€ ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•œ๋‹ค.
  • ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฐœ๋…

PCA.fit(x)

  • x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ์•ž์—์„œ ์ƒ์„ฑํ•œ ๊ฐ์ฒด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๋ถ€ํ•˜๋Š” ๊ฐœ๋…

PCA.components_

  • ์•ž์„œ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ณ€์ˆ˜์†์— ๋‹ด๊ธด ์ด ์ „ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์ด ๋‹ด๊ธด ์ •๋„

PCA.explained_variance_ratio_

  • ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ

PCA.transform

  • ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ํ•™์Šต๊ธฐ๋กœ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜

 

- 2๊ฐœ์˜ ๋ณ€์ˆ˜ -> 1๊ฐœ์˜ ๋ณ€์ˆ˜ 

pca = PCA(n_components=1)

n_components ์— ์ƒ์„ฑํ•  ๋ณ€์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

pca.fit(x)

๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ‚จ๋‹ค. 

pca.components_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‹ด๊ธด ๊ฐ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์„ ํ™•์ธํ•ด ๋ณธ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์˜ค๋ฅธ์ชฝ (b) ์˜ ๋ถ„์‚ฐ์ด ๋„ˆ๋ฌด ํฌ๋‹ค. ์ด๋Š” ์˜ค๋ฅธ์ชฝ ๋ณ€์ˆ˜์˜ ์ •๋ณด๋งŒ ๋งŽ์ด ๋‹ด๊ฒผ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. ( -> ์ •๊ทœํ™”๊ฐ€ ํ•„์š”ํ•œ ์ด์œ  / ๋’ค์—์„œ ํ•™์Šต ) 

pca.explained_variance_ratio_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ์ด๋‹ค. 

data['pc1'] = pca.transform(x)

ํ•™์Šต์‹œํ‚จ pca๋ฅผ ์ด์šฉํ•˜์—ฌ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ data์˜ pc1์ด๋ผ๋Š” ์ปฌ๋Ÿผ์— ์ถ”๊ฐ€ํ•œ๋‹ค. 

data

์ถ”๊ฐ€ ์™„๋ฃŒ๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„. 

sns.jointplot(data=data, x='pc1', y='medv', kind='reg')

data['pc1'].corr(data['medv'])

์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด๋ฉด ์ „์— b๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜์™€ ๋ณ„ ์ฐจ์ด๊ฐ€ ์—†๋‹ค. ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰ํ•˜์—ฌ ๋‹ค์‹œ ํ•™์Šตํ•ด ๋ณด์ž. 

 

์ •๊ทœํ™”

from sklearn.preprocessing import StandardScaler

StandardScaler()

  • ์ •๊ทœํ™” ๊ฐ์ฒด ์ƒ์„ฑ

scaler.fit(x)

  • ์ •๊ทœํ™” ๊ฐ์ฒด๋กœ ํ•™์Šต

scaler.transform(x)

  • ํ•™์Šต๋œ ํ•™์Šต๊ธฐ๋กœ ๋ณ€์ˆ˜ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜

 

scaler = StandardScaler()

์ •๊ทœํ™” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

scaler.fit(x)
scaler_x = scaler.transform(x)

x๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚จ ํ›„, ์ •๊ทœํ™” ๋œ x๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜์—ฌ scaler_x ์— ์ €์žฅํ•ด์ค€๋‹ค. 

scaler_x

 

 

- ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ pca ์‹คํ–‰ 

# ๋ณ€์ˆ˜ 1๊ฐœ๋กœ ์„ค์ • 
pca = PCA(n_components=1)
# ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต
pca.fit(scaler_x)
# ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‹ด๊ธด ๊ฐ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์„ ํ™•์ธ
# ์œ„์™€ ๋‹ฌ๋ผ์ง„ ๋ถ„์‚ฐ์˜ ์ •๋„๋ฅผ ํ™•์ธ
pca.components_

์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ pca๋ฅผ ์ง„ํ–‰ํ•˜๋‹ˆ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‘๊ฐ€์ง€์˜ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์ด ๋™์ผํ•˜๊ฒŒ ๋‹ด๊ธด ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

pca.explained_variance_ratio_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ์€ ์ค„์–ด๋“ค์—ˆ์ง€๋งŒ , ๋‘๊ฐ€์ง€ ๋ณ€์ˆ˜์˜ ๊ฐ ๋ถ„์‚ฐ์ด ๋™์ผํ•˜๊ฒŒ ๋‹ด๊ธด๊ฒƒ์ด ๋” ์ค‘์š”ํ•˜๋‹ค. 

data['pc1'] = pca.transform(scaler_x)
data.head()

์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ pca ๋ณ€ํ™˜ํ•˜์—ฌ pc1 ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ฃผ์—ˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ ๋น„๊ต

sns.jointplot(data=data, x=data['pc1'], y=data['medv'], kind='reg')

data['pc1'].corr(data['medv'])

data['b'].corr(data['medv'])

์ด์ฒ˜๋Ÿผ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ pc1๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์˜ˆ์ „ ๋‘ ๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ณด๋‹ค ๋” ๋†’์•„์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์ƒ๊ด€์„ฑ์ด ์—†๋Š” ๋‘ ๊ฐ€์ง€์˜ ๋ณ€์ˆ˜๋ฅผ ์ƒ๊ด€์„ฑ์ด ๋” ๋†’์•„์ง€๋„๋ก ํ•˜๋Š” ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” pca๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค. ๋‹ค์Œ์‹œ๊ฐ„์—๋Š” ์˜ค๋Š˜ ํ•™์Šตํ•œ pca๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œํ•˜๊ณ , ๊ตฐ์ง‘ํ™”, ์‹œ๊ฐํ™”๋ฅผ ํ•˜๋Š” ์‹ค์Šต์„ ํ•œ๋‹ค. 


 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/60?category=965020 

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 3-1 open API ์‹ ์ฒญ & ํ™œ์šฉ (์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ)

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/59 https://silvercoding.tistory.com/58 https://silvercoding.tistory.com/57 https://silvercoding.tistory.com/56 https://silvercoding...

silvercoding.tistory.com

 

 


์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด์ „ ๊ธ€์—์„œ ์ƒ์„ฑํ•œ ํŒŒ์ผ๊ณผ folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„œ์šธ์‹œ ๋”ฐ๋ฆ‰์ด ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

<folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ด€๋ จ ํฌ์ŠคํŒ…> 

https://silvercoding.tistory.com/53?category=965020 

 

[python ์‹œ๊ฐํ™”] 2. ์„œ์šธ์‹œ ๋Œ€ํ”ผ์†Œ ํ˜„ํ™ฉ ์ง€๋„ ๋งŒ๋“ค๊ธฐ , ์ง€๋„ ์‹œ๊ฐํ™” ( folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot, jointplot, pairplot, boxplot, swarmplot, heatmap) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ..

silvercoding.tistory.com


๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
data = pd.read_excel('./data/bicycle.xlsx')
data.head()

 

 

 

 

 

 


์„œ์šธ์‹œ ๋”ฐ๋ฆ‰์ด ์ง€๋„ ์‹œ๊ฐํ™” 

import folium

 

- ์ง€๋„ ์ƒ์„ฑ 

m = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)   # ์„œ์šธ์—ญ ์ค‘์‹ฌ
m

์„œ์šธ์—ญ์˜ ์œ„๋„, ๊ฒฝ๋„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. zoom_start ์ธ์ž๋ฅผ ์ด์šฉํ•˜์—ฌ ํ™•๋Œ€ ์ •๋„๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ง€๋„์— ํ‘œ์‹œํ•  ๋ฐ์ดํ„ฐ ํ™•์ธ 

for i in range(len(data)):
    name = data.loc[i, 'stationName']
    available = data.loc[i, 'parkingBikeTotCnt']
    total = data.loc[i, 'rackTotCnt']
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    print(name, available, total, lat, long)

์ง€๋„์ƒ์„ฑ์— ํ•„์š”ํ•œ ์ปฌ๋Ÿผ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•˜์—ฌ ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค.  ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งˆ์ปค๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)

# ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
for i in range(len(data)):
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    name = data.loc[i, 'stationName']
    available = int(data.loc[i, 'parkingBikeTotCnt'])
    total = int(data.loc[i, 'rackTotCnt'])
    
    # ์ž์ „๊ฑฐ ์ˆ˜๋Ÿ‰์— ๋Œ€ํ•ด ์ƒ‰์ƒ์œผ๋กœ ํ‘œ์‹œ
    ##  ์ž์ „๊ฑฐ ๋ณด์œ ์œจ์ด 50% ์ดˆ๊ณผ์ผ ๊ฒฝ์šฐ --> ํŒŒ๋ž€์ƒ‰
    ##  ํ˜„์žฌ ์ž์ „๊ฑฐ๊ฐ€ 2๋Œ€ ๋ณด๋‹ค ์ ์„ ๊ฒฝ์šฐ --> ๋นจ๊ฐ„์ƒ‰
    ##  ๊ทธ ์™ธ์˜ ๊ฒฝ์šฐ(์ž์ „๊ฑฐ 2๋Œ€ ์ด์ƒ ์ด๋ฉด์„œ, ์ž์ „๊ฑฐ ๋ณด์œ ์œจ 50% ๋ฏธ๋งŒ) --> ์ดˆ๋ก์ƒ‰
    if available/total > 0.5:
        color = 'blue'
    elif available < 2 :
        color = 'red'
    else:
        color = 'green'
    icon=folium.Icon(color=color, icon='info-sign')
    folium.Marker(location = [lat, long],
                 tooltip = f"{name} : {available}", 
                  icon = icon
             ).add_to(m)
m

ํ˜„์žฌ ์ž์ „๊ฑฐ ์ด์šฉ ํ˜„ํ™ฉ์„ ๋” ์ง๊ด€์ ์œผ๋กœ ๋ณด๊ธฐ ์œ„ํ•ด ์ƒ‰๊น” ์„ค์ •์„ ํ•ด์ค€๋‹ค. ํŒŒ๋ž€์ƒ‰์€ ๋Œ€์—ฌ ๊ฐ€๋Šฅ ์ž์ „๊ฑฐ 50% ์ด์ƒ, ๋นจ๊ฐ„์ƒ‰์€ 2๊ฐœ ๋ฏธ๋งŒ์ผ ๋•Œ, ์ดˆ๋ก์ƒ‰์€ ๊ทธ ์ด์™ธ์˜ ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋˜ํ•œ, tooltip์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์šฐ์Šค๋ฅผ ๊ฐ–๋‹ค๋Œ€๋ฉด ์ž์ „๊ฑฐ ๋Œ€์—ฌ์†Œ ์ด๋ฆ„๊ณผ ์ด์šฉ๊ฐ€๋Šฅํ•œ ์ž์ „๊ฑฐ ์ˆ˜๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. 

 

 

 

 


๋ฌธ์ œ์  : ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„ ์ง€๋„๋ฅผ ๋ณด๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. 

ํ•ด๊ฒฐ๋ฐฉ์•ˆ : ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์ ‘ํ•œ ๋งˆ์ปค๋“ค๋ผ๋ฆฌ ์„œ๋กœ ๋ฌถ์–ด์ค€๋‹ค. 


ํด๋Ÿฌ์Šคํ„ฐ & ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€๋„ ์‹œ๊ฐํ™” 

from folium.plugins import MiniMap, MarkerCluster
# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m_ver2 = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)

# ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ํ•˜๊ธฐ
minimap = MiniMap() 
m_ver2.add_child(minimap)

# ๋งˆ์ปค ํด๋Ÿฌ์Šคํ„ฐ ๋งŒ๋“ค๊ธฐ
marker_cluster_ver2 = MarkerCluster().add_to(m_ver2)  # ํด๋Ÿฌ์Šคํ„ฐ ์ถ”๊ฐ€ํ•˜๊ธฐ

# ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
for i in range(len(data)):
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    name = data.loc[i, 'stationName']
    available = int(data.loc[i, 'parkingBikeTotCnt'])
    total = int(data.loc[i, 'rackTotCnt'])
    
    # ์ž์ „๊ฑฐ ์ˆ˜๋Ÿ‰์— ๋Œ€ํ•ด ์ƒ‰์ƒ์œผ๋กœ ํ‘œ์‹œ
    ##  ์ž์ „๊ฑฐ ๋ณด์œ ์œจ์ด 50% ์ดˆ๊ณผ์ผ ๊ฒฝ์šฐ --> ํŒŒ๋ž€์ƒ‰
    ##  ํ˜„์žฌ ์ž์ „๊ฑฐ๊ฐ€ 2๋Œ€ ๋ณด๋‹ค ์ ์„ ๊ฒฝ์šฐ --> ๋นจ๊ฐ„์ƒ‰
    ##  ๊ทธ ์™ธ์˜ ๊ฒฝ์šฐ(์ž์ „๊ฑฐ 2๋Œ€ ์ด์ƒ ์ด๋ฉด์„œ, ์ž์ „๊ฑฐ ๋ณด์œ ์œจ 50% ๋ฏธ๋งŒ) --> ์ดˆ๋ก์ƒ‰
    if available/total > 0.5:
        color = 'blue'
    elif available < 2 :
        color = 'red'
    else:
        color = 'green'
    icon=folium.Icon(color=color, icon='info-sign')
#     print(name, available, total, lat, long)
    folium.Marker(location = [lat, long],
                 tooltip = f"{name} : {available}", 
                  icon = icon
             ).add_to(marker_cluster_ver2)
m_ver2

ํด๋Ÿฌ์Šคํ„ฐ์™€ ๋ฏธ๋‹ˆ๋งต์„ ์ถ”๊ฐ€ํ•œ ์ง€๋„์ด๋‹ค. ์ˆซ์ž๋ฅผ ํด๋ฆญํ•˜๋ฉด ํ•ด๋‹น ์ง€์—ญ์œผ๋กœ ํ™•๋Œ€๋˜์–ด ๋” ํŽธ๋ฆฌํ•˜๊ณ  ์ง๊ด€์ ์ธ ์ง€๋„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

m_ver2.save('./map/bicycle_clustermap.html')

์ง€๋„๋Š” html๋กœ ์ €์žฅํ•˜์—ฌ ์–ธ์ œ๋“  ๊บผ๋‚ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

+ Recent posts