๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/64

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. Python KNN ๋ถ„๋ฅ˜

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ..

silvercoding.tistory.com

 

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

์ด์ „ ๊ธ€๊ณผ ๋™์ผํ•œ Iris Flower Dataset ์„ ์ด์šฉํ•˜์—ฌ ์‹ค์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')  # ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋Š” ๋ณธ์ธ ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

iris['species'].value_counts()

๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค 50๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 


 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‚ฌ์šฉ 

train & Test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

iris['id'] = range(len(iris))

์šฐ์„  ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋„ฃ์–ด์ค€ id ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์•ž์— ์˜ค๋„๋ก ์ •๋ ฌํ•ด์ค€๋‹ค. 

train = iris.sample(100,replace=False,random_state=7).reset_index().drop(['index'],axis=1)

๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜์—ฌ train ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
test = test.reset_index().drop(['index'],axis=1)

train์˜ id๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” iris ๋ฐ์ดํ„ฐ๋“ค์„ test์— ๋„ฃ์–ด์ค€๋‹ค. 

 

 

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ํ•™์Šต 

DecisionTreeClassifier(min_samples_split = n)

---> ํŠน์ง• : ํ•ด์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๋‹ค. 

---> min_samples_split : ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ์ตœ์ข… ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split = 10)

min_samples_split ์„ 10์œผ๋กœ ์„ค์ •ํ•ด์ฃผ์–ด ์ตœ์ข… ๋…ธ๋“œ์˜ ์ƒ˜ํ”Œ์ˆ˜๊ฐ€ 10๋ฏธ๋งŒ์ด ๋˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•œ๋‹ค. 

dt.fit(train[['sepal_length','sepal_width','petal_length','petal_width']],train['species'])

์ƒ์„ฑํ•ด ๋†“์€ dt ๊ฐ์ฒด๋กœ ํ•™์Šต์„ ์‹œ์ผœ์ค€๋‹ค. 

predictions = dt.predict(test[['sepal_length','sepal_width','petal_length','petal_width']])

์˜ˆ์ธก๊ฐ’์„ prediction์— ๋„ฃ์–ด์ค€๋‹ค. 

test['pred'] = predictions

์˜ˆ์ธก๊ฐ’ prediction์„ test์˜ pred ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. 

test.head()

(pd.Series(predictions)==test['species']).mean()

์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.98์ด ๋‚˜์™”๋‹ค. 

 

 

 


์œ„์˜ ์ •ํ™•๋„ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


from sklearn.model_selection import cross_val_score
import numpy as np
dt = DecisionTreeClassifier(min_samples_split = 10)
scores = cross_val_score(dt, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5, scoring="accuracy")
np.mean(scores)

 

์ด๋ฒˆ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ ์„ ๊ฒฝ์šฐ์—๋Š” ์œ„์™€ ๊ฐ™์ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์‹ ๋ขฐ์„ฑ์ด ๋†’๋‹ค. 5 fold cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ , ์ •ํ™•๋„๊ฐ€ ์•ฝ 0.97์ด ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™”

from sklearn import tree
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 16,10
a=tree.plot_tree(dt,feature_names = ['sepal_length','sepal_width','petal_length','petal_width'],impurity=False, max_depth=2, fontsize=10, proportion=True)
plt.show(a)

max_depth๋ฅผ ์ด์šฉํ•˜์—ฌ ๊นŠ์ด๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค. 2๊ฐœ ์ดํ›„๋กœ๋Š” (...) ์œผ๋กœ ์ƒ๋žต๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์œ„์™€ ๊ฐ™์ด ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์‹œ๊ฐํ™” ํ•ด๋ณด๋ฉด ํ•ด์„์„ ์‰ฝ๊ณ  ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

+ Recent posts