[IRIS λ°μ΄ν° λΆμ] 2. Python Decision Tree ( μμ¬ κ²°μ λ무 )
λ¬λμ€νΌμ¦ μμ μ 리
< μ΄μ κΈ >
https://silvercoding.tistory.com/64
[IRIS λ°μ΄ν° λΆμ] 1. Python KNN λΆλ₯
λ¬λμ€νΌμ¦ μμ μ 리 < μ΄μ κΈ > https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston λ°μ΄ν° λΆμ] 1. μ°¨μμΆμ (PCA) νμ΄μ¬ μμ λ¬λμ€νΌμ¦ μμ μ 리 λΌ..
silvercoding.tistory.com
λ°μ΄ν° λΆλ¬μ€κΈ°
μ΄μ κΈκ³Ό λμΌν Iris Flower Dataset μ μ΄μ©νμ¬ μ€μ΅μ μ§ννλ€.
< Iris Flower Dataset >
https://www.kaggle.com/arshid/iris-flower-dataset
Iris Flower Dataset
Iris flower data set used for multi-class classification.
www.kaggle.com
import pandas as pd
import os
os.chdir('../data') # λ°μ΄ν°μ
μ΄ μλ λ³ΈμΈ ν΄λ κ²½λ‘
iris = pd.read_csv("IRIS.csv")
iris.head()
iris['species'].value_counts()
κ° μ’ λ₯λ§λ€ 50κ°μ λ°μ΄ν°κ° μ‘΄μ¬νλ€.
μμ¬κ²°μ λ무 μ¬μ©
train & Test λ°μ΄ν°μ λΆλ¦¬
iris['id'] = range(len(iris))
μ°μ λ°μ΄ν°λ₯Ό ꡬλΆνκΈ° μν΄ μμλλ‘ κ°μ λ£μ΄μ€ id 컬λΌμ μμ±νλ€.
iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]
id 컬λΌμ΄ κ°μ₯ μμ μ€λλ‘ μ λ ¬ν΄μ€λ€.
train = iris.sample(100,replace=False,random_state=7).reset_index().drop(['index'],axis=1)
λλ€μΌλ‘ 100κ°μ μνμ μΆμΆνμ¬ train μ μ μ₯ν΄ μ€λ€.
test = iris.loc[ ~iris['id'].isin(train['id']) ]
test = test.reset_index().drop(['index'],axis=1)
trainμ idκ°μ΄ μ‘΄μ¬νμ§ μλ iris λ°μ΄ν°λ€μ testμ λ£μ΄μ€λ€.
μμ¬κ²°μ λ무 νμ΅
DecisionTreeClassifier(min_samples_split = n)
---> νΉμ§ : ν΄μμ΄ μ½κ³ λΉ λ₯΄λ€.
---> min_samples_split : μμ¬κ²°μ λ무μμ μ΅μ’ λ Έλμ μ΅μ μν μ
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split = 10)
min_samples_split μ 10μΌλ‘ μ€μ ν΄μ£Όμ΄ μ΅μ’ λ Έλμ μνμκ° 10λ―Έλ§μ΄ λμ§ μλλ‘ μ‘°μ νλ€.
dt.fit(train[['sepal_length','sepal_width','petal_length','petal_width']],train['species'])
μμ±ν΄ λμ dt κ°μ²΄λ‘ νμ΅μ μμΌμ€λ€.
predictions = dt.predict(test[['sepal_length','sepal_width','petal_length','petal_width']])
μμΈ‘κ°μ predictionμ λ£μ΄μ€λ€.
test['pred'] = predictions
μμΈ‘κ° predictionμ testμ pred 컬λΌμ μ μ₯νλ€.
test.head()
(pd.Series(predictions)==test['species']).mean()
μμΈ‘κ°κ³Ό μ λ΅μ λΉκ΅νμ¬ μ νλλ₯Ό ꡬν΄λ³΄λ 0.98μ΄ λμλ€.
μμ μ νλ μΈ‘μ λ°©λ²μ μ¬μ©νλ©΄ μ λ’°μ±μ΄ νλ½ν μ μλ€. train, test λ°μ΄ν°λ₯Ό μ΄λ»κ² λλλμ§μ λ°λΌ κ²°κ³Όκ° ν¬κ² λ¬λΌμ§ μλ μκΈ° λλ¬Έμ΄λ€. λ°λΌμ cross validationμ μ΄μ©νμ¬ μ νλλ₯Ό ꡬν΄λ³Ό μ μλ€.
from sklearn.model_selection import cross_val_score
import numpy as np
dt = DecisionTreeClassifier(min_samples_split = 10)
scores = cross_val_score(dt, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5, scoring="accuracy")
np.mean(scores)
μ΄λ² μμμ²λΌ λ°μ΄ν° μκ° μ μ κ²½μ°μλ μμ κ°μ΄ μ 체 λ°μ΄ν°λ‘ cross validationμ μννλ κ²μ΄ μ λ’°μ±μ΄ λλ€. 5 fold cross validationμ μνν κ²°κ³Ό , μ νλκ° μ½ 0.97μ΄ λμ¨ κ²μ λ³Ό μ μλ€.
μμ¬κ²°μ λ무 μκ°ν
from sklearn import tree
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 16,10
a=tree.plot_tree(dt,feature_names = ['sepal_length','sepal_width','petal_length','petal_width'],impurity=False, max_depth=2, fontsize=10, proportion=True)
plt.show(a)
max_depthλ₯Ό μ΄μ©νμ¬ κΉμ΄λ₯Ό μ‘°μ ν μ μλ€. 2κ° μ΄νλ‘λ (...) μΌλ‘ μλ΅λ κ²μ λ³Ό μ μλ€. μμ κ°μ΄ μμ¬κ²°μ λ무λ₯Ό μ¬μ©νκ³ , μκ°ν ν΄λ³΄λ©΄ ν΄μμ μ½κ³ κ°νΈνκ² ν΄λΌ μ μλ€.