๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/63?category=967543 

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. PCA, ๊ตฐ์ง‘ํ™”๋ฅผ ์‚ฌ์šฉํ•œ ์ง‘๊ฐ’ ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ impo..

silvercoding.tistory.com

 

 

 

 


KNN ๊ฐœ๋… ์ •๋ฆฌ

* 1๊ทธ๋ฃน vs 2๊ทธ๋ฃน KNN ๋ถ„๋ฅ˜ ๊ณผ์ •

1. k ์„ค์ • : ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์ ์„ ์„ ํƒ 

2. k ๊ฐœ์˜ ์  ์ค‘ 1๊ทธ๋ฃน์ด ๋งŽ์€์ง€ 2๊ทธ๋ฃน์ด ๋งŽ์€์ง€ ํ™•์ธ 

3. ๋” ๋งŽ์€ ๊ทธ๋ฃน์˜ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

 

* K๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •

1. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ K๋ณ„๋กœ KNN ๋ชจ๋ธ ํ•™์Šต 

2. ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ) ์—์„œ์˜ ์—๋Ÿฌ์œจ ์ธก์ • 

3. ์—๋Ÿฌ์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k ์„ ํƒ 

 

 

* ์ ์ ˆํ•œ k๋ฅผ ์ฐพ์•„๋‚ด์–ด์•ผ ํ•œ๋‹ค!

- k๊ฐ€ ๋งค์šฐ ์ž‘์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ๊ณผ์ ํ•ฉ ์šฐ๋ ค 

- k๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์ง€์—ญ์  ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋จ 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์€ ์บ๊ธ€์˜ ๋‹ค์Œ๋งํฌ์—์„œ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')   # ๋ณธ์ธ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

(์ฐธ๊ณ ) sepal : ๊ฝƒ๋ฐ›์นจ / petal : ๊ฝƒ์žŽ 

๊ฝƒ๋ฐ›์นจ์˜ ํฌ๊ธฐ์™€ ๊ฝƒ์žŽ์˜ ํฌ๊ธฐ๋ฅผ ๊ทผ๊ฑฐ๋กœ setosa, versicolor, virginica ์ด 3์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„ํ•ด ๋‚ด๋Š” ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค. 

iris.info()

์ด 150๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด ๊ฐ€ ์žˆ๊ณ  , ๊ฒฐ์ธก๊ฐ’์€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. 

iris['species'].value_counts()

value_counts() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ์ข…๋ฅ˜๊ฐ€ ๋ช‡๊ฐ€์ง€์”ฉ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค ๋™์ผํ•˜๊ฒŒ 50๊ฐœ์”ฉ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 


 KNN ์‹ค์Šต - ๋ถ„๋ฅ˜ 

(ex) KNeighborsClassifier(n_neighbors=n)

---> ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ๋Š๋ฆผ 

---> n_neighbors=n : k์˜ ๊ฐœ์ˆ˜ ์ง€์ • (๊ฐ€์žฅ ๊ฐ€๊นŒ์šด K๊ฐœ๋ฅผ ๋ณผ๊ฒƒ์ด๋ผ๋Š” ์˜๋ฏธ) 

 

 

iris['id'] = range(len(iris))

๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•˜์—ฌ id ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ์— ์˜ค๋„๋ก ์ •๋ ฌ ํ•ด ์ค€๋‹ค. 

iris.head()

 

 

train & test data ๋ถ„๋ฆฌ 

train = iris.sample(100, replace=False, random_state=7).reset_index(drop=True)
train

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๋น„๋ณต์›์ถ”์ถœ์ด๊ณ , ๋’ค์ฃฝ๋ฐ•์ฃฝ๋œ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ์‹œ์ผœ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
# test = test.reset_index().drop(['index'],axis=1)  # ๋ฐ‘๊ณผ ๊ฐ™์€ ์ฝ”๋“œ
test = test.reset_index(drop=True)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” id๊ฐ’์ด ์กด์žฌํ•˜๋Š” row๋งŒ ์ถ”์ถœํ•˜์—ฌ ๊ตฌ์„ฑํ•œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค. 

 

 

 

 

KNN ํ•™์Šต (k=3 ์ผ ๋•Œ ํ•™์Šตํ•ด๋ณด๊ธฐ) 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # ๋ชจ๋ธ ์ •์˜

k=3์œผ๋กœ ์„ค์ •ํ•œ KNN ๋ถ„๋ฅ˜๊ธฐ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )

knn.fit(train_X, train_y) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )

knn.predict(test_X) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ณ , predictions์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test['pred'] = predictions
test.head()

pred ์ปฌ๋Ÿผ์— ์˜ˆ์ธก ๊ฒฐ๊ณผ์ธ predictions๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ๋‹ค. ์œ„์˜ 5๊ฐœ๋ฅผ ๋ณด๋‹ˆ ๋ชจ๋‘ ์ •๋‹ต์„ ๋งž์ถ˜ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  

(test['pred'] == test['species']).mean()

์ •๋‹ต๊ณผ ์˜ˆ์ธก์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.94๊ฐ€ ๋‚˜์™”๋‹ค. ์ด์ œ ์—ฌ๋Ÿฌ k๊ฐ’์˜ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•˜์—ฌ ์ตœ์ ์˜ k๋ฅผ ๊ฒฐ์ •ํ•ด ๋ณธ๋‹ค. 

 

 

 

 

์ตœ์  K ์ฐพ๊ธฐ 

- train & test ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ 

for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )
    print((pd.Series(predictions) == test['species']).mean())

1๋ถ€ํ„ฐ 29๊นŒ์ง€์˜ k ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์—ฌ ์–ป์€ ์ •ํ™•๋„์ด๋‹ค. ๋†’์€ ๊ฐ’ ์ค‘์—์„œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋ฅผ ๊ณ ๋ฅด๋ฉด k=5 (์ •ํ™•๋„ 0.98) ์ด๋‹ค. 

 

---> ์ตœ์ ์˜ K : 5

 


ํ•˜์ง€๋งŒ ์œ„์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

- cross validation ์‚ฌ์šฉ

from sklearn.model_selection import cross_val_score
import numpy as np
for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5)
    print(f"{k} : " ,np.mean(scores))

5-fold-cross validation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. k=6 ์ผ๋•Œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋กœ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

---> ์ตœ์ ์˜ K : 6

 

 

 

 


 KNN ์‹ค์Šต - ํšŒ๊ท€ 

ํšŒ๊ท€๋ฌธ์ œ์— KNN ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ KNN ํšŒ๊ท€๋ฌธ์ œ๋ฅผ ์‹ค์Šต์„ ํ•ด๋ณด๊ธฐ ์œ„ํ•ด sepal_length, sepal_width, petal_length ๋ฅผ ์ด์šฉํ•˜์—ฌ petal_width๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

del train['species']
del test['species']

๊ฐ„๋‹จํ•œ ์‹ค์Šต์„ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ species ๋Š” ์‚ญ์ œํ•ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ ๋ถ„๋ฅ˜๋ฌธ์ œ์™€ ๋˜‘๊ฐ™์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )
test['pred'] = predictions
test.head()

ํ•™์Šต๊ณผ ์˜ˆ์ธก์€ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

* Mean absolute error ( MAE ) : ํšŒ๊ท€๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜. 

MAE ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

abs(test['petal_width'] - pd.Series(predictions)).mean()

์ •๋‹ต์—์„œ ์˜ˆ์ธก๊ฐ’์„ ๋นผ๊ณ , ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•ด์ค€ ํ›„ ๊ฐ๊ฐ์˜ ์˜ค๋ฅ˜์œจ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ์ด ํ‰๊ฐ€์ง€ํ‘œ๋Š” ์˜ค๋ฅ˜์œจ์ด๋ฏ€๋กœ ์ž‘์„ ์ˆ˜๋ก ์ž˜ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋ผ ํŒ๋‹จ๋˜์–ด์ง„๋‹ค. 

for k in range(1,30):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )    
    print(str(k)+' :'+str(abs(test['petal_width'] - pd.Series(predictions)).mean()))

์˜ค๋ฅ˜์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k๋Š” 7์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

---> ์ตœ์ ์˜ K : 7

 

 

 

 

 

 

 

+ Recent posts