๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/64

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. Python KNN ๋ถ„๋ฅ˜

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ..

silvercoding.tistory.com

 

 

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

์ด์ „ ๊ธ€๊ณผ ๋™์ผํ•œ Iris Flower Dataset ์„ ์ด์šฉํ•˜์—ฌ ์‹ค์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')  # ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ๋Š” ๋ณธ์ธ ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

iris['species'].value_counts()

๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค 50๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•œ๋‹ค. 

 

 

 

 

 


 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‚ฌ์šฉ 

train & Test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

iris['id'] = range(len(iris))

์šฐ์„  ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋„ฃ์–ด์ค€ id ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์•ž์— ์˜ค๋„๋ก ์ •๋ ฌํ•ด์ค€๋‹ค. 

train = iris.sample(100,replace=False,random_state=7).reset_index().drop(['index'],axis=1)

๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ์ƒ˜ํ”Œ์„ ์ถ”์ถœํ•˜์—ฌ train ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
test = test.reset_index().drop(['index'],axis=1)

train์˜ id๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š” iris ๋ฐ์ดํ„ฐ๋“ค์„ test์— ๋„ฃ์–ด์ค€๋‹ค. 

 

 

 

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ํ•™์Šต 

DecisionTreeClassifier(min_samples_split = n)

---> ํŠน์ง• : ํ•ด์„์ด ์‰ฝ๊ณ  ๋น ๋ฅด๋‹ค. 

---> min_samples_split : ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ์ตœ์ข… ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split = 10)

min_samples_split ์„ 10์œผ๋กœ ์„ค์ •ํ•ด์ฃผ์–ด ์ตœ์ข… ๋…ธ๋“œ์˜ ์ƒ˜ํ”Œ์ˆ˜๊ฐ€ 10๋ฏธ๋งŒ์ด ๋˜์ง€ ์•Š๋„๋ก ์กฐ์ •ํ•œ๋‹ค. 

dt.fit(train[['sepal_length','sepal_width','petal_length','petal_width']],train['species'])

์ƒ์„ฑํ•ด ๋†“์€ dt ๊ฐ์ฒด๋กœ ํ•™์Šต์„ ์‹œ์ผœ์ค€๋‹ค. 

predictions = dt.predict(test[['sepal_length','sepal_width','petal_length','petal_width']])

์˜ˆ์ธก๊ฐ’์„ prediction์— ๋„ฃ์–ด์ค€๋‹ค. 

test['pred'] = predictions

์˜ˆ์ธก๊ฐ’ prediction์„ test์˜ pred ์ปฌ๋Ÿผ์— ์ €์žฅํ•œ๋‹ค. 

test.head()

(pd.Series(predictions)==test['species']).mean()

์˜ˆ์ธก๊ฐ’๊ณผ ์ •๋‹ต์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.98์ด ๋‚˜์™”๋‹ค. 

 

 

 


์œ„์˜ ์ •ํ™•๋„ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


from sklearn.model_selection import cross_val_score
import numpy as np
dt = DecisionTreeClassifier(min_samples_split = 10)
scores = cross_val_score(dt, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5, scoring="accuracy")
np.mean(scores)

 

์ด๋ฒˆ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ๋ฐ์ดํ„ฐ ์ˆ˜๊ฐ€ ์ ์„ ๊ฒฝ์šฐ์—๋Š” ์œ„์™€ ๊ฐ™์ด ์ „์ฒด ๋ฐ์ดํ„ฐ๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ์‹ ๋ขฐ์„ฑ์ด ๋†’๋‹ค. 5 fold cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ , ์ •ํ™•๋„๊ฐ€ ์•ฝ 0.97์ด ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์‹œ๊ฐํ™”

from sklearn import tree
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 16,10
a=tree.plot_tree(dt,feature_names = ['sepal_length','sepal_width','petal_length','petal_width'],impurity=False, max_depth=2, fontsize=10, proportion=True)
plt.show(a)

max_depth๋ฅผ ์ด์šฉํ•˜์—ฌ ๊นŠ์ด๋ฅผ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค. 2๊ฐœ ์ดํ›„๋กœ๋Š” (...) ์œผ๋กœ ์ƒ๋žต๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์œ„์™€ ๊ฐ™์ด ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์‹œ๊ฐํ™” ํ•ด๋ณด๋ฉด ํ•ด์„์„ ์‰ฝ๊ณ  ๊ฐ„ํŽธํ•˜๊ฒŒ ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/63?category=967543 

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. PCA, ๊ตฐ์ง‘ํ™”๋ฅผ ์‚ฌ์šฉํ•œ ์ง‘๊ฐ’ ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ impo..

silvercoding.tistory.com

 

 

 

 


KNN ๊ฐœ๋… ์ •๋ฆฌ

* 1๊ทธ๋ฃน vs 2๊ทธ๋ฃน KNN ๋ถ„๋ฅ˜ ๊ณผ์ •

1. k ์„ค์ • : ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด k๊ฐœ์˜ ์ ์„ ์„ ํƒ 

2. k ๊ฐœ์˜ ์  ์ค‘ 1๊ทธ๋ฃน์ด ๋งŽ์€์ง€ 2๊ทธ๋ฃน์ด ๋งŽ์€์ง€ ํ™•์ธ 

3. ๋” ๋งŽ์€ ๊ทธ๋ฃน์˜ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

 

 

* K๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •

1. ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ K๋ณ„๋กœ KNN ๋ชจ๋ธ ํ•™์Šต 

2. ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์„ ์ด์šฉํ•ด ๊ฒ€์ฆ ๋ฐ์ดํ„ฐ(ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ) ์—์„œ์˜ ์—๋Ÿฌ์œจ ์ธก์ • 

3. ์—๋Ÿฌ์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k ์„ ํƒ 

 

 

* ์ ์ ˆํ•œ k๋ฅผ ์ฐพ์•„๋‚ด์–ด์•ผ ํ•œ๋‹ค!

- k๊ฐ€ ๋งค์šฐ ์ž‘์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ๊ณผ์ ํ•ฉ ์šฐ๋ ค 

- k๊ฐ€ ๋งค์šฐ ํฌ๋ฉด ์ง€์—ญ์  ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ์žƒ๊ฒŒ ๋จ 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ์…‹์€ ์บ๊ธ€์˜ ๋‹ค์Œ๋งํฌ์—์„œ ๋‹ค์šด๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Iris Flower Dataset >

https://www.kaggle.com/arshid/iris-flower-dataset

 

Iris Flower Dataset

Iris flower data set used for multi-class classification.

www.kaggle.com

 

import pandas as pd
import os
os.chdir('../data')   # ๋ณธ์ธ ๋ฐ์ดํ„ฐ์…‹์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
iris = pd.read_csv("IRIS.csv")
iris.head()

(์ฐธ๊ณ ) sepal : ๊ฝƒ๋ฐ›์นจ / petal : ๊ฝƒ์žŽ 

๊ฝƒ๋ฐ›์นจ์˜ ํฌ๊ธฐ์™€ ๊ฝƒ์žŽ์˜ ํฌ๊ธฐ๋ฅผ ๊ทผ๊ฑฐ๋กœ setosa, versicolor, virginica ์ด 3์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„ํ•ด ๋‚ด๋Š” ๋ถ„๋ฅ˜๋ชจ๋ธ์„ ๋งŒ๋“ค ๊ฒƒ์ด๋‹ค. 

iris.info()

์ด 150๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ๋“ค์–ด ๊ฐ€ ์žˆ๊ณ  , ๊ฒฐ์ธก๊ฐ’์€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. 

iris['species'].value_counts()

value_counts() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ฐ ์ข…๋ฅ˜๊ฐ€ ๋ช‡๊ฐ€์ง€์”ฉ ์žˆ๋Š”์ง€ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ ์ข…๋ฅ˜๋งˆ๋‹ค ๋™์ผํ•˜๊ฒŒ 50๊ฐœ์”ฉ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 


 KNN ์‹ค์Šต - ๋ถ„๋ฅ˜ 

(ex) KNeighborsClassifier(n_neighbors=n)

---> ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์œผ๋ฉด ๋Š๋ฆผ 

---> n_neighbors=n : k์˜ ๊ฐœ์ˆ˜ ์ง€์ • (๊ฐ€์žฅ ๊ฐ€๊นŒ์šด K๊ฐœ๋ฅผ ๋ณผ๊ฒƒ์ด๋ผ๋Š” ์˜๋ฏธ) 

 

 

iris['id'] = range(len(iris))

๋ฐ์ดํ„ฐ๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ’์„ ๋ถ€์—ฌํ•˜์—ฌ id ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

iris = iris[['id','sepal_length','sepal_width','petal_length','petal_width','species']]

id ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ์— ์˜ค๋„๋ก ์ •๋ ฌ ํ•ด ์ค€๋‹ค. 

iris.head()

 

 

train & test data ๋ถ„๋ฆฌ 

train = iris.sample(100, replace=False, random_state=7).reset_index(drop=True)
train

ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์— ๋žœ๋ค์œผ๋กœ 100๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ๋น„๋ณต์›์ถ”์ถœ์ด๊ณ , ๋’ค์ฃฝ๋ฐ•์ฃฝ๋œ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ์‹œ์ผœ์ค€๋‹ค. 

test = iris.loc[ ~iris['id'].isin(train['id']) ]
# test = test.reset_index().drop(['index'],axis=1)  # ๋ฐ‘๊ณผ ๊ฐ™์€ ์ฝ”๋“œ
test = test.reset_index(drop=True)

ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—๋Š” ํ•™์Šต๋ฐ์ดํ„ฐ์…‹์— ์—†๋Š” id๊ฐ’์ด ์กด์žฌํ•˜๋Š” row๋งŒ ์ถ”์ถœํ•˜์—ฌ ๊ตฌ์„ฑํ•œ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค€๋‹ค. 

 

 

 

 

KNN ํ•™์Šต (k=3 ์ผ ๋•Œ ํ•™์Šตํ•ด๋ณด๊ธฐ) 

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # ๋ชจ๋ธ ์ •์˜

k=3์œผ๋กœ ์„ค์ •ํ•œ KNN ๋ถ„๋ฅ˜๊ธฐ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )

knn.fit(train_X, train_y) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )

knn.predict(test_X) ์™€ ๊ฐ™์ด ์‚ฌ์šฉํ•œ๋‹ค.  test ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ํ•˜๊ณ , predictions์— ์ €์žฅํ•ด ์ค€๋‹ค. 

test['pred'] = predictions
test.head()

pred ์ปฌ๋Ÿผ์— ์˜ˆ์ธก ๊ฒฐ๊ณผ์ธ predictions๋ฅผ ์ถ”๊ฐ€ํ•ด ์ฃผ์—ˆ๋‹ค. ์œ„์˜ 5๊ฐœ๋ฅผ ๋ณด๋‹ˆ ๋ชจ๋‘ ์ •๋‹ต์„ ๋งž์ถ˜ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  

(test['pred'] == test['species']).mean()

์ •๋‹ต๊ณผ ์˜ˆ์ธก์„ ๋น„๊ตํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณด๋‹ˆ 0.94๊ฐ€ ๋‚˜์™”๋‹ค. ์ด์ œ ์—ฌ๋Ÿฌ k๊ฐ’์˜ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•˜์—ฌ ์ตœ์ ์˜ k๋ฅผ ๊ฒฐ์ •ํ•ด ๋ณธ๋‹ค. 

 

 

 

 

์ตœ์  K ์ฐพ๊ธฐ 

- train & test ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ 

for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length','petal_width']] , train['species'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length','petal_width']] )
    print((pd.Series(predictions) == test['species']).mean())

1๋ถ€ํ„ฐ 29๊นŒ์ง€์˜ k ์˜ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์—ฌ ์–ป์€ ์ •ํ™•๋„์ด๋‹ค. ๋†’์€ ๊ฐ’ ์ค‘์—์„œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋ฅผ ๊ณ ๋ฅด๋ฉด k=5 (์ •ํ™•๋„ 0.98) ์ด๋‹ค. 

 

---> ์ตœ์ ์˜ K : 5

 


ํ•˜์ง€๋งŒ ์œ„์˜ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋ฉด ์‹ ๋ขฐ์„ฑ์ด ํ•˜๋ฝํ•  ์ˆ˜ ์žˆ๋‹ค. train, test ๋ฐ์ดํ„ฐ๋ฅผ ์–ด๋–ป๊ฒŒ ๋‚˜๋ˆ„๋Š”์ง€์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์งˆ ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋”ฐ๋ผ์„œ cross validation์„ ์ด์šฉํ•˜์—ฌ ์ •ํ™•๋„๋ฅผ ๊ตฌํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

- cross validation ์‚ฌ์šฉ

from sklearn.model_selection import cross_val_score
import numpy as np
for k in range(1,30):
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, iris[['sepal_length','sepal_width','petal_length','petal_width']], iris['species'], cv=5)
    print(f"{k} : " ,np.mean(scores))

5-fold-cross validation์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค. k=6 ์ผ๋•Œ ๊ฐ€์žฅ ์ฒซ๋ฒˆ์งธ๋กœ ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

---> ์ตœ์ ์˜ K : 6

 

 

 

 


 KNN ์‹ค์Šต - ํšŒ๊ท€ 

ํšŒ๊ท€๋ฌธ์ œ์— KNN ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ KNN ํšŒ๊ท€๋ฌธ์ œ๋ฅผ ์‹ค์Šต์„ ํ•ด๋ณด๊ธฐ ์œ„ํ•ด sepal_length, sepal_width, petal_length ๋ฅผ ์ด์šฉํ•˜์—ฌ petal_width๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

del train['species']
del test['species']

๊ฐ„๋‹จํ•œ ์‹ค์Šต์„ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜์ธ species ๋Š” ์‚ญ์ œํ•ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ ๋ถ„๋ฅ˜๋ฌธ์ œ์™€ ๋˜‘๊ฐ™์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=3)
knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )

predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )
test['pred'] = predictions
test.head()

ํ•™์Šต๊ณผ ์˜ˆ์ธก์€ ๋™์ผํ•˜๊ฒŒ ์ง„ํ–‰ํ•œ๋‹ค.

 

 

* Mean absolute error ( MAE ) : ํšŒ๊ท€๋ฌธ์ œ์—์„œ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜. 

MAE ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. 

abs(test['petal_width'] - pd.Series(predictions)).mean()

์ •๋‹ต์—์„œ ์˜ˆ์ธก๊ฐ’์„ ๋นผ๊ณ , ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•ด์ค€ ํ›„ ๊ฐ๊ฐ์˜ ์˜ค๋ฅ˜์œจ์˜ ํ‰๊ท ์„ ๊ตฌํ•ด์ฃผ๋ฉด ๋œ๋‹ค. ์ด ํ‰๊ฐ€์ง€ํ‘œ๋Š” ์˜ค๋ฅ˜์œจ์ด๋ฏ€๋กœ ์ž‘์„ ์ˆ˜๋ก ์ž˜ ์˜ˆ์ธกํ•œ ๊ฒƒ์ด๋ผ ํŒ๋‹จ๋˜์–ด์ง„๋‹ค. 

for k in range(1,30):
    knn = KNeighborsRegressor(n_neighbors=k)
    knn.fit( train[['sepal_length','sepal_width','petal_length']] , train['petal_width'] )
    predictions = knn.predict( test[['sepal_length','sepal_width','petal_length']] )    
    print(str(k)+' :'+str(abs(test['petal_width'] - pd.Series(predictions)).mean()))

์˜ค๋ฅ˜์œจ์ด ๊ฐ€์žฅ ์ž‘์€ k๋Š” 7์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

---> ์ตœ์ ์˜ K : 7

 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/62

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import matplotlib.pyplot as plt import seaborn as sns - ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ data = pd.read_csv('./data/bosto..

silvercoding.tistory.com

 

 

 


 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('./data/boston.csv')
data.head()

 

 

 


 ๊ตฐ์ง‘ํ™” Clustering 
del data['chas']

์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ๋งŒ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ œ๊ฑฐํ•œ๋‹ค. 

medv = data['medv']
del data['medv']

ํƒ€๊ฒŸ๋ณ€์ˆ˜๋ฅผ ๋ณต์‚ฌํ•ด ๋†“๊ณ , ํƒ€๊ฒŸ๋ณ€์ˆ˜ ์ปฌ๋Ÿผ์„ ์ง€์›Œ์ค€๋‹ค. ( pca๋ฅผ ์œ„ํ•˜์—ฌ )

 

 

์ฐจ์› ์ถ•์†Œ (PCA) : 12์ฐจ์› -> 2์ฐจ์› 

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

์ฐจ์› ์ถ•์†Œ์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

 

- ์ •๊ทœํ™”

scaler = StandardScaler()

์ •๊ทœํ™” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

# ๋ฐ์ดํ„ฐ ํ•™์Šต
scaler.fit(data)
# ๋ณ€ํ™˜
scaler_data = scaler.transform(data)

data ์ „์ฒด๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ scaler_data์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

- PCA

pca = PCA(n_components = 2)

PCA ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 2์ฐจ์› ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•˜์—ฌ ๋ณ€์ˆ˜๋Š” 2๊ฐœ๋กœ ์„ค์ •ํ•œ๋‹ค. 

pca.fit(scaler_data)

pca๋กœ scaler_data๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค. 

data2 = pd.DataFrame(data = pca.transform(scaler_data), columns=['pc1', 'pc2'])

pca๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ data2์— ์ €์žฅํ•œ๋‹ค. 

data2.head()

 

 

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ์ •ํ•˜๊ธฐ - Elbow Point ์ง€์ • 

from sklearn.cluster import KMeans

KMeans(n_cluster = k)

  • k๊ฐœ์˜ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•˜๊ฒ ๋‹ค๋Š” ๊ฐ์ฒด ์ƒ์„ฑ

Kmeans.fit()

  • ํ•™์Šต์‹œํ‚ค๊ธฐ

KMeans.inertia_

  • ํ•™์Šต๋œ KMeans์˜ ์‘์ง‘๋„๋ฅผ ํ™•์ธ
  • ์‘์ง‘๋„๋ž€ ๊ฐ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ž์‹ ์ด ์†ํ•œ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธ
  • ์ฆ‰, ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ง‘ํ™”๊ฐ€ ๋” ์ž˜๋˜์–ด์žˆ์Œ.

KMeans.predict(data)

  • ํ•™์Šต๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜์‹œ์ผœ์คŒ

x = []   # k ๊ฐ€ ๋ช‡๊ฐœ์ธ์ง€ 
y = []   # ์‘์ง‘๋„๊ฐ€ ๋ช‡์ธ์ง€ 

for k in range(1, 30):
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(data2)
    
    x.append(k)
    y.append(kmeans.inertia_)

1๋ถ€ํ„ฐ 30๊นŒ์ง€ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๊ณ , ๊ฐ€์žฅ ์ ์ ˆํ•œ ์‘์ง‘๋„์˜ ๊ตฐ์ง‘๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณธ๋‹ค. 

plt.plot(x, y)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ๋ณ„ ์‘์ง‘๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋‹ˆ 3~5๊ฐœ ์ •๋„๊ฐ€ ์ ๋‹นํ•  ๊ฒƒ ๊ฐ™๋‹ค. Elbow Point๋ฅผ 4๋กœ ์ง€์ •ํ•˜๊ณ  ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

๊ตฐ์ง‘ํ™”

kmeans = KMeans(n_clusters=4)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ 4๋กœ ์„ค์ •ํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

kmeans.fit(data2)

์œ„์—์„œ ์ƒ์„ฑํ•ด ๋†“์€ data2๋ฅผ ํ•™์Šตํ•œ๋‹ค. 

data2['labels'] = kmeans.predict(data2)

๊ฐ๊ฐ์˜ ์˜ˆ์ธก๋œ ๊ตฐ์ง‘ ์ข…๋ฅ˜๋ฅผ labels ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

data2.head()

lebels๊ฐ€ 1์ด๋ผ๋Š” ๊ฒƒ์€ ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ 1๋ฒˆ ๊ตฐ์ง‘์— ํฌํ•จ๋˜์—ˆ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

sns.scatterplot(x='pc1', y='pc2', hue='labels', data=data2)

์œ„์™€ ๊ฐ™์ด ๊ตฐ์ง‘์ด ํ˜•์„ฑ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

๊ฒฐ๊ณผ ํ•ด์„ 

- ์–ด๋–ค ๊ทธ๋ฃน์˜ ์ง‘ ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์„๊นŒ ? : ํ‰๊ท ์œผ๋กœ ๋น„๊ต

data2['medv'] = medv

๊ฐ ๊ทธ๋ฃน์˜ ์ง‘๊ฐ’ ํ‰๊ท ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ์— ์ €์žฅํ•ด ์ฃผ์—ˆ๋˜ medv ์ปฌ๋Ÿผ์„ data2์˜ medv ์ปฌ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

data2.head()

data2[data2['labels']==0]['medv'].mean()

0๋ฒˆ ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์„ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ณด๋„๋ก ํ•œ๋‹ค. 

medv_list = []

for i in range(4):
    medv_avg = data2[data2['labels']==i]['medv'].mean()
    medv_list.append(medv_avg)
sns.barplot(x=['group_0', 'group_1', 'group_2', 'group_3'], y=medv_list)

 


์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœ์ƒ์œ„ ๊ทธ๋ฃน : group_2

์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœํ•˜์œ„ ๊ทธ๋ฃน : group_0

 

---> ์ตœ์ƒ์œ„ ๊ทธ๋ฃน๊ณผ ์ตœํ•˜์œ„ ๊ทธ๋ฃน์„ ๋น„๊ตํ•˜์—ฌ ์ง‘๊ฐ’์˜ ํ‰๊ท ์ด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์€ ์ด์œ ์— ๋Œ€ํ•˜์—ฌ ํ™•์ธํ•ด ๋ณธ๋‹ค. 


* ์›๋ณธ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉํ•˜์—ฌ ์›์ธ ๋ถ„์„ํ•ด๋ณด๊ธฐ 

data['labels'] = data2['labels']

์›๋ณธ๋ฐ์ดํ„ฐ์— ๊ทธ๋ฃน labels๋ฅผ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

group = data[(data['labels']==0) | (data['labels']==2)]

๊ทธ๋ฃน0, ๊ทธ๋ฃน2 ๋งŒ ์„ ํƒํ•˜์—ฌ group ๋ณ€์ˆ˜์— ์ €์žฅํ•œ๋‹ค. 

group = group.groupby('labels').mean().reset_index()

gropuby๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ labels ์ปฌ๋Ÿผ ๋ณ„๋กœ ๋ชจ๋“  ์ปฌ๋Ÿผ์˜ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•˜๊ณ , groupby๋กœ ์ธํ•˜์—ฌ ์ธ๋ฑ์Šค๊ฐ€ ๋˜์—ˆ๋˜ labels๋ฅผ reset_index()๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์‹œ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

group

๊ฐ ๊ทธ๋ฃน๋ณ„ ํ‰๊ท ์ด ๊ตฌํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์‹œ๊ฐํ™” ํ•˜์—ฌ ๋น„๊ตํ•ด ๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

๊ฒฐ๊ณผ ํ•ด์„ - ์‹œ๊ฐํ™” 

column = group.columns
fig, ax = plt.subplots(2, 6, figsize=(30, 13))

for i in range(12):
    sns.barplot('labels', column[i+1], data=group, ax=ax[i//6, i%6])

๋‘๊ฐœ์˜ ๋ง‰๋Œ€๊ฐ€ ๊ทธ๋ ค์ ธ ์žˆ๋Š” ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ, ์™ผ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋‚ฎ์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€, ์˜ค๋ฅธ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋†’์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋œ๋‹ค๊ณ  ํ•ด์„ํ•œ๋‹ค. 

 

 

๊ฒฐ๋ก 

- (0,0) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด crim( ๋ฒ”์ฃ„์œจ )์ด 0๋ฒˆ ๊ทธ๋ฃน์—์„œ ์›”๋“ฑํžˆ ๋†’๋‹ค. ์ด๋Š” ๋ฒ”์ฃ„์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

- (0, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์—์„œ๋Š” zn( 25,000 ํ‰๋ฐฉ๋น„ํŠธ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ ) ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค.

- ๊ทธ๋‹ค์Œ์œผ๋กœ ์ฐจ์ด๊ฐ€ ์ ์–ด ๋ณด์ด๋Š” ๊ฒƒ์€ (1, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์ด๋‹ค. rad( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ ) ๊ฐ€ ๋†’์„์ˆ˜๋ก ( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก ) ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

์ด์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ํ•ด์„์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 


 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

- ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

 

- ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

data = pd.read_csv('./data/boston.csv')
data.head()

 

- ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

data.info()

506๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ , ๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

data.columns

์ด 14๊ฐœ์˜ ์ปฌ๋Ÿผ์ด ์žˆ๋‹ค. ๊ฐ ์ปฌ๋Ÿผ์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

  • crim: ๋ฒ”์ฃ„์œจ
  • zn: 25,000 ํ‰๋ฐฉํ”ผํŠธ๋ฅผ ์ดˆ๊ณผ ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ
  • indus: ๋น„์†Œ๋งค์ƒ์—…์ง€์—ญ ๋ฉด์  ๋น„์œจ
  • chas: ์ฐฐ์Šค๊ฐ•์˜ ๊ฒฝ๊ณ„์— ์œ„์น˜ํ•œ ๊ฒฝ์šฐ๋Š” 1, ์•„๋‹ˆ๋ฉด 0
  • nox: ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„
  • rm: ์ฃผํƒ๋‹น ๋ฐฉ ์ˆ˜
  • age: 1940๋…„ ์ด์ „์— ๊ฑด์ถ•๋œ ์ฃผํƒ์˜ ๋น„์œจ
  • dis: ์ง์—…์„ผํ„ฐ์˜ ๊ฑฐ๋ฆฌ
  • rad: ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ
  • tax: ์žฌ์‚ฐ์„ธ์œจ
  • ptratio: ํ•™์ƒ/๊ต์‚ฌ ๋น„์œจ
  • b: ์ธ๊ตฌ ์ค‘ ํ‘์ธ ๋น„์œจ
  • lstat: ์ธ๊ตฌ ์ค‘ ํ•˜์œ„ ๊ณ„์ธต ๋น„์œจ
  • medv : ๋ณด์Šคํ„ด 506๊ฐœ ํƒ€์šด์˜ 1978๋…„ ์ฃผํƒ ๊ฐ€๊ฒฉ ์ค‘์•™๊ฐ’ (๋‹จ์œ„ 1,000 ๋‹ฌ๋Ÿฌ)

 Feature Selection : ์ƒ๊ด€๊ณ„์ˆ˜์™€ ๊ณต๋ถ„์‚ฐ 

- ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

del data['chas']

์ƒ๊ด€๊ณ„์ˆ˜์™€ ๊ณต๋ถ„์‚ฐ์€ ์—ฐ์†ํ˜• ์ž๋ฃŒ๋ฅผ ๋ถ„์„ํ•˜๋ฏ€๋กœ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. ์‹ค์ œ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•˜๋Š” ๊ฒƒ์€ ์‹ ์ค‘ํ•˜๊ฒŒ ํ–‰ํ•ด์•ผ ํ•˜์ง€๋งŒ , ํ•™์Šต์„ ์œ„ํ•˜์—ฌ ์ œ๊ฑฐํ•œ๋‹ค.

 

๊ฐ€์„ค ์„ธ์šฐ๊ธฐ 

1. ๋ฒ”์ฃ„์œจ์ด ๋†’์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

2. ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋†’์„๊นŒ? 

3. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

4. ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์„ ๋†’์„๊นŒ?

 

1. ๋ฒ”์ฃ„์œจ์ด ๋†’์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ? 

sns.jointplot(data=data, x='crim', y='medv', kind='reg')

๊ทน๋‹จ์ ์ธ ์Œ์˜ ๊ด€๊ณ„๋Š” ์•„๋‹ˆ์ง€๋งŒ , ๋ฒ”์ฃ„์œจ์ด ๋†’์•„์งˆ ์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ์•„์ง€๋Š” ์ถ”์„ธ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ๊ณต๋ถ„์‚ฐ

data['crim'].cov(data['medv'])

์Œ์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ–๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ 

data['crim'].corr(data['medv'])    # ํ”ผ์–ด์Šจ์ƒ๊ด€๊ณ„์ˆ˜ 0.3 ~ 0.6 ๊ฐ•ํ•œ ์ƒ๊ด€๊ณ„์ˆ˜


r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„

 

์ถœ์ฒ˜ - https://ko.wikipedia.org/wiki/%EC%83%81%EA%B4%80_%EB%B6%84%EC%84%9D


์œ„์™€ ๊ฐ™์€ ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜ ํ•ด์„์„ ๋ณด์•˜์„ ๋•Œ ๋ฒ”์ฃ„์œจ๊ณผ ์ง‘๊ฐ’์€ ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

[ ๊ฐ€์„ค1 : True ] 

 

 

2. ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ณณ์˜ ์ง‘๊ฐ’์€ ๋†’์„๊นŒ? 

sns.jointplot(data=data, x='rm', y='medv', kind='reg')

๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด, ๋ฐฉ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋งŽ์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋†’์•„์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

- ๊ณต๋ถ„์‚ฐ

data['rm'].cov(data['medv'])

๊ณต๋ถ„์‚ฐ์„ ๋ณด์•˜์„ ๋•Œ ์–‘์˜ ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ 

data['rm'].corr(data['medv'])

์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๋ณด๋‹ˆ ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„์— ๊ฐ€๊นŒ์šด ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ณต๋ถ„์‚ฐ์˜ ํ—ˆ์ ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฐ€์„ค2์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” ๊ฐ€์„ค1์—์„œ ์ƒ๊ด€๊ณ„์ˆ˜๋ณด๋‹ค ๋” ๋šœ๋ ทํ•œ ๊ด€๊ณ„์ด์ง€๋งŒ, ๊ณต๋ถ„์‚ฐ์€ ๊ฐ€์„ค2๊ฐ€ ๋” ๋†’๋‹ค. 

 

[ ๊ฐ€์„ค2 : True ] 

 

 

3. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์„๊นŒ?

sns.jointplot(data=data, x='nox', y='medv', kind='reg')

data['nox'].corr(data['medv'])

-0.4273207723732824

๋šœ๋ ทํ•œ ์Œ์  ์ƒ๊ด€๊ด€๊ณ„์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ผ์‚ฐํ™”์งˆ์†Œ ๋†๋„๊ฐ€ ๋†’์„ ์ˆ˜๋ก ์ง‘๊ฐ’์€ ๋‚ฎ์•„์ง€๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค.

 

[ ๊ฐ€์„ค3 : True ] 

 

 

4. ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์„ ๋†’์„๊นŒ?

sns.jointplot(data=data, x='tax', y='medv', kind='reg')

data['tax'].corr(data['medv'])

-0.46853593356776696

๋šœ๋ ทํ•œ ์Œ์ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ–๊ณ  ์žˆ๊ณ , ์žฌ์‚ฐ์„ธ์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ์•„์ง€๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

[ ๊ฐ€์„ค4 : False ] 

 

 

- ๋ชจ๋“ ๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ์•Œ๊ธฐ - heatmap 

plt.figure(figsize=(10, 7))
sns.heatmap(data.corr(), cmap='RdBu_r', annot=True, fmt='0.1f')

lstat์™€ rm ์˜ ์ง‘๊ฐ’๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.  ์ธ๊ตฌ ์ค‘ ํ•˜์œ„ ๊ณ„์ธต ๋น„์œจ(lstat)์™€๋Š” ์Œ์˜ ์ƒ๊ด€๊ด€๊ณ„, ๋ฐฉ์˜ ๊ฐœ์ˆ˜(rm) ๊ณผ๋Š” ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ˆ๋‹ค. ๋ฐ˜๋ฉด dis(์ง์—…์„ผํ„ฐ์˜ ๊ฑฐ๋ฆฌ), b(์ธ๊ตฌ ์ค‘ ํ‘์ธ ๋น„์œจ) ์™€ ์ง‘๊ฐ’์€ ๋‚ฎ์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ธ๋‹ค. 

 

 

 

 

 


 Feature Extraction 

- ์ƒ๊ด€๊ด€๊ณ„ ๋น„๊ตํ•˜์—ฌ ๋ช‡๊ฐœ์˜ ๋ณ€์ˆ˜๋ฅผ ๋ช‡๊ฐœ๋กœ ์ค„์ผ ๊ฒƒ์ธ์ง€ ๊ฒฐ์ • 

corr_bar = []

for column in data.columns:
    print(f"{column}๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ด€๊ณ„: {data[column].corr(data['medv'])}")
    corr_bar.append(abs(data[column].corr(data['medv'])))

๊ฐ ์ปฌ๋Ÿผ๋ณ„ ์ง‘๊ฐ‘๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๊ณ , corr_bar ๋ฆฌ์ŠคํŠธ์—๋Š” ์ ˆ๋Œ“๊ฐ’์„ ์ทจํ•˜์—ฌ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค. 

corr_bar

sns.barplot(data.columns, corr_bar)

๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋‹ˆ dis, b ๊ฐ€ ๋‹ค๋ฅธ ์ปฌ๋Ÿผ๋“ค๋ณด๋‹ค ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์ž‘์€ ๊ฒƒ์„ ํ•œ๋ˆˆ์— ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

x = data[['dis', 'b']]

data์—์„œ ๋‘ ๋ณ€์ˆ˜๋ฅผ ์„ ํƒํ•˜์—ฌ x์— ์ €์žฅํ•œ๋‹ค. 

x.head()

 

PCA ์‚ฌ์šฉ 

from sklearn.decomposition import PCA

PCA(n_components)

  • n_components : ๋ช‡๊ฐ€์ง€์˜ ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“ค์ง€ ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•œ๋‹ค.
  • ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฐœ๋…

PCA.fit(x)

  • x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ด์œผ๋กœ์จ ์•ž์—์„œ ์ƒ์„ฑํ•œ ๊ฐ์ฒด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๋ถ€ํ•˜๋Š” ๊ฐœ๋…

PCA.components_

  • ์•ž์„œ ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ณ€์ˆ˜์†์— ๋‹ด๊ธด ์ด ์ „ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์ด ๋‹ด๊ธด ์ •๋„

PCA.explained_variance_ratio_

  • ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ

PCA.transform

  • ํ•™์Šตํ•œ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ํ•™์Šต๊ธฐ๋กœ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜

 

- 2๊ฐœ์˜ ๋ณ€์ˆ˜ -> 1๊ฐœ์˜ ๋ณ€์ˆ˜ 

pca = PCA(n_components=1)

n_components ์— ์ƒ์„ฑํ•  ๋ณ€์ˆ˜๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

pca.fit(x)

๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ‚จ๋‹ค. 

pca.components_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‹ด๊ธด ๊ฐ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์„ ํ™•์ธํ•ด ๋ณธ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์˜ค๋ฅธ์ชฝ (b) ์˜ ๋ถ„์‚ฐ์ด ๋„ˆ๋ฌด ํฌ๋‹ค. ์ด๋Š” ์˜ค๋ฅธ์ชฝ ๋ณ€์ˆ˜์˜ ์ •๋ณด๋งŒ ๋งŽ์ด ๋‹ด๊ฒผ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. ( -> ์ •๊ทœํ™”๊ฐ€ ํ•„์š”ํ•œ ์ด์œ  / ๋’ค์—์„œ ํ•™์Šต ) 

pca.explained_variance_ratio_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ์ด๋‹ค. 

data['pc1'] = pca.transform(x)

ํ•™์Šต์‹œํ‚จ pca๋ฅผ ์ด์šฉํ•˜์—ฌ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ data์˜ pc1์ด๋ผ๋Š” ์ปฌ๋Ÿผ์— ์ถ”๊ฐ€ํ•œ๋‹ค. 

data

์ถ”๊ฐ€ ์™„๋ฃŒ๋œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„. 

sns.jointplot(data=data, x='pc1', y='medv', kind='reg')

data['pc1'].corr(data['medv'])

์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด๋ฉด ์ „์— b๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜์™€ ๋ณ„ ์ฐจ์ด๊ฐ€ ์—†๋‹ค. ์ •๊ทœํ™”๋ฅผ ์ง„ํ–‰ํ•˜์—ฌ ๋‹ค์‹œ ํ•™์Šตํ•ด ๋ณด์ž. 

 

์ •๊ทœํ™”

from sklearn.preprocessing import StandardScaler

StandardScaler()

  • ์ •๊ทœํ™” ๊ฐ์ฒด ์ƒ์„ฑ

scaler.fit(x)

  • ์ •๊ทœํ™” ๊ฐ์ฒด๋กœ ํ•™์Šต

scaler.transform(x)

  • ํ•™์Šต๋œ ํ•™์Šต๊ธฐ๋กœ ๋ณ€์ˆ˜ x์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜

 

scaler = StandardScaler()

์ •๊ทœํ™” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

scaler.fit(x)
scaler_x = scaler.transform(x)

x๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚จ ํ›„, ์ •๊ทœํ™” ๋œ x๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜ํ•˜์—ฌ scaler_x ์— ์ €์žฅํ•ด์ค€๋‹ค. 

scaler_x

 

 

- ์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ pca ์‹คํ–‰ 

# ๋ณ€์ˆ˜ 1๊ฐœ๋กœ ์„ค์ • 
pca = PCA(n_components=1)
# ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต
pca.fit(scaler_x)
# ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‹ด๊ธด ๊ฐ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์„ ํ™•์ธ
# ์œ„์™€ ๋‹ฌ๋ผ์ง„ ๋ถ„์‚ฐ์˜ ์ •๋„๋ฅผ ํ™•์ธ
pca.components_

์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋กœ pca๋ฅผ ์ง„ํ–‰ํ•˜๋‹ˆ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜์— ๋‘๊ฐ€์ง€์˜ ๋ณ€์ˆ˜์˜ ๋ถ„์‚ฐ์ด ๋™์ผํ•˜๊ฒŒ ๋‹ด๊ธด ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

pca.explained_variance_ratio_

์ƒˆ๋กœ์šด ๋ณ€์ˆ˜๊ฐ€ ์„ค๋ช…ํ•˜๋Š” ๋ถ„์‚ฐ์˜ ๋น„์œจ์€ ์ค„์–ด๋“ค์—ˆ์ง€๋งŒ , ๋‘๊ฐ€์ง€ ๋ณ€์ˆ˜์˜ ๊ฐ ๋ถ„์‚ฐ์ด ๋™์ผํ•˜๊ฒŒ ๋‹ด๊ธด๊ฒƒ์ด ๋” ์ค‘์š”ํ•˜๋‹ค. 

data['pc1'] = pca.transform(scaler_x)
data.head()

์ •๊ทœํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ pca ๋ณ€ํ™˜ํ•˜์—ฌ pc1 ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ฃผ์—ˆ๋‹ค. 

 

 

- ์ƒ๊ด€๊ณ„์ˆ˜ ๋น„๊ต

sns.jointplot(data=data, x=data['pc1'], y=data['medv'], kind='reg')

data['pc1'].corr(data['medv'])

data['b'].corr(data['medv'])

์ด์ฒ˜๋Ÿผ ์ƒˆ๋กœ์šด ๋ณ€์ˆ˜ pc1๊ณผ ์ง‘๊ฐ’์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ์˜ˆ์ „ ๋‘ ๋ณ€์ˆ˜์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ณด๋‹ค ๋” ๋†’์•„์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


์ƒ๊ด€์„ฑ์ด ์—†๋Š” ๋‘ ๊ฐ€์ง€์˜ ๋ณ€์ˆ˜๋ฅผ ์ƒ๊ด€์„ฑ์ด ๋” ๋†’์•„์ง€๋„๋ก ํ•˜๋Š” ๋ณ€์ˆ˜๋ฅผ ์ƒ์„ฑํ•˜๋Š” pca๋ฅผ ํ•™์Šตํ•˜์˜€๋‹ค. ๋‹ค์Œ์‹œ๊ฐ„์—๋Š” ์˜ค๋Š˜ ํ•™์Šตํ•œ pca๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐจ์›์„ ์ถ•์†Œํ•˜๊ณ , ๊ตฐ์ง‘ํ™”, ์‹œ๊ฐํ™”๋ฅผ ํ•˜๋Š” ์‹ค์Šต์„ ํ•œ๋‹ค. 


 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/60?category=965020 

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 3-1 open API ์‹ ์ฒญ & ํ™œ์šฉ (์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ)

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/59 https://silvercoding.tistory.com/58 https://silvercoding.tistory.com/57 https://silvercoding.tistory.com/56 https://silvercoding...

silvercoding.tistory.com

 

 


์ด๋ฒˆ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด์ „ ๊ธ€์—์„œ ์ƒ์„ฑํ•œ ํŒŒ์ผ๊ณผ folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„œ์šธ์‹œ ๋”ฐ๋ฆ‰์ด ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

<folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ด€๋ จ ํฌ์ŠคํŒ…> 

https://silvercoding.tistory.com/53?category=965020 

 

[python ์‹œ๊ฐํ™”] 2. ์„œ์šธ์‹œ ๋Œ€ํ”ผ์†Œ ํ˜„ํ™ฉ ์ง€๋„ ๋งŒ๋“ค๊ธฐ , ์ง€๋„ ์‹œ๊ฐํ™” ( folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. seaborn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ (distplot, relplot, jointplot, pairplot, boxplot, swarmplot, heatmap) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ..

silvercoding.tistory.com


๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
data = pd.read_excel('./data/bicycle.xlsx')
data.head()

 

 

 

 

 

 


์„œ์šธ์‹œ ๋”ฐ๋ฆ‰์ด ์ง€๋„ ์‹œ๊ฐํ™” 

import folium

 

- ์ง€๋„ ์ƒ์„ฑ 

m = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)   # ์„œ์šธ์—ญ ์ค‘์‹ฌ
m

์„œ์šธ์—ญ์˜ ์œ„๋„, ๊ฒฝ๋„๋ฅผ ์ด์šฉํ•˜์—ฌ ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. zoom_start ์ธ์ž๋ฅผ ์ด์šฉํ•˜์—ฌ ํ™•๋Œ€ ์ •๋„๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- ์ง€๋„์— ํ‘œ์‹œํ•  ๋ฐ์ดํ„ฐ ํ™•์ธ 

for i in range(len(data)):
    name = data.loc[i, 'stationName']
    available = data.loc[i, 'parkingBikeTotCnt']
    total = data.loc[i, 'rackTotCnt']
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    print(name, available, total, lat, long)

์ง€๋„์ƒ์„ฑ์— ํ•„์š”ํ•œ ์ปฌ๋Ÿผ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•˜์—ฌ ์ถœ๋ ฅํ•œ ๊ฒƒ์ด๋‹ค.  ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋งˆ์ปค๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)

# ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
for i in range(len(data)):
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    name = data.loc[i, 'stationName']
    available = int(data.loc[i, 'parkingBikeTotCnt'])
    total = int(data.loc[i, 'rackTotCnt'])
    
    # ์ž์ „๊ฑฐ ์ˆ˜๋Ÿ‰์— ๋Œ€ํ•ด ์ƒ‰์ƒ์œผ๋กœ ํ‘œ์‹œ
    ##  ์ž์ „๊ฑฐ ๋ณด์œ ์œจ์ด 50% ์ดˆ๊ณผ์ผ ๊ฒฝ์šฐ --> ํŒŒ๋ž€์ƒ‰
    ##  ํ˜„์žฌ ์ž์ „๊ฑฐ๊ฐ€ 2๋Œ€ ๋ณด๋‹ค ์ ์„ ๊ฒฝ์šฐ --> ๋นจ๊ฐ„์ƒ‰
    ##  ๊ทธ ์™ธ์˜ ๊ฒฝ์šฐ(์ž์ „๊ฑฐ 2๋Œ€ ์ด์ƒ ์ด๋ฉด์„œ, ์ž์ „๊ฑฐ ๋ณด์œ ์œจ 50% ๋ฏธ๋งŒ) --> ์ดˆ๋ก์ƒ‰
    if available/total > 0.5:
        color = 'blue'
    elif available < 2 :
        color = 'red'
    else:
        color = 'green'
    icon=folium.Icon(color=color, icon='info-sign')
    folium.Marker(location = [lat, long],
                 tooltip = f"{name} : {available}", 
                  icon = icon
             ).add_to(m)
m

ํ˜„์žฌ ์ž์ „๊ฑฐ ์ด์šฉ ํ˜„ํ™ฉ์„ ๋” ์ง๊ด€์ ์œผ๋กœ ๋ณด๊ธฐ ์œ„ํ•ด ์ƒ‰๊น” ์„ค์ •์„ ํ•ด์ค€๋‹ค. ํŒŒ๋ž€์ƒ‰์€ ๋Œ€์—ฌ ๊ฐ€๋Šฅ ์ž์ „๊ฑฐ 50% ์ด์ƒ, ๋นจ๊ฐ„์ƒ‰์€ 2๊ฐœ ๋ฏธ๋งŒ์ผ ๋•Œ, ์ดˆ๋ก์ƒ‰์€ ๊ทธ ์ด์™ธ์˜ ์ค‘๊ฐ„ ์ƒํƒœ๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ๋˜ํ•œ, tooltip์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งˆ์šฐ์Šค๋ฅผ ๊ฐ–๋‹ค๋Œ€๋ฉด ์ž์ „๊ฑฐ ๋Œ€์—ฌ์†Œ ์ด๋ฆ„๊ณผ ์ด์šฉ๊ฐ€๋Šฅํ•œ ์ž์ „๊ฑฐ ์ˆ˜๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. 

 

 

 

 


๋ฌธ์ œ์  : ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„ ์ง€๋„๋ฅผ ๋ณด๋Š” ๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. 

ํ•ด๊ฒฐ๋ฐฉ์•ˆ : ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ทผ์ ‘ํ•œ ๋งˆ์ปค๋“ค๋ผ๋ฆฌ ์„œ๋กœ ๋ฌถ์–ด์ค€๋‹ค. 


ํด๋Ÿฌ์Šคํ„ฐ & ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€๋„ ์‹œ๊ฐํ™” 

from folium.plugins import MiniMap, MarkerCluster
# ์ง€๋„ ์ƒ์„ฑํ•˜๊ธฐ
m_ver2 = folium.Map(location = ['37.5536067','126.9674308'], zoom_start = 13)

# ๋ฏธ๋‹ˆ๋งต ์ถ”๊ฐ€ํ•˜๊ธฐ
minimap = MiniMap() 
m_ver2.add_child(minimap)

# ๋งˆ์ปค ํด๋Ÿฌ์Šคํ„ฐ ๋งŒ๋“ค๊ธฐ
marker_cluster_ver2 = MarkerCluster().add_to(m_ver2)  # ํด๋Ÿฌ์Šคํ„ฐ ์ถ”๊ฐ€ํ•˜๊ธฐ

# ๋งˆ์ปค ์ถ”๊ฐ€ํ•˜๊ธฐ
for i in range(len(data)):
    lat = data.loc[i, 'stationLatitude']
    long = data.loc[i, 'stationLongitude']
    name = data.loc[i, 'stationName']
    available = int(data.loc[i, 'parkingBikeTotCnt'])
    total = int(data.loc[i, 'rackTotCnt'])
    
    # ์ž์ „๊ฑฐ ์ˆ˜๋Ÿ‰์— ๋Œ€ํ•ด ์ƒ‰์ƒ์œผ๋กœ ํ‘œ์‹œ
    ##  ์ž์ „๊ฑฐ ๋ณด์œ ์œจ์ด 50% ์ดˆ๊ณผ์ผ ๊ฒฝ์šฐ --> ํŒŒ๋ž€์ƒ‰
    ##  ํ˜„์žฌ ์ž์ „๊ฑฐ๊ฐ€ 2๋Œ€ ๋ณด๋‹ค ์ ์„ ๊ฒฝ์šฐ --> ๋นจ๊ฐ„์ƒ‰
    ##  ๊ทธ ์™ธ์˜ ๊ฒฝ์šฐ(์ž์ „๊ฑฐ 2๋Œ€ ์ด์ƒ ์ด๋ฉด์„œ, ์ž์ „๊ฑฐ ๋ณด์œ ์œจ 50% ๋ฏธ๋งŒ) --> ์ดˆ๋ก์ƒ‰
    if available/total > 0.5:
        color = 'blue'
    elif available < 2 :
        color = 'red'
    else:
        color = 'green'
    icon=folium.Icon(color=color, icon='info-sign')
#     print(name, available, total, lat, long)
    folium.Marker(location = [lat, long],
                 tooltip = f"{name} : {available}", 
                  icon = icon
             ).add_to(marker_cluster_ver2)
m_ver2

ํด๋Ÿฌ์Šคํ„ฐ์™€ ๋ฏธ๋‹ˆ๋งต์„ ์ถ”๊ฐ€ํ•œ ์ง€๋„์ด๋‹ค. ์ˆซ์ž๋ฅผ ํด๋ฆญํ•˜๋ฉด ํ•ด๋‹น ์ง€์—ญ์œผ๋กœ ํ™•๋Œ€๋˜์–ด ๋” ํŽธ๋ฆฌํ•˜๊ณ  ์ง๊ด€์ ์ธ ์ง€๋„๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

m_ver2.save('./map/bicycle_clustermap.html')

์ง€๋„๋Š” html๋กœ ์ €์žฅํ•˜์—ฌ ์–ธ์ œ๋“  ๊บผ๋‚ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/59

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 2-3 ์Šน์ฐจ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ์ง€ํ•˜์ฒ  ์—ญ ๋ถ„์„

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/58 https://silvercoding.tistory.com/57 https://silvercoding.tistory.com/56 https://silvercoding.tistory.com/55 https://silvercoding...

silvercoding.tistory.com

 

 


 API ์‹ ์ฒญํ•˜๊ธฐ 

1. ํšŒ์›๊ฐ€์ž…, ๋กœ๊ทธ์ธ ํ•˜๊ธฐ 

http://data.seoul.go.kr

 

์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ๊ด‘์žฅ

๋ชจ๋“  ์„œ์šธ์‹œ๋ฏผ์„ ์œ„ํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ๊ด‘์žฅ์—์„œ ์„œ์šธ์‹œ์™€ ์—ฐ๊ณ„ ๊ธฐ๊ด€์ด ๊ณต๊ฐœํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์„œ์šธ์‹œ์™€ ๊ด€๋ จ๋œ ๋‹ค์–‘ํ•œ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•ด ๋ณด์„ธ์š”.

data.seoul.go.kr

 

 

2. ์›ํ•˜๋Š” ๋ฐ์ดํ„ฐ์˜ ์ธ์ฆํ‚ค ์‹ ์ฒญ ํด๋ฆญ 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ์˜ ๊ณต๊ณต์ž์ „๊ฑฐ ์‹ค์‹œ๊ฐ„ ๋Œ€์—ฌ์ •๋ณด API ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

 

๋‹ค์Œ์˜ ๋งํฌ์—์„œ API๋ฅผ ์‹ ์ฒญํ•  ์ˆ˜ ์žˆ๋‹ค. 

http://data.seoul.go.kr/dataList/OA-15493/A/1/datasetView.do

์ธ์ฆํ‚ค ์‹ ์ฒญ์„ ํด๋ฆญํ•œ๋‹ค. 

 

 

3. ๊ฐ€์ž… ์‹ ์ฒญ์„œ ์ž‘์„ฑ 

ํ™œ์šฉํ•  ๋ชฉ์ ์— ๋งž๊ฒŒ ์‹ ์ฒญ์„œ๋ฅผ ์ž‘์„ฑํ•ด ์ค€๋‹ค. ํ˜„์žฌ ๋ฐ์ดํ„ฐ๋ถ„์„ ํ•™์Šต์— ์‚ฌ์šฉํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ์šฉ url์ด ์—†๋‹ค. ์ด ์ฒ˜๋Ÿผ ์‚ฌ์šฉurl์ด ์—†์„ ๊ฒฝ์šฐ์—๋Š” localhost๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

 

 

4. ์ƒ๋‹จ [๋‚˜์˜ํ™”๋ฉด - ์ธ์ฆํ‚ค ๊ด€๋ฆฌ] 

๋‚˜์˜ํ™”๋ฉด - ์ธ์ฆํ‚ค ๊ด€๋ฆฌ์— ๋“ค์–ด๊ฐ„๋‹ค. 

 

 

5. ์ธ์ฆํ‚ค ๋ณต์‚ฌ 

์ธ์ฆํ‚ค ๋ณต์‚ฌ๋ฅผ ํ•œ๋‹ค. ์ด์ œ API์‹ ์ฒญ์ด ์™„๋ฃŒ๋˜์—ˆ๊ณ , ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€๋ณด์ž. 

 

 

 

 

 

 

 


 ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

1. ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import

import requests  # API๋ฅผ ์ด์šฉํ•ด ์ž๋ฃŒ๋ฅผ ๋ฐ›์•„์˜ค๊ธฐ ์œ„ํ•ด
import pandas as pd   # ์ž๋ฃŒ ์ €์žฅ, ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•ด

 

 

2. ์„œ์šธํŠน๋ณ„์‹œ ๊ณต๊ณต์ž์ „๊ฑฐ ์‹ค์‹œ๊ฐ„ ๋Œ€์—ฌ์ •๋ณด ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

- ํ˜ธ์ถœ

์ถœ์ฒ˜-์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ

ํ•œ๋ฒˆ์— ์ตœ๋Œ€ 1000๊ฑด์„ ํ˜ธ์ถœํ•  ์ˆ˜ ์—†์œผ๋ฏ€๋กœ 2๋ฒˆ์— ๊ฑธ์ณ ํ˜ธ์ถœํ•œ ํ›„, ๋ฐ์ดํ„ฐ๋ฅผ ํ•ฉ์น˜๋Š” ์ž‘์—…์„ ํ•ด์•ผํ•œ๋‹ค. 

 

 

- ์˜ˆ์‹œ

์ถœ์ฒ˜-์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ

์œ„์˜ ์ƒ˜ํ”Œ URL์„ ์‚ฌ์šฉํ•˜์—ฌ API๋ฅผ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. (์ธ์ฆํ‚ค) ๋ถ€ํ„ฐ ๊ฐ’์˜ ์˜๋ฏธ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

์ถœ์ฒ˜-์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ

์ธ์ฆํ‚ค, ์š”์ฒญํŒŒ์ผ ํƒ€์ž…, ์„œ๋น„์Šค๋ช…, ์š”์ฒญ์‹œ์ž‘์œ„์น˜, ์š”์ฒญ์ข…๋ฃŒ์œ„์น˜๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ์š”์ฒญ์‹œ์ž‘์œ„์น˜์™€ ์š”์ฒญ์ข…๋ฃŒ์œ„์น˜๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ธ์ถœํ•˜๋ฉด ๋œ๋‹ค. 

 

 

- ์ถœ๋ ฅ๊ฐ’ 

์ถœ์ฒ˜-์„œ์šธ ์—ด๋ฆฐ๋ฐ์ดํ„ฐ ๊ด‘์žฅ

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ปฌ๋Ÿผ์ด ๋  ๋ณ€์ˆ˜๋“ค์ด๋‹ค. ๊ฑฐ์น˜๋Œ€๊ฐœ์ˆ˜, ๋Œ€์—ฌ์†Œ์ด๋ฆ„, ์ž์ „๊ฑฐ์ฃผ์ฐจ์ด๊ฑด์ˆ˜, ๊ฑฐ์น˜์œจ, ์œ„๋„, ๊ฒฝ๋„, ๋Œ€์—ฌ์†ŒID ๋กœ ์ด 7๊ฐœ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. 

 

 

3. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

- ์ธ์ฆํ‚ค ์ž…๋ ฅ 

apikey = ' '  # ๋ฐ›์€ key ๊ฐ’ ์ž…๋ ฅ

๋ณธ์ธ์˜ api ๊ฐ’์„ ๋„ฃ์–ด์ค€๋‹ค. 

 

 

- API ์š”์ฒญํ•˜๊ธฐ 

startnum = 1
endnum = 1000
url1 = f'http://openapi.seoul.go.kr:8088/{apikey}/json/bikeList/{startnum}/{endnum}/'
requests.get(url1)
# requests.get(url1).text
# requests.get(url1).content

requests.get() ์„ ์‚ฌ์šฉํ•˜๋ฉด ์‘๋‹ต์ด ์˜จ๋‹ค. ์ด ์™ธ์—๋„ text, content๋กœ ์•ˆ์˜ ๋‚ด์šฉ์„ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜๋„ ์žˆ๋‹ค. 

 

 

- ์ž๋ฃŒ ์š”์ฒญ 

json1 = requests.get(url1).json()
json1

์ด๋ ‡๊ฒŒ json ํ˜•ํƒœ๋กœ ๋œ ๋ฐ์ดํ„ฐ๊ฐ€ 1000๊ฐœ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

- ํ•„์š”ํ•œ ์ •๋ณด(ํ‚ค) ์„ ํƒ 

json1['rentBikeStatus'].keys()

ํ˜„์žฌ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ •๋ณด๋Š” rentBikeStatus ์•ˆ์— ์žˆ๋Š” row ํ‚ค์˜ ๋ฐ์ดํ„ฐ์ด์ง€๋งŒ,  ๋ชจ๋“  ํ‚ค๋ฅผ ๊บผ๋‚ด๋ณธ๋‹ค.

json1['rentBikeStatus']['list_total_count']  # ๋ฐ์ดํ„ฐ์˜ ๊ฐœ์ˆ˜

json1['rentBikeStatus']['RESULT']  # ์˜ค๋ฅ˜ ์—ฌ๋ถ€

json1['rentBikeStatus']['row']  # ์ž์ „๊ฑฐ ์ •๋ฅ˜์žฅ๋ณ„ ์ž์ „๊ฑฐ ํ˜„ํ™ฉ

 

 

 

- ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋ณ€๊ฒฝ

raw1 = pd.DataFrame(json1['rentBikeStatus']['row'])
raw1.head()

 

 

- 1000 ~ 2000 & 2000 ~ 3000 ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ณ  ํ•ฉ๋ณ‘ํ•˜๊ธฐ 

startnum = 1001
endnum = 2000
url2 = f'http://openapi.seoul.go.kr:8088/{apikey}/json/bikeList/{startnum}/{endnum}/'
json2 = requests.get(url2).json()
raw2 = pd.DataFrame(json2['rentBikeStatus']['row'])
raw2.tail()

data_mid = raw1.append(raw2)
data_mid

 

๋ฐ์ดํ„ฐ๊ฐ€ 2000๊ฐœ๋ฅผ ๋” ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ ๊ฐ™์œผ๋‹ˆ 2000 ~ 3000 ๋ฐ์ดํ„ฐ๋„ ํ˜ธ์ถœํ•œ ํ›„ ์ถ”๊ฐ€ํ•ด์ค€๋‹ค. 

 

 

startnum = 2001
endnum = 3000
url3 = f'http://openapi.seoul.go.kr:8088/{apikey}/json/bikeList/{startnum}/{endnum}/'
json3 = requests.get(url3).json()
raw3 = pd.DataFrame(json3['rentBikeStatus']['row'])
raw3.tail()

data = data_mid.append(raw3)
data

๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. ์ด 2540๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ธ๋ฑ์Šค๊ฐ€ ์กฐ๊ธˆ ์ด์ƒํ•ด์„œ ์ธ๋ฑ์Šค๋ฅผ ์ดˆ๊ธฐํ™” ํ•ด์ค€ ํ›„, ๋ฐ์ดํ„ฐ๋ฅผ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•ด์ค€๋‹ค. 

 

 

 

- ์ตœ์ข… ๋ฐ์ดํ„ฐ 

data.reset_index(drop=True)

data.info()

์ด 2450๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ , null ๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

- ์—‘์…€ํŒŒ์ผ๋กœ ์ €์žฅ 

data.to_excel('./data/bicycle.xlsx', index = False)

์—‘์…€ํŒŒ์ผ๋กœ ์ €์žฅํ•ด์ค€๋‹ค. ๋‹ค์Œ์‹œ๊ฐ„์—๋Š” ์ด๋Ÿฌํ•œ ๊ณผ์ •์œผ๋กœ ๋ถˆ๋Ÿฌ์˜จ ์ž์ „๊ฑฐ ๋ฐ์ดํ„ฐ์…‹์„ folium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์„œ์šธ์‹œ ๋”ฐ๋ฆ‰์ด ํ˜„ํ™ฉ ์ง€๋„๋ฅผ ์ƒ์„ฑํ•ด ๋ณผ ๊ฒƒ์ด๋‹ค. 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/58

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 2-2 ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๋งŽ์€ ๋‚ ?

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/57 https://silvercoding.tistory.com/56 https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding...

silvercoding.tistory.com

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
raw = pd.read_excel('./data/subway_raw.xlsx')

2-1 ํฌ์ŠคํŒ…์—์„œ ํ•ฉ๋ณ‘ํ•ด ๋†“์€ 2019๋…„ 1์›” - 6์›” ์ง€ํ•˜์ฒ  ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜จ๋‹ค. 

raw.info()

์ด 99342 ๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ  , null ๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 

 

 

 


 ์–ด๋Š ์—ญ์—์„œ , ์–ธ์ œ ์ง€ํ•˜์ฒ ์„ ๊ฐ€์žฅ ๋งŽ์ด ํƒˆ๊นŒ ? 

1. ์Šน๊ฐ์ด ๊ฐ€์žฅ ๋งŽ์ด ํƒ€๋Š” ์—ญ

data_station = raw.pivot_table(index = '์—ญ๋ช…', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc='sum')
data_station = data_station.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending = False)
data_station.head(10)  # ์Šน์ฐจ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ์—ญ ์ƒ์œ„ 10๊ฐœ

์—ญ๋ณ„๋กœ ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌํ•œ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์ด๋‹ค. ์ƒ์œ„ 10๊ฐœ ์—ญ์„ ์ถœ๋ ฅํ•˜์˜€๊ณ  , 2019๋…„ ์ƒ๋ฐ˜๊ธฐ ๊ฐ€์žฅ ๋งŽ์€ ์Šน๊ฐ์ˆ˜๊ฐ€ ์žˆ์—ˆ๋˜ ์—ญ์€ ์ž ์‹ค์—ญ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

2. ์‹œ๊ฐํ™” : ๋…ธ์„  ๋ณ„ (1-9ํ˜ธ์„ ) ์—ญ๋ณ„/์š”์ผ๋ณ„ ์Šน๊ฐ์ˆ˜ ๋น„๊ตํ•ด ๋ณด๊ธฐ


* ํžˆํŠธ๋งต 

  • sns.heatmap(data, annot = True, fmt = '.0f', cmap = "RdBu_r")
    • annot : True ์ผ๊ฒฝ์šฐ ๊ฐ’์„ ๊ทธ๋ž˜ํ”„์— ํ‘œ์‹œ
    • fmt : ๊ฐ’ ํ‘œ์‹œ ํ˜•ํƒœ.
      • ex) 'f' : ์‹ค์ˆ˜๋กœ ํ‘œํ˜„(default ๋กœ ๊ฐ’์ด ์žˆ๋Š” ์†Œ์ˆ˜ ์ž๋ฆฌ๊นŒ์ง€ ํ‘œ์‹œ๋จ)
      • ex) '.0f' : ์‹ค์ˆ˜๋กœ ํ‘œํ˜„ํ•ด๋‹ฌ๋ผ (์†Œ์ˆ˜ 0๋ฒˆ์งธ ์ž๋ฆฌ๊นŒ์ง€๋งŒ == ์ •์ˆ˜์ž๋ฆฌ๋งŒ )
      • ex) '.1f' : ์‹ค์ˆ˜๋กœ ํ‘œํ˜„ํ•ด๋‹ฌ๋ผ (์†Œ์ˆ˜ 1๋ฒˆ์งธ ์ž๋ฆฌ๊นŒ์ง€๋งŒ)
      • ex) .1% ๋Š” ํผ์„ผํŠธ(์†Œ์ˆ˜ ์ฒซ๋ฒˆ์งธ ์ž๋ฆฌ๊นŒ์ง€ ํ‘œ์‹œ)
    • cmap : ์ƒ‰์ƒ ์ฐจํŠธ. _r ์œผ๋กœ ๋๋‚˜๋Š” ์ฐจํŠธ๋Š” ์ƒ‰์ƒ ๋ฐฉํ–ฅ ๋ฐ˜๋Œ€๋กœ ๋˜์–ด์žˆ๋Š” ๋ฒ„์ „์ž„(์•„๋ž˜ ์ปฌ๋Ÿฌ ๋ฆฌ์ŠคํŠธ ์ฐธ๊ณ )

* cmap ์ข…๋ฅ˜

Accent, Accent_r, Blues, Blues_r, BrBG, BrBG_r, BuGn, BuGn_r, BuPu, BuPu_r, CMRmap, CMRmap_r, Dark2, Dark2_r, GnBu, GnBu_r, Greens, Greens_r, Greys, Greys_r, OrRd, OrRd_r, Oranges, Oranges_r, PRGn, PRGn_r, Paired, Paired_r, Pastel1, Pastel1_r, Pastel2, Pastel2_r, PiYG, PiYG_r, PuBu, PuBuGn, PuBuGn_r, PuBu_r, PuOr, PuOr_r, PuRd, PuRd_r, Purples, Purples_r, RdBu, RdBu_r, RdGy, RdGy_r, RdPu, RdPu_r, RdYlBu, RdYlBu_r, RdYlGn, RdYlGn_r, Reds, Reds_r, Set1, Set1_r, Set2, Set2_r, Set3, Set3_r, Spectral, Spectral_r, Wistia, Wistia_r, YlGn, YlGnBu, YlGnBu_r, YlGn_r, YlOrBr, YlOrBr_r, YlOrRd, YlOrRd_r, afmhot, afmhot_r, autumn, autumn_r, binary, binary_r, bone, bone_r, brg, brg_r, bwr, bwr_r, cividis, cividis_r, cool, cool_r, coolwarm, coolwarm_r, copper, copper_r, cubehelix, cubehelix_r, flag, flag_r, gist_earth, gist_earth_r, gist_gray, gist_gray_r, gist_heat, gist_heat_r, gist_ncar, gist_ncar_r, gist_rainbow, gist_rainbow_r, gist_stern, gist_stern_r, gist_yarg, gist_yarg_r, gnuplot, gnuplot2, gnuplot2_r, gnuplot_r, gray, gray_r, hot, hot_r, hsv, hsv_r, icefire, icefire_r, inferno, inferno_r, jet, jet_r, magma, magma_r, mako, mako_r, nipy_spectral, nipy_spectral_r, ocean, ocean_r, pink, pink_r, plasma, plasma_r, prism, prism_r, rainbow, rainbow_r, rocket, rocket_r, seismic, seismic_r, spring, spring_r, summer, summer_r, tab10, tab10_r, tab20, tab20_r, tab20b, tab20b_r, tab20c, tab20c_r, terrain, terrain_r, twilight, twilight_r, twilight_shifted, twilight_shifted_r, viridis, viridis_r, vlag, vlag_r, winter, winter_r


- 1ํ˜ธ์„ ๋งŒ ์‹œ๊ฐํ™” ํ•ด๋ณด๊ธฐ 

line = '1ํ˜ธ์„ '
data_line = raw[raw['๋…ธ์„ ๋ช…'] == line]

# ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”: ๋…ธ์„ ์˜ ์—ญ ์ˆœ์„œ์— ๋งž์ถฐ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์—ญID๋„ ์ธ๋ฑ์Šค์— ํฌํ•จ
df_pivot = data_line.pivot_table(index = ['์—ญID', '์—ญ๋ช…'], columns = '์š”์ผ', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜',aggfunc = 'sum') 
df_pivot = df_pivot[['์›”','ํ™”','์ˆ˜','๋ชฉ','๊ธˆ','ํ† ','์ผ']]   # ์ปฌ๋Ÿผ ์ˆœ์„œ๋ฅผ ์š”์ผ์— ๋งž๊ฒŒ ์ •๋ฆฌ
df_pivot = df_pivot / 10000  # ๋งŒ๋ช…๋‹จ์œ„๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ „์ฒด๋ฅผ 1๋งŒ์œผ๋กœ ๋‚˜๋ˆ„๊ธฐ
df_pivot

์—ญ๋ณ„ ์š”์ผ๋ณ„ ์Šน์ฐจ์ด๊ฐ์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•œ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์ด๋‹ค. ์š”์ผ์ด ๋’ค์ฃฝ๋ฐ•์ฃฝ ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์‹œ ์„ ํƒํ•˜์—ฌ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ฆฌํ•ด ์ค€๋‹ค. 

import matplotlib.pyplot as plt
import seaborn as sns 
from matplotlib import font_manager, rc
import platform 

# ํ•œ๊ธ€ ํฐํŠธ ์‚ฌ์šฉ
if platform.system() == 'Windows': 
    path = 'c:/Windows/Fonts/malgun.ttf'
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc('font', family=font_name)
elif platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
fig, ax = plt.subplots( figsize=(6,5) )   # ๊ทธ๋ž˜ํ”„ ์‚ฌ์ด์ฆˆ ์ง€์ •
plt.title(f"{line} ์—ญ๋ณ„/์š”์ผ๋ณ„ ์Šน๊ฐ์ˆ˜", fontsize = 20) # for title
sns.heatmap(df_pivot, cmap = "Reds", 
           annot = True, fmt = '.0f')

1ํ˜ธ์„ ์˜ ์—ญ๋ณ„ ์š”์ผ๋ณ„ ์Šน๊ฐ์ˆ˜ ํžˆํŠธ๋งต๋‹ˆ๋‹ค. ์„œ์šธ์—ญ๊ณผ ์ข…๊ฐ์—ญ์˜ ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์•„๋ณด์ด๊ณ , ๊ทธ์ค‘์—์„œ๋„ ์„œ์šธ์—ญ์˜ ๊ธˆ์š”์ผ์— ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

๋™์ผํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ชจ๋“  ๋…ธ์„ ์„ ์‹œ๊ฐํ™”ํ•˜์—ฌ ๋น„๊ต ํ•ด ๋ณด์ž. 

 

- 1ํ˜ธ์„ ~9ํ˜ธ์„  ์‹œ๊ฐํ™” ํ•˜์—ฌ ๋น„๊ต ํ•ด๋ณด๊ธฐ 

raw['๋…ธ์„ ๋ช…'].unique()

์ด๋ ‡๊ฒŒ ๋งŽ์€ ๋…ธ์„ ์ด ์žˆ๋Š”๋ฐ , ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” 1ํ˜ธ์„  ~ 9ํ˜ธ์„  ๋งŒ์„ ์‹œ๊ฐํ™” ํ•œ๋‹ค. 

line_seoul_list = [ ]
for line in raw['๋…ธ์„ ๋ช…'].unique():
    if line[1:] == 'ํ˜ธ์„ ':    # xํ˜ธ์„  ์ธ ๊ฒฝ์šฐ๋ฅผ ์„ ํƒ. 
        line_seoul_list.append(line)
line_seoul_list

for line in sorted(line_seoul_list):
    
    # ๋ฐ์ดํ„ฐ ์ •๋ฆฌํ•˜๊ธฐ
    data_line = raw[raw['๋…ธ์„ ๋ช…'] == line]
    df_pivot = data_line.pivot_table(index = ['์—ญID', '์—ญ๋ช…'], columns = '์š”์ผ', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜',aggfunc = 'sum')
    df_pivot = df_pivot[['์›”','ํ™”','์ˆ˜','๋ชฉ','๊ธˆ','ํ† ','์ผ']]
    df_pivot = df_pivot / 10000  # ๋งŒ๋ช…๋‹จ์œ„๋กœ ์ˆ˜์ •
    
    
    # ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
    fig, ax = plt.subplots( figsize=(6,len(df_pivot)/3 ) )   # ๊ทธ๋ž˜ํ”„ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •ํ•˜์—ฌ, ์—ญ ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ๋Š” ์„ธ๋กœ๋ฅผ ๊ธธ๊ฒŒ ํ‘œํ˜„
    plt.title(f"{line} ์—ญ๋ณ„/์š”์ผ๋ณ„ ์Šน๊ฐ์ˆ˜", fontsize = 20) # for title
    sns.heatmap(df_pivot, cmap = "Reds", 
               annot = True, fmt = '.0f')

 

์ง„ํ•œ ๋นจ๊ฐ„์ƒ‰์ผ ์ˆ˜๋ก ์Šน๊ฐ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค. ์Šน๊ฐ์ˆ˜๊ฐ€ ๋งŽ์€ ์—ญ์€ ๋ชจ๋“  ์š”์ผ์ด ๋Œ€์ฒด์ ์œผ๋กœ ์ƒ‰์ด ์ง„ํ•œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ์ „์ฒด์ ์œผ๋กœ ๊ธˆ์š”์ผ์— ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๊ณ  , ์ฃผ๋ง์ด ๋˜๋ฉด ์Šน๊ฐ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์—ˆ๋Š”๋ฐ , ์—ญ ๋ณ„๋กœ ๋ณด๋‹ˆ ์ฃผ๋ง์ด ๋” ๋งŽ์€ ์—ญ๋„ ๋ฐœ๊ฒฌํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. (  ex, ์ดํƒœ์›, ๊ณ ์†ํ„ฐ๋ฏธ๋„, ํ™๋Œ€์ž…๊ตฌ )  

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/57

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 2-1 pandas๋กœ ์—ฌ๋Ÿฌ csv ํŒŒ์ผ ํ•ฉ์น˜๊ธฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/56 https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding...

silvercoding.tistory.com

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
raw = pd.read_excel('./data/subway_raw.xlsx')
raw.head()

์ด์ „ ๊ธ€์—์„œ ๋ณ‘ํ•ฉํ–ˆ๋˜ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์™€ ์ค€๋‹ค. 

raw.info()

์ด 99342๊ฐœ์˜ row๊ฐ€ ์กด์žฌํ•˜๊ณ  ,  null๊ฐ’์ด ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

 

 


 ์Šน๊ฐ์ด ๋งŽ์•„์ง€๋Š” ๋‚ ์€ ์–ธ์ œ์ธ๊ฐ€ ? 

- ์ผ์ž , ์š”์ผ ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜ ํ™•์ธ ํ•ด ๋ณด๊ธฐ 

data_date = pd.pivot_table(raw, index = ['์‚ฌ์šฉ์ผ์ž', '์š”์ผ'], values= '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')
data_date.head()

 

- ์ผ์ž , ์š”์ผ ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ ์ˆ˜ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 

data_date_sort = data_date.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_date_sort

 

์ผ์ž, ์š”์ผ ๋ณ„ ์Šน์ฐจ์Šน๊ฐ์ˆ˜๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ ํ•ด ๋ณด์•˜๋‹ค. ์ด ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ™•์ธํ•ด ๋ณด๋ฉด ๋‘๊ฐ€์ง€ ๊ฐ€์ •์„ ์„ธ์šธ ์ˆ˜ ์žˆ๋‹ค. 

 

1. 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค. 

2. ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค. 

 

 

(1) 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค 

- '์—ฐ์›”' ์ปฌ๋Ÿผ , '์›”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ 

yearmonth_list = []
monthday_list = []
for date in raw['์‚ฌ์šฉ์ผ์ž']:
    yearmonth = str(date)[:6]   # ์™ผ์ชฝ๋ถ€ํ„ฐ 6์ž๋ฆฌ ๋ฌธ์ž ์„ ํƒ
    yearmonth_list.append(yearmonth)
    monthday = str(date)[4:]    # ์™ผ์ชฝ์—์„œ 5๋ฒˆ์งธ ๋ฌธ์ž๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์„ ํƒ
    monthday_list.append(monthday)
    
# ์—ฐ์›”/ ์›”์ผ ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ
raw['์—ฐ์›”'] = yearmonth_list
raw['์›”์ผ'] = monthday_list
raw.head()

์›” ๋ณ„ ์Šน๊ฐ์ˆ˜๋ฅผ ๋ณด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ์ผ์ž๋ฅผ ์—ฐ์›”, ์›”์ผ๋กœ ๋‚˜๋ˆ„์–ด ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•˜์˜€๋‹ค. 

data_month = pd.pivot_table(raw, index = '์—ฐ์›”', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc='sum')
data_month = data_month.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_month

์›”๋ณ„ ์Šน์ฐจ์ด๊ฐ์ˆ˜ ํ”ผ๋ฒ—๋ฐ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  , ์Šน์ฐจ์ข…์Šน๊ฐ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ์„ ํ•ด๋ณด๋‹ˆ , 5์›”์— ๊ฐ€์žฅ ์Šน๊ฐ์ˆ˜๊ฐ€ ๋งŽ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

๊ฒฐ๋ก  -> 5์›”์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค : True

 

 

 

(2) ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค 

data_week = pd.pivot_table(raw, index = '์š”์ผ', values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc='sum')
data_week = data_week.sort_values(by = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', ascending= False)
data_week

์š”์ผ๋ณ„ ์Šน์ฐจ์ด๊ฐ์ˆ˜ ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ณ  , ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ์„ ํ•ด๋ณด๋‹ˆ ๊ธˆ์š”์ผ์— ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

๊ฒฐ๋ก  -> ๊ธˆ์š”์ผ์— ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋งŽ๋‹ค : True

 

 

 

 

 


* ์›”๋ณ„ / ์ผ์ž๋ณ„ ์Šน๊ฐ์ˆ˜ ๊ทธ๋ž˜ํ”„ ์‹œ๊ฐํ™” 

- 2019๋…„ 1์›” ๋ฐ์ดํ„ฐ๋กœ ํ…Œ์ŠคํŠธ ํ•ด๋ณด๊ธฐ 

df_selected = raw[ raw['์—ฐ์›”'] == '201901']
df_selected.head()

df_pivot = pd.pivot_table(df_selected, index = ['์›”์ผ','์š”์ผ'], values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')
df_pivot = df_pivot.reset_index()
df_pivot

2019๋…„ 1์›” ๋‚ ์งœ๋ณ„ ์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜๋ฅผ ์ง‘๊ณ„ํ•˜๋Š” ํ”ผ๋ฒ—ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋ฅผ pointplot์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™” ํ•ด๋ณด์ž. 

import matplotlib.pyplot as plt
import seaborn as sns 
from matplotlib import font_manager, rc
import platform 

# ํ•œ๊ธ€ ํฐํŠธ ์‚ฌ์šฉ
if platform.system() == 'Windows': 
    path = 'c:/Windows/Fonts/malgun.ttf'
    font_name = font_manager.FontProperties(fname=path).get_name()
    rc('font', family=font_name)
elif platform.system() == 'Darwin':
    rc('font', family='AppleGothic')
fig, ax = plt.subplots( figsize=(20,6) )
# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
sns.pointplot(data = df_pivot, x = '์›”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

fig, ax = plt.subplots( figsize=(20,6) )
# ๊ทธ๋ž˜ํ”„ ๊ทธ๋ฆฌ๊ธฐ
sns.pointplot(data = df_pivot, x = '์š”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

1์›” ๊ทธ๋ž˜ํ”„๋งŒ ๊ทธ๋ ค๋ณด์•˜๋”๋‹ˆ , ๊ธˆ์š”์ผ์— ๊ฐ€์žฅ ๋งŽ์€ ์Šน๊ฐ์ˆ˜๋ฅผ ์ฐ๊ณ , ๊ทธ ์ดํ›„๋กœ ์ค„์–ด๋“ค์–ด ์ผ์š”์ผ์—๋Š” ์Šน๊ฐ์ˆ˜๊ฐ€ ๋Œ€ํญ ์ค„์–ด๋“œ๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. ๋งค ๋‹ฌ๋งˆ๋‹ค ํŽธ์ฐจ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์œ„ํ•ด 1-6์›”์˜ ๋‚ ์งœ๋ณ„ ์Šน๊ฐ์ˆ˜ ์‹œ๊ฐํ™”๋ฅผ ํ•ด๋ณธ๋‹ค. 

 

 

 

 

- ์ƒ๋ฐ˜๊ธฐ ๋‚ ์งœ๋ณ„ ์Šน๊ฐ์ˆ˜ ์‹œ๊ฐํ™” 

raw['์—ฐ์›”'].unique()

for yearmonth in raw['์—ฐ์›”'].unique():
    df_selected = raw[ raw['์—ฐ์›”'] == yearmonth]  # ํ•ด๋‹น ์—ฐ์›” ๋ฐ์ดํ„ฐ ์„ ํƒํ•˜๊ธฐ
    df_pivot = pd.pivot_table(df_selected, index = ['์›”์ผ','์š”์ผ'], values = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', aggfunc= 'sum')# ์ผ์ž๋ณ„ ์Šน๊ฐ์ˆ˜ ๊ณ„
    df_pivot = df_pivot.reset_index()
    
    fig, ax = plt.subplots( figsize=(20,6) )
    
    ax.set_title(f'์ผ์ž๋ณ„ ์ง€ํ•˜์ฒ ์Šน๊ฐ์ˆ˜({yearmonth})')  # ๊ทธ๋ž˜ํ”„ ์ œ๋ชฉ ์ถ”๊ฐ€ํ•˜๊ธฐ
    sns.pointplot(data = df_pivot, x = '์›”์ผ', y = '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜')

 

ํ•ด์„ :  ์ „์ฒด์ ์œผ๋กœ ํ˜•ํƒœ๊ฐ€ ๋น„์Šทํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ธˆ์š”์ผ ์ดํ›„ ์ฃผ๋ง์—๋Š” ์Šน๊ฐ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€๊ณ  , ์ผ์š”์ผ์— ๋Œ€ํญ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 2์›”๋‹ฌ์„ ๋ณด๋ฉด ์„ค์—ฐํœด๋กœ ์ธํ•ด์„œ 2์›” ์ดˆ ์ฃผ์ค‘์—๋„ ์Šน๊ฐ์ˆ˜์˜ ์ˆ˜๊ฐ€ ๋‚ฎ์œผ๋ฉฐ , 6์›” 6์ผ ํ˜„์ถฉ์ผ์—๋„ ์Šน๊ฐ์ˆ˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ ๋˜ํ•œ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ง€ํ•˜์ฒ  ์Šน๊ฐ์ˆ˜๋Š” ๊ณตํœด์ผ์˜ ์˜ํ–ฅ์„ ๋ฐ›๋Š”๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

+ Recent posts