๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/62

 

[boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ์ด์ฌ ์˜ˆ์ œ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ - ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ import pandas as pd import matplotlib.pyplot as plt import seaborn as sns - ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ data = pd.read_csv('./data/bosto..

silvercoding.tistory.com

 

 

 


 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv('./data/boston.csv')
data.head()

 

 

 


 ๊ตฐ์ง‘ํ™” Clustering 
del data['chas']

์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ๋งŒ ๋‹ค๋ฃจ๊ธฐ ์œ„ํ•˜์—ฌ ๋ฒ”์ฃผํ˜• ๋ฐ์ดํ„ฐ๋Š” ์ œ๊ฑฐํ•œ๋‹ค. 

medv = data['medv']
del data['medv']

ํƒ€๊ฒŸ๋ณ€์ˆ˜๋ฅผ ๋ณต์‚ฌํ•ด ๋†“๊ณ , ํƒ€๊ฒŸ๋ณ€์ˆ˜ ์ปฌ๋Ÿผ์„ ์ง€์›Œ์ค€๋‹ค. ( pca๋ฅผ ์œ„ํ•˜์—ฌ )

 

 

์ฐจ์› ์ถ•์†Œ (PCA) : 12์ฐจ์› -> 2์ฐจ์› 

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

์ฐจ์› ์ถ•์†Œ์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

 

- ์ •๊ทœํ™”

scaler = StandardScaler()

์ •๊ทœํ™” ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

# ๋ฐ์ดํ„ฐ ํ•™์Šต
scaler.fit(data)
# ๋ณ€ํ™˜
scaler_data = scaler.transform(data)

data ์ „์ฒด๋ฅผ ์ •๊ทœํ™”ํ•˜์—ฌ scaler_data์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

- PCA

pca = PCA(n_components = 2)

PCA ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 2์ฐจ์› ์‹œ๊ฐํ™”๋ฅผ ์œ„ํ•˜์—ฌ ๋ณ€์ˆ˜๋Š” 2๊ฐœ๋กœ ์„ค์ •ํ•œ๋‹ค. 

pca.fit(scaler_data)

pca๋กœ scaler_data๋ฅผ ํ•™์Šต์‹œํ‚จ๋‹ค. 

data2 = pd.DataFrame(data = pca.transform(scaler_data), columns=['pc1', 'pc2'])

pca๋กœ ๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ data2์— ์ €์žฅํ•œ๋‹ค. 

data2.head()

 

 

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ์ •ํ•˜๊ธฐ - Elbow Point ์ง€์ • 

from sklearn.cluster import KMeans

KMeans(n_cluster = k)

  • k๊ฐœ์˜ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•˜๊ฒ ๋‹ค๋Š” ๊ฐ์ฒด ์ƒ์„ฑ

Kmeans.fit()

  • ํ•™์Šต์‹œํ‚ค๊ธฐ

KMeans.inertia_

  • ํ•™์Šต๋œ KMeans์˜ ์‘์ง‘๋„๋ฅผ ํ™•์ธ
  • ์‘์ง‘๋„๋ž€ ๊ฐ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ž์‹ ์ด ์†ํ•œ ๊ตฐ์ง‘์˜ ์ค‘์‹ฌ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์˜๋ฏธ
  • ์ฆ‰, ๋‚ฎ์„์ˆ˜๋ก ๊ตฐ์ง‘ํ™”๊ฐ€ ๋” ์ž˜๋˜์–ด์žˆ์Œ.

KMeans.predict(data)

  • ํ•™์Šต๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ€ํ™˜์‹œ์ผœ์คŒ

x = []   # k ๊ฐ€ ๋ช‡๊ฐœ์ธ์ง€ 
y = []   # ์‘์ง‘๋„๊ฐ€ ๋ช‡์ธ์ง€ 

for k in range(1, 30):
    kmeans = KMeans(n_clusters = k)
    kmeans.fit(data2)
    
    x.append(k)
    y.append(kmeans.inertia_)

1๋ถ€ํ„ฐ 30๊นŒ์ง€ ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๊ณ , ๊ฐ€์žฅ ์ ์ ˆํ•œ ์‘์ง‘๋„์˜ ๊ตฐ์ง‘๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•˜์—ฌ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณธ๋‹ค. 

plt.plot(x, y)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜ ๋ณ„ ์‘์ง‘๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค๋ณด๋‹ˆ 3~5๊ฐœ ์ •๋„๊ฐ€ ์ ๋‹นํ•  ๊ฒƒ ๊ฐ™๋‹ค. Elbow Point๋ฅผ 4๋กœ ์ง€์ •ํ•˜๊ณ  ๊ตฐ์ง‘ํ™”๋ฅผ ํ•ด๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

๊ตฐ์ง‘ํ™”

kmeans = KMeans(n_clusters=4)

๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ 4๋กœ ์„ค์ •ํ•˜์—ฌ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

kmeans.fit(data2)

์œ„์—์„œ ์ƒ์„ฑํ•ด ๋†“์€ data2๋ฅผ ํ•™์Šตํ•œ๋‹ค. 

data2['labels'] = kmeans.predict(data2)

๊ฐ๊ฐ์˜ ์˜ˆ์ธก๋œ ๊ตฐ์ง‘ ์ข…๋ฅ˜๋ฅผ labels ์ปฌ๋Ÿผ์— ๋„ฃ์–ด์ค€๋‹ค. 

data2.head()

lebels๊ฐ€ 1์ด๋ผ๋Š” ๊ฒƒ์€ ํ•ด๋‹น ๋ฐ์ดํ„ฐ๊ฐ€ 1๋ฒˆ ๊ตฐ์ง‘์— ํฌํ•จ๋˜์—ˆ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

sns.scatterplot(x='pc1', y='pc2', hue='labels', data=data2)

์œ„์™€ ๊ฐ™์ด ๊ตฐ์ง‘์ด ํ˜•์„ฑ๋˜์—ˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

๊ฒฐ๊ณผ ํ•ด์„ 

- ์–ด๋–ค ๊ทธ๋ฃน์˜ ์ง‘ ๊ฐ’์ด ๊ฐ€์žฅ ๋†’์„๊นŒ ? : ํ‰๊ท ์œผ๋กœ ๋น„๊ต

data2['medv'] = medv

๊ฐ ๊ทธ๋ฃน์˜ ์ง‘๊ฐ’ ํ‰๊ท ์„ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด ์ฒ˜์Œ์— ์ €์žฅํ•ด ์ฃผ์—ˆ๋˜ medv ์ปฌ๋Ÿผ์„ data2์˜ medv ์ปฌ๋Ÿผ์œผ๋กœ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

data2.head()

data2[data2['labels']==0]['medv'].mean()

0๋ฒˆ ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌํ•œ๋‹ค. ์ด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋“  ๊ตฐ์ง‘์˜ ์ง‘๊ฐ’์˜ ํ‰๊ท ์„ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ณด๋„๋ก ํ•œ๋‹ค. 

medv_list = []

for i in range(4):
    medv_avg = data2[data2['labels']==i]['medv'].mean()
    medv_list.append(medv_avg)
sns.barplot(x=['group_0', 'group_1', 'group_2', 'group_3'], y=medv_list)

 


์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœ์ƒ์œ„ ๊ทธ๋ฃน : group_2

์ง‘๊ฐ’์˜ ํ‰๊ท  ์ตœํ•˜์œ„ ๊ทธ๋ฃน : group_0

 

---> ์ตœ์ƒ์œ„ ๊ทธ๋ฃน๊ณผ ์ตœํ•˜์œ„ ๊ทธ๋ฃน์„ ๋น„๊ตํ•˜์—ฌ ์ง‘๊ฐ’์˜ ํ‰๊ท ์ด ๋†’๊ฑฐ๋‚˜ ๋‚ฎ์€ ์ด์œ ์— ๋Œ€ํ•˜์—ฌ ํ™•์ธํ•ด ๋ณธ๋‹ค. 


* ์›๋ณธ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉํ•˜์—ฌ ์›์ธ ๋ถ„์„ํ•ด๋ณด๊ธฐ 

data['labels'] = data2['labels']

์›๋ณธ๋ฐ์ดํ„ฐ์— ๊ทธ๋ฃน labels๋ฅผ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

group = data[(data['labels']==0) | (data['labels']==2)]

๊ทธ๋ฃน0, ๊ทธ๋ฃน2 ๋งŒ ์„ ํƒํ•˜์—ฌ group ๋ณ€์ˆ˜์— ์ €์žฅํ•œ๋‹ค. 

group = group.groupby('labels').mean().reset_index()

gropuby๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ labels ์ปฌ๋Ÿผ ๋ณ„๋กœ ๋ชจ๋“  ์ปฌ๋Ÿผ์˜ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•˜๊ณ , groupby๋กœ ์ธํ•˜์—ฌ ์ธ๋ฑ์Šค๊ฐ€ ๋˜์—ˆ๋˜ labels๋ฅผ reset_index()๋ฅผ ์ด์šฉํ•˜์—ฌ ๋‹ค์‹œ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

group

๊ฐ ๊ทธ๋ฃน๋ณ„ ํ‰๊ท ์ด ๊ตฌํ•ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์‹œ๊ฐํ™” ํ•˜์—ฌ ๋น„๊ตํ•ด ๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

๊ฒฐ๊ณผ ํ•ด์„ - ์‹œ๊ฐํ™” 

column = group.columns
fig, ax = plt.subplots(2, 6, figsize=(30, 13))

for i in range(12):
    sns.barplot('labels', column[i+1], data=group, ax=ax[i//6, i%6])

๋‘๊ฐœ์˜ ๋ง‰๋Œ€๊ฐ€ ๊ทธ๋ ค์ ธ ์žˆ๋Š” ๋ง‰๋Œ€๊ทธ๋ž˜ํ”„๋กœ, ์™ผ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋‚ฎ์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€, ์˜ค๋ฅธ์ชฝ์ด ๋†’์œผ๋ฉด ์ง‘๊ฐ’์ด ๋†’์€ ์ด์œ ์˜ ๊ทผ๊ฑฐ๊ฐ€ ๋œ๋‹ค๊ณ  ํ•ด์„ํ•œ๋‹ค. 

 

 

๊ฒฐ๋ก 

- (0,0) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด crim( ๋ฒ”์ฃ„์œจ )์ด 0๋ฒˆ ๊ทธ๋ฃน์—์„œ ์›”๋“ฑํžˆ ๋†’๋‹ค. ์ด๋Š” ๋ฒ”์ฃ„์œจ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

- (0, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์—์„œ๋Š” zn( 25,000 ํ‰๋ฐฉ๋น„ํŠธ๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ฑฐ์ฃผ์ง€์—ญ ๋น„์œจ ) ์ด ๋†’์„์ˆ˜๋ก ์ง‘๊ฐ’์ด ๋†’๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค.

- ๊ทธ๋‹ค์Œ์œผ๋กœ ์ฐจ์ด๊ฐ€ ์ ์–ด ๋ณด์ด๋Š” ๊ฒƒ์€ (1, 1) ์œ„์น˜์˜ ๊ทธ๋ž˜ํ”„์ด๋‹ค. rad( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ ) ๊ฐ€ ๋†’์„์ˆ˜๋ก ( ๋ฐฉ์‚ฌํ˜• ๊ณ ์†๋„๋กœ๊นŒ์ง€์˜ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก ) ์ง‘๊ฐ’์ด ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

์ด์™€ ๊ฐ™์ด ์—ฌ๋Ÿฌ ํ•ด์„์„ ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

+ Recent posts