๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/70

 

[FIFA DATA] 2019/2020 ์‹œ์ฆŒ Manchester United ์— ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€?, EDA ๊ณผ์ •

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding...

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

<Rossmann Store Sales> 

https://www.kaggle.com/c/rossmann-store-sales/data?select=test.csv 

 

Rossmann Store Sales | Kaggle

 

www.kaggle.com

ํ•ด๋‹น ๋งํฌ์˜ ์บ๊ธ€ ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ๋กœ์Šค๋งŒ ๋ฐ์ดํ„ฐ์ด๋‹ค. 

  • train.csv - historical data including Sales
  • test.csv - historical data excluding Sales
  • sample_submission.csv - a sample submission file in the correct format
  • store.csv - supplemental information about the stores

 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ถ•์†Œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์ ์˜ ๋งค์ถœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค.  

(๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต)

 

import os
import pandas as pd
os.chdir('../data')
train = pd.read_csv("lspoons_train.csv")
test = pd.read_csv("lspoons_test.csv")
store = pd.read_csv("store.csv")

lspoons_train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
lspoons_test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ

store.csv - ์ƒ์ ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ณด์กฐ ๋ฐ์ดํ„ฐ

 

 

train.head()


์ปฌ๋Ÿผ ์ •๋ณด 

  • id
  • Store: ๊ฐ ์ƒ์ ์˜ id
  • Date: ๋‚ ์งœ
  • Sales: ๋‚ ์งœ์— ๋”ฐ๋ฅธ ๋งค์ถœ
  • Promo: ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์ง„ํ–‰ ์—ฌ๋ถ€
  • StateHoliday: ๊ณตํœด์ผ ์—ฌ๋ถ€/ ๊ณตํœด์ผ X-> 0, ๊ณตํœด์ผ-> ๊ณตํœด์ผ์˜ ์ข…๋ฅ˜(a, b, c)
  • SchoolHoliday: ํ•™๊ต ํœด์ผ์ธ์ง€ ์—ฌ๋ถ€

์œ„์˜ ์ปฌ๋Ÿผ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ Sales(๋งค์ถœ) ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 


- ๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง ( feature engineering - ๋ณ€์ˆ˜์„ ํƒ - ๋ชจ๋ธ๋ง ) 

2. 2์ฐจ ๋ชจ๋ธ๋ง ( store ๋ฐ์ดํ„ฐ merge - feature engineering - ๋ณ€์ˆ˜ ์„ ํƒ - ๋ชจ๋ธ๋ง )

3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

... ๋ชจ๋ธ๋ง ๋ฐ˜๋ณต ( ์ด ํ›„ ๋ชจ๋ธ๋ง์€ ์ž์œจ, ๊นƒํ—™ ์ •๋ฆฌ ) 

 


1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง 

: ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค. (๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ, ์›ํ•ซ ์ธ์ฝ”๋”ฉ) 


ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์ด๋ž€? 

  • ์˜ˆ์ธก์„ ์œ„ํ•ด ๊ธฐ์กด์˜ input ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด input ๋ณ€์ˆ˜ ์ƒ์„ฑ
  • ๋จธ์‹ ๋Ÿฌ๋‹ ์˜ˆ์ธก ์„ฑ๋Šฅ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•

train.info()

๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , object ํƒ€์ž…์ธ Date, StateHoliday ์ปฌ๋Ÿผ์„ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€๋‹ค. 

 

- StateHoliday column one-hot encoding 

train = pd.get_dummies(columns=['StateHoliday'],data=train)
test = pd.get_dummies(columns=['StateHoliday'],data=test)

get_dummies ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ StateHoliday ์ปฌ๋Ÿผ์„ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

print("train_columns: ", train.columns, end="\n\n\n")
print("test_columns: ", test.columns)

์ƒˆ๋กœ ์ƒ์„ฑ๋œ ์นผ๋Ÿผ์„ ๋ณด๋ฉด train์—๋Š” b, c ๊ฐ€ ์žˆ์ง€๋งŒ test์—๋Š” b, c ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๊ฒฝ์šฐ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

test['StateHoliday_b'] = 0
test['StateHoliday_c'] = 0

๋”ฐ๋ผ์„œ ๊ฐ™์€ ์นผ๋Ÿผ์„ test ๋ฐ์ดํ„ฐ์…‹์— ์ƒ์„ฑํ•ด ์ค€๋‹ค.

 

- feature engineering using Date column

train['Date']

Date ์นผ๋Ÿผ์€ ๋‚ ์งœํ˜• ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์ง€๋งŒ dtype์ด object์ด๋ฏ€๋กœ ๋‚ ์งœ๋กœ์„œ์˜ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

train['Date'] = pd.to_datetime( train['Date'] )
test['Date'] = pd.to_datetime( test['Date'] )

๋”ฐ๋ผ์„œ pandas์—์„œ ๋‚ ์งœ ๊ณ„์‚ฐ์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” to_datetime ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ ์งœํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๋‹ค. 

 

 

# ์š”์ผ ์ปฌ๋Ÿผ weekday ์ƒ์„ฑ 

train['weekday'] = train['Date'].dt.weekday
test['weekday'] = test['Date'].dt.weekday

# ๋…„๋„ ์ปฌ๋Ÿผ year ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

# ์›” ์ปฌ๋Ÿผ month ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

 

 

- ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง 

from xgboost import XGBRegressor
train.columns

xgb = XGBRegressor( n_estimators= 300 , learning_rate=0.1 , random_state=2020 )
xgb.fit(train[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']],
        train['Sales'])

 

XGB ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ผœ ์ค€๋‹ค. 

 

from sklearn.model_selection import cross_val_score
cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

cross validation ์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ๊ตฌํ•ด๋ณด์•˜๋”๋‹ˆ ์œ„์™€ ๊ฐ™์ด ๋‚˜์™”๋‹ค.  ์ถ”๊ฐ€ ์ž‘์—…์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ์ค„์—ฌ๋‚˜๊ฐ€ ๋ณด์ž! 

 

 

cf.  ์บ๊ธ€ ์ œ์ถœ ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ 

test['Sales'] = xgb.predict(test[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์— ๋„ฃ์–ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

test[['id','Sales']].to_csv("submission.csv",index=False)

 

- ๋ณ€์ˆ˜ ์„ ํƒ 

xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์˜ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

input_var = ['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']

input_var์— Sales๋ฅผ ์ œ์™ธํ•œ ์ธํ’‹ ๋ณ€์ˆ˜๋ฅผ ์ €์žฅํ•ด ์ค€๋‹ค. 

imp_df = pd.DataFrame({"var": input_var,
                       "imp": xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
imp_df

๋ณ€์ˆ˜ ์ค‘์š”๋„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ์„ ํ•ด ์ค€๋‹ค. Promo๊ฐ€ ์••๋„์ ์œผ๋กœ ๋ณ€์ˆ˜์ค‘์š”๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. State_Holiday๋Š” ๋Œ€์ฒด์ ์œผ๋กœ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

import matplotlib.pyplot as plt
plt.bar(imp_df['var'],imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

ํ•œ๋ˆˆ์— ๋ณด๊ธฐ์œ„ํ•ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๋ณด์•˜๋”๋‹ˆ SchoolHoliday ์ดํ›„ ์ปฌ๋Ÿผ๋“ค์€ ๋ณ„ ์˜๋ฏธ๊ฐ€ ์—†์–ด ๋ณด์ธ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ปฌ๋Ÿผ์„ ๋ช‡๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜ค๋ฅ˜์œจ์„ ์ค„๊ฒŒ ํ•˜๋Š”์ง€ ์‹คํ—˜ํ•ด ๋ณธ๋‹ค. 

import numpy as np
score_list=[]
selected_varnum=[]
for i in range(1,10):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

 

๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜ ๋ณ„๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ 2๊ฐœ์ผ ๋•Œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ cross validation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋‘๋ฒˆ์งธ ๋นผ๊ณ ๋Š” ๋ชจ๋‘ ์ค„์–ด๋“  ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ ๋ชจ๋ธ ํ•™์Šต์„ ํ•œ ํ›„, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ œ์ถœํ•œ ์บ๊ธ€ ์Šค์ฝ”์–ด๋„ ๋” ์ค„์–ด๋“ค์—ˆ๋‹ค. (๋ฐ˜๋ณต์ž‘์—…์ด๋ฏ€๋กœ ํฌ์ŠคํŒ…์—์„œ ์ƒ๋žต) 

 

 

 

 

 


2. 2์ฐจ ๋ชจ๋ธ๋ง 

- store ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘ 

store


store ๋ฐ์ดํ„ฐ์…‹: ๊ฐ ์ƒ์ ์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ •๋ฆฌํ•œ ๊ฒƒ 

์ปฌ๋Ÿผ ์˜๋ฏธ

  • Store: ์ƒ์ ์˜ ์œ ๋‹ˆํฌํ•œ id
  • Store Type: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • Assortment: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • CompetitionDistance: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์ƒ์ ๊ณผ์˜ ๊ฑฐ๋ฆฌ
  • CompetitionOpenSinceMonth: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์˜คํ”ˆ ์›”
  • CompetitionOpenSinceYear: ์˜คํ”ˆ ๋…„๋„
  • Promo2: ์ง€์†์ ์ธ(์ฃผ๊ธฐ์ ์ธ) ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์—ฌ๋ถ€
  • Promo2SinceWeek/ promo2SinceYear: ํ•ด๋‹น ์ƒ์ ์ด promo2๋ฅผ ํ•˜๊ณ ์žˆ๋‹ค๋ฉด ์–ธ์ œ ์‹œ์ž‘ํ–ˆ๋Š”์ง€
  • PromoInterval: ์ฃผ๊ธฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€

train = pd.merge(train, store, on=['Store'], how='left')
test = pd.merge(test, store, on=['Store'], how='left')

Store ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ train, test ๋ฐ์ดํ„ฐ์…‹๊ณผ store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•ด ์ค€๋‹ค. 

 

 

- CompetitionOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ

: ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ (ํ•ด๋‹น ๊ฐ€๊ฒŒ ์ด์ „ ๊ฐœ์žฅ: ์–‘์ˆ˜, ์ดํ›„ ๊ฐœ์žฅ: ์Œ์ˆ˜

train['CompetitionOpen'] = 12*( train['year'] - train['CompetitionOpenSinceYear'] ) + \
                             (train['month'] - train['CompetitionOpenSinceMonth'])

test['CompetitionOpen'] = 12*( test['year'] - test['CompetitionOpenSinceYear'] ) + \
                             (test['month'] - test['CompetitionOpenSinceMonth'])

ํ•ด๋‹น ๊ฐ€๊ฒŒ๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„์—์„œ ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„๋ฅผ ๋บ€ ํ›„ 12๋ฅผ ๊ณฑํ•˜๋ฉด ๊ฐœ์›” ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ๋‹ฌ์—์„œ ๊ฒฝ์Ÿ์—…์ฒด ๊ฐœ์žฅ ๋‹ฌ์˜ ์ฐจ์ด์™€ ๋”ํ•ด์ฃผ๋ฉด ํ•ด๋‹น ๊ฐ€๊ฒŒ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- PromoOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ 

: ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ํ›„์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์‹œ์ž‘๋˜์—ˆ๋Š”์ง€ 

train['WeekOfYear'] = train['Date'].dt.weekofyear # ํ˜„์žฌ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€
test['WeekOfYear'] = test['Date'].dt.weekofyear

ํ”„๋กœ๋ชจ์…˜2์— ๋Œ€ํ•œ ๋‚ ์งœ ์ •๋ณด๊ฐ€ ๋…„๋„(Year)์™€ ์ฃผ(Week)๋กœ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— Date์ปฌ๋Ÿผ์—์„œ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€ ๊ณ„์‚ฐํ•˜์—ฌ WeekOfYear ์ปฌ๋Ÿผ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

train['PromoOpen'] = 12* ( train['year'] - train['Promo2SinceYear'] ) + \
                        (train['WeekOfYear'] - train['Promo2SinceWeek']) / 4

test['PromoOpen'] = 12* ( test['year'] - test['Promo2SinceYear'] ) + \
                        (test['WeekOfYear'] - test['Promo2SinceWeek']) / 4

์ด์ „๊ณผ ๊ฐ™์ด ๋…„๋„๋ฅผ ๊ฐœ์›”์ˆ˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ , ์ฃผ๋ฅผ 4๋กœ ๋‚˜๋ˆ„์–ด ๊ฐœ์›”์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๊ฒƒ์„ ๋”ํ•˜์—ฌ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ๋’ค์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐœ์›” ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

- ์›ํ•ซ์ธ์ฝ”๋”ฉ ( get_dummies() ) 

train.dtypes

๋ฐ์ดํ„ฐํƒ€์ž…์„ ํ™•์ธ ํ•ด ๋ณด๋ฉด object์ธ ์ปฌ๋Ÿผ์ด 3๊ฐ€์ง€ ์žˆ๋‹ค. 3๊ฐœ์˜ ์ปฌ๋Ÿผ์„ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

train = pd.get_dummies(columns=['StoreType'],data=train)
test = pd.get_dummies(columns=['StoreType'],data=test)
train = pd.get_dummies(columns=['Assortment'],data=train)
test = pd.get_dummies(columns=['Assortment'],data=test)
train = pd.get_dummies(columns=['PromoInterval'],data=train)
test = pd.get_dummies(columns=['PromoInterval'],data=test)
train.columns

test.columns

train column๊ณผ test column ์ด ๋™์ผํ•œ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 

 

 

 

- ๋ชจ๋ธ๋ง 

input_var = ['Promo', 'SchoolHoliday',
       'StateHoliday_0', 'StateHoliday_a', 'StateHoliday_b', 'StateHoliday_c',
       'weekday', 'year', 'month', 'CompetitionDistance',
       'Promo2',
       'CompetitionOpen', 'WeekOfYear',
       'PromoOpen', 'StoreType_a', 'StoreType_b', 'StoreType_c', 'StoreType_d',
       'Assortment_a', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec']

ํ•„์š”์—†๋Š” ์ปฌ๋Ÿผ์€ ์‚ญ์ œํ•˜๊ณ  input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

set(train) - set(input_var)

(์ฐธ๊ณ ) input_var์— ๋“ค์–ด๊ฐ€์ง€ ์•Š์€ ์ปฌ๋Ÿผ๋“ค ๋ชฉ๋ก์ด๋‹ค. 

xgb = XGBRegressor( n_estimators=300, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],train['Sales'])

์•ž๊ณผ ๋™์ผํ•˜๊ฒŒ xgb ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.  

cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋ธ๋ง์„ ํ–ˆ๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋Œ€ํญ ํ•˜๋ฝํ•˜์˜€๋‹ค. 

 

 

- ๋ณ€์ˆ˜์ค‘์š”๋„ 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
plt.bar(imp_df['var'],
        imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๋”๋‹ˆ, ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์„ ํƒํ•ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค. 

score_list=[]
selected_varnum=[]
for i in range(1,25):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

์ง€์†์ ์œผ๋กœ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด์ง€๋งŒ 17๊ฐœ ์ดํ›„๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™์ด ๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ 17๊ฐœ๊นŒ์ง€ ์„ ํƒํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ๋ณธ๋‹ค. 

input_var = imp_df['var'].iloc[:17].tolist()
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

์ „์ฒด์ ์œผ๋กœ ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. 

 

 

 

 

 

 


3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

estim_list = [100,200,300,400,500,600,700,800,900]
score_list = []
for i in estim_list:
    xgb = XGBRegressor( n_estimators=i, learning_rate= 0.1, random_state=2020)
    scores = cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    print(i)
plt.plot(estim_list,score_list)
plt.xticks(rotation=90)
plt.show()

n_estimators๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์„ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๊ณ , n_estimators=400์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์ ๋‹นํ•ด ๋ณด์ธ๋‹ค.  

xgb = XGBRegressor( n_estimators=400, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

400์œผ๋กœ ๋ณ€๊ฒฝํ•˜์˜€๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋‚ฎ์•„์กŒ๋‹ค. 

 

์•„์‰ฝ๊ฒŒ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ์ดํ›„๋กœ ์บ๊ธ€์—์„œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ ์˜ค๋ฅ˜์œจ์ด ๋” ๋†’๊ฒŒ ๋‚˜์™”๋‹ค. ์ด์™ธ์— ๊ฒฐ์ธก๊ฐ’, ์ด์ƒ์น˜ ๋“ฑ feature engineering์„ ์ง€์†์ ์œผ๋กœ ์‹œ๋„ํ•ด ๋ณด์•„์•ผ๊ฒ ๋‹ค. (์ถ”ํ›„ github ์—…๋กœ๋“œ ์˜ˆ์ •) 


 

+ Recent posts