๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/71

 

[rossmann data]์ƒ์  ๋งค์ถœ ์˜ˆ์ธก/ kaggle ์ถ•์†Œ๋ฐ์ดํ„ฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ <์ด์ „ ๊ธ€> https://silvercoding.tistory.com/70 https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.ti..

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

[ Home Credit Data ]

์›๋ณธ ๋ฐ์ดํ„ฐ: ์บ๊ธ€ 

ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต 

  • ๊ณ ๊ฐ์˜ ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ ์˜ˆ์ธก: ๊ณ ๊ฐ์˜ ์ธ์  ์ •๋ณด, ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด๋‹น ๊ณ ๊ฐ์—๊ฒŒ ๋ˆ์„ ๋นŒ๋ ค์ฃผ์—ˆ์„ ๋•Œ ์ด๋ฅผ ์ƒํ™˜ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ
loan_before.csv - ๊ฐ ์‚ฌ๋žŒ์ด ์ด์ „์— ์ง„ํ–‰ํ–ˆ๋˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด

 

import pandas as pd
import os
os.chdir('../data')
lb = pd.read_csv("loan_before.csv")
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

 

lb.head()

 

- loan before ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€ DAYS_CREDIT
๋Œ€์ถœ ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€ CNT_CREDIT_PROLONG
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT_SUM
๋Œ€์ถœ ์œ ํ˜• CREDIT_TYPE

 

- train, test ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํƒ€๊ฒŸ๊ฐ’(0: ์ •์ƒ ์ƒํ™˜, 1: ์—ฐ์ฒด ํ˜น์€ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด ๊ฒฝ์šฐ) TARGET
์„ฑ๋ณ„(0: ์—ฌ์„ฑ, 1: ๋‚จ์„ฑ) CODE_GENDER
์ฐจ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_CAR
์ฃผํƒ ํ˜น์€ ์•„ํŒŒํŠธ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_REALTY
์ž๋…€ ์ˆ˜ CNT_CHILDREN
์ˆ˜์ž… AMT_INCOME_TOTAL
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT
1๋‹ฌ๋งˆ๋‹ค ๊ฐš์•„์•ผ ํ•˜๋Š” ๊ธˆ์•ก AMT_ANNUITY
๋Œ€์ถœ์‹ ์ฒญ์„ ํ•  ๋•Œ ๋ˆ„๊ฐ€ ๋™ํ–‰ํ–ˆ๋Š”์ง€ NAME_TYPE_SUITE
์ง์—… ์ข…๋ฅ˜ NAME_INCOME_TYPE
ํ•™์œ„ NAME_EDUCATION_TYPE
์ฃผ๊ฑฐ ์ƒํ™ฉ NAME_HOUSING_TYPE
์ง€์—ญ์˜ ์ธ๊ตฌ REGION_POPULATION_RELATIVE
๋‚˜์ด DAYS_BIRTH
์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€(365243๋Š” ๊ฒฐ์ธก์น˜) DAYS_EMPLOYED
๊ณ ๊ฐ์ด ๋Œ€์ถœ์„ ์‹ ์ฒญํ•œ ID ๋ฌธ์„œ๋ฅผ ๋ณ€๊ฒฝํ•œ ๋‚ ์งœ DAYS_ID_PUBLISH
๋ณด์œ ํ•œ ์ฐจ์˜ ๋‚˜์ด OWN_CAR_AGE
๊ฐ€์กฑ ์ˆ˜ CNT_FAM_MEMBERS
์–ธ์ œ ๋Œ€์ถœ์‹ ์ฒญ์„ ํ–ˆ๋Š”์ง€ ์‹œ๊ฐ„ HOUR_APPR_PROCESS_START
์ผํ•˜๋Š” ์กฐ์ง์˜ ์ข…๋ฅ˜ ORGANIZATION_TYPE
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ1๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_1
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ2๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_2
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ3๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_3
๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ DAYS_LAST_PHONE_CHANGE
์‹ ์ฒญ ์ „ 1๋…„๊ฐ„ ์‹ ์šฉํ‰๊ฐ€๊ธฐ๊ด€์— ํ•ด๋‹น ์‚ฌ๋žŒ์— ๋Œ€ํ•œ ์‹ ์šฉ์ •๋ณด๋ฅผ ์กฐํšŒํ•œ ๊ฐœ์ˆ˜ AMT_REQ_CREDIT_BUREAU_YEAR

1. ๋ฌธ์ œ ์ •์˜ 

์งˆ๋ฌธ 1 - ์–ด๋–ค ์š”์†Œ๊ฐ€ ๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

์งˆ๋ฌธ 2 - ๊ทธ ์š”์†Œ๋“ค์ด ์ƒํ™˜์—ฌ๋ถ€์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

 

2. ๋ฐฉ๋ฒ•๋ก  

- ๋ถ„์„ ๊ณผ์ • 

์งˆ๋ฌธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ด์„๊ฐ€๋Šฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ (xAI) ํ™œ์šฉ 

(1) Feature Engineering

- AMT_CREDIT_TO_ANNUITY_RATIO ๋ณ€์ˆ˜ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ๋ช‡๊ฐœ์›”์— ๊ฑธ์ณ ๋ˆ์„ ๊ฐš์•„์•ผ ํ•˜๋Š”์ง€ 

train['AMT_CREDIT_TO_ANNUITY_RATIO'] = train['AMT_CREDIT']/train['AMT_ANNUITY']
test['AMT_CREDIT_TO_ANNUITY_RATIO'] = test['AMT_CREDIT']/test['AMT_ANNUITY']

- lb๋ฐ์ดํ„ฐ: groupby ํ›„ ํ‰๊ท  

  • AMT_CREDIT_SUM (์ด์ „ ๋Œ€์ถœ์˜ ๊ธˆ์•ก) 
  • DAYS_CREDIT (train, test์˜ ๋Œ€์ถœ๋กœ๋ถ€ํ„ฐ ๋ฉฐ์น  ์ „์— ์ด์ „ ๋Œ€์ถœ์„ ์ง„ํ–‰ํ–ˆ๋Š”์ง€) 
  • CNT_CREDIT_PROLONG (๋Œ€์ถœ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€) 
train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )

- lb ๋ฐ์ดํ„ฐ: groupby ํ›„ ๊ฐฏ์ˆ˜ 

  • count ์ปฌ๋Ÿผ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ์ด์ „์— ๋Œ€์ถœ์„ ๋ช‡ ๋ฒˆ ์ง„ํ–‰ํ–ˆ๋Š”์ง€
train = pd.merge(train , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')
test = pd.merge(test , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')

 

- ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์€ ๋ชจ๋ธ ํ•ด์„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด์— ๋ฐฉํ•ด๋ฅผ ์ฃผ๋Š” ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์ œ๊ฑฐ

์ œ๊ฑฐ ๋ณ€์ˆ˜๋ชฉ๋ก

  • CODE_GENDER : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • FLAG_OWN_CAR : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_TYPE_SUITE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_INCOME_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_EDUCATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_HOUSING_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • ORGANIZATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • EXT_SOURCE_1 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_2 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_3 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
del_list = ['CODE_GENDER','FLAG_OWN_CAR','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE','ORGANIZATION_TYPE',
'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
train = train.drop(del_list,axis=1)
test = test.drop(del_list,axis=1)
train.columns

 

(2) ๋ชจ๋ธ๋ง 

- ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ input๋ณ€์ˆ˜๋Š” ์‚ญ์ œํ•œ๋‹ค. 

: Input ๋ณ€์ˆ˜๊ฐ€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋Œ ๋•Œ shap value๋Š” ์ œ๋Œ€๋กœ ๋œ ์„ค๋ช…๋ ฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•จ. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

ํƒ€๊ฒŸ๋ณ€์ˆ˜์ธ TARGET  ์„ ์ œ์™ธํ•œ ๋ณ€์ˆ˜๋“ค์„ input_var ์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

corr = train[input_var].corr()
corr.style.background_gradient(cmap='coolwarm')

์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์‹œ๊ฐํ™” ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง€๊ณ , ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜๋“ค์„ ๋‚˜์—ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

[ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜ ๋ชฉ๋ก ]  

  • CNT_FAM_MEMBERS & CNT_CHILDREN 0.883051
  • AMT_CREDIT_TO_ANNUITY_RATIO & AMT_CREDIT 0.656337
  • AMT_ANNUITY & AMT_CREDIT 0.770938

cf) ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜์˜ ํ•ด์„ 

r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„


ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ์ƒ๊ด€์„ฑ์ด ๋” ๋‚ฎ์€ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. 

print(train['CNT_FAM_MEMBERS'].corr(train['TARGET']))
print(train['CNT_CHILDREN'].corr(train['TARGET']))

0.018876651698723705

0.025357359317615676

del train['CNT_FAM_MEMBERS']
del test['CNT_FAM_MEMBERS']

CNT_FAM_MEMBERS๊ฐ€ TARGET๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

print(train['AMT_CREDIT_TO_ANNUITY_RATIO'].corr(train['TARGET']))
print(train['AMT_CREDIT'].corr(train['TARGET']))

-0.024740288335190132

-0.02255843084934759

del train['AMT_CREDIT']
del test['AMT_CREDIT']

AMT_CREDIT๊ณผ TARGER์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

์ œ๊ฑฐํ•œ ๋ณ€์ˆ˜๋“ค์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜๋“ค์„ input_var์— ๋‹ค์‹œ ์ €์žฅํ•ด ์ค€๋‹ค. 

 

-xgboost ๋ชจ๋ธ๋ง 

: shap value๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ˜•ํƒœ์˜ treeํ˜• ๋ชจ๋ธ์ด์–ด์•ผ ํ•œ๋‹ค. ์ด ์ค‘ xgboost๊ฐ€ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉด์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ์„ ํƒ. 

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(train[input_var],train['TARGET'])

 

 

(3) shap value 

import shap
shap_values = shap.TreeExplainer(model).shap_values(train[input_var])
shap.summary_plot(shap_values, train[input_var], plot_type='bar')

 

ํƒ€๊ฒŸ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ƒ์œ„ 5๊ฐ€์ง€ ๋ณ€์ˆ˜ ๋ชฉ๋ก

  • AMT_CREDIT_TO_ANNUITY_RATIO
  • DAYS_EMPLOYED
  • DAYS_CREDIT
  • DAYS_BIRTH
  • DAYS_LAST_PHONE_CHANGE

 

(4) 5๊ฐœ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ๋ณ€์ˆ˜(๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€) ์™€์˜ ๊ด€๊ณ„ 

-1. AMT_CREDIT_TO_ANNUITY_RATIO: ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„

shap.dependence_plot('AMT_CREDIT_TO_ANNUITY_RATIO', shap_values, train[input_var])

ํ•ด๋‹น ๊ทธ๋ž˜ํ”„๋Š” ์„ธ๋กœ์ถ•์˜ ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ ํ•œ๋‹ค๊ณ  ํ•ด์„(TARGET์ด 0์ผ ํ™•๋ฅ ์ด ๋†’์Œ)ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ๊ฐ„์ด 12-20๊ฐœ์›”์ผ ๋•Œ ์ƒํ™˜์„ ์ž˜ ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, 12๊ฐœ์›” ์ดํ•˜, 20๊ฐœ์›” ์ด์ƒ์ผ ๋•Œ๋Š” ๋น„๊ต์  ์ƒํ™˜์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

 

 

- 2. DAYS_EMPLOYED: ์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€

shap.dependence_plot('DAYS_EMPLOYED', shap_values, train[input_var])

๋Œ€์ถœ์ผ ๊ธฐ์ค€์œผ๋กœ 9000์ผ ๋ณด๋‹ค ์ „์— ์ทจ์—…ํ–ˆ์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๊ธ‰ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 3. DAYS_CREDIT: ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€

shap.dependence_plot('DAYS_CREDIT', shap_values, train[input_var])

-3000์ผ ๋ถ€ํ„ฐ -2000์ผ๊นŒ์ง€ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ์ƒ์Šนํ•˜๋‹ค๊ฐ€ ๊ทธ ์ดํ›„๋ถ€ํ„ฐ ํ•˜๋ฝํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋„ˆ๋ฌด ์˜ค๋ž˜ ์ „์— ๋Œ€์ถœ์„ ๋ฐ›์•˜๊ฑฐ๋‚˜, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 4. DAYS_BIRTH: ๋‚˜์ด

shap.dependence_plot('DAYS_BIRTH', shap_values, train[input_var])

ํƒœ์–ด๋‚œ์ง€ ์˜ค๋ž˜ ๋˜์—ˆ์„ ์ˆ˜๋ก(๋‚˜์ด๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก) ๋Œ€์ถœ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 

 

 

- 5. DAYS_LAST_PHONE_CHANGE: ๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ

shap.dependence_plot('DAYS_LAST_PHONE_CHANGE', shap_values, train[input_var])

ํ•ธ๋“œํฐ์„ ์˜ค๋ž˜ ์ „์— ๋ฐ”๊พธ์—ˆ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋ณด์ธ๋‹ค. 

 

 


3. ๊ฒฐ๋ก  

  • ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„์ด ์ƒํ™˜์—ฌ๋ถ€์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค. ํ•ด๋‹น ์˜ํ–ฅ์€ ๋น„์„ ํ˜•์  ๊ด€๊ณ„์ด๋‹ค. (์˜ํ–ฅ์ด ํฌ๋‹ค๊ณ  ํ•ด์„œ ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‹จ์ •์ง“๊ธฐ๋Š” ์–ด๋ ต๋‹ค. )
  • ์ฃผํƒ ๋ณด์œ  ์—ฌ๋ถ€์™€ ์ž์‹์˜ ์ˆ˜๋Š” ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ์ตœ๊ทผ์— ์ทจ์—…ํ–ˆ์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ํ•ธ๋“œํฐ์„ ๋ฐ”๊ฟจ์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ๋Œ€์ถœ๊ธˆ ์ƒํ™ฉ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค.  
train['DAYS_EMPLOYED'].quantile(0.75)

-748.0

์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ์œ„ 25%์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ 4๊ฐœ์˜ ๋ณ€์ˆ˜์˜ ์ƒ์œ„ 25% ์ด์ƒ ๊ทธ๋ฃน๊ณผ ํ•˜์œ„ 25%๋ฏธ๋งŒ ๊ทธ๋ฃน์„ ๋‚˜๋ˆ„์–ด ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•ด ๋ณธ๋‹ค. 

 

- ์ƒ์œ„ 25%

group1 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.75)< train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.75)< train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)< train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.75)< train['DAYS_BIRTH']) ]

- ํ•˜์œ„ 25 %

group2 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.25)> train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.25)> train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)> train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.25)> train['DAYS_BIRTH']) ]
group1['group'] = 1
group2['group'] = 0

group1์€ group๋ณ€์ˆ˜์— 1์„, group2๋Š” group ๋ณ€์ˆ˜์— 0์„ ๋„ฃ์–ด ์ค€๋‹ค. 

full = pd.concat([group1,group2],axis=0)

group1๊ณผ group2๋ฅผ ํ•ฉ์ณ์ค€๋‹ค. 

import seaborn as sns
sns.barplot('group','TARGET',data=full)

group2 (group=0, ํ•˜์œ„ 25%)  ์˜ Target๊ฐ’์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค(0์ด ๋งŽ๋‹ค=์ •์ƒ ์ƒํ™˜). ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒฐ๋ก ๊ณผ ๊ฐ™์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

+ Recent posts