๋ณธ ํฌ์ŠคํŠธ๋Š” ํŒจ์ŠคํŠธ์บ ํผ์Šค ํŒŒ์ด์ฌ ๊ธฐ์ดˆ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์˜์ƒ์ธ์‹ ๋ฐ”์ด๋ธ” ๊ฐ•์˜๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

์˜ฌํ•ด ์ƒ๋ฐ˜๊ธฐ๊ฐ€ ์ง€๋‚˜๊ฐ€๊ธฐ ์ „ ๋”ฅ๋Ÿฌ๋‹ ๊ณต๋ถ€๋ฅผ ๊นŠ๊ฒŒ ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. CNN, RNN, LSTM ๋“ฑ์˜ ์ด๋ก ์€ ํ•™๋ถ€์ƒํ™œ์„ ํ•˜๋ฉด์„œ ๊ฝค๋‚˜ ์ตํ˜”๋Š”๋ฐ, ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์€ ํ•ด๋ณด์ง€ ์•Š์•˜๊ธฐ์—, ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘ ํ•œ ๊ฐ€์ง€ ์ •๋„๋Š” ๋Šฅ์ˆ™ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ฐ•์˜๋ฅผ ๋”ฐ๋ผ keras๋ฅผ ์‚ฌ์šฉํ•  ์˜ˆ์ •์ธ๋ฐ, 

keras ๋ฌธ์„œ ๋ฅผ ๋“ค์–ด๊ฐ€ ์‚ฌ์šฉ๋ฒ•์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ๋ง์€ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ด ๋ฐฐ์šธ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค. 

 

 

 


1. ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import 

# TensorFlow and tf.keras
import tensorflow as tf 
from tensorflow import keras 
#  Helper libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import math

print(tf.__version__) # tensorflow ๋ฒ„์ „ ํ™•์ธ

tensorflow, keras๋ฅผ import ํ•ด์ค€๋‹ค.  ํ•„์ž๋Š” google colab ์—์„œ ์‹ค์Šต์„ ์ง„ํ–‰ํ•˜์˜€๊ณ , ๊ธ€์„ ์“ฐ๋Š” ์‹œ์ ์„ ๊ธฐ์ค€์œผ๋กœ tf ๋ฒ„์ „ 2.8.0 ์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐ–์— ํ•„์š”ํ•œ numpy, matplotlib, math๋ฅผ import ํ•ด ์ค€๋‹ค. 

 

 


2. batch size, epochs, num_classes ์ •์˜

# Define Constants 
batch_size = 128 
epochs = 100 
num_classes = 10

batch_size: ๋ฐ์ดํ„ฐ๋ฅผ ๋ช‡๊ฐœ์”ฉ ๋ฌถ์–ด์„œ ํ•™์Šตํ•  ๊ฒƒ์ธ๊ฐ€? -> 128๊ฐœ์”ฉ ๋ฌถ์–ด์„œ ํ•™์Šตํ•˜๊ฒ ๋‹ค

ephocs: ํ•™์Šต์„ ๋ฐ˜๋ณตํ•˜๋Š” ํšŸ์ˆ˜ -> 100๋ฒˆ ํ•™์Šตํ•˜๊ฒ ๋‹ค

num_classes: ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜ -> MNIST๋Š” 0~9๊นŒ์ง€ 10๊ฐœ์ด๋ฏ€๋กœ 10

 

 

  • 60000์žฅ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ์— ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  batch size๋ฅผ ์„ค์ •ํ•˜๋Š” ์ด์œ 

๋ฐฐ์น˜๋ฅผ ๋‚˜๋ˆ ์„œ ํ•™์Šตํ•˜๊ฒŒ๋˜๋ฉด ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์ŠคํŠธ๋ ˆ์ดํŠธ๋กœ ์ญ‰ ํ•™์Šต๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, batch size๋งŒํผ ํ•™์Šต๋˜๋ฉด์„œ ์˜ˆ์ธก ๊ฐ’์ด ๋งž๊ฑฐ๋‚˜ ํ‹€๋ฆฐ ๊ฒฝ์šฐ๊ฐ€ ๊ฐ ๋ฐฐ์น˜๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๊ฐ„์ค‘๊ฐ„ ๊ฐ€์ค‘์น˜๊ฐ€ ์กฐ์ ˆ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

(์‹ค์ œ๋กœ ์‹คํ—˜ํ•ด๋ณด์•˜๋”๋‹ˆ batch size๋ฅผ 60000์žฅ์œผ๋กœ ํ–ˆ์„ ๋•Œ ์ •ํ™•๋„๊ฐ€ 0.02์ •๋„ ๋‚ฎ๊ฒŒ ๋‚˜์™”๋‹ค. (MNIST ๋ฐ์ดํ„ฐ ๊ธฐ์ค€) ๊ทธ๋ฆฌ๊ณ  batch size๊ฐ€ ์ž‘์•„์งˆ ์ˆ˜๋ก ํ•™์Šต ์†๋„๊ฐ€ ๋Š๋ ค์ง„๋‹ค. ์•„์ง ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •ํ•  ๋ ˆ๋ฒจ์€ ์•„๋‹ˆ์ง€๋งŒ, ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์„ ์ง์ ‘ ํ™•์ธํ•˜๋‹ˆ ์ ์ ˆํ•œ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ์„ค์ •ํ•ด์ฃผ๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ธ ๊ฒƒ ๊ฐ™์•„๋ณด์ธ๋‹ค. )

 

 

 


3. MNIST ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# Download MNIST dataset 
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

์›Œ๋‚™ ์œ ๋ช…ํ•œ MNIST ๋ฐ์ดํ„ฐ์…‹์€ keras์—์„œ ์ œ๊ณตํ•ด์ฃผ๋ฏ€๋กœ ๋”ฐ๋กœ ๋‹ค์šด๋ฐ›์„ ํ•„์š” ์—†์ด ์œ„์™€ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.  

len(train_images), len(test_images)

 (60000, 10000) 

train์€ 60000์žฅ, test๋Š” 10000์žฅ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 


4. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต

(1) normailze (0.0 ~ 1.0 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋˜๋„๋ก) 

# Normalize the input image so that each pixel value is between 0 to 1 
train_images = train_images / 255.0 
test_images = test_images / 255.0

๋ฐ์ดํ„ฐ๋ฅผ floatํ˜•์œผ๋กœ ๋งŒ๋“ค๋ฉด์„œ 0.0~1.0 ์‚ฌ์ด๋กœ ์ •๊ทœํ™”ํ•ด์ค€๋‹ค. 

 

 

(2) ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ •์˜

# Define the model architecture 
model = keras.Sequential([
                          keras.layers.Flatten(input_shape=(28, 28)),
                          keras.layers.Dense(128, activation=tf.nn.relu),
                          keras.layers.Dense(num_classes, activation='softmax')
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

๋ชจ๋ธ์€ keras.Sequential์— ์ธต์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ง๊ด€์ ์œผ๋กœ ๋ชจ๋ธ๋ง์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. flatten์œผ๋กœ ํ•œ ์žฅ๋‹น 2์ฐจ์› ๋ฐฐ์—ด 28x28์ธ ์ด๋ฏธ์ง€๋ฅผ 1์ฐจ์›์œผ๋กœ ๋งŒ๋“ค์–ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ Dense layer๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , activation ํ•จ์ˆ˜๋Š” relu๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ ์ธต์—๋Š” ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜์™€ softmax ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํด๋ž˜์Šค ๋ณ„ ํ™•๋ฅ ๋กœ ๋‚˜์˜ค๊ฒŒ๋” ๋งŒ๋“ค์–ด์ค€๋‹ค. 

 

model.complie๋กœ optimizer์™€ lossํ•จ์ˆ˜, metrics (ํ‰๊ฐ€์ง€ํ‘œ)๋ฅผ ์„ค์ •ํ•ด ์ค€๋‹ค. 

 

์ด์ œ ๋ชจ๋ธ ํ•™์Šตํ•  ๋ชจ๋“  ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค. 

 

 

(3) ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต

history = model.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size)

train ๋ฐ์ดํ„ฐ์…‹๊ณผ ์•ž์„œ ์ง€์ •ํ–ˆ๋˜ ephocs, batch_size๋ฅผ ์„ค์ •ํ•ด ์ค€๋‹ค. 

 


5. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ‰๊ฐ€

(1) loss, accuracy ํ™•์ธ

test_loss, test_acc = model.evaluate(test_images, test_labels)
print("Test Loss: ", test_loss)
print("Test Accuracy: ", test_acc)

 Test Loss: 0.12909765541553497 

 Test Accuracy: 0.98089998960495 

์•„์ฃผ ๊ธฐ๋ณธ์ ์ธ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์˜€์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  0.98์ด๋ผ๋Š” ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์™”๋‹ค. ๋ชจ๋“  ํ•™์Šต ๊ฒฐ๊ณผ๊ฐ€ ์ด๋žฌ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค. 

 

 

 

(2) ํ•„์š” ํ•จ์ˆ˜ ์ •์˜ 

# 1. ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
def show_sample(images, labels, sample_count=25):
  # Create a square with can fit {sample_count} images
  grid_count = math.ceil(math.ceil(math.sqrt(sample_count)))
  grid_count = min(grid_count, len(images), len(labels))

  plt.figure(figsize=(2*grid_count, 2*grid_count))
  for i in range(sample_count):
    plt.subplot(grid_count, grid_count, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(images[i], cmap=plt.cm.gray)
    plt.xlabel(labels[i])
  plt.show()

###################################################################
# 2. ํŠน์ • ์ˆซ์ž์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
# Helper function to display specific digit images 
def show_sample_digit(images, labels, digit, sample_count=25):
  # Create a square with can fit {sample_count} images
  grid_count = math.ceil(math.ceil(math.sqrt(sample_count)))
  grid_count = min(grid_count, len(images), len(labels))

  plt.figure(figsize=(2*grid_count, 2*grid_count))
  i = 0 
  digit_count = 0 
  while digit_count < sample_count:
    i += 1 
    if digit == labels[i]: 
      plt.subplot(grid_count, grid_count, digit_count+1)
      plt.xticks([])
      plt.yticks([])
      plt.grid(False)
      plt.imshow(images[i], cmap=plt.cm.gray)
      plt.xlabel(labels[i])
      digit_count += 1 
  plt.show()


###################################################################
# 3.์ด๋ฏธ์ง€ ํ•œ๊ฐœ๋ฅผ ํฌ๊ฒŒ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
def show_digit_image(image):
  # Draw digit image 
  fig = plt.figure()
  ax = fig.add_subplot(1, 1, 1)
  # Major ticks every 20, minor ticks every 5 
  major_ticks = np.arange(0, 29, 5)
  minor_ticks = np.arange(0, 29, 1)
  ax.set_xticks(major_ticks)
  ax.set_xticks(minor_ticks, minor=True)
  ax.set_yticks(major_ticks)
  ax.set_yticks(minor_ticks, minor=True)
  # And a corresponding grid 
  ax.grid(which='both')
  # Or if you want different settings for the grids:
  ax.grid(which='minor', alpha=0.2)
  ax.grid(which='major', alpha=0.5)
  ax.imshow(image, cmap=plt.cm.binary)

  plt.show()

28x28 ๋ฐฐ์—ด์˜ ์ด๋ฏธ์ง€๋ฅผ ์‹œ๊ฐํ™”๋กœ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜์ด๋‹ค. 

 

 

์œ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž ๊น ์ด๋ฏธ์ง€๋ฅผ ํ™•์ธํ•ด ๋ณด์ž. 

 

  • show_sample ํ•จ์ˆ˜ ์‚ฌ์šฉ (์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ์‚ฌ์ง„ ์ถœ๋ ฅ)
show_sample(train_images, ['Label: %s' % label for label in train_labels])

์ด๋ ‡๊ฒŒ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜ ๋งŒํผ ์ด๋ฏธ์ง€๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

  • show_sample_digit ํ•จ์ˆ˜ ์‚ฌ์šฉ (ํŠน์ • ์ˆซ์ž์— ๋Œ€ํ•œ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ์˜ ์‚ฌ์ง„ ์ถœ๋ ฅ)
show_sample_digit(train_images, train_labels, 7)

ํŠน์ • ์ˆซ์ž๋ฅผ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(3) train ๋ฐ์ดํ„ฐ์…‹ ํ•™์Šต ์‹œ ephoch์— ๋”ฐ๋ฅธ loss์™€ accuracy ๊ฐ’ ์‹œ๊ฐํ™”

# Evaluate the model using test dataset. - Show performance 
fig, loss_ax = plt.subplots()
fig, acc_ax = plt.subplots()

loss_ax.plot(history.history['loss'], 'ro')
loss_ax.set_xlabel('ephoc')
loss_ax.set_ylabel('loss')

acc_ax.plot(history.history['accuracy'], 'bo')
acc_ax.set_xlabel('ephoc')
acc_ax.set_ylabel('accuracy')

 

 

 

 

 

(4) test data์˜ ์˜ˆ์ธก ๊ฐ’๊ณผ ์ •๋‹ต ๊ฐ’ ๋น„๊ตํ•ด๋ณด๊ธฐ

  • ์‹ค์ œ๊ฐ’: ๊ทธ๋ฆผ
  • ์˜ˆ์ธก๊ฐ’: x label
# Predict the labels of digit images in our test datasets.
predictions = model.predict(test_images)

# Then plot the first 25 test images and their predicted labels.
show_sample(test_images, ['predicted: %s' % np.argmax(result) for result in predictions])

 

 

 

(5) show_digit_image ํ•จ์ˆ˜ ์‚ฌ์šฉ

  • ํŠน์ • ์ธ๋ฑ์Šค์˜ ์‚ฌ์ง„๊ณผ ๊ทธ๋•Œ์˜ ์˜ˆ์ธก๊ฐ’์„ ๋น„๊ตํ•ด ๋ด„
Digit = 2005 #@param {type:'slider', min:1, max:10000, step:1}
selected_digit = Digit - 1 

result = predictions[selected_digit]
result_number = np.argmax(result)
print('Number is %2d' % result_number)

show_digit_image(test_images[selected_digit])

#@param์„ ์‚ฌ์šฉํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ์Šฌ๋ผ์ด๋”๊ฐ€ ์ƒ๊ธด๋‹ค. ๋žœ๋ค์œผ๋กœ ์Šฌ๋ผ์ด๋“œ๋ฅผ ํ•ด์„œ ์ธ๋ฑ์Šค ๊ฐ’์„ ์ง€์ •ํ•ด ์ฃผ๋ฉด,

Number is 7

์ด์™€ ๊ฐ™์ด Number is 7 ์€ ์˜ˆ์ธก ๊ฐ’, ์ด๋ฏธ์ง€๋Š” test ์ด๋ฏธ์ง€ (์ •๋‹ต ๊ฐ’)์œผ๋กœ ๋‘๊ฐœ๋ฅผ ๋น„๊ต ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ ์‚ฌ์šฉํ•œ MNIST๋ฐ์ดํ„ฐ์…‹์€ ์•„์ฃผ ๊ฐ„๋‹จํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ธ๋ฐ๋„ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค. 

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด๋ฏธ์ง€ ๋ชจ๋ธํ•™์Šต์— ์ตœ์ ํ™” ๋˜์–ด์žˆ๋Š” CNN ๋ชจ๋ธ๋ง์„ ํ•จ์œผ๋กœ์จ MNIST์˜ ์„ฑ๋Šฅ์„ ๋”์šฑ ๋†’์—ฌ๋ณด๋Š” ๊ณต๋ถ€๋ฅผ ํ•ด ๋ณผ ๊ฒƒ์ด๋‹ค. 

์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹ 

https://www.data.go.kr/dataset/3035522/fileData.do

ํ˜„์žฌ ์ด ๋ฐ์ดํ„ฐ์…‹์€ ํ๊ธฐ ๋˜์—ˆ๋‹ค๊ณ  ๋‚˜์˜จ๋‹ค. 

 

์œ„ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธํ”„๋Ÿฐ๊ฐ•์˜ (๊ณต๊ณต๋ฐ์ดํ„ฐ๋กœ ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„) (https://bit.ly/3sISk6Z) ์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•œ ๋ฐ์ดํ„ฐ๋กœ ์‹œ๊ฐํ™” ์ •๋ฆฌ ์ง„ํ–‰ํ•œ๋‹ค. 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ

 

cf1) figure, axes ์ƒ์„ฑ

fig=plt.figure(figsize=(10,3), dpi=100)
ax1=fig.subplots()

 

cf2) ๋ชจ๋“  x tick ํ‘œํ˜„ํ•˜๊ธฐ 

_=plt.xticks(ticks=np.arange(len(df)), labels=df.index)

 

cf3) x์ถ• ์†Œ์ˆ˜์  ์ œ๊ฑฐ

from matplotlib.ticker import MaxNLocator
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))

 

(cf4) ๊ทธ๋ž˜ํ”„์˜ ๋ฐ–์— Legend ํ‘œ์‹œํ•˜๋„๋ก ์„ค์ •

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

 

 

 

 lineplot 


1. pandas plot

(1) pandas plot์˜ ๊ธฐ๋ณธ plot - lineplot 

- df์˜ index ๋˜๋Š” column ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ ค์ง 

df.plot(figsize=(10,3))

์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ , gropuby ์‚ฌ์šฉ

cf) ๋ชจ๋“  x tick ํ‘œํ˜„ํ•˜๊ธฐ 

_=plt.xticks(ticks=np.arange(len(g)), labels=g.index)

- df ์˜ column์ด ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌํ•  ๋•Œ  (df์˜ column์ด seaborn์˜ hue์—ญํ• )

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

2. seaborn plot 

sns.lineplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ง€์—ญ๋ช…", ci=None, ax=ax1)
ax1.legend(bbox_to_anchor=(1.02, 1), loc=2)

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 pointplot 

sns.pointplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ง€์—ญ๋ช…", ci=None, ax=ax2)
ax2.legend(bbox_to_anchor=(1.02, 1), loc=2)

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 

 barplot 


1. pandas plot 

(1) df.plot(kind='bar')

- df์˜ index ๋˜๋Š” columm ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ ค์ง

df.plot.bar(rot=0, figsize=(10, 3))
# or
df.plot(kind='bar',rot=0, ax=ax1)

์ง€์—ญ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ , groupby ์‚ฌ์šฉ

(2) df.plot.bar()

df.plot.bar(color='g',rot=0, figsize=(10,3)) # cmap='Pastel1' ๋˜ํ•œ ๊ฐ€๋Šฅ

์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

- column ์—ฌ๋Ÿฌ๊ฐœ์ผ ๋•Œ ( df์˜ column์ด seaborn์˜ hue์™€ ๊ฐ™์€ ์—ญํ• )

ax=df2.plot.bar(figsize=(10,3), rot=0)
ax.set_ylabel('ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')

์ง€์—ญ๋ณ„ ์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

2. seaborn plot 

sns.barplot(data=df, x="์ง€์—ญ๋ช…", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")
# estimator default: mean
# color changable
# palette (https://seaborn.pydata.org/tutorial/color_palettes.html)
# ci: bootstrap resampling (with replacement), sorted means

palette ์ƒ‰ ๋ชจ์Œ ๋งํฌ

์ง€์—ญ๋ณ„ ํ‰๋‹น ๋ถ„์–‘๊ฐ€๊ฒฉ (ํ™•์‹คํžˆ seaborn์ด ๋” ์˜ˆ์˜๊ธด ํ•˜๋‹ค)

 

- hue ์ง€์ • 

sns.barplot(data=df, x="์ง€์—ญ๋ช…", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue='์—ฐ๋„', ci=None)

 

 

 histplot 


1. pandas plot

(1) df.plot(kind='hist') or df.plot.hist()

df.plot(kind='hist', figsize=(10, 3), title='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')
# or
ax=df.plot(kind='hist', figsize=(10, 3))
ax.set_title('ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')

ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ์˜ ๋ถ„ํฌ

df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"].plot.hist(bins=50)

 

 

(2) df.hist(bins=)

df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"].hist(bins=50)

axs=df.hist(bins=50, figsize=(10,10))
ax1,ax2,ax3,ax4=axs.flatten()
ax2.set_title('ax๋ณ„ ์ œ๋ชฉ ์ง€์ • ๊ฐ€๋Šฅ')

 

 

 

 

 

2. seaborn plot 

sns.histplot(df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"], kde=True)

 

 

 

 kdeplot 


1. seaborn plot 

sns.kdeplot(data=df['ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ'])

sns.kdeplot(data=df[['ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ','๋ถ„์–‘๊ฐ€๊ฒฉ']])

 

 

 

 

 

 boxplot 


1. pandas plot

(1) df.plot(kind='box')

df.plot(kind='box', figsize=(5, 5))

 

(2) df.plot.box()

- df ์˜ column์ด x์ถ• 

df.plot.box(fontsize=15)

 

์›”๋ณ„ ์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- ์ด์ค‘ column์ผ ๊ฒฝ์šฐ 

df.plot.box(figsize=(15, 3), rot=30)

์›”๋ณ„ ์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

(3) df.boxplot(column='', by='')

- by: x์ถ• 

df.boxplot(column='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ',by='์—ฐ๋„', figsize=(5,3), rot=30)

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- by๊ฐ€ ๋ฆฌ์ŠคํŠธ์ผ ๋•Œ 

df.boxplot(column='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ',by=['์—ฐ๋„','์ „์šฉ๋ฉด์ '], figsize=(20,3), rot=30)

์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

2. seaborn plot 

sns.boxplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

- hue ์ง€์ •

plt.figure(figsize=(12, 3))
sns.boxplot(data=df_last, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ „์šฉ๋ฉด์ ")

 

 

 violinplot 

1. seaborn plot 

sns.violinplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- hue ์ง€์ •

plt.figure(figsize=(12, 3))
sns.violinplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ „์šฉ๋ฉด์ ")

์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 heatmap 

1. seaborn plot 

plt.figure(figsize=(15, 7), dpi=100)
ax=sns.heatmap(df, cmap="Blues", annot=True, fmt=".0f")

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ, pivot_table๋กœ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€ df์— ์ ์šฉํ•ด์•ผ ํ•จ

 

 

2. matplotlib pcolor  

fig=plt.figure(figsize=(15,5), dpi=100)
ax=fig.subplots()

t2=t.iloc[::-1]
t2
hm1=ax.pcolor(t2, cmap="Blues")
_=fig.colorbar(hm1, ax=ax)

col_len=len(t2.columns)
row_len=len(t2.index)
for r in range(row_len):
    for c in range(col_len):
        _=ax.text(c+0.5, r+0.5, int(t2.iloc[r, c]),ha="center", va="center", color="k", fontsize=11)

_=ax.set_xticks(np.arange(col_len)+0.5)
_=ax.set_xticklabels(t2.columns)

_=ax.set_yticks(np.arange(row_len)+0.5)
_=ax.set_yticklabels(t2.index)

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/71

 

[rossmann data]์ƒ์  ๋งค์ถœ ์˜ˆ์ธก/ kaggle ์ถ•์†Œ๋ฐ์ดํ„ฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ <์ด์ „ ๊ธ€> https://silvercoding.tistory.com/70 https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.ti..

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

[ Home Credit Data ]

์›๋ณธ ๋ฐ์ดํ„ฐ: ์บ๊ธ€ 

ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต 

  • ๊ณ ๊ฐ์˜ ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ ์˜ˆ์ธก: ๊ณ ๊ฐ์˜ ์ธ์  ์ •๋ณด, ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด๋‹น ๊ณ ๊ฐ์—๊ฒŒ ๋ˆ์„ ๋นŒ๋ ค์ฃผ์—ˆ์„ ๋•Œ ์ด๋ฅผ ์ƒํ™˜ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ
loan_before.csv - ๊ฐ ์‚ฌ๋žŒ์ด ์ด์ „์— ์ง„ํ–‰ํ–ˆ๋˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด

 

import pandas as pd
import os
os.chdir('../data')
lb = pd.read_csv("loan_before.csv")
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

 

lb.head()

 

- loan before ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€ DAYS_CREDIT
๋Œ€์ถœ ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€ CNT_CREDIT_PROLONG
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT_SUM
๋Œ€์ถœ ์œ ํ˜• CREDIT_TYPE

 

- train, test ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํƒ€๊ฒŸ๊ฐ’(0: ์ •์ƒ ์ƒํ™˜, 1: ์—ฐ์ฒด ํ˜น์€ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด ๊ฒฝ์šฐ) TARGET
์„ฑ๋ณ„(0: ์—ฌ์„ฑ, 1: ๋‚จ์„ฑ) CODE_GENDER
์ฐจ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_CAR
์ฃผํƒ ํ˜น์€ ์•„ํŒŒํŠธ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_REALTY
์ž๋…€ ์ˆ˜ CNT_CHILDREN
์ˆ˜์ž… AMT_INCOME_TOTAL
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT
1๋‹ฌ๋งˆ๋‹ค ๊ฐš์•„์•ผ ํ•˜๋Š” ๊ธˆ์•ก AMT_ANNUITY
๋Œ€์ถœ์‹ ์ฒญ์„ ํ•  ๋•Œ ๋ˆ„๊ฐ€ ๋™ํ–‰ํ–ˆ๋Š”์ง€ NAME_TYPE_SUITE
์ง์—… ์ข…๋ฅ˜ NAME_INCOME_TYPE
ํ•™์œ„ NAME_EDUCATION_TYPE
์ฃผ๊ฑฐ ์ƒํ™ฉ NAME_HOUSING_TYPE
์ง€์—ญ์˜ ์ธ๊ตฌ REGION_POPULATION_RELATIVE
๋‚˜์ด DAYS_BIRTH
์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€(365243๋Š” ๊ฒฐ์ธก์น˜) DAYS_EMPLOYED
๊ณ ๊ฐ์ด ๋Œ€์ถœ์„ ์‹ ์ฒญํ•œ ID ๋ฌธ์„œ๋ฅผ ๋ณ€๊ฒฝํ•œ ๋‚ ์งœ DAYS_ID_PUBLISH
๋ณด์œ ํ•œ ์ฐจ์˜ ๋‚˜์ด OWN_CAR_AGE
๊ฐ€์กฑ ์ˆ˜ CNT_FAM_MEMBERS
์–ธ์ œ ๋Œ€์ถœ์‹ ์ฒญ์„ ํ–ˆ๋Š”์ง€ ์‹œ๊ฐ„ HOUR_APPR_PROCESS_START
์ผํ•˜๋Š” ์กฐ์ง์˜ ์ข…๋ฅ˜ ORGANIZATION_TYPE
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ1๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_1
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ2๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_2
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ3๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_3
๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ DAYS_LAST_PHONE_CHANGE
์‹ ์ฒญ ์ „ 1๋…„๊ฐ„ ์‹ ์šฉํ‰๊ฐ€๊ธฐ๊ด€์— ํ•ด๋‹น ์‚ฌ๋žŒ์— ๋Œ€ํ•œ ์‹ ์šฉ์ •๋ณด๋ฅผ ์กฐํšŒํ•œ ๊ฐœ์ˆ˜ AMT_REQ_CREDIT_BUREAU_YEAR

1. ๋ฌธ์ œ ์ •์˜ 

์งˆ๋ฌธ 1 - ์–ด๋–ค ์š”์†Œ๊ฐ€ ๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

์งˆ๋ฌธ 2 - ๊ทธ ์š”์†Œ๋“ค์ด ์ƒํ™˜์—ฌ๋ถ€์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

 

2. ๋ฐฉ๋ฒ•๋ก  

- ๋ถ„์„ ๊ณผ์ • 

์งˆ๋ฌธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ด์„๊ฐ€๋Šฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ (xAI) ํ™œ์šฉ 

(1) Feature Engineering

- AMT_CREDIT_TO_ANNUITY_RATIO ๋ณ€์ˆ˜ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ๋ช‡๊ฐœ์›”์— ๊ฑธ์ณ ๋ˆ์„ ๊ฐš์•„์•ผ ํ•˜๋Š”์ง€ 

train['AMT_CREDIT_TO_ANNUITY_RATIO'] = train['AMT_CREDIT']/train['AMT_ANNUITY']
test['AMT_CREDIT_TO_ANNUITY_RATIO'] = test['AMT_CREDIT']/test['AMT_ANNUITY']

- lb๋ฐ์ดํ„ฐ: groupby ํ›„ ํ‰๊ท  

  • AMT_CREDIT_SUM (์ด์ „ ๋Œ€์ถœ์˜ ๊ธˆ์•ก) 
  • DAYS_CREDIT (train, test์˜ ๋Œ€์ถœ๋กœ๋ถ€ํ„ฐ ๋ฉฐ์น  ์ „์— ์ด์ „ ๋Œ€์ถœ์„ ์ง„ํ–‰ํ–ˆ๋Š”์ง€) 
  • CNT_CREDIT_PROLONG (๋Œ€์ถœ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€) 
train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )

- lb ๋ฐ์ดํ„ฐ: groupby ํ›„ ๊ฐฏ์ˆ˜ 

  • count ์ปฌ๋Ÿผ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ์ด์ „์— ๋Œ€์ถœ์„ ๋ช‡ ๋ฒˆ ์ง„ํ–‰ํ–ˆ๋Š”์ง€
train = pd.merge(train , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')
test = pd.merge(test , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')

 

- ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์€ ๋ชจ๋ธ ํ•ด์„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด์— ๋ฐฉํ•ด๋ฅผ ์ฃผ๋Š” ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์ œ๊ฑฐ

์ œ๊ฑฐ ๋ณ€์ˆ˜๋ชฉ๋ก

  • CODE_GENDER : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • FLAG_OWN_CAR : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_TYPE_SUITE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_INCOME_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_EDUCATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_HOUSING_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • ORGANIZATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • EXT_SOURCE_1 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_2 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_3 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
del_list = ['CODE_GENDER','FLAG_OWN_CAR','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE','ORGANIZATION_TYPE',
'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
train = train.drop(del_list,axis=1)
test = test.drop(del_list,axis=1)
train.columns

 

(2) ๋ชจ๋ธ๋ง 

- ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ input๋ณ€์ˆ˜๋Š” ์‚ญ์ œํ•œ๋‹ค. 

: Input ๋ณ€์ˆ˜๊ฐ€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋Œ ๋•Œ shap value๋Š” ์ œ๋Œ€๋กœ ๋œ ์„ค๋ช…๋ ฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•จ. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

ํƒ€๊ฒŸ๋ณ€์ˆ˜์ธ TARGET  ์„ ์ œ์™ธํ•œ ๋ณ€์ˆ˜๋“ค์„ input_var ์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

corr = train[input_var].corr()
corr.style.background_gradient(cmap='coolwarm')

์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์‹œ๊ฐํ™” ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง€๊ณ , ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜๋“ค์„ ๋‚˜์—ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

[ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜ ๋ชฉ๋ก ]  

  • CNT_FAM_MEMBERS & CNT_CHILDREN 0.883051
  • AMT_CREDIT_TO_ANNUITY_RATIO & AMT_CREDIT 0.656337
  • AMT_ANNUITY & AMT_CREDIT 0.770938

cf) ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜์˜ ํ•ด์„ 

r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„


ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ์ƒ๊ด€์„ฑ์ด ๋” ๋‚ฎ์€ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. 

print(train['CNT_FAM_MEMBERS'].corr(train['TARGET']))
print(train['CNT_CHILDREN'].corr(train['TARGET']))

0.018876651698723705

0.025357359317615676

del train['CNT_FAM_MEMBERS']
del test['CNT_FAM_MEMBERS']

CNT_FAM_MEMBERS๊ฐ€ TARGET๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

print(train['AMT_CREDIT_TO_ANNUITY_RATIO'].corr(train['TARGET']))
print(train['AMT_CREDIT'].corr(train['TARGET']))

-0.024740288335190132

-0.02255843084934759

del train['AMT_CREDIT']
del test['AMT_CREDIT']

AMT_CREDIT๊ณผ TARGER์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

์ œ๊ฑฐํ•œ ๋ณ€์ˆ˜๋“ค์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜๋“ค์„ input_var์— ๋‹ค์‹œ ์ €์žฅํ•ด ์ค€๋‹ค. 

 

-xgboost ๋ชจ๋ธ๋ง 

: shap value๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ˜•ํƒœ์˜ treeํ˜• ๋ชจ๋ธ์ด์–ด์•ผ ํ•œ๋‹ค. ์ด ์ค‘ xgboost๊ฐ€ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉด์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ์„ ํƒ. 

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(train[input_var],train['TARGET'])

 

 

(3) shap value 

import shap
shap_values = shap.TreeExplainer(model).shap_values(train[input_var])
shap.summary_plot(shap_values, train[input_var], plot_type='bar')

 

ํƒ€๊ฒŸ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ƒ์œ„ 5๊ฐ€์ง€ ๋ณ€์ˆ˜ ๋ชฉ๋ก

  • AMT_CREDIT_TO_ANNUITY_RATIO
  • DAYS_EMPLOYED
  • DAYS_CREDIT
  • DAYS_BIRTH
  • DAYS_LAST_PHONE_CHANGE

 

(4) 5๊ฐœ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ๋ณ€์ˆ˜(๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€) ์™€์˜ ๊ด€๊ณ„ 

-1. AMT_CREDIT_TO_ANNUITY_RATIO: ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„

shap.dependence_plot('AMT_CREDIT_TO_ANNUITY_RATIO', shap_values, train[input_var])

ํ•ด๋‹น ๊ทธ๋ž˜ํ”„๋Š” ์„ธ๋กœ์ถ•์˜ ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ ํ•œ๋‹ค๊ณ  ํ•ด์„(TARGET์ด 0์ผ ํ™•๋ฅ ์ด ๋†’์Œ)ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ๊ฐ„์ด 12-20๊ฐœ์›”์ผ ๋•Œ ์ƒํ™˜์„ ์ž˜ ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, 12๊ฐœ์›” ์ดํ•˜, 20๊ฐœ์›” ์ด์ƒ์ผ ๋•Œ๋Š” ๋น„๊ต์  ์ƒํ™˜์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

 

 

- 2. DAYS_EMPLOYED: ์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€

shap.dependence_plot('DAYS_EMPLOYED', shap_values, train[input_var])

๋Œ€์ถœ์ผ ๊ธฐ์ค€์œผ๋กœ 9000์ผ ๋ณด๋‹ค ์ „์— ์ทจ์—…ํ–ˆ์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๊ธ‰ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 3. DAYS_CREDIT: ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€

shap.dependence_plot('DAYS_CREDIT', shap_values, train[input_var])

-3000์ผ ๋ถ€ํ„ฐ -2000์ผ๊นŒ์ง€ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ์ƒ์Šนํ•˜๋‹ค๊ฐ€ ๊ทธ ์ดํ›„๋ถ€ํ„ฐ ํ•˜๋ฝํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋„ˆ๋ฌด ์˜ค๋ž˜ ์ „์— ๋Œ€์ถœ์„ ๋ฐ›์•˜๊ฑฐ๋‚˜, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 4. DAYS_BIRTH: ๋‚˜์ด

shap.dependence_plot('DAYS_BIRTH', shap_values, train[input_var])

ํƒœ์–ด๋‚œ์ง€ ์˜ค๋ž˜ ๋˜์—ˆ์„ ์ˆ˜๋ก(๋‚˜์ด๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก) ๋Œ€์ถœ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 

 

 

- 5. DAYS_LAST_PHONE_CHANGE: ๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ

shap.dependence_plot('DAYS_LAST_PHONE_CHANGE', shap_values, train[input_var])

ํ•ธ๋“œํฐ์„ ์˜ค๋ž˜ ์ „์— ๋ฐ”๊พธ์—ˆ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋ณด์ธ๋‹ค. 

 

 


3. ๊ฒฐ๋ก  

  • ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„์ด ์ƒํ™˜์—ฌ๋ถ€์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค. ํ•ด๋‹น ์˜ํ–ฅ์€ ๋น„์„ ํ˜•์  ๊ด€๊ณ„์ด๋‹ค. (์˜ํ–ฅ์ด ํฌ๋‹ค๊ณ  ํ•ด์„œ ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‹จ์ •์ง“๊ธฐ๋Š” ์–ด๋ ต๋‹ค. )
  • ์ฃผํƒ ๋ณด์œ  ์—ฌ๋ถ€์™€ ์ž์‹์˜ ์ˆ˜๋Š” ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ์ตœ๊ทผ์— ์ทจ์—…ํ–ˆ์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ํ•ธ๋“œํฐ์„ ๋ฐ”๊ฟจ์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ๋Œ€์ถœ๊ธˆ ์ƒํ™ฉ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค.  
train['DAYS_EMPLOYED'].quantile(0.75)

-748.0

์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ์œ„ 25%์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ 4๊ฐœ์˜ ๋ณ€์ˆ˜์˜ ์ƒ์œ„ 25% ์ด์ƒ ๊ทธ๋ฃน๊ณผ ํ•˜์œ„ 25%๋ฏธ๋งŒ ๊ทธ๋ฃน์„ ๋‚˜๋ˆ„์–ด ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•ด ๋ณธ๋‹ค. 

 

- ์ƒ์œ„ 25%

group1 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.75)< train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.75)< train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)< train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.75)< train['DAYS_BIRTH']) ]

- ํ•˜์œ„ 25 %

group2 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.25)> train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.25)> train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)> train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.25)> train['DAYS_BIRTH']) ]
group1['group'] = 1
group2['group'] = 0

group1์€ group๋ณ€์ˆ˜์— 1์„, group2๋Š” group ๋ณ€์ˆ˜์— 0์„ ๋„ฃ์–ด ์ค€๋‹ค. 

full = pd.concat([group1,group2],axis=0)

group1๊ณผ group2๋ฅผ ํ•ฉ์ณ์ค€๋‹ค. 

import seaborn as sns
sns.barplot('group','TARGET',data=full)

group2 (group=0, ํ•˜์œ„ 25%)  ์˜ Target๊ฐ’์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค(0์ด ๋งŽ๋‹ค=์ •์ƒ ์ƒํ™˜). ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒฐ๋ก ๊ณผ ๊ฐ™์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/70

 

[FIFA DATA] 2019/2020 ์‹œ์ฆŒ Manchester United ์— ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€?, EDA ๊ณผ์ •

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding...

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

<Rossmann Store Sales> 

https://www.kaggle.com/c/rossmann-store-sales/data?select=test.csv 

 

Rossmann Store Sales | Kaggle

 

www.kaggle.com

ํ•ด๋‹น ๋งํฌ์˜ ์บ๊ธ€ ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ๋กœ์Šค๋งŒ ๋ฐ์ดํ„ฐ์ด๋‹ค. 

  • train.csv - historical data including Sales
  • test.csv - historical data excluding Sales
  • sample_submission.csv - a sample submission file in the correct format
  • store.csv - supplemental information about the stores

 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ถ•์†Œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์ ์˜ ๋งค์ถœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค.  

(๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต)

 

import os
import pandas as pd
os.chdir('../data')
train = pd.read_csv("lspoons_train.csv")
test = pd.read_csv("lspoons_test.csv")
store = pd.read_csv("store.csv")

lspoons_train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
lspoons_test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ

store.csv - ์ƒ์ ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ณด์กฐ ๋ฐ์ดํ„ฐ

 

 

train.head()


์ปฌ๋Ÿผ ์ •๋ณด 

  • id
  • Store: ๊ฐ ์ƒ์ ์˜ id
  • Date: ๋‚ ์งœ
  • Sales: ๋‚ ์งœ์— ๋”ฐ๋ฅธ ๋งค์ถœ
  • Promo: ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์ง„ํ–‰ ์—ฌ๋ถ€
  • StateHoliday: ๊ณตํœด์ผ ์—ฌ๋ถ€/ ๊ณตํœด์ผ X-> 0, ๊ณตํœด์ผ-> ๊ณตํœด์ผ์˜ ์ข…๋ฅ˜(a, b, c)
  • SchoolHoliday: ํ•™๊ต ํœด์ผ์ธ์ง€ ์—ฌ๋ถ€

์œ„์˜ ์ปฌ๋Ÿผ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ Sales(๋งค์ถœ) ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 


- ๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง ( feature engineering - ๋ณ€์ˆ˜์„ ํƒ - ๋ชจ๋ธ๋ง ) 

2. 2์ฐจ ๋ชจ๋ธ๋ง ( store ๋ฐ์ดํ„ฐ merge - feature engineering - ๋ณ€์ˆ˜ ์„ ํƒ - ๋ชจ๋ธ๋ง )

3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

... ๋ชจ๋ธ๋ง ๋ฐ˜๋ณต ( ์ด ํ›„ ๋ชจ๋ธ๋ง์€ ์ž์œจ, ๊นƒํ—™ ์ •๋ฆฌ ) 

 


1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง 

: ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค. (๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ, ์›ํ•ซ ์ธ์ฝ”๋”ฉ) 


ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์ด๋ž€? 

  • ์˜ˆ์ธก์„ ์œ„ํ•ด ๊ธฐ์กด์˜ input ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด input ๋ณ€์ˆ˜ ์ƒ์„ฑ
  • ๋จธ์‹ ๋Ÿฌ๋‹ ์˜ˆ์ธก ์„ฑ๋Šฅ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•

train.info()

๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , object ํƒ€์ž…์ธ Date, StateHoliday ์ปฌ๋Ÿผ์„ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€๋‹ค. 

 

- StateHoliday column one-hot encoding 

train = pd.get_dummies(columns=['StateHoliday'],data=train)
test = pd.get_dummies(columns=['StateHoliday'],data=test)

get_dummies ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ StateHoliday ์ปฌ๋Ÿผ์„ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

print("train_columns: ", train.columns, end="\n\n\n")
print("test_columns: ", test.columns)

์ƒˆ๋กœ ์ƒ์„ฑ๋œ ์นผ๋Ÿผ์„ ๋ณด๋ฉด train์—๋Š” b, c ๊ฐ€ ์žˆ์ง€๋งŒ test์—๋Š” b, c ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๊ฒฝ์šฐ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

test['StateHoliday_b'] = 0
test['StateHoliday_c'] = 0

๋”ฐ๋ผ์„œ ๊ฐ™์€ ์นผ๋Ÿผ์„ test ๋ฐ์ดํ„ฐ์…‹์— ์ƒ์„ฑํ•ด ์ค€๋‹ค.

 

- feature engineering using Date column

train['Date']

Date ์นผ๋Ÿผ์€ ๋‚ ์งœํ˜• ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์ง€๋งŒ dtype์ด object์ด๋ฏ€๋กœ ๋‚ ์งœ๋กœ์„œ์˜ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

train['Date'] = pd.to_datetime( train['Date'] )
test['Date'] = pd.to_datetime( test['Date'] )

๋”ฐ๋ผ์„œ pandas์—์„œ ๋‚ ์งœ ๊ณ„์‚ฐ์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” to_datetime ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ ์งœํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๋‹ค. 

 

 

# ์š”์ผ ์ปฌ๋Ÿผ weekday ์ƒ์„ฑ 

train['weekday'] = train['Date'].dt.weekday
test['weekday'] = test['Date'].dt.weekday

# ๋…„๋„ ์ปฌ๋Ÿผ year ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

# ์›” ์ปฌ๋Ÿผ month ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

 

 

- ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง 

from xgboost import XGBRegressor
train.columns

xgb = XGBRegressor( n_estimators= 300 , learning_rate=0.1 , random_state=2020 )
xgb.fit(train[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']],
        train['Sales'])

 

XGB ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ผœ ์ค€๋‹ค. 

 

from sklearn.model_selection import cross_val_score
cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

cross validation ์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ๊ตฌํ•ด๋ณด์•˜๋”๋‹ˆ ์œ„์™€ ๊ฐ™์ด ๋‚˜์™”๋‹ค.  ์ถ”๊ฐ€ ์ž‘์—…์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ์ค„์—ฌ๋‚˜๊ฐ€ ๋ณด์ž! 

 

 

cf.  ์บ๊ธ€ ์ œ์ถœ ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ 

test['Sales'] = xgb.predict(test[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์— ๋„ฃ์–ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

test[['id','Sales']].to_csv("submission.csv",index=False)

 

- ๋ณ€์ˆ˜ ์„ ํƒ 

xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์˜ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

input_var = ['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']

input_var์— Sales๋ฅผ ์ œ์™ธํ•œ ์ธํ’‹ ๋ณ€์ˆ˜๋ฅผ ์ €์žฅํ•ด ์ค€๋‹ค. 

imp_df = pd.DataFrame({"var": input_var,
                       "imp": xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
imp_df

๋ณ€์ˆ˜ ์ค‘์š”๋„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ์„ ํ•ด ์ค€๋‹ค. Promo๊ฐ€ ์••๋„์ ์œผ๋กœ ๋ณ€์ˆ˜์ค‘์š”๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. State_Holiday๋Š” ๋Œ€์ฒด์ ์œผ๋กœ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

import matplotlib.pyplot as plt
plt.bar(imp_df['var'],imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

ํ•œ๋ˆˆ์— ๋ณด๊ธฐ์œ„ํ•ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๋ณด์•˜๋”๋‹ˆ SchoolHoliday ์ดํ›„ ์ปฌ๋Ÿผ๋“ค์€ ๋ณ„ ์˜๋ฏธ๊ฐ€ ์—†์–ด ๋ณด์ธ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ปฌ๋Ÿผ์„ ๋ช‡๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜ค๋ฅ˜์œจ์„ ์ค„๊ฒŒ ํ•˜๋Š”์ง€ ์‹คํ—˜ํ•ด ๋ณธ๋‹ค. 

import numpy as np
score_list=[]
selected_varnum=[]
for i in range(1,10):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

 

๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜ ๋ณ„๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ 2๊ฐœ์ผ ๋•Œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ cross validation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋‘๋ฒˆ์งธ ๋นผ๊ณ ๋Š” ๋ชจ๋‘ ์ค„์–ด๋“  ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ ๋ชจ๋ธ ํ•™์Šต์„ ํ•œ ํ›„, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ œ์ถœํ•œ ์บ๊ธ€ ์Šค์ฝ”์–ด๋„ ๋” ์ค„์–ด๋“ค์—ˆ๋‹ค. (๋ฐ˜๋ณต์ž‘์—…์ด๋ฏ€๋กœ ํฌ์ŠคํŒ…์—์„œ ์ƒ๋žต) 

 

 

 

 

 


2. 2์ฐจ ๋ชจ๋ธ๋ง 

- store ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘ 

store


store ๋ฐ์ดํ„ฐ์…‹: ๊ฐ ์ƒ์ ์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ •๋ฆฌํ•œ ๊ฒƒ 

์ปฌ๋Ÿผ ์˜๋ฏธ

  • Store: ์ƒ์ ์˜ ์œ ๋‹ˆํฌํ•œ id
  • Store Type: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • Assortment: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • CompetitionDistance: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์ƒ์ ๊ณผ์˜ ๊ฑฐ๋ฆฌ
  • CompetitionOpenSinceMonth: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์˜คํ”ˆ ์›”
  • CompetitionOpenSinceYear: ์˜คํ”ˆ ๋…„๋„
  • Promo2: ์ง€์†์ ์ธ(์ฃผ๊ธฐ์ ์ธ) ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์—ฌ๋ถ€
  • Promo2SinceWeek/ promo2SinceYear: ํ•ด๋‹น ์ƒ์ ์ด promo2๋ฅผ ํ•˜๊ณ ์žˆ๋‹ค๋ฉด ์–ธ์ œ ์‹œ์ž‘ํ–ˆ๋Š”์ง€
  • PromoInterval: ์ฃผ๊ธฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€

train = pd.merge(train, store, on=['Store'], how='left')
test = pd.merge(test, store, on=['Store'], how='left')

Store ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ train, test ๋ฐ์ดํ„ฐ์…‹๊ณผ store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•ด ์ค€๋‹ค. 

 

 

- CompetitionOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ

: ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ (ํ•ด๋‹น ๊ฐ€๊ฒŒ ์ด์ „ ๊ฐœ์žฅ: ์–‘์ˆ˜, ์ดํ›„ ๊ฐœ์žฅ: ์Œ์ˆ˜

train['CompetitionOpen'] = 12*( train['year'] - train['CompetitionOpenSinceYear'] ) + \
                             (train['month'] - train['CompetitionOpenSinceMonth'])

test['CompetitionOpen'] = 12*( test['year'] - test['CompetitionOpenSinceYear'] ) + \
                             (test['month'] - test['CompetitionOpenSinceMonth'])

ํ•ด๋‹น ๊ฐ€๊ฒŒ๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„์—์„œ ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„๋ฅผ ๋บ€ ํ›„ 12๋ฅผ ๊ณฑํ•˜๋ฉด ๊ฐœ์›” ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ๋‹ฌ์—์„œ ๊ฒฝ์Ÿ์—…์ฒด ๊ฐœ์žฅ ๋‹ฌ์˜ ์ฐจ์ด์™€ ๋”ํ•ด์ฃผ๋ฉด ํ•ด๋‹น ๊ฐ€๊ฒŒ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- PromoOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ 

: ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ํ›„์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์‹œ์ž‘๋˜์—ˆ๋Š”์ง€ 

train['WeekOfYear'] = train['Date'].dt.weekofyear # ํ˜„์žฌ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€
test['WeekOfYear'] = test['Date'].dt.weekofyear

ํ”„๋กœ๋ชจ์…˜2์— ๋Œ€ํ•œ ๋‚ ์งœ ์ •๋ณด๊ฐ€ ๋…„๋„(Year)์™€ ์ฃผ(Week)๋กœ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— Date์ปฌ๋Ÿผ์—์„œ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€ ๊ณ„์‚ฐํ•˜์—ฌ WeekOfYear ์ปฌ๋Ÿผ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

train['PromoOpen'] = 12* ( train['year'] - train['Promo2SinceYear'] ) + \
                        (train['WeekOfYear'] - train['Promo2SinceWeek']) / 4

test['PromoOpen'] = 12* ( test['year'] - test['Promo2SinceYear'] ) + \
                        (test['WeekOfYear'] - test['Promo2SinceWeek']) / 4

์ด์ „๊ณผ ๊ฐ™์ด ๋…„๋„๋ฅผ ๊ฐœ์›”์ˆ˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ , ์ฃผ๋ฅผ 4๋กœ ๋‚˜๋ˆ„์–ด ๊ฐœ์›”์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๊ฒƒ์„ ๋”ํ•˜์—ฌ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ๋’ค์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐœ์›” ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

- ์›ํ•ซ์ธ์ฝ”๋”ฉ ( get_dummies() ) 

train.dtypes

๋ฐ์ดํ„ฐํƒ€์ž…์„ ํ™•์ธ ํ•ด ๋ณด๋ฉด object์ธ ์ปฌ๋Ÿผ์ด 3๊ฐ€์ง€ ์žˆ๋‹ค. 3๊ฐœ์˜ ์ปฌ๋Ÿผ์„ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

train = pd.get_dummies(columns=['StoreType'],data=train)
test = pd.get_dummies(columns=['StoreType'],data=test)
train = pd.get_dummies(columns=['Assortment'],data=train)
test = pd.get_dummies(columns=['Assortment'],data=test)
train = pd.get_dummies(columns=['PromoInterval'],data=train)
test = pd.get_dummies(columns=['PromoInterval'],data=test)
train.columns

test.columns

train column๊ณผ test column ์ด ๋™์ผํ•œ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 

 

 

 

- ๋ชจ๋ธ๋ง 

input_var = ['Promo', 'SchoolHoliday',
       'StateHoliday_0', 'StateHoliday_a', 'StateHoliday_b', 'StateHoliday_c',
       'weekday', 'year', 'month', 'CompetitionDistance',
       'Promo2',
       'CompetitionOpen', 'WeekOfYear',
       'PromoOpen', 'StoreType_a', 'StoreType_b', 'StoreType_c', 'StoreType_d',
       'Assortment_a', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec']

ํ•„์š”์—†๋Š” ์ปฌ๋Ÿผ์€ ์‚ญ์ œํ•˜๊ณ  input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

set(train) - set(input_var)

(์ฐธ๊ณ ) input_var์— ๋“ค์–ด๊ฐ€์ง€ ์•Š์€ ์ปฌ๋Ÿผ๋“ค ๋ชฉ๋ก์ด๋‹ค. 

xgb = XGBRegressor( n_estimators=300, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],train['Sales'])

์•ž๊ณผ ๋™์ผํ•˜๊ฒŒ xgb ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.  

cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋ธ๋ง์„ ํ–ˆ๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋Œ€ํญ ํ•˜๋ฝํ•˜์˜€๋‹ค. 

 

 

- ๋ณ€์ˆ˜์ค‘์š”๋„ 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
plt.bar(imp_df['var'],
        imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๋”๋‹ˆ, ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์„ ํƒํ•ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค. 

score_list=[]
selected_varnum=[]
for i in range(1,25):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

์ง€์†์ ์œผ๋กœ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด์ง€๋งŒ 17๊ฐœ ์ดํ›„๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™์ด ๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ 17๊ฐœ๊นŒ์ง€ ์„ ํƒํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ๋ณธ๋‹ค. 

input_var = imp_df['var'].iloc[:17].tolist()
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

์ „์ฒด์ ์œผ๋กœ ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. 

 

 

 

 

 

 


3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

estim_list = [100,200,300,400,500,600,700,800,900]
score_list = []
for i in estim_list:
    xgb = XGBRegressor( n_estimators=i, learning_rate= 0.1, random_state=2020)
    scores = cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    print(i)
plt.plot(estim_list,score_list)
plt.xticks(rotation=90)
plt.show()

n_estimators๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์„ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๊ณ , n_estimators=400์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์ ๋‹นํ•ด ๋ณด์ธ๋‹ค.  

xgb = XGBRegressor( n_estimators=400, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

400์œผ๋กœ ๋ณ€๊ฒฝํ•˜์˜€๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋‚ฎ์•„์กŒ๋‹ค. 

 

์•„์‰ฝ๊ฒŒ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ์ดํ›„๋กœ ์บ๊ธ€์—์„œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ ์˜ค๋ฅ˜์œจ์ด ๋” ๋†’๊ฒŒ ๋‚˜์™”๋‹ค. ์ด์™ธ์— ๊ฒฐ์ธก๊ฐ’, ์ด์ƒ์น˜ ๋“ฑ feature engineering์„ ์ง€์†์ ์œผ๋กœ ์‹œ๋„ํ•ด ๋ณด์•„์•ผ๊ฒ ๋‹ค. (์ถ”ํ›„ github ์—…๋กœ๋“œ ์˜ˆ์ •) 


 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/69

 

[๋จธ์‹ ๋Ÿฌ๋‹] ๋ณ€์ˆ˜์ค‘์š”๋„, shap value

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding...

silvercoding.tistory.com

 

 


Menchester United ํŒ€์—์„œ 2013๋…„ Alex Ferguson ๊ฐ๋…์ด ์€ํ‡ด๋ฅผ ํ•˜๊ณ , ํ•˜๋ฝ์„ธ๋ฅผ ํƒ€๋‹ค๊ฐ€ ์†”์ƒค๋ฅด ๊ฐ๋…์ด ํŒ€์„ ๋งก๊ฒŒ๋˜์—ˆ์„ ๋•Œ 2020๋…„ 3์›” ๊ธฐ์ค€ 2019/2020 ์‹œ์ฆŒ ๊ฒจ์šธ ์‹œ์žฅ์—์„œ ๋‘๋ช…์˜ ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•˜์—ฌ ํ•˜๋ฝ์„ธ๋ฅผ ๋ฐ˜์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

์ด๋ฅผ ์„ ์ˆ˜๋“ค์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ†ตํ•ด ๋ฐฉ์ถœ๊ณผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด, ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ๊นŒ? 


 

 

๋ฐ์ดํ„ฐ : FIFA ๋ฐ์ดํ„ฐ (๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ๊ฐ•์˜ ์ œ๊ณต)


1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
import warnings 

warnings.filterwarnings(action='ignore')  # ๊ฒฝ๊ณ ๋ฌธ ์ œ๊ฑฐ
data = pd.read_csv("./data/FIFA_data.csv")
pd.set_option('display.max_columns', 80)

column์ด ๋งŽ์œผ๋ฉด ... ์œผ๋กœ ์ƒ๋žต๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ์ปฌ๋Ÿผ ์ˆ˜์ธ 80๊ฐœ๋กœ ์„ค์ •ํ•ด์ค€๋‹ค. 

data.head()

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

2. ๋ฐ์ดํ„ฐ ํ™•์ธ, ๋ถ„์„๊ณ„ํš 

์ปฌ๋Ÿผ ๋ณ„ ์˜๋ฏธ ํ™•์ธ 

ID ๊ณ ์œ ์˜ ๋ฒˆํ˜ธ
Name ์ด๋ฆ„
Age ๋‚˜์ด
Overall ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜
Potential ์ž ์žฌ ๋Šฅ๋ ฅ์น˜
Club ์†Œ์† ํŒ€
Value ์˜ˆ์ƒ ์ด์ ๋ฃŒ (์œ ๋กœ)
Wage ์ฃผ๊ธ‰ (์œ ๋กœ)
Preferred Foot ์ž˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐœ
Weak Foot ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ฐœ
Skill Moves ๊ฐœ์ธ๊ธฐ
Position ํฌ์ง€์…˜
Jersey Number ๋“ฑ๋ฒˆํ˜ธ
Joined ์†Œ์† ํŒ€ ์ž…๋‹จ ๋‚ ์งœ
Contract Valid Until ๊ณ„์•ฝ ๊ธฐ๊ฐ„
Height ํ‚ค (ํ”ผํŠธ)
Weight ๋ชธ๋ฌด๊ฒŒ (ํŒŒ์šด๋“œ)
LS ~ RB ํฌ์ง€์…˜ ๋ณ„ ๋Šฅ๋ ฅ์น˜
Crossing ~ GKReflexes ์„ธ๋ถ€ ๋Šฅ๋ ฅ์น˜
Release Clause ๋ฐ”์ด์•„์›ƒ

 

๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. Manchester United ์„ ์ˆ˜ ๋ถ„์„ (์–ด๋–ค ์„ ์ˆ˜๋“ค์ด ์กด์žฌํ•˜๋Š”๊ฐ€?) 

2. Manchester United ์ง€์—ญ๋ผ์ด๋ฒŒ Manchester City ์„ ์ˆ˜๋“ค๊ณผ ๋น„๊ต ๋ถ„์„ 

3. ๋ถ€์กฑํ•œ ํฌ์ง€์…˜ 2๊ฐ€์ง€ ์„ ํƒ 

4. ๋‹ค๋ฅธํŒ€์˜ ์„ ์ˆ˜๋“ค ์ค‘ 2๋ช…์˜ ์˜์ž… ์„ ์ˆ˜ ์„ ํƒ (์žฌ์ •, ํ˜„์‹ค๊ฐ€๋Šฅ์„ฑ, ์˜์ž…๋ฐฉ์นจ ๊ณ ๋ ค

 

 

 

 

 


3. Manchester United ์„ ์ˆ˜๋“ค ๋ถ„์„ 

(1) EDA 

- ๋งจ์œ  ์„ ์ˆ˜ ์ถ”์ถœ

mu = data[data['Club'] == 'Manchester United']
mu.head()

Club์ด Manchester United์ธ ํ–‰๋งŒ ๋ฝ‘์•„ mu์— ์ €์žฅํ•ด์ค€๋‹ค.  

mu['Club'].unique()

unique() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•์ธํ•ด ๋ณด๋‹ˆ ๋งจ์œ ๋งŒ ์ž˜ ๋ฝ‘ํžŒ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

- ๋งจ์œ  ์„ ์ˆ˜๋“ค ๊ฐ„๋žตํ•œ ์ •๋ณด ์ถœ๋ ฅ 

print(f"์ธ์›: {mu.shape[0]}")
print(f"๋งจ์œ  ์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜: {mu['Position'].unique()}")
print(f"ํ‰๊ท  ๋Šฅ๋ ฅ์น˜: {mu['Overall'].mean()}")
print(f"ํ‰๊ท  ์ž ์žฌ ๋Šฅ๋ ฅ์น˜: {mu['Potential'].mean()}")

 

 

- ์‹œ๊ฐํ™” 

import seaborn as sns 
sns.countplot(mu['Age'])

์„ ์ˆ˜๋“ค์˜ ๋‚˜์ด ๋ถ„ํฌ์ด๋‹ค. 19์‚ด์ด ๊ฐ€์žฅ ๋งŽ๊ณ , ๊ทธ๋‹ค์Œ์œผ๋ก  25์‚ด, 28์‚ด, 22์‚ด์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

sns.countplot(mu['Position'])

ใ…

์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜ ์ค‘ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์€ CM, CB ์ด๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

Position๋ณ„ ๋Šฅ๋ ฅ์น˜ boxplot ์„ ๊ทธ๋ ค๋ณด์•˜๋”๋‹ˆ CB ํฌ์ง€์…˜์—์„œ ์ด์ƒ์น˜๊ฐ€ ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค. 

 

 

* ์ด์ƒ์น˜ & ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ 


์ด์ƒ์น˜

  • ์ •์ƒ ๋ฒ”์ฃผ์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฐ’
  • ์ด์ƒ์น˜๋ฅผ ํฌํ•จํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ์™œ๊ณก๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ 

๊ฒฐ์ธก์น˜

  • ๋ˆ„๋ฝ๊ฐ’, ๋น„์–ด์žˆ๋Š” ๊ฐ’ 
  • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ ๊ธฐ๋ก๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜, ๋ˆ„๋ฝ๋œ ๊ฐ’

์ด์ƒ์น˜์™€ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฒ•

  • ์ œ๊ฑฐ: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ํ–‰, ํ˜น์€ ์—ด์„ ์ œ๊ฑฐํ•œ๋‹ค. (์ตœํ›„์˜ ์ˆ˜๋‹จ, ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ์†Œ์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ) 
  • ๋Œ€์ฒด: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๋ฅผ ํ•ด๋‹น ์ปฌ๋Ÿผ์˜ ์ตœ๋Œ“๊ฐ’, ํ‰๊ท ๊ฐ’, ์ค‘์•™๊ฐ’ ๋“ฑ์œผ๋กœ ๋Œ€์ฒด (์ถ”์ฒœํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋‹˜.)
  • ์˜ˆ์ธก: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ์ปฌ๋Ÿผ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์˜ˆ์ธก ๊ฐ’์œผ๋กœ ์ฑ„์›Œ ๋„ฃ์Œ (์ถ”์ฒœ) 

mu[mu['Overall']>100]

๋Šฅ๋ ฅ์น˜๊ฐ€ 100์ด์ƒ์ธ row๋ฅผ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

 

 

์ด์ƒ์น˜ ์ฒ˜๋ฆฌ - ์˜ˆ์ธก ์‚ฌ์šฉ 

mu[mu['Position'] == 'CB'][['Position', 'Overall', 'CB']]

๊ฐ™์€ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ ๋น„๊ต๋ฅผ ํ•ด๋ณธ๋‹ค. CB๊ฐ€ ๋น„์Šทํ•œ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ์˜ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๊ฐ™์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด์ƒ์น˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ์„ ์ˆ˜๋Š” 11081 ๋ฒˆ์งธ ์„ ์ˆ˜์™€ CB๊ฐ€ ๊ฐ™์œผ๋ฏ€๋กœ 75๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu['Overall'][11422] = 75

11422 ๋ฒˆ์งธ ์„ ์ˆ˜์˜ ๋Šฅ๋ ฅ์น˜๋ฅผ 75๋กœ ๋ฐ”๊พธ์–ด์ค€๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

๋‹ค์‹œ boxplot์„ ๊ทธ๋ ค๋ณด๋‹ˆ ์ด์ƒ์น˜ ์—†์ด ๊ทธ๋ ค์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Potential')

potential์— ๋Œ€ํ•œ boxplot๋„ ๊ทธ๋ ค์ค€๋‹ค. potential์—๋Š” ์ด์ƒ์น˜๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜๋‹ค. 

 

 

 

mu.info()

mu๋Š” ์ด 33๊ฐœ์˜ row์ธ๋ฐ, 19~44 ๋ฒˆ์งธ ์ปฌ๋Ÿผ์— 3๊ฐœ์˜ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. 

mu[mu.isnull()['LS']]

ํฌ์ง€์…˜์ด GK์ธ ์„ ์ˆ˜๋“ค๋งŒ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. GK๋Š” ๊ณจํ‚คํผ์ด๊ณ , ๊ณจํ‚คํผ๋Š” ๋‹ค๋ฅธ ํฌ์ง€์…˜์— ๋Œ€ํ•œ ๋Šฅ๋ ฅ์น˜๋ฅผ ๋ถ€์—ฌํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ์ธก๊ฐ’์œผ๋กœ ๋‘” ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu = mu.fillna(-1)

๊ฒฐ์ธก๊ฐ’์„ -1๋กœ ์ฑ„์›Œ์ค€๋‹ค. (๊ฐ’์„ ์ธก์ •ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ์˜๋ฏธ์—์„œ ์ž„์˜์˜ ๊ฐ’ -1, ๋‹ค๋ฅธ๊ฐ’์„ ๋„ฃ์–ด์ฃผ์–ด๋„ ๋จ) 

mu.info()

๊ฒฐ์ธก๊ฐ’์ด ๋ชจ๋‘ ์ฑ„์›Œ์กŒ๋‹ค. 

 

 

 

 

 


4. Manchester United vs Manchester City 

(1) ์ „์ฒ˜๋ฆฌ 

df = data[(data['Club'] == 'Manchester United') | (data['Club']=='Manchester City')]

Manchester United์™€ Manchester City๋งŒ ๋ฝ‘์•„ df ์— ์ €์žฅํ•ด์ค€๋‹ค. 

df['Club'].unique()

df['Value'].head()

์ด์ ๋ฃŒ Value๊ฐ€ ๊ธฐํ˜ธ๋กœ ์จ์ ธ์žˆ์œผ๋ฏ€๋กœ, ๊ธฐํ˜ธ ์‚ญ์ œ, ์†Œ์ˆ˜์  ์‚ญ์ œ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 

df['Value'] = df['Value'].str.replace('M', '000000')
df['Value'] = df['Value'].str.replace('K', '000')

M์ด ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 6๊ฐœ, K๊ฐ€ ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 3๊ฐœ ๋ถ™์—ฌ ์ค€๋‹ค. 

df['Value']

df['Value'] = df['Value'].str.slice(1,)

๊ทธ๋‹ค์Œ str.slice๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐํ˜ธ๋ฅผ ์—†์• ์ค€๋‹ค. 

df['Value'].iloc[3]

'64.5000000'

์ด๋ ‡๊ฒŒ ์†Œ์ˆ˜์ ์ด ์žˆ๋Š” ๊ฒƒ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ์ ์„ ์—†์• ๊ณ  ๋’ค์˜ 0์„ ํ•˜๋‚˜ ์‚ญ์ œํ•œ๋‹ค. 

for i in df["Value"]:
    if '.' in i:
        df['Value'] = df['Value'].str.replace('.', '')
        df['Value'] = df['Value'].str.slice(0,-1)
df['Value']

์ ์šฉ์ด ์ž˜ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

df['Value'] = df['Value'].astype('int')

์ด์ œ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ object -> int๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. 

df.head()

 

 

 

- mu, mc ์„ ์ˆ˜ ๋ถ„๋ฆฌ 

mu = df[df['Club'] == "Manchester United"]
mc = df[df['Club'] == "Manchester City"]

df์—์„œ Manchester United, Manchester City ์„ ์ˆ˜๋“ค์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค. 

mc.head()

df['Position'].unique()

์œ„์˜ ํฌ์ง€์…˜์„ ๊ณจ๊ธฐํผ, ์ˆ˜๋น„์ˆ˜, ๋ฏธ๋“œํ•„๋”, ๊ณต๊ฒฉ์ˆ˜, ์ด 4๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค. ํฌ์ง€์…˜์„ ๋‚˜๋ˆ„๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 


  • ๊ณจํ‚คํผ ๋ฆฌ์ŠคํŠธ GK= GK (๊ณจํ‚คํผ)
  • ์ˆ˜๋น„์ˆ˜ ๋ฆฌ์ŠคํŠธ CB = CB(์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LB(์™ผ์ชฝ ์ˆ˜๋น„์ˆ˜), RB(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„์ˆ˜), RCB(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LCB(์™ผ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜) 
  • ๋ฏธ๋“œํ•„๋” ๋ฆฌ์ŠคํŠธ MF = RCM(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), LCM(์™ผ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RDM(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CDM(์ค‘์•™ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CM(์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RM(์˜ค๋ฅธ์ชฝ ๋ฏธ๋“œํ•„๋”), CAM(์ค‘์•™ ๊ณต๊ฒฉํ˜• ๋ฏธ๋“œํ•„๋”)
  • ๊ณต๊ฒฉ์ˆ˜ ๋ฆฌ์ŠคํŠธ ST = ST(์ „๋ฐฉ ๊ณต๊ฒฉ์ˆ˜), LW(์™ผ์ชฝ ๊ณต๊ฒฉ์ˆ˜), RW(์˜ค๋ฅธ์ชฝ ๊ณต๊ฒฉ์ˆ˜)

* GK(๊ณต๊ฒฉ์ˆ˜) : 1๋ช…, CB(์ˆ˜๋น„์ˆ˜) : 4๋ช…, MF(๋ฏธ๋“œํ•„๋”) : 4๋ช…, ST(๊ณต๊ฒฉ์ˆ˜) : 2๋ช… ์„ ๋ฐœ

-> ์„ ๋ฐœ์˜ ๊ธฐ์ค€์€ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(Overall ์ปฌ๋Ÿผ)

 

gk_list = ['GK']
cb_list = ['CB', 'LCB', 'RCB', 'RB', 'LB']
mf_list = ['RCM', 'LCM', 'RDM', 'CDM', 'CM', 'RM', 'CAM']
st_list = ['ST', 'LW', 'RW']

ํฌ์ง€์…˜์„ ๋ถ„๋ฅ˜ํ•œ๋Œ€๋กœ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2



mu_id = []

for index in mu.index:
    if mu['Position'][index] in gk_list: 
        if gk_count != 0:
            mu_id.append(mu['ID'][index])
            gk_count -= 1 
    elif mu['Position'][index] in cb_list:
        if cb_count != 0:
            mu['Position'][index] = 'CB'
            mu_id.append(mu['ID'][index])
            cb_count -= 1 
    elif mu['Position'][index] in mf_list:
        if mf_count != 0:
            mu['Position'][index] = 'MF'
            mu_id.append(mu['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mu['Position'][index] = 'ST'
            mu_id.append(mu['ID'][index])
            st_count -= 1

ํ˜„์žฌ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋Œ€๋กœ ์ƒ์œ„ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค์˜ ID ๊ฐ’์„ ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์ค€๋‹ค. 

mu[mu['ID'].isin(mu_id)]

11๋ช…์˜ ์„ ์ˆ˜๊ฐ€ ์•Œ๋งž๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

mu = mu[mu['ID'].isin(mu_id)]

์„ ๋ฐœ๋œ 11๋ช…์˜ ์„ ์ˆ˜๋“ค๋งŒ mu ๋ณ€์ˆ˜์— ๋„ฃ์–ด ์ค€๋‹ค. 

 

 

 

๊ฐ™์€ ์ ˆ์ฐจ๋กœ Manchester City ๋˜ํ•œ ์ง„ํ–‰ํ•œ๋‹ค. 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2


mc_id = []

for index in mc.index:
    if mc['Position'][index] in gk_list: 
        if gk_count != 0:
            mc_id.append(mc['ID'][index])
            gk_count -= 1 
    elif mc['Position'][index] in cb_list:
        if cb_count != 0:
            mc['Position'][index] = 'CB'
            mc_id.append(mc['ID'][index])
            cb_count -= 1 
    elif mc['Position'][index] in mf_list:
        if mf_count != 0:
            mc['Position'][index] = 'MF'
            mc_id.append(mc['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mc['Position'][index] = 'ST'
            mc_id.append(mc['ID'][index])
            st_count -= 1
mc = mc[mc['ID'].isin(mc_id)]

 


concat vs merge

merge: ์ขŒ์šฐํ•ฉ๋ณ‘, concat: ์ƒํ•˜ํ•ฉ๋ณ‘


df = pd.concat([mu, mc])

์„ ๋ฐœ๋œ mu, mc ์„ ์ˆ˜๋“ค์„ ํ•ฉ์ณ df์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

 

(2) EDA 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(overall) ๋น„๊ต 

df = pd.concat([mu, mc])

๊ณจ๊ธฐํผ๋ฅผ ๋บ€ ํƒ€ ํฌ์ง€์…˜์€ ๋ชจ๋‘ Manchester United ํŒ€์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ์˜ˆ์ƒ์ด์ ๋ฃŒ(Value) ๋น„๊ต

sns.boxplot(data=df, x='Position', y='Value', hue='Club')

์ด์ ๋ฃŒ๋Š” ๊ณจ๊ธฐํผ๋ฅผ ๋นผ๊ณ  ๊ฑฐ์˜ ์ฐจ์ด๊ฐ€ ์—†๊ฑฐ๋‚˜ ๋” ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

์œ„์˜ boxplot์œผ๋กœ ๋‘ ํŒ€์„ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ, ์ด์ ๋ฃŒ ๋Œ€๋น„ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ํฌ์ง€์…˜์€ MF, CB๋กœ ํŒ๋‹จํ•˜์—ฌ ๋‘ ํฌ์ง€์…˜์— ๋Œ€ํ•ด ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ• ์ง€ ๋ถ„์„์„ ํ•ด๋ณธ๋‹ค. 

 

 

 


5. Manchester United๋Š” ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€? 

(1) EDA

* ๋ฐฉ์ถœ ์„ ์ˆ˜ ์„ ์ •

์˜์ž…์ผ, ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ, ๋‚˜์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ณต์‹ ์„ธ์šฐ๊ธฐ 

 Point = (Overall * 2 + Potential) / Age 

๋Šฅ๋ ฅ์น˜(๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€ํ•จ)์™€ ์ž ์žฌ๋ ฅ์ด ๋†’์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ๋‚ฎ์„ ์ˆ˜๋ก ์ข‹์Œ. 

mu['Point'] = (mu['Overall'] * 2 + mu['Potential']) / mu['Age']

 

- MF ํฌ์ง€์…˜ 

mu[mu['Position'] == 'MF'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 211๋ฒˆ ์„ ์ˆ˜์ด๋‹ค.  

 

- CB ํฌ์ง€์…˜ 

mu[mu['Position'] == 'CB'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 377๋ฒˆ ์„ ์ˆ˜์ด๋‹ค. 

 

๋งˆํƒ€, ์Šค๋ชฐ๋ง ๋‘ ์„ ์ˆ˜๋ฅผ ๋ฐฉ์ถœํ•˜๊ณ  MF, CB ํฌ์ง€์…˜์„ ํ•œ๋ช…์”ฉ ์˜์ž…ํ•œ๋‹ค. 

 

 

(2) ์‹œ๊ฐํ™” 

์ „์ฒด ์„ ์ˆ˜ ์‹œ๊ฐํ™” - ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ฅธ ์˜์ž… ์„ ์ˆ˜ ๊ฒฐ์ • 


Manchester United ์˜์ž…๋ฐฉ์นจ (์†”์ƒค๋ฅด๊ฐ๋…) 

- ์„ ์ˆ˜์˜ ๋‚˜์ด๋Š” ์–ด๋ฆด ์ˆ˜๋ก ์ข‹์Œ

- ์ž ์žฌ๋ ฅ ๋ณด๋‹ค ํ˜„์žฌ ๋ฐ”๋กœ ์ฃผ์ „์œผ๋กœ ๋›ธ ์ˆ˜ ์žˆ๋Š” ์„ ์ˆ˜ 


market = data[(data['Position']=='RM') | (data['Position']=='CB')]

ํฌ์ง€์…˜์€ ๋ฐฉ์ถœ ์„ ์ •๋œ ๋‘์„ ์ˆ˜์˜ ์„ธ๋ถ€ ํฌ์ง€์…˜์ธ RM, CB๋ฅผ ์„ ํƒํ•œ๋‹ค. 

market.head()

import matplotlib.pyplot as plt
f, ax = plt.subplots(2, 4, figsize=(20, 10))

vs_list = ['Age', 'Overall', 'Potential', 'Weak Foot']

for i in range(8):
    if i < 4:
        colors = ['firebrick' if x > market[market['Position']=='CB'][:13][vs_list[i]].mean() else 'gray' for x in market[market['Position']=='CB'][:13][vs_list[i]]]
        sns.barplot(x=vs_list[i], y='Name', data=market[market['Position']=='CB'][:13], ax=ax[i//4, i%4], palette=colors)
        ax[i//4, i%4].axvline(market[market['Position']=='CB'][:13][vs_list[i]].mean(), ls = '--', color='k')
   
    else:
        colors = ['firebrick' if x > market[market['Position']=='RM'][:13][vs_list[i%4]].mean() else 'gray' for x in market[market['Position']=='RM'][:13][vs_list[i%4]]]        
        sns.barplot(x=vs_list[i%4], y='Name', data=market[market['Position']=='RM'][:13], ax=ax[i//4, i%4], palette=colors)        
        ax[i//4, i%4].axvline(market[market['Position']=='RM'][:13][vs_list[i%4]].mean(), ls='--', color='k')

๋ฐ์ดํ„ฐ ๋ถ„์„์œผ๋กœ ๋‹ค๋ฅธ ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ณ  ๋‚˜์ด, ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ์œผ๋กœ๋งŒ ๋”ฐ์ง„๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด S. Umtiti, K. Mbappé ์„ ์ˆ˜๊ฐ€ ๋  ๊ฒƒ์ด๋ผ ํŒ๋‹จํ•˜์˜€๋‹ค. 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/67

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 2. python ๋ถ€์ŠคํŒ… Boosting, XGBoost ์‚ฌ์šฉ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https..

silvercoding.tistory.com

 

 


 

'๊ฒฐ๋ก ์ด ๋ฌด์—‡์ธ์ง€' ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋กœ์„œ์˜ ์ค‘์š”ํ•œ ์—…๋ฌด์ด๋‹ค. 

์˜ˆ์ธก ๊ฒฐ๊ณผ๋งŒ ๋ณด๊ณ ๋Š” ๋ชจ๋ธ์ด ์–ด๋–ค ํŒจํ„ด์„ ์ด์šฉํ•˜์—ฌ ์˜ˆ์ธก์„ ์‹คํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€, ์™œ ๊ทธ๋ ‡๊ฒŒ ์˜ˆ์ธกํ–ˆ๋Š”์ง€ ์„ค๋ช…ํ•  ์ˆ˜ ์—†๋‹ค. ๊ทธ๋ ‡๊ฒŒ ๋˜๋ฉด ๋‹ค๋ฅธ ๋ถ„์•ผ์˜ ํ˜‘์—…์ž๋“ค์€ ์‹ ๋ขฐ๋ฅผ ์žƒ๊ฒŒ๋  ๊ฒƒ์ด๋‹ค. 

๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜ˆ๋ฅผ ๋“ค์–ด๋ณธ๋‹ค. ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•˜์—ฌ ์˜ํ™” ํฅํ–‰์„ฑ์ ์„ ์˜ˆ์ธกํ•˜๋Š” ํ”„๋กœ์ ํŠธ์—์„œ ํฅํ–‰ ์‹คํŒจ๋ผ๋Š” ์˜ˆ์ธก์ด ๋‚˜์™”๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์–ด๋–ป๊ฒŒ ํฅํ–‰์‹คํŒจ๋ฅผ ๋ง‰์„ ๊ฒƒ์ด๋ƒ๊ณ  ์งˆ๋ฌธ์ด ๋“ค์–ด์˜ฌ ์ˆ˜๋„ ์žˆ๋‹ค. ๊ธฐ์กด์˜ ์ทจ์•ฝ์ ์„ ๋ณด์™„ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋ฉด ๋น„์ฆˆ๋‹ˆ์Šค์˜ ๊ด€์ ์—์„œ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

 

๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์€ ์•„์ฃผ ์ค‘์š”ํ•˜๋‹ค. ์ด ๋•Œ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ ๋ณ€์ˆ˜์™€, ํŠน์ • ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


๋ณ€์ˆ˜์ค‘์š”๋„

- ๋ชจ๋ธ์— ํ™œ์šฉํ•œ input ๋ณ€์ˆ˜ ์ค‘์—์„œ ์–ด๋–ค ๊ฒƒ์ด target ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋‚˜? 
- ํ•ด๋‹น ์ค‘์š”๋„๋ฅผ ์ˆ˜์น˜ํ™”์‹œํ‚จ ๊ฒƒ
- treeํ˜• ๋ชจ๋ธ (์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด, ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ) ์—์„œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ 

 

์ด์ „ ๊ธ€์˜ treeํ˜• ๋ชจ๋ธ์ธ random forest์™€ xgboost์—์„œ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ณ„์‚ฐ์„ ์‹คํ–‰ํ–ˆ์—ˆ๋‹ค.  

(์ฐธ๊ณ )  ๋ฐฐ๊น…  ๋ถ€์ŠคํŒ…


์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ์˜ ๋ณ€์ˆ˜์ค‘์š”๋„

- ํ•ด๋‹น input ๋ณ€์ˆ˜๊ฐ€ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ๊ตฌ์ถ•์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์ด ์“ฐ์ด๋‚˜ 
- ํ•ด๋‹น ๋ณ€์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ถ„๊ธฐ๋ฅผ ํ–ˆ์„ ๋•Œ ๊ฐ ๊ตฌ๊ฐ„์˜ ๋ณต์žก๋„๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค„์–ด๋“œ๋Š”๊ฐ€? 



shapley ๊ฐ’ 

: ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฌผ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ์˜ ํฌ๊ธฐ

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

 

(์˜ˆ) ์ถ•๊ตฌ ์„ ์ˆ˜ A , ์†ํ•œ ํŒ€ B 

- ๊ฐ ์„ ์ˆ˜๊ฐ€ ํŒ€ ์„ฑ์ ์— ์ฃผ๋Š” ์˜ํ–ฅ๋ ฅ ํฌํ‚ค

- ํ•ด๋‹น ์„ ์ˆ˜๊ฐ€ ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€ 

- (์„ ์ˆ˜ A๊ฐ€ ์žˆ๋Š” ํŒ€ B์˜ ์Šน๋ฅ ) - (์„ ์ˆ˜ A๊ฐ€ ์—†๋Š” ํŒ€ B์˜ ์Šน๋ฅ  = 7% 


 shap value ์‹ค์Šต 

shap value ์‹ค์Šต์— ์ค‘์ ์„ ๋‘๊ธฐ ์œ„ํ•ด  Xgboost ํ•™์Šต๊นŒ์ง€ ์ „์— ํ–ˆ๋˜ ๊ทธ๋Œ€๋กœ ์‹คํ–‰ํ•ด์ค€๋‹ค. 

๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import os
import pandas as pd
import numpy as np
os.chdir('./data') # ๋ณธ์ธ ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ";")

์ด์ „ ๊ธ€์—์„œ ์‚ฌ์šฉํ•˜์˜€๋˜ ์˜ˆ๊ธˆ ๊ฐ€์ž… ์—ฌ๋ถ€ ๋ฐ์ดํ„ฐ์…‹์ด๋‹ค. 

data = pd.get_dummies(data, columns = ['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

data['y'].value_counts()

๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชฉํ‘œ๋ณ€์ˆ˜๋„ ๋‹น์—ฐํžˆ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ๋˜์–ด์žˆ๋‹ค. 

data['y'] = np.where( data['y'] == 'no', 0, 1)

ํ•˜์ง€๋งŒ shap value ํŒจํ‚ค์ง€๋Š” ๋ชฉํ‘œ๋ณ€์ˆ˜๊ฐ€ ์ˆ˜์น˜ํ˜•์ด์–ด์•ผ ์ž˜ ์ž‘๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ˆ˜์น˜ํ™” ์‹œ์ผœ์ค€๋‹ค. 

 

 

 

Xgboost ํ•™์Šต 

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

y ์ปฌ๋Ÿผ์„ ์ œ์™ธํ•œ ์ธํ’‹๋ณ€์ˆ˜๋ฅผ ๋ฆฌ์ŠคํŠธ์— ๋ชจ๋‘ ๋‹ด์•„์ค€๋‹ค. 

from xgboost import XGBRegressor

์ˆ˜์น˜ํ˜•์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด XBGRegressor ํšŒ๊ท€๋ชจ๋ธ์„ ์ž„ํฌํŠธ ํ•ด์ค€๋‹ค. 

xgb = XGBRegressor( n_estimators = 300, learning_rate=0.1 )
xgb.fit(data[input_var], data['y'])

Xgboost ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

 

 

Shap Value ์˜ˆ์ œ 

import shap

shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ import ํ•ด์ค€๋‹ค. 

 

(1) ๋ณ€์ˆ˜์ค‘์š”๋„

explainer = shap.TreeExplainer(xgb)
shap_values = explainer.shap_values( data[input_var] )

shap.TreeExplainer์˜ ์ธ์ž์— ํ•™์Šตํ•œ ๋ชจ๋ธ xgb๋ฅผ ๋„ฃ์–ด ๊ฐ์ฒด๋ฅผ ์ €์žฅํ•ด์ค€๋‹ค. ๊ทธ๋‹ค์Œ explainer.shap_values์˜ ์ธ์ž์— ๋ฐ์ดํ„ฐ์…‹์˜ ์ธํ’‹๊ฐ’์„ ๋„ฃ์–ด์ค€๋‹ค. 

shap.summary_plot( shap_values , data[input_var] , plot_type="bar" )

shap.summary_plot์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค์ค€๋‹ค. ๊ฐ€์žฅ ๋†’์€ ๋ณ€์ˆ˜๋Š” duration์ด๋‹ค. duration์€ ์ „ํ™”์‹œ๊ฐ„์ด๋‹ค. ์ „ํ™”์‹œ๊ฐ„์˜ ๊ธธ์ด๊ฐ€ ์ด ๋ชจ๋ธ์˜ ์˜ˆ์ธก์— ๊ฐ€์žฅ ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฏธ์นœ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค. 

 

 

(2) dependence plot 

: ํŠน์ • input ๋ณ€์ˆ˜์™€ target ๋ณ€์ˆ˜์™€์˜ ๊ด€๊ณ„๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ 

: ์ ์€ ๊ฐ๊ฐ์˜ row๋ฅผ ์˜๋ฏธ(๋ฐ์ดํ„ฐ ํ•œ๊ฐœ), ํƒ€๊ฒŸ๋ณ€์ˆ˜์— ๋ฏธ์นœ ์˜ํ–ฅ = y 

: ํ•ด๋‹น ๋ณ€์ˆ˜๊ฐ€ ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'duration' , shap_values , data[input_var] )

duration์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ๋ณด๋ฉด duration์˜ ๋Œ€๋ถ€๋ถ„์ด 3000 ๋ฏธ๋งŒ์— ์กด์žฌํ•˜๊ณ , ๊ทธ ์ค‘์—์„œ๋Š” duration์ด 50์ด์ƒ์ฏค ๋˜๋ฉด ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์ณ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„๋œ๋‹ค. (shpa value for duration์ด 0๋ณด๋‹ค ํฐ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์Œ) 

shap.dependence_plot( 'nr.employed' , shap_values , data[input_var] )

5020์ฏค ๋˜๋Š” ์ง€์ ์—์„œ ์˜ํ–ฅ๋ ฅ์ด ์Œ์ˆ˜๊ฐ€ ๋œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  5100์ด ๋„˜์–ด๊ฐ€๊ณ ๋Š” ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ๋ฐ–์— ์—†๋‹ค. (-> 0์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) ๊ทธ ์ด์ „์—๋Š” ์˜ํ–ฅ๋ ฅ์ด ๋†’์œผ๋ฏ€๋กœ ์ข‹์€ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ๋‹ค. (-> 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ) 

shap.dependence_plot( 'euribor3m' , shap_values , data[input_var] )

์Œ์ˆ˜์™€ ์–‘์ˆ˜๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ๋ถ„ํฌ๋˜์–ด์žˆ๋Š” ๊ฒƒ ๊ฐ™์•„ ๋ณด์ธ๋‹ค. ์ด ์ค‘์—์„œ ์Œ์ˆ˜๊ฐ€ ์–ผ๋งˆ ์—†๊ณ  ์–‘์ˆ˜๊ฐ€ ๋งŽ์€ ๊ตฌ๊ฐ„์„ ์ฐพ์•„๋ณด๋ฉด 1.3~1.4 - 2, 4-5 ๊ฐ€ ์žˆ๋‹ค. ํ•ด๋‹น ๊ตฌ๊ฐ„์ผ ๋•Œ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'cons.conf.idx' , shap_values , data[input_var] )

์ „์ฒด์ ์œผ๋กœ ์Œ์ˆ˜๋ฅผ ์ด๋ฃจ๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. -45์ดํ•˜์ผ ๋•Œ๋Š” 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง„๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋‹ค. 

shap.dependence_plot( 'pdays' , shap_values , data[input_var] )

pdays๊ฐ€ 0์ผ๋•Œ ๋Œ€๋‹ค์ˆ˜์˜ ๋ฐ์ดํ„ฐ๊ฐ€ 1์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์งˆ ๊ฒƒ์ด๋ผ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(3) force plot

: ํŠน์ • ๊ฐ’์ด ์–ด๋–ป๊ฒŒ ์˜ˆ์ธก๋˜์—ˆ๋Š”์ง€๋ฅผ ์‹œ๊ฐํ™” 

prediction = xgb.predict(data[input_var])
data['pred'] = prediction

 

shap.initjs()
shap.force_plot( explainer.expected_value , shap_values[41187] , data[input_var].iloc[41187] )

411187๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” 0.09๊ฐ€ ๋‚˜์™”๋Š”๋ฐ, ๋–จ์–ด๋œจ๋ฆฌ๋Š” ๋ณ€์ˆ˜์™€ ์˜ฌ๋ฆฌ๋Š” ๋ณ€์ˆ˜๊ฐ€ ๊ณจ๊ณ ๋ฃจ ๋ถ„ํฌ๋˜์–ด ์žˆ๋‹ค. 

 

shap.force_plot( explainer.expected_value , shap_values[0] , data[input_var].iloc[41187] )

0์— ๊ฑฐ์˜ ๊ฐ€๊น๊ฒŒ ์˜ˆ์ธก๋œ 0๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ๊ฑฐ์˜ ๋ชจ๋“  ๋ณ€์ˆ˜๊ฐ€ ์Œ์ˆ˜์˜ ์˜ํ–ฅ๋ ฅ์„ ๋ผ์นœ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

41183๋ฒˆ์งธ ๋ฐ์ดํ„ฐ๋Š” ์–‘์˜ ์˜ํ–ฅ๋ ฅ์ด ํ›จ์”ฌ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ 0.88์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์™”๊ณ , ์ •๋‹ต์€ 1๋กœ, ๊ทผ์ ‘ํ•˜๊ฒŒ ๋งžํ˜”๋‹ค. 

 

 

์ด๋ ‡๊ฒŒ shap ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ๋ณ€์ˆ˜๊ฐ€ ์˜ˆ์ธก์— ์–ด๋– ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์ณค๋Š”์ง€ ์„ฌ์„ธํ•˜๊ฒŒ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ์—ˆ๋‹ค.  


 

 

 

 

 

 

 

 

 

 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/66

 

[Bank Marketing๋ฐ์ดํ„ฐ ๋ถ„์„] 1. python ๋ฐฐ๊น… , ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ bagging, randomforest

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [bost..

silvercoding.tistory.com

 

 


๋ถ€์ŠคํŒ… Boosting

๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด (๋ถ€์ŠคํŒ… ์ ˆ์ฐจ) 

  • ์ด์ „ ๋ชจ๋ธ์—์„œ ์˜ค๋ถ„๋ฅ˜ํ•œ ๊ฐ์ฒด์— ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ(๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์—ฌ๋œ)๋กœ ๋ชจ๋ธ ํ•™์Šต
  • ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ๋งŒ๋“ฆ
  • ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์˜ ๋‹ค์–‘์„ฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด

์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ ๊ฒฐํ•ฉ

  • ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜๋ฅผ ๊ฐ€์ค‘ํ‰๊ท 

 

n_estimators ์„ค์ • 

(n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€) 

  • n_estimators ๊ฐ€ ๋„ˆ๋ฌด ๋†’์œผ๋ฉด ๋…ธ์ด์ฆˆ์— ๋ฏผ๊ฐํ•œ ์˜ค๋ฒ„ํ”ผํŒ… ์šฐ๋ ค 
  • n_estimators๊ฐ€ ๋„ˆ๋ฌด ๋‚ฎ์œผ๋ฉด ์–ธ๋”ํ”ผํŒ… ์šฐ๋ ค 
  • ์ ์ ˆํ•œ n_estimators๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๊ด€๊ฑด 

 

 


 

 


๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import os
import pandas as pd
os.chdir('../data')   # ๋ณธ์ธ ํŒŒ์ผ์ด ์กด์žฌํ•˜๋Š” ํด๋” ๊ฒฝ๋กœ
data = pd.read_csv("bank-additional-full.csv", sep = ';')
data.head()

data.info()

 

 

 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

data = pd.get_dummies(data,columns=['job','marital','education','default','housing','loan','contact','month','day_of_week','poutcome'])

dtype์ด object์ธ ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋ฅผ get_dummies๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

 

 

 

train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

data['id']=range(len(data))

๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ๊ฐ row์— id๋ฅผ ๋ถ€์—ฌํ•œ๋‹ค. 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)
test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

์ด์ „๊ธ€๊ณผ ๋™์ผํ•˜๊ฒŒ train, test ๋ฐ์ดํ„ฐ์…‹์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค.

 

 

 

์ธํ’‹๋ณ€์ˆ˜ ์ €์žฅ 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ์ œ์™ธํ•œ ์ปฌ๋Ÿผ์„ input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

 

 

 

 

 


XGBoost ๋ชจ๋ธํ•™์Šต 


XGBoost 

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€ 
  • ๋Œ€์ฒด์ ์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ์— ๋น„ํ•ด ๋น ๋ฅด๊ณ  ์„ฑ๋Šฅ์ด ์ข‹์Œ

- xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )

  • n_estimators : ๋ช‡ ๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€ 
  • learning_rate : ์–ผ๋งˆ๋‚˜ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•  ๊ฒƒ์ธ์ง€ 

-์„ค์น˜ 

!pip install xgboost

์šฐ์„  xgboost๊ฐ€ ์„ค์น˜๋˜์–ด์žˆ์ง€ ์•Š๋‹ค๋ฉด ์„ค์น˜ํ•ด ์ค€๋‹ค. 

from xgboost import XGBClassifier
xgb = XGBClassifier( n_estimators = 300, learning_rate = 0.1 )
xgb.fit(train[input_var], train['y'])

๊ฐ์ฒด ์ƒ์„ฑ์„ ํ•˜๊ณ , train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๊นŒ์ง€ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = xgb.predict(test[input_var])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•œ ํ›„ predictions์— ์ €์žฅํ•œ๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

์ •ํ™•๋„๊ฐ€ ์•ฝ 91 % ๊ฐ€ ๋‚˜์™”๋‹ค. ํ˜„์žฌ ๋ชจ๋ธ์€ n_estimators๋ฅผ 300์œผ๋กœ ์ง€์ •ํ•˜์˜€๋‹ค. ์•ž์—์„œ ํ•™์Šตํ•˜์˜€๋“ฏ์ด, ์˜ค๋ฒ„ํ”ผํŒ…๊ณผ ์–ธ๋”ํ”ผํŒ…์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ถ€์ŠคํŒ…์—์„œ n_estimators๋ฅผ ์ ์ ˆํ•˜๊ฒŒ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์ด ๊ด€๊ฑด์ด๋ผ๊ณ  ํ•˜์˜€๋‹ค. ๋”ฐ๋ผ์„œ ์ตœ์ ์˜ n_estimators๋ฅผ ์ฐพ์•„๋ณด๋„๋ก ํ•œ๋‹ค. 

 

 

 

์ตœ์  ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ์ˆ˜ ( n_estimators ) ์ฐพ๊ธฐ 

for n in [100,200,300,400,500,600,700,800,900]:
    xgb = XGBClassifier( n_estimators = n, learning_rate = 0.05, eval_metric='logloss' )
    xgb.fit(train[input_var], train['y'])
    predictions = xgb.predict(test[input_var])
    print((pd.Series(predictions)==test['y']).mean())

๊ฒฐ๊ณผ : ์ตœ์ ์˜ n_estimators ๋Š” 400์ด๋‹ค. 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋‹ค. 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•ด๋ณด๋‹ˆ nr.emplyed ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ณ€์ˆ˜๋กœ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 


 

 

 

 

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/65

 

[IRIS ๋ฐ์ดํ„ฐ ๋ถ„์„] 2. Python Decision Tree ( ์˜์‚ฌ ๊ฒฐ์ • ๋‚˜๋ฌด )

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/64 https://silvercoding.tistory.com/63?category=967543 https://silvercoding.tistory.com/62 [boston ๋ฐ์ดํ„ฐ ๋ถ„์„] 1. ์ฐจ์›์ถ•์†Œ (PCA) ํŒŒ..

silvercoding.tistory.com

 

 

 


๋ฐฐ๊น… bagging 

- ๋ฐฐ๊น…์˜ ์ฒ ํ•™ 

1. ๋งŽ์„์ˆ˜๋ก ์ข‹๋‹ค. 

2. ๋‹ค์–‘ํ• ์ˆ˜๋ก ์ข‹๋‹ค. 

(ex) ๋‚จ์„ฑ 1๋ช… < ๋‚จ์„ฑ 10๋ช… (์ˆ˜๊ฐ€ ๋งŽ์Œ) < ๋‚จ์„ฑ 5๋ช… , ์—ฌ์„ฑ 5๋ช… (์ˆ˜๊ฐ€ ๋งŽ๊ณ  ๋‹ค์–‘ํ•จ) 

 

 

- ๊ฐ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”๊ฐ€? (๋ฐฐ๊น… ํ”„๋กœ์„ธ์Šค)  

1. ์ „์ฒด ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง ( ๋ณต์› ์ถ”์ถœ / ์ค‘๋ณต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ฌ์ˆ˜๋„, ์•„์˜ˆ ๋ฝ‘ํžˆ์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์„์ˆ˜๋„. ) -> ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ 

2. ๊ฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ชจ๋ธ ์ƒ์„ฑ 

3. ๋ชจ๋ธ๋ณ„๋กœ ํ•™์Šตํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์ด ๋‹ค๋ฅด๋ฏ€๋กœ ๋ชจ๋ธ์˜ ๋‹ค์–‘์„ฑ ํ™•๋ณด 

 

 

- ์ตœ์ข… ๊ฒฐ๊ณผ๋ฌผ์˜ ๊ฒฐํ•ฉ? 

: ๊ฐ ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ ๋‚˜์˜จ ์˜ˆ์ธก์น˜์˜ ๋‹จ์ˆœ ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค. 

 

 

- ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ (๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ๋ชจ๋ธ) 

: ๋ฐฐ๊น…์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ๋”ฐ๋ฅด๋ฉด์„œ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ 


 

 

 


 ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ๋Š” ์บ๊ธ€์˜ Dataset ์—์„œ ๋ฐ›์„ ์ˆ˜ ์žˆ๋‹ค. 

< Bank Marketing dataset > 

https://www.kaggle.com/volodymyrgavrysh/bank-marketing-campaigns-dataset

 

Bank marketing campaigns dataset | Opening Deposit

Bank Marketing (with social/economic context) dataset with loan target variable

www.kaggle.com

import os
import pandas as pd
os.chdir('../data')  # ๋ณธ์ธ์˜ ํŒŒ์ผ ํด๋” ๊ฒฝ๋กœ 
data = pd.read_csv("bank-additional-full.csv", sep = ';')

๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์ฃผ์˜ํ•  ์ ์€ sep=';' ์„ ์„ค์ •ํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด ํŒŒ์ผ์€ csv ํŒŒ์ผ์ด์ง€๋งŒ ์ฝค๋งˆ(,) ๊ฐ€ ์•„๋‹Œ ์„ธ๋ฏธ์ฝœ๋ก (;) ์œผ๋กœ ๊ตฌ๋ถ„์ด ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. 

data.head()

๋‚˜์ด, ์ง์—…, ๊ฒฐํ˜ผ์—ฌ๋ถ€, ๋Œ€์ถœ์—ฌ๋ถ€ ๋“ฑ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น ๊ณ ๊ฐ์˜ ์˜ˆ๊ธˆ ๊ฐ€์ž…์—ฌ๋ถ€๋ฅผ ๋งžํžˆ๋Š” ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data.info()

dtype์ด object์ธ ๋ณ€์ˆ˜๋Š” ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜๋กœ ,  ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค. 

 

 

 

 


 ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ์‚ฌ์šฉ 

์ „์ฒ˜๋ฆฌ - ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜ ์›ํ•ซ์ธ์ฝ”๋”ฉ

- dtype์ด object์ธ ์ปฌ๋Ÿผ ์ถ”์ถœ 

obj_column = []
for column in data.columns[:-1]:
    if data[column].dtype == 'object':
        obj_column.append(column)
        
obj_column

data = pd.get_dummies(data,columns=obj_column)

get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ์„ ์ง„ํ–‰ํ•œ๋‹ค. 

data

์ปฌ๋Ÿผ์ˆ˜๊ฐ€ ๋งŽ์ด ๋Š˜์–ด๋‚œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

data['id']=range(len(data))

๋ฐ์ดํ„ฐ ๊ตฌ๋ถ„์„ ์œ„ํ•˜์—ฌ id๊ฐ’์„ ๋ถ€์—ฌํ•œ๋‹ค. 

 

 

- train & test ๋ฐ์ดํ„ฐ์…‹ ๋ถ„๋ฆฌ 

train = data.sample(30000,replace=False,random_state=2020).reset_index().drop(['index'],axis=1)

train ๋ฐ์ดํ„ฐ์…‹์„ ๋น„๋ณต์›์ถ”์ถœ๋กœ 30000๊ฐœ๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค. 

test = data.loc[ ~data['id'].isin(train['id']) ].reset_index().drop(['index'],axis=1)

test๋ฐ์ดํ„ฐ์…‹์€ train์— ์—†๋Š” id๊ฐ’์œผ๋กœ ์ด 11188๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค. 

 

 

 

 

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ ํ•™์Šต 


๋žœ๋คํฌ๋ ˆ์ŠคํŠธ

- ํŠน์ง• 

  • ํ•ด์„์ด ์–ด๋ ค์›€
  • ๋งค์šฐ ๋Š๋ฆผ
  • ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋” ๊ฐ๊ด€์ ์ธ ๋ณ€์ˆ˜ ์ค‘์š”๋„๋ฅผ ๋ฝ‘์•„๋‚ผ ์ˆ˜ ์žˆ์Œ 

 

- RandomForestClassifier(n_estimators=m, min_samples_split=n)

  • n_estimators : ๋ช‡๊ฐœ์˜ ์˜์‚ฌ๊ฒฐ์ • ๋‚˜๋ฌด๋ฅผ ๋งŒ๋“œ๋Š”๊ฐ€ 
  • max_depth : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์˜ ์ตœ๋Œ€ ๊นŠ์ด 
  • min_samples_split : ๊ฐ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์—์„œ ๊ฐ ๋…ธ๋“œ์˜ ์ตœ์†Œ ์ƒ˜ํ”Œ ์ˆ˜ 

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=500, min_samples_split=10)

๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

data.columns

input_var = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate',
       'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed',
       'job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'marital_divorced', 'marital_married', 'marital_single',
       'marital_unknown', 'education_basic.4y', 'education_basic.6y',
       'education_basic.9y', 'education_high.school', 'education_illiterate',
       'education_professional.course', 'education_university.degree',
       'education_unknown', 'default_no', 'default_unknown', 'default_yes',
       'housing_no', 'housing_unknown', 'housing_yes', 'loan_no',
       'loan_unknown', 'loan_yes', 'contact_cellular', 'contact_telephone',
       'month_apr', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'day_of_week_fri', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'poutcome_failure',
       'poutcome_nonexistent', 'poutcome_success']

๋ฐ˜ํ™˜๋œ data์˜ ์ปฌ๋Ÿผ์—์„œ y๋ฅผ ๋บ€ ์ปฌ๋Ÿผ๋“ค์„ input_var ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

rf.fit(train[input_var],train['y'])

train ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋ถ„๋ฅ˜๊ธฐ ๋ชจ๋ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. 

predictions = rf.predict(test[input_var])

test๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•˜๊ณ , predictions ๋ณ€์ˆ˜์— ์ €์žฅํ•ด ์ค€๋‹ค. 

(pd.Series(predictions)==test['y']).mean()

predictions์™€ ์ •๋‹ต๊ฐ’(y) ์„ ๋น„๊ตํ•˜์—ฌ ํ‰๊ท ์„ ๋‚ด์ฃผ๋ฉด ์ •ํ™•๋„๋Š” ์•ฝ 91% ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

 

* ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด์™€์˜ ๋น„๊ต 

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(min_samples_split=10)

์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. 

dt.fit(train[input_var], train['y'])

predictions = dt.predict(test[input_var])

ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ํ•™์Šต๊ณผ ํ…Œ์ŠคํŠธ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

(pd.Series(predictions) == test['y']).mean()

์ •ํ™•๋„๋ฅผ ๋น„๊ตํ•ด๋ณด๋‹ˆ ์˜์‚ฌ๊ฒฐ์ •๋‚˜๋ฌด๋ณด๋‹ค ๋žœ๋คํฌ๋ ˆ์ŠคํŠธ ๋ชจ๋ธ์˜ ์ •ํ™•๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋†’์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

๋ณ€์ˆ˜์ค‘์š”๋„ 

feature_imp = rf.feature_importances_
imp_df = pd.DataFrame({'var':input_var,
                       'imp':feature_imp})

imp_df.sort_values(['imp'],ascending=False)

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ์„ ํ•ด ๋ณด์•˜๋”๋‹ˆ duration์ด ๊ฐ€์žฅ ๋†’๊ณ , default_yes ์ปฌ๋Ÿผ์ด ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.  (๋ณ€์ˆ˜์ค‘์š”๋„์— ๋Œ€ํ•œ ๊ฐœ๋…์€ ๋‹ค๋‹ค์Œ์‹œ๊ฐ„์— ์ž์„ธํžˆ ์•Œ์•„๋ณธ๋‹ค.) 


 

 

 

+ Recent posts