๋ณธ ํฌ์ŠคํŠธ๋Š” ํŒจ์ŠคํŠธ์บ ํผ์Šค ํŒŒ์ด์ฌ ๊ธฐ์ดˆ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋”ฅ๋Ÿฌ๋‹ ์˜์ƒ์ธ์‹ ๋ฐ”์ด๋ธ” ๊ฐ•์˜๋ฅผ ์ •๋ฆฌํ•œ ๊ธ€์ž…๋‹ˆ๋‹ค.

 

์˜ฌํ•ด ์ƒ๋ฐ˜๊ธฐ๊ฐ€ ์ง€๋‚˜๊ฐ€๊ธฐ ์ „ ๋”ฅ๋Ÿฌ๋‹ ๊ณต๋ถ€๋ฅผ ๊นŠ๊ฒŒ ํ•ด๋ณด๊ณ  ์‹ถ์—ˆ๋‹ค. CNN, RNN, LSTM ๋“ฑ์˜ ์ด๋ก ์€ ํ•™๋ถ€์ƒํ™œ์„ ํ•˜๋ฉด์„œ ๊ฝค๋‚˜ ์ตํ˜”๋Š”๋ฐ, ๋”ฅ๋Ÿฌ๋‹ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฒƒ์€ ํ•ด๋ณด์ง€ ์•Š์•˜๊ธฐ์—, ํ”„๋ ˆ์ž„์›Œํฌ ์ค‘ ํ•œ ๊ฐ€์ง€ ์ •๋„๋Š” ๋Šฅ์ˆ™ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์‚ผ๊ฒŒ ๋˜์—ˆ๋‹ค. ๊ฐ•์˜๋ฅผ ๋”ฐ๋ผ keras๋ฅผ ์‚ฌ์šฉํ•  ์˜ˆ์ •์ธ๋ฐ, 

keras ๋ฌธ์„œ ๋ฅผ ๋“ค์–ด๊ฐ€ ์‚ฌ์šฉ๋ฒ•์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ๋ง์€ ๋ฌธ์„œ๋ฅผ ํ†ตํ•ด ๋ฐฐ์šธ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค. 

 

 

 


1. ํ•„์š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import 

# TensorFlow and tf.keras
import tensorflow as tf 
from tensorflow import keras 
#  Helper libraries 
import numpy as np 
import matplotlib.pyplot as plt 
import math

print(tf.__version__) # tensorflow ๋ฒ„์ „ ํ™•์ธ

tensorflow, keras๋ฅผ import ํ•ด์ค€๋‹ค.  ํ•„์ž๋Š” google colab ์—์„œ ์‹ค์Šต์„ ์ง„ํ–‰ํ•˜์˜€๊ณ , ๊ธ€์„ ์“ฐ๋Š” ์‹œ์ ์„ ๊ธฐ์ค€์œผ๋กœ tf ๋ฒ„์ „ 2.8.0 ์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด ๋ฐ–์— ํ•„์š”ํ•œ numpy, matplotlib, math๋ฅผ import ํ•ด ์ค€๋‹ค. 

 

 


2. batch size, epochs, num_classes ์ •์˜

# Define Constants 
batch_size = 128 
epochs = 100 
num_classes = 10

batch_size: ๋ฐ์ดํ„ฐ๋ฅผ ๋ช‡๊ฐœ์”ฉ ๋ฌถ์–ด์„œ ํ•™์Šตํ•  ๊ฒƒ์ธ๊ฐ€? -> 128๊ฐœ์”ฉ ๋ฌถ์–ด์„œ ํ•™์Šตํ•˜๊ฒ ๋‹ค

ephocs: ํ•™์Šต์„ ๋ฐ˜๋ณตํ•˜๋Š” ํšŸ์ˆ˜ -> 100๋ฒˆ ํ•™์Šตํ•˜๊ฒ ๋‹ค

num_classes: ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜ -> MNIST๋Š” 0~9๊นŒ์ง€ 10๊ฐœ์ด๋ฏ€๋กœ 10

 

 

  • 60000์žฅ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ๋ฒˆ์— ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  batch size๋ฅผ ์„ค์ •ํ•˜๋Š” ์ด์œ 

๋ฐฐ์น˜๋ฅผ ๋‚˜๋ˆ ์„œ ํ•™์Šตํ•˜๊ฒŒ๋˜๋ฉด ๋ชจ๋“  ๋ฐ์ดํ„ฐ๊ฐ€ ์ŠคํŠธ๋ ˆ์ดํŠธ๋กœ ์ญ‰ ํ•™์Šต๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, batch size๋งŒํผ ํ•™์Šต๋˜๋ฉด์„œ ์˜ˆ์ธก ๊ฐ’์ด ๋งž๊ฑฐ๋‚˜ ํ‹€๋ฆฐ ๊ฒฝ์šฐ๊ฐ€ ๊ฐ ๋ฐฐ์น˜๋งˆ๋‹ค ์—…๋ฐ์ดํŠธ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๊ฐ„์ค‘๊ฐ„ ๊ฐ€์ค‘์น˜๊ฐ€ ์กฐ์ ˆ๋  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

(์‹ค์ œ๋กœ ์‹คํ—˜ํ•ด๋ณด์•˜๋”๋‹ˆ batch size๋ฅผ 60000์žฅ์œผ๋กœ ํ–ˆ์„ ๋•Œ ์ •ํ™•๋„๊ฐ€ 0.02์ •๋„ ๋‚ฎ๊ฒŒ ๋‚˜์™”๋‹ค. (MNIST ๋ฐ์ดํ„ฐ ๊ธฐ์ค€) ๊ทธ๋ฆฌ๊ณ  batch size๊ฐ€ ์ž‘์•„์งˆ ์ˆ˜๋ก ํ•™์Šต ์†๋„๊ฐ€ ๋Š๋ ค์ง„๋‹ค. ์•„์ง ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •ํ•  ๋ ˆ๋ฒจ์€ ์•„๋‹ˆ์ง€๋งŒ, ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์„ ์ง์ ‘ ํ™•์ธํ•˜๋‹ˆ ์ ์ ˆํ•œ ๋ฐฐ์น˜์‚ฌ์ด์ฆˆ๋ฅผ ์„ค์ •ํ•ด์ฃผ๋Š” ๊ฒƒ๋„ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์ธ ๊ฒƒ ๊ฐ™์•„๋ณด์ธ๋‹ค. )

 

 

 


3. MNIST ๋ฐ์ดํ„ฐ์…‹ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# Download MNIST dataset 
mnist = keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

์›Œ๋‚™ ์œ ๋ช…ํ•œ MNIST ๋ฐ์ดํ„ฐ์…‹์€ keras์—์„œ ์ œ๊ณตํ•ด์ฃผ๋ฏ€๋กœ ๋”ฐ๋กœ ๋‹ค์šด๋ฐ›์„ ํ•„์š” ์—†์ด ์œ„์™€ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.  

len(train_images), len(test_images)

 (60000, 10000) 

train์€ 60000์žฅ, test๋Š” 10000์žฅ์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 


4. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต

(1) normailze (0.0 ~ 1.0 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋˜๋„๋ก) 

# Normalize the input image so that each pixel value is between 0 to 1 
train_images = train_images / 255.0 
test_images = test_images / 255.0

๋ฐ์ดํ„ฐ๋ฅผ floatํ˜•์œผ๋กœ ๋งŒ๋“ค๋ฉด์„œ 0.0~1.0 ์‚ฌ์ด๋กœ ์ •๊ทœํ™”ํ•ด์ค€๋‹ค. 

 

 

(2) ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ์ •์˜

# Define the model architecture 
model = keras.Sequential([
                          keras.layers.Flatten(input_shape=(28, 28)),
                          keras.layers.Dense(128, activation=tf.nn.relu),
                          keras.layers.Dense(num_classes, activation='softmax')
])
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

๋ชจ๋ธ์€ keras.Sequential์— ์ธต์„ ํ•˜๋‚˜ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•ด์ฃผ๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ง๊ด€์ ์œผ๋กœ ๋ชจ๋ธ๋ง์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค. flatten์œผ๋กœ ํ•œ ์žฅ๋‹น 2์ฐจ์› ๋ฐฐ์—ด 28x28์ธ ์ด๋ฏธ์ง€๋ฅผ 1์ฐจ์›์œผ๋กœ ๋งŒ๋“ค์–ด ์ค€๋‹ค. ๊ทธ๋‹ค์Œ Dense layer๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , activation ํ•จ์ˆ˜๋Š” relu๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ๋งˆ์ง€๋ง‰ ์ธต์—๋Š” ํด๋ž˜์Šค์˜ ๊ฐœ์ˆ˜์™€ softmax ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์˜ˆ์ธก ๊ฒฐ๊ณผ๋ฅผ ํด๋ž˜์Šค ๋ณ„ ํ™•๋ฅ ๋กœ ๋‚˜์˜ค๊ฒŒ๋” ๋งŒ๋“ค์–ด์ค€๋‹ค. 

 

model.complie๋กœ optimizer์™€ lossํ•จ์ˆ˜, metrics (ํ‰๊ฐ€์ง€ํ‘œ)๋ฅผ ์„ค์ •ํ•ด ์ค€๋‹ค. 

 

์ด์ œ ๋ชจ๋ธ ํ•™์Šตํ•  ๋ชจ๋“  ์ค€๋น„๊ฐ€ ๋˜์—ˆ๋‹ค. 

 

 

(3) ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ•™์Šต

history = model.fit(train_images, train_labels, epochs=epochs, batch_size=batch_size)

train ๋ฐ์ดํ„ฐ์…‹๊ณผ ์•ž์„œ ์ง€์ •ํ–ˆ๋˜ ephocs, batch_size๋ฅผ ์„ค์ •ํ•ด ์ค€๋‹ค. 

 


5. ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ ํ‰๊ฐ€

(1) loss, accuracy ํ™•์ธ

test_loss, test_acc = model.evaluate(test_images, test_labels)
print("Test Loss: ", test_loss)
print("Test Accuracy: ", test_acc)

 Test Loss: 0.12909765541553497 

 Test Accuracy: 0.98089998960495 

์•„์ฃผ ๊ธฐ๋ณธ์ ์ธ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์˜€์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  0.98์ด๋ผ๋Š” ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋‚˜์™”๋‹ค. ๋ชจ๋“  ํ•™์Šต ๊ฒฐ๊ณผ๊ฐ€ ์ด๋žฌ์œผ๋ฉด ์ข‹๊ฒ ๋‹ค. 

 

 

 

(2) ํ•„์š” ํ•จ์ˆ˜ ์ •์˜ 

# 1. ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
def show_sample(images, labels, sample_count=25):
  # Create a square with can fit {sample_count} images
  grid_count = math.ceil(math.ceil(math.sqrt(sample_count)))
  grid_count = min(grid_count, len(images), len(labels))

  plt.figure(figsize=(2*grid_count, 2*grid_count))
  for i in range(sample_count):
    plt.subplot(grid_count, grid_count, i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(images[i], cmap=plt.cm.gray)
    plt.xlabel(labels[i])
  plt.show()

###################################################################
# 2. ํŠน์ • ์ˆซ์ž์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
# Helper function to display specific digit images 
def show_sample_digit(images, labels, digit, sample_count=25):
  # Create a square with can fit {sample_count} images
  grid_count = math.ceil(math.ceil(math.sqrt(sample_count)))
  grid_count = min(grid_count, len(images), len(labels))

  plt.figure(figsize=(2*grid_count, 2*grid_count))
  i = 0 
  digit_count = 0 
  while digit_count < sample_count:
    i += 1 
    if digit == labels[i]: 
      plt.subplot(grid_count, grid_count, digit_count+1)
      plt.xticks([])
      plt.yticks([])
      plt.grid(False)
      plt.imshow(images[i], cmap=plt.cm.gray)
      plt.xlabel(labels[i])
      digit_count += 1 
  plt.show()


###################################################################
# 3.์ด๋ฏธ์ง€ ํ•œ๊ฐœ๋ฅผ ํฌ๊ฒŒ ๋ณด์—ฌ์ฃผ๋Š” ํ•จ์ˆ˜ 
def show_digit_image(image):
  # Draw digit image 
  fig = plt.figure()
  ax = fig.add_subplot(1, 1, 1)
  # Major ticks every 20, minor ticks every 5 
  major_ticks = np.arange(0, 29, 5)
  minor_ticks = np.arange(0, 29, 1)
  ax.set_xticks(major_ticks)
  ax.set_xticks(minor_ticks, minor=True)
  ax.set_yticks(major_ticks)
  ax.set_yticks(minor_ticks, minor=True)
  # And a corresponding grid 
  ax.grid(which='both')
  # Or if you want different settings for the grids:
  ax.grid(which='minor', alpha=0.2)
  ax.grid(which='major', alpha=0.5)
  ax.imshow(image, cmap=plt.cm.binary)

  plt.show()

28x28 ๋ฐฐ์—ด์˜ ์ด๋ฏธ์ง€๋ฅผ ์‹œ๊ฐํ™”๋กœ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ํ•จ์ˆ˜์ด๋‹ค. 

 

 

์œ„ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ž ๊น ์ด๋ฏธ์ง€๋ฅผ ํ™•์ธํ•ด ๋ณด์ž. 

 

  • show_sample ํ•จ์ˆ˜ ์‚ฌ์šฉ (์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ์‚ฌ์ง„ ์ถœ๋ ฅ)
show_sample(train_images, ['Label: %s' % label for label in train_labels])

์ด๋ ‡๊ฒŒ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜ ๋งŒํผ ์ด๋ฏธ์ง€๋ฅผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

  • show_sample_digit ํ•จ์ˆ˜ ์‚ฌ์šฉ (ํŠน์ • ์ˆซ์ž์— ๋Œ€ํ•œ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ์˜ ์‚ฌ์ง„ ์ถœ๋ ฅ)
show_sample_digit(train_images, train_labels, 7)

ํŠน์ • ์ˆซ์ž๋ฅผ ์›ํ•˜๋Š” ๊ฐœ์ˆ˜๋งŒํผ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(3) train ๋ฐ์ดํ„ฐ์…‹ ํ•™์Šต ์‹œ ephoch์— ๋”ฐ๋ฅธ loss์™€ accuracy ๊ฐ’ ์‹œ๊ฐํ™”

# Evaluate the model using test dataset. - Show performance 
fig, loss_ax = plt.subplots()
fig, acc_ax = plt.subplots()

loss_ax.plot(history.history['loss'], 'ro')
loss_ax.set_xlabel('ephoc')
loss_ax.set_ylabel('loss')

acc_ax.plot(history.history['accuracy'], 'bo')
acc_ax.set_xlabel('ephoc')
acc_ax.set_ylabel('accuracy')

 

 

 

 

 

(4) test data์˜ ์˜ˆ์ธก ๊ฐ’๊ณผ ์ •๋‹ต ๊ฐ’ ๋น„๊ตํ•ด๋ณด๊ธฐ

  • ์‹ค์ œ๊ฐ’: ๊ทธ๋ฆผ
  • ์˜ˆ์ธก๊ฐ’: x label
# Predict the labels of digit images in our test datasets.
predictions = model.predict(test_images)

# Then plot the first 25 test images and their predicted labels.
show_sample(test_images, ['predicted: %s' % np.argmax(result) for result in predictions])

 

 

 

(5) show_digit_image ํ•จ์ˆ˜ ์‚ฌ์šฉ

  • ํŠน์ • ์ธ๋ฑ์Šค์˜ ์‚ฌ์ง„๊ณผ ๊ทธ๋•Œ์˜ ์˜ˆ์ธก๊ฐ’์„ ๋น„๊ตํ•ด ๋ด„
Digit = 2005 #@param {type:'slider', min:1, max:10000, step:1}
selected_digit = Digit - 1 

result = predictions[selected_digit]
result_number = np.argmax(result)
print('Number is %2d' % result_number)

show_digit_image(test_images[selected_digit])

#@param์„ ์‚ฌ์šฉํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ์Šฌ๋ผ์ด๋”๊ฐ€ ์ƒ๊ธด๋‹ค. ๋žœ๋ค์œผ๋กœ ์Šฌ๋ผ์ด๋“œ๋ฅผ ํ•ด์„œ ์ธ๋ฑ์Šค ๊ฐ’์„ ์ง€์ •ํ•ด ์ฃผ๋ฉด,

Number is 7

์ด์™€ ๊ฐ™์ด Number is 7 ์€ ์˜ˆ์ธก ๊ฐ’, ์ด๋ฏธ์ง€๋Š” test ์ด๋ฏธ์ง€ (์ •๋‹ต ๊ฐ’)์œผ๋กœ ๋‘๊ฐœ๋ฅผ ๋น„๊ต ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

 

์ด๋ฒˆ ํฌ์ŠคํŠธ์—์„œ ์‚ฌ์šฉํ•œ MNIST๋ฐ์ดํ„ฐ์…‹์€ ์•„์ฃผ ๊ฐ„๋‹จํ•œ ๋”ฅ๋Ÿฌ๋‹ ๋ชจ๋ธ์ธ๋ฐ๋„ ์„ฑ๋Šฅ์ด ์ข‹์•˜๋‹ค. 

๋‹ค์Œ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ด๋ฏธ์ง€ ๋ชจ๋ธํ•™์Šต์— ์ตœ์ ํ™” ๋˜์–ด์žˆ๋Š” CNN ๋ชจ๋ธ๋ง์„ ํ•จ์œผ๋กœ์จ MNIST์˜ ์„ฑ๋Šฅ์„ ๋”์šฑ ๋†’์—ฌ๋ณด๋Š” ๊ณต๋ถ€๋ฅผ ํ•ด ๋ณผ ๊ฒƒ์ด๋‹ค. 

์‚ฌ์šฉ ๋ฐ์ดํ„ฐ์…‹ 

https://www.data.go.kr/dataset/3035522/fileData.do

ํ˜„์žฌ ์ด ๋ฐ์ดํ„ฐ์…‹์€ ํ๊ธฐ ๋˜์—ˆ๋‹ค๊ณ  ๋‚˜์˜จ๋‹ค. 

 

์œ„ ๊ณต๊ณต๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธํ”„๋Ÿฐ๊ฐ•์˜ (๊ณต๊ณต๋ฐ์ดํ„ฐ๋กœ ํŒŒ์ด์ฌ ๋ฐ์ดํ„ฐ ๋ถ„์„) (https://bit.ly/3sISk6Z) ์˜ ๋ฐฉ๋ฒ•์œผ๋กœ ์ „์ฒ˜๋ฆฌํ•œ ๋ฐ์ดํ„ฐ๋กœ ์‹œ๊ฐํ™” ์ •๋ฆฌ ์ง„ํ–‰ํ•œ๋‹ค. 

์‚ฌ์šฉํ•  ๋ฐ์ดํ„ฐ

 

cf1) figure, axes ์ƒ์„ฑ

fig=plt.figure(figsize=(10,3), dpi=100)
ax1=fig.subplots()

 

cf2) ๋ชจ๋“  x tick ํ‘œํ˜„ํ•˜๊ธฐ 

_=plt.xticks(ticks=np.arange(len(df)), labels=df.index)

 

cf3) x์ถ• ์†Œ์ˆ˜์  ์ œ๊ฑฐ

from matplotlib.ticker import MaxNLocator
ax1.xaxis.set_major_locator(MaxNLocator(integer=True))

 

(cf4) ๊ทธ๋ž˜ํ”„์˜ ๋ฐ–์— Legend ํ‘œ์‹œํ•˜๋„๋ก ์„ค์ •

plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

 

 

 

 lineplot 


1. pandas plot

(1) pandas plot์˜ ๊ธฐ๋ณธ plot - lineplot 

- df์˜ index ๋˜๋Š” column ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ ค์ง 

df.plot(figsize=(10,3))

์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ , gropuby ์‚ฌ์šฉ

cf) ๋ชจ๋“  x tick ํ‘œํ˜„ํ•˜๊ธฐ 

_=plt.xticks(ticks=np.arange(len(g)), labels=g.index)

- df ์˜ column์ด ์—ฌ๋Ÿฌ ๊ฐœ ์กด์žฌํ•  ๋•Œ  (df์˜ column์ด seaborn์˜ hue์—ญํ• )

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

2. seaborn plot 

sns.lineplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ง€์—ญ๋ช…", ci=None, ax=ax1)
ax1.legend(bbox_to_anchor=(1.02, 1), loc=2)

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 pointplot 

sns.pointplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ง€์—ญ๋ช…", ci=None, ax=ax2)
ax2.legend(bbox_to_anchor=(1.02, 1), loc=2)

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 

 barplot 


1. pandas plot 

(1) df.plot(kind='bar')

- df์˜ index ๋˜๋Š” columm ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ ค์ง

df.plot.bar(rot=0, figsize=(10, 3))
# or
df.plot(kind='bar',rot=0, ax=ax1)

์ง€์—ญ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ , groupby ์‚ฌ์šฉ

(2) df.plot.bar()

df.plot.bar(color='g',rot=0, figsize=(10,3)) # cmap='Pastel1' ๋˜ํ•œ ๊ฐ€๋Šฅ

์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

- column ์—ฌ๋Ÿฌ๊ฐœ์ผ ๋•Œ ( df์˜ column์ด seaborn์˜ hue์™€ ๊ฐ™์€ ์—ญํ• )

ax=df2.plot.bar(figsize=(10,3), rot=0)
ax.set_ylabel('ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')

์ง€์—ญ๋ณ„ ์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

2. seaborn plot 

sns.barplot(data=df, x="์ง€์—ญ๋ช…", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")
# estimator default: mean
# color changable
# palette (https://seaborn.pydata.org/tutorial/color_palettes.html)
# ci: bootstrap resampling (with replacement), sorted means

palette ์ƒ‰ ๋ชจ์Œ ๋งํฌ

์ง€์—ญ๋ณ„ ํ‰๋‹น ๋ถ„์–‘๊ฐ€๊ฒฉ (ํ™•์‹คํžˆ seaborn์ด ๋” ์˜ˆ์˜๊ธด ํ•˜๋‹ค)

 

- hue ์ง€์ • 

sns.barplot(data=df, x="์ง€์—ญ๋ช…", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue='์—ฐ๋„', ci=None)

 

 

 histplot 


1. pandas plot

(1) df.plot(kind='hist') or df.plot.hist()

df.plot(kind='hist', figsize=(10, 3), title='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')
# or
ax=df.plot(kind='hist', figsize=(10, 3))
ax.set_title('ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ')

ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ์˜ ๋ถ„ํฌ

df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"].plot.hist(bins=50)

 

 

(2) df.hist(bins=)

df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"].hist(bins=50)

axs=df.hist(bins=50, figsize=(10,10))
ax1,ax2,ax3,ax4=axs.flatten()
ax2.set_title('ax๋ณ„ ์ œ๋ชฉ ์ง€์ • ๊ฐ€๋Šฅ')

 

 

 

 

 

2. seaborn plot 

sns.histplot(df["ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ"], kde=True)

 

 

 

 kdeplot 


1. seaborn plot 

sns.kdeplot(data=df['ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ'])

sns.kdeplot(data=df[['ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ','๋ถ„์–‘๊ฐ€๊ฒฉ']])

 

 

 

 

 

 boxplot 


1. pandas plot

(1) df.plot(kind='box')

df.plot(kind='box', figsize=(5, 5))

 

(2) df.plot.box()

- df ์˜ column์ด x์ถ• 

df.plot.box(fontsize=15)

 

์›”๋ณ„ ์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- ์ด์ค‘ column์ผ ๊ฒฝ์šฐ 

df.plot.box(figsize=(15, 3), rot=30)

์›”๋ณ„ ์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

(3) df.boxplot(column='', by='')

- by: x์ถ• 

df.boxplot(column='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ',by='์—ฐ๋„', figsize=(5,3), rot=30)

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- by๊ฐ€ ๋ฆฌ์ŠคํŠธ์ผ ๋•Œ 

df.boxplot(column='ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ',by=['์—ฐ๋„','์ „์šฉ๋ฉด์ '], figsize=(20,3), rot=30)

์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

2. seaborn plot 

sns.boxplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

- hue ์ง€์ •

plt.figure(figsize=(12, 3))
sns.boxplot(data=df_last, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ „์šฉ๋ฉด์ ")

 

 

 violinplot 

1. seaborn plot 

sns.violinplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ")

์—ฐ๋„๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

- hue ์ง€์ •

plt.figure(figsize=(12, 3))
sns.violinplot(data=df, x="์—ฐ๋„", y="ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ", hue="์ „์šฉ๋ฉด์ ")

์—ฐ๋„๋ณ„ ์ „์šฉ๋ฉด์ ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ

 

 

 

 heatmap 

1. seaborn plot 

plt.figure(figsize=(15, 7), dpi=100)
ax=sns.heatmap(df, cmap="Blues", annot=True, fmt=".0f")

์—ฐ๋„๋ณ„ ์ง€์—ญ๋ณ„ ํ‰๋‹น๋ถ„์–‘๊ฐ€๊ฒฉ, pivot_table๋กœ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€ df์— ์ ์šฉํ•ด์•ผ ํ•จ

 

 

2. matplotlib pcolor  

fig=plt.figure(figsize=(15,5), dpi=100)
ax=fig.subplots()

t2=t.iloc[::-1]
t2
hm1=ax.pcolor(t2, cmap="Blues")
_=fig.colorbar(hm1, ax=ax)

col_len=len(t2.columns)
row_len=len(t2.index)
for r in range(row_len):
    for c in range(col_len):
        _=ax.text(c+0.5, r+0.5, int(t2.iloc[r, c]),ha="center", va="center", color="k", fontsize=11)

_=ax.set_xticks(np.arange(col_len)+0.5)
_=ax.set_xticklabels(t2.columns)

_=ax.set_yticks(np.arange(row_len)+0.5)
_=ax.set_yticklabels(t2.index)

 

 

 

 

 

 

์ด๊ฒƒ์ด ์ฝ”๋”ฉํ…Œ์ŠคํŠธ๋‹ค with ํŒŒ์ด์ฌ ๊ฐœ๋… ์ •๋ฆฌ 

 

๊ตฌํ˜„

  • ๋จธ๋ฆฟ์†์— ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์†Œ์Šค์ฝ”๋“œ๋กœ ๋ฐ”๊พธ๋Š” ๊ณผ์ • 
  • ํ’€์ด๋ฅผ ๋– ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์€ ์‰ฝ์ง€๋งŒ ์†Œ์Šค์ฝ”๋“œ๋กœ ์˜ฎ๊ธฐ๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ์˜๋ฏธ 

 

๊ตฌํ˜„ํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฌธ์ œ 

  1. ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ฐ„๋‹จํ•œ๋ฐ ์ฝ”๋“œ๊ฐ€ ์ง€๋‚˜์น  ๋งŒํผ ๊ธธ์–ด์ง€๋Š” ๋ฌธ์ œ 
  2. ํŠน์ • ์†Œ์ˆ˜์  ์ž๋ฆฌ๊นŒ์ง€ ์ถœ๋ ฅํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ 
  3. ๋ฌธ์ž์—ด์ด ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์กŒ์„ ๋•Œ ํ•œ ๋ฌธ์ž ๋‹จ์œ„๋กœ ๋Š์–ด์„œ ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์•ผํ•˜๋Š” (ํŒŒ์‹ฑ์„ ํ•ด์•ผํ•˜๋Š”) ๋ฌธ์ œ 
  4. ์‚ฌ์†Œํ•œ ์กฐ๊ฑด ์„ค์ •์ด ๋งŽ์€ ๋ฌธ์ œ

 

์™„์ „ํƒ์ƒ‰ 

  • ๋ชจ๋“  ๊ฒฝ์šฐ์˜ ์ˆ˜๋ฅผ ์ฃผ์ € ์—†์ด ๋‹ค ๊ณ„์‚ฐํ•˜๋Š” ํ•ด๊ฒฐ ๋ฐฉ๋ฒ• 

 

์‹œ๋ฎฌ๋ ˆ์ด์…˜ 

  • ๋ฌธ์ œ์—์„œ ์ œ์‹œํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ•œ ๋‹จ๊ณ„์”ฉ ์ฐจ๋ก€๋Œ€๋กœ ์ง์ ‘ ์ˆ˜ํ–‰ 

 

๊ตฌํ˜„ ๋ฌธ์ œ ์ ‘๊ทผ

- ์‚ฌ์†Œํ•œ ์ž…๋ ฅ ์กฐ๊ฑด ๋“ฑ์„ ๋ฌธ์ œ์—์„œ ๋ช…์‹œํ•ด์ฃผ๋ฉฐ ๋ฌธ์ œ์˜ ๊ธธ์ด๊ฐ€ ๊ฝค ๊ธด ํŽธ์ž„ 

 

 

cf) ๋ฉ”๋ชจ๋ฆฌ ,์‹œ๊ฐ„ ์ œํ•œ ๊ณ ๋ ค ์‚ฌํ•ญ 

ํŒŒ์ด์ฌ์—์„œ ๋ฆฌ์ŠคํŠธ ํฌ๊ธฐ 

๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜ (๋ฆฌ์ŠคํŠธ ๊ธธ์ด) ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ 
1,000 ์•ฝ 4KB
1,000,000  ์•ฝ 4MB
10,000,000 ์•ฝ 40MB

 

  • ํŒŒ์ด์ฌ์€ 1์ดˆ์— 2000๋งŒ ๋ฒˆ์˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ๋ฌธ์ œ๋ฅผ ํ’€๋ฉด ์‹œ๊ฐ„์ œํ•œ์— ์•ˆ์ •์ ์ž„
  • (ex) ์‹œ๊ฐ„์ œํ•œ์ด 1์ดˆ, ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๊ฐ€ 100๋งŒ๊ฐœ -> ์‹œ๊ฐ„๋ณต์žก๋„ O(NlogN) ์ด๋‚ด์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ ํ’€์–ด์•ผ ํ•จ  (N=1,000,000์ผ ๋•Œ NlogN์€ 20,000,000 ์ด๊ธฐ ๋•Œ๋ฌธ) 
  • ์‹œ๊ฐ„ ์ œํ•œ๊ณผ ๋ฐ์ดํ„ฐ ๊ฐœ์ˆ˜๋ฅผ ํ™•์ธํ•˜๊ณ  ์–ด๋Š ์ •๋„์˜ ์‹œ๊ฐ„ ๋ณต์žก๋„์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ์ž‘์„ฑํ•ด์•ผ ํ’€ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ธ์ง€ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•จ. 

 

 

 

 

<์˜ˆ์ œ ์ถœ์ฒ˜> 

https://wikidocs.net/22803

 

2) map, filter

์•ž์„œ ๋ฐฐ์šด ์ œ๋„ˆ๋ ˆ์ดํ„ฐ(`generator`)๋Š” ์ดํ„ฐ๋ ˆ์ดํ„ฐ(`iterator`) ์ž…๋‹ˆ๋‹ค. ๋‹ค๋งŒ ์ œ๋„ˆ๋ ˆ์ดํ„ฐ ํ‘œํ˜„์‹ ๋˜๋Š” `yield`ํ‚ค์›Œ๋“œ๋ฅผ ํ†ตํ•ด ์ƒ์„ฑํ•œ ์ดํ„ฐ๋ ˆ์ดํ„ฐ๋Š” ๊ตฌ๋ถ„์„ ...

wikidocs.net

 

 

์‚ฌ์šฉ๋ฒ•์ด ์ต์ˆ™์น˜ ์•Š์•„ ํ•ญ์ƒ ์ฐพ์•„๋ณด๋Š” map, filterํ•จ์ˆ˜๋ฅผ ํ™•์‹คํžˆ ์ •๋ฆฌํ•ด ๋†“๋Š”๋‹ค. 


map(์ ์šฉ์‹œํ‚ฌ ํ•จ์ˆ˜, ์ ์šฉํ•  ์š”์†Œ๋“ค) 

: ๋ฐ˜๋ณต๊ฐ€๋Šฅํ•œ iterable ๊ฐ์ฒด๋ฅผ ๋ฐ›์•„์„œ ๊ฐ ์š”์†Œ์— ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•ด์ฃผ๋Š” ํ•จ์ˆ˜ 

 

#1. for๋ฌธ ์‚ฌ์šฉ

def add_1(n): 
    return n + 1

target = [1, 2, 3, 4, 5]
result = []

for value in target: 
    result.append(add_1(value))
    
print(result)

[2, 3, 4, 5, 6]

 

#2. mapํ•จ์ˆ˜ ์‚ฌ์šฉ

# map ํ•จ์ˆ˜ ์‚ฌ์šฉ 
def add_1(n): 
    return n + 1

target = [1, 2, 3, 4, 5]

result = map(add_1, target)

print(result)  # ์ถœ๋ ฅ๊ฒฐ๊ณผ: iterator -> nextํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ํ™•์ธ ๊ฐ€๋Šฅ 
print(list(result))  # list ํƒ€์ž…์œผ๋กœ ํ˜•๋ณ€ํ™˜ ํ•˜์—ฌ ํ™•์ธ ๊ฐ€๋Šฅ

<map object at 0x000001B7457EF3C8>

[2, 3, 4, 5, 6]

 

#3. mapํ•จ์ˆ˜ + lambda ์‚ฌ์šฉ

# map + lambda: add_1 ๊ณผ ๊ฐ™์€ ํ•จ์ˆ˜๊ฐ€ ์žฌ์‚ฌ์šฉ ๋ชฉ์ ์ด ์—†๋‹ค๋ฉด lambda ํ•จ์ˆ˜ ์‚ฌ์šฉ
target = [1, 2, 3, 4, 5]

result = map(lambda x: x + 1, target)

print(list(result))

[2, 3, 4, 5, 6]

 

# ์ถ”๊ฐ€ ์˜ˆ์ œ: ๋ชจ๋“  ์š”์†Œ๋“ค์„ str ํƒ€์ž…์œผ๋กœ ๋ณ€๊ฒฝ 
target = [1, 2, 3, 4, 5]
list(map(str, target))

['1', '2', '3', '4', '5']


filter(์ ์šฉ์‹œํ‚ฌ ํ•จ์ˆ˜, ์ ์šฉํ•  ์š”์†Œ๋“ค) 

: ํŠน์ • ์กฐ๊ฑด์œผ๋กœ ๊ฑธ๋Ÿฌ์„œ ๊ฑธ๋Ÿฌ์ง„ ์š”์†Œ๋“ค๋กœ iterator ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด์„œ ๋ฆฌํ„ด 

#1. for๋ฌธ ์‚ฌ์šฉ

target = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
result = []

def is_even(n):
    return True if n % 2 == 0 else False 

for value in target: 
    if is_even(value): 
        result.append(value)
        
print(result)

 

#2. filter ํ•จ์ˆ˜ ์‚ฌ์šฉ

[2, 4, 6, 8, 10]

# filter ํ•จ์ˆ˜ ์‚ฌ์šฉ 
target = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

def is_even(n): 
    return True if n % 2 == 0 else False

result = filter(is_even, target)

print(list(result))

[2, 4, 6, 8, 10]

 

#3. filter ํ•จ์ˆ˜ + lambda ์‚ฌ์šฉ

# filter + lambda
target = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
result = filter(lambda x: x%2==0, target)

print(list(result))

[2, 4, 6, 8, 10]

 

 

์‘์šฉ: Map + Filter ์˜ˆ์ œ

## target๋ฆฌ์ŠคํŠธ์˜ ๋ชจ๋“  ์š”์†Œ๋“ค์— 1์„ ๋”ํ•˜๊ณ  ํ™€์ˆ˜๋งŒ return
target = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
list(filter(lambda x: x%2!=0, map(lambda x: x+1, target)))

[3, 5, 7, 9, 11]

 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/71

 

[rossmann data]์ƒ์  ๋งค์ถœ ์˜ˆ์ธก/ kaggle ์ถ•์†Œ๋ฐ์ดํ„ฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ <์ด์ „ ๊ธ€> https://silvercoding.tistory.com/70 https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.ti..

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

[ Home Credit Data ]

์›๋ณธ ๋ฐ์ดํ„ฐ: ์บ๊ธ€ 

ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต 

  • ๊ณ ๊ฐ์˜ ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ ์˜ˆ์ธก: ๊ณ ๊ฐ์˜ ์ธ์  ์ •๋ณด, ๊ฑฐ๋ž˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ํ•ด๋‹น ๊ณ ๊ฐ์—๊ฒŒ ๋ˆ์„ ๋นŒ๋ ค์ฃผ์—ˆ์„ ๋•Œ ์ด๋ฅผ ์ƒํ™˜ํ• ์ง€ ์—ฌ๋ถ€๋ฅผ ์˜ˆ์ธก

train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ
loan_before.csv - ๊ฐ ์‚ฌ๋žŒ์ด ์ด์ „์— ์ง„ํ–‰ํ–ˆ๋˜ ๋Œ€์ถœ์— ๋Œ€ํ•œ ์ƒ์„ธ ์ •๋ณด

 

import pandas as pd
import os
os.chdir('../data')
lb = pd.read_csv("loan_before.csv")
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train.head()

 

lb.head()

 

- loan before ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€ DAYS_CREDIT
๋Œ€์ถœ ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€ CNT_CREDIT_PROLONG
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT_SUM
๋Œ€์ถœ ์œ ํ˜• CREDIT_TYPE

 

- train, test ์ปฌ๋Ÿผ ์ •๋ณด 

์œ ๋‹ˆํฌํ•œ ์•„์ด๋””
SK_ID_CURR
ํƒ€๊ฒŸ๊ฐ’(0: ์ •์ƒ ์ƒํ™˜, 1: ์—ฐ์ฒด ํ˜น์€ ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธด ๊ฒฝ์šฐ) TARGET
์„ฑ๋ณ„(0: ์—ฌ์„ฑ, 1: ๋‚จ์„ฑ) CODE_GENDER
์ฐจ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_CAR
์ฃผํƒ ํ˜น์€ ์•„ํŒŒํŠธ ๋ณด์œ  ์—ฌ๋ถ€(0: ์—†์Œ, 1: ์žˆ์Œ) FLAG_OWN_REALTY
์ž๋…€ ์ˆ˜ CNT_CHILDREN
์ˆ˜์ž… AMT_INCOME_TOTAL
๋Œ€์ถœ๊ธˆ์•ก AMT_CREDIT
1๋‹ฌ๋งˆ๋‹ค ๊ฐš์•„์•ผ ํ•˜๋Š” ๊ธˆ์•ก AMT_ANNUITY
๋Œ€์ถœ์‹ ์ฒญ์„ ํ•  ๋•Œ ๋ˆ„๊ฐ€ ๋™ํ–‰ํ–ˆ๋Š”์ง€ NAME_TYPE_SUITE
์ง์—… ์ข…๋ฅ˜ NAME_INCOME_TYPE
ํ•™์œ„ NAME_EDUCATION_TYPE
์ฃผ๊ฑฐ ์ƒํ™ฉ NAME_HOUSING_TYPE
์ง€์—ญ์˜ ์ธ๊ตฌ REGION_POPULATION_RELATIVE
๋‚˜์ด DAYS_BIRTH
์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€(365243๋Š” ๊ฒฐ์ธก์น˜) DAYS_EMPLOYED
๊ณ ๊ฐ์ด ๋Œ€์ถœ์„ ์‹ ์ฒญํ•œ ID ๋ฌธ์„œ๋ฅผ ๋ณ€๊ฒฝํ•œ ๋‚ ์งœ DAYS_ID_PUBLISH
๋ณด์œ ํ•œ ์ฐจ์˜ ๋‚˜์ด OWN_CAR_AGE
๊ฐ€์กฑ ์ˆ˜ CNT_FAM_MEMBERS
์–ธ์ œ ๋Œ€์ถœ์‹ ์ฒญ์„ ํ–ˆ๋Š”์ง€ ์‹œ๊ฐ„ HOUR_APPR_PROCESS_START
์ผํ•˜๋Š” ์กฐ์ง์˜ ์ข…๋ฅ˜ ORGANIZATION_TYPE
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ1๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_1
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ2๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_2
์™ธ๋ถ€ ๋ฐ์ดํ„ฐ3๋กœ๋ถ€ํ„ฐ ์‹ ์šฉ์ ์ˆ˜ EXT_SOURCE_3
๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ DAYS_LAST_PHONE_CHANGE
์‹ ์ฒญ ์ „ 1๋…„๊ฐ„ ์‹ ์šฉํ‰๊ฐ€๊ธฐ๊ด€์— ํ•ด๋‹น ์‚ฌ๋žŒ์— ๋Œ€ํ•œ ์‹ ์šฉ์ •๋ณด๋ฅผ ์กฐํšŒํ•œ ๊ฐœ์ˆ˜ AMT_REQ_CREDIT_BUREAU_YEAR

1. ๋ฌธ์ œ ์ •์˜ 

์งˆ๋ฌธ 1 - ์–ด๋–ค ์š”์†Œ๊ฐ€ ๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€์— ํฐ ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

์งˆ๋ฌธ 2 - ๊ทธ ์š”์†Œ๋“ค์ด ์ƒํ™˜์—ฌ๋ถ€์— ์–ด๋–ค ์˜ํ–ฅ์„ ์ฃผ๋Š”๊ฐ€? 

 

2. ๋ฐฉ๋ฒ•๋ก  

- ๋ถ„์„ ๊ณผ์ • 

์งˆ๋ฌธ์— ๋Œ€ํ•œ ํ•ด๋‹ต์„ ์–ป๊ธฐ ์œ„ํ•ด ํ•ด์„๊ฐ€๋Šฅํ•œ ๋จธ์‹ ๋Ÿฌ๋‹ (xAI) ํ™œ์šฉ 

(1) Feature Engineering

- AMT_CREDIT_TO_ANNUITY_RATIO ๋ณ€์ˆ˜ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ๋ช‡๊ฐœ์›”์— ๊ฑธ์ณ ๋ˆ์„ ๊ฐš์•„์•ผ ํ•˜๋Š”์ง€ 

train['AMT_CREDIT_TO_ANNUITY_RATIO'] = train['AMT_CREDIT']/train['AMT_ANNUITY']
test['AMT_CREDIT_TO_ANNUITY_RATIO'] = test['AMT_CREDIT']/test['AMT_ANNUITY']

- lb๋ฐ์ดํ„ฐ: groupby ํ›„ ํ‰๊ท  

  • AMT_CREDIT_SUM (์ด์ „ ๋Œ€์ถœ์˜ ๊ธˆ์•ก) 
  • DAYS_CREDIT (train, test์˜ ๋Œ€์ถœ๋กœ๋ถ€ํ„ฐ ๋ฉฐ์น  ์ „์— ์ด์ „ ๋Œ€์ถœ์„ ์ง„ํ–‰ํ–ˆ๋Š”์ง€) 
  • CNT_CREDIT_PROLONG (๋Œ€์ถœ์—ฐ์žฅ์„ ๋ช‡ ๋ฒˆ ํ–ˆ๋Š”์ง€) 
train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['AMT_CREDIT_SUM'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['DAYS_CREDIT'].mean().reset_index(),on='SK_ID_CURR',how='left' )

train = pd.merge( train,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )
test = pd.merge( test,lb.groupby(['SK_ID_CURR'])['CNT_CREDIT_PROLONG'].mean().reset_index(),on='SK_ID_CURR',how='left' )

- lb ๋ฐ์ดํ„ฐ: groupby ํ›„ ๊ฐฏ์ˆ˜ 

  • count ์ปฌ๋Ÿผ ์ƒ์„ฑ: ํ•ด๋‹น ์‚ฌ๋žŒ์ด ์ด์ „์— ๋Œ€์ถœ์„ ๋ช‡ ๋ฒˆ ์ง„ํ–‰ํ–ˆ๋Š”์ง€
train = pd.merge(train , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')
test = pd.merge(test , lb.groupby(['SK_ID_CURR']).size().reset_index().rename(columns={0:'count'}),on='SK_ID_CURR', how='left')

 

- ๋ณ€์ˆ˜ ์ œ๊ฑฐ 

ํ•ด๋‹น ํ”„๋กœ์ ํŠธ์˜ ๋ชฉ์ ์€ ๋ชจ๋ธ ํ•ด์„์ด๊ธฐ ๋•Œ๋ฌธ์—, ์ด์— ๋ฐฉํ•ด๋ฅผ ์ฃผ๋Š” ๋ณ€์ˆ˜๋Š” ๋ชจ๋‘ ์ œ๊ฑฐ

์ œ๊ฑฐ ๋ณ€์ˆ˜๋ชฉ๋ก

  • CODE_GENDER : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • FLAG_OWN_CAR : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_TYPE_SUITE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_INCOME_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_EDUCATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • NAME_HOUSING_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • ORGANIZATION_TYPE : ๋ฒ”์ฃผํ˜• ๋ณ€์ˆ˜
  • EXT_SOURCE_1 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_2 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
  • EXT_SOURCE_3 : ๋ณ€์ˆ˜์˜ ์˜๋ฏธ๋ฅผ ์ •ํ™•ํžˆ ๋ชจ๋ฆ„
del_list = ['CODE_GENDER','FLAG_OWN_CAR','NAME_TYPE_SUITE','NAME_INCOME_TYPE','NAME_EDUCATION_TYPE','NAME_HOUSING_TYPE','ORGANIZATION_TYPE',
'EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3']
train = train.drop(del_list,axis=1)
test = test.drop(del_list,axis=1)
train.columns

 

(2) ๋ชจ๋ธ๋ง 

- ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ๋†’์€ input๋ณ€์ˆ˜๋Š” ์‚ญ์ œํ•œ๋‹ค. 

: Input ๋ณ€์ˆ˜๊ฐ€ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋Œ ๋•Œ shap value๋Š” ์ œ๋Œ€๋กœ ๋œ ์„ค๋ช…๋ ฅ์„ ๋ฐœํœ˜ํ•˜์ง€ ๋ชปํ•จ. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_CREDIT', 'AMT_ANNUITY',
       'REGION_POPULATION_RELATIVE', 'DAYS_BIRTH', 'DAYS_EMPLOYED',
       'DAYS_ID_PUBLISH', 'OWN_CAR_AGE', 'CNT_FAM_MEMBERS',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

ํƒ€๊ฒŸ๋ณ€์ˆ˜์ธ TARGET  ์„ ์ œ์™ธํ•œ ๋ณ€์ˆ˜๋“ค์„ input_var ์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

corr = train[input_var].corr()
corr.style.background_gradient(cmap='coolwarm')

์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ์˜ ์‹œ๊ฐํ™” ๊ทธ๋ž˜ํ”„๊ฐ€ ๊ทธ๋ ค์ง€๊ณ , ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜๋“ค์„ ๋‚˜์—ดํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 

[ ๋†’์€ ์ƒ๊ด€์„ฑ์„ ๋„๋Š” ๋ณ€์ˆ˜ ๋ชฉ๋ก ]  

  • CNT_FAM_MEMBERS & CNT_CHILDREN 0.883051
  • AMT_CREDIT_TO_ANNUITY_RATIO & AMT_CREDIT 0.656337
  • AMT_ANNUITY & AMT_CREDIT 0.770938

cf) ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜์˜ ํ•ด์„ 

r์ด -1.0๊ณผ -0.7 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.7๊ณผ -0.3 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.3๊ณผ -0.1 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์Œ์  ์„ ํ˜•๊ด€๊ณ„,

r์ด -0.1๊ณผ +0.1 ์‚ฌ์ด์ด๋ฉด, ๊ฑฐ์˜ ๋ฌด์‹œ๋  ์ˆ˜ ์žˆ๋Š” ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.1๊ณผ +0.3 ์‚ฌ์ด์ด๋ฉด, ์•ฝํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.3๊ณผ +0.7 ์‚ฌ์ด์ด๋ฉด, ๋šœ๋ ทํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„,

r์ด +0.7๊ณผ +1.0 ์‚ฌ์ด์ด๋ฉด, ๊ฐ•ํ•œ ์–‘์  ์„ ํ˜•๊ด€๊ณ„


ํƒ€๊ฒŸ ๋ณ€์ˆ˜์™€์˜ ์ƒ๊ด€์„ฑ์ด ๋” ๋‚ฎ์€ ๋ณ€์ˆ˜๋ฅผ ์ œ๊ฑฐํ•œ๋‹ค. 

print(train['CNT_FAM_MEMBERS'].corr(train['TARGET']))
print(train['CNT_CHILDREN'].corr(train['TARGET']))

0.018876651698723705

0.025357359317615676

del train['CNT_FAM_MEMBERS']
del test['CNT_FAM_MEMBERS']

CNT_FAM_MEMBERS๊ฐ€ TARGET๊ณผ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

print(train['AMT_CREDIT_TO_ANNUITY_RATIO'].corr(train['TARGET']))
print(train['AMT_CREDIT'].corr(train['TARGET']))

-0.024740288335190132

-0.02255843084934759

del train['AMT_CREDIT']
del test['AMT_CREDIT']

AMT_CREDIT๊ณผ TARGER์˜ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ€ ๋” ๋‚ฎ์œผ๋ฏ€๋กœ ์ œ๊ฑฐํ•ด ์ค€๋‹ค. 

input_var = ['FLAG_OWN_REALTY', 'CNT_CHILDREN',
       'AMT_INCOME_TOTAL', 'AMT_ANNUITY', 'REGION_POPULATION_RELATIVE',
       'DAYS_BIRTH', 'DAYS_EMPLOYED', 'DAYS_ID_PUBLISH', 'OWN_CAR_AGE',
       'HOUR_APPR_PROCESS_START', 'DAYS_LAST_PHONE_CHANGE',
       'AMT_REQ_CREDIT_BUREAU_YEAR', 'AMT_CREDIT_TO_ANNUITY_RATIO',
       'AMT_CREDIT_SUM', 'DAYS_CREDIT', 'CNT_CREDIT_PROLONG', 'count']

์ œ๊ฑฐํ•œ ๋ณ€์ˆ˜๋“ค์„ ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ณ€์ˆ˜๋“ค์„ input_var์— ๋‹ค์‹œ ์ €์žฅํ•ด ์ค€๋‹ค. 

 

-xgboost ๋ชจ๋ธ๋ง 

: shap value๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ชจ๋ธ์ด ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ ํ˜•ํƒœ์˜ treeํ˜• ๋ชจ๋ธ์ด์–ด์•ผ ํ•œ๋‹ค. ์ด ์ค‘ xgboost๊ฐ€ ์†๋„๊ฐ€ ๋น ๋ฅด๋ฉด์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฏ€๋กœ ์„ ํƒ. 

from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=100, learning_rate=0.1)
model.fit(train[input_var],train['TARGET'])

 

 

(3) shap value 

import shap
shap_values = shap.TreeExplainer(model).shap_values(train[input_var])
shap.summary_plot(shap_values, train[input_var], plot_type='bar')

 

ํƒ€๊ฒŸ๊ฐ’์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์ƒ์œ„ 5๊ฐ€์ง€ ๋ณ€์ˆ˜ ๋ชฉ๋ก

  • AMT_CREDIT_TO_ANNUITY_RATIO
  • DAYS_EMPLOYED
  • DAYS_CREDIT
  • DAYS_BIRTH
  • DAYS_LAST_PHONE_CHANGE

 

(4) 5๊ฐœ์˜ ์˜ˆ์ธก๋ณ€์ˆ˜์™€ ํƒ€๊ฒŸ๋ณ€์ˆ˜(๋Œ€์ถœ๊ธˆ ์ƒํ™˜ ์—ฌ๋ถ€) ์™€์˜ ๊ด€๊ณ„ 

-1. AMT_CREDIT_TO_ANNUITY_RATIO: ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„

shap.dependence_plot('AMT_CREDIT_TO_ANNUITY_RATIO', shap_values, train[input_var])

ํ•ด๋‹น ๊ทธ๋ž˜ํ”„๋Š” ์„ธ๋กœ์ถ•์˜ ๊ฐ’์ด ๋‚ฎ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ ํ•œ๋‹ค๊ณ  ํ•ด์„(TARGET์ด 0์ผ ํ™•๋ฅ ์ด ๋†’์Œ)ํ•  ์ˆ˜ ์žˆ๋‹ค. ๊ธฐ๊ฐ„์ด 12-20๊ฐœ์›”์ผ ๋•Œ ์ƒํ™˜์„ ์ž˜ ํ•˜์ง€ ๋ชปํ•˜๋ฉฐ, 12๊ฐœ์›” ์ดํ•˜, 20๊ฐœ์›” ์ด์ƒ์ผ ๋•Œ๋Š” ๋น„๊ต์  ์ƒํ™˜์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

 

 

- 2. DAYS_EMPLOYED: ์–ธ์ œ ์ทจ์—…ํ–ˆ๋Š”์ง€

shap.dependence_plot('DAYS_EMPLOYED', shap_values, train[input_var])

๋Œ€์ถœ์ผ ๊ธฐ์ค€์œผ๋กœ 9000์ผ ๋ณด๋‹ค ์ „์— ์ทจ์—…ํ–ˆ์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๊ธ‰ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 3. DAYS_CREDIT: ํ•ด๋‹น ๋Œ€์ถœ์ด home credit์œผ๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๋Œ€์ถœ๋ณด๋‹ค ๋ฉฐ์น  ์ด์ „์— ์ผ์–ด๋‚ฌ๋Š”์ง€

shap.dependence_plot('DAYS_CREDIT', shap_values, train[input_var])

-3000์ผ ๋ถ€ํ„ฐ -2000์ผ๊นŒ์ง€ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ์ƒ์Šนํ•˜๋‹ค๊ฐ€ ๊ทธ ์ดํ›„๋ถ€ํ„ฐ ํ•˜๋ฝํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰ ๋„ˆ๋ฌด ์˜ค๋ž˜ ์ „์— ๋Œ€์ถœ์„ ๋ฐ›์•˜๊ฑฐ๋‚˜, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ๋•Œ ๋Œ€์ถœ ์ƒํ™˜ ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง„๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค. 

 

 

- 4. DAYS_BIRTH: ๋‚˜์ด

shap.dependence_plot('DAYS_BIRTH', shap_values, train[input_var])

ํƒœ์–ด๋‚œ์ง€ ์˜ค๋ž˜ ๋˜์—ˆ์„ ์ˆ˜๋ก(๋‚˜์ด๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก) ๋Œ€์ถœ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ธ๋‹ค. 

 

 

- 5. DAYS_LAST_PHONE_CHANGE: ๋งˆ์ง€๋ง‰ ํ•ธ๋“œํฐ์„ ๋ฐ”๊พผ ์‹œ๊ธฐ

shap.dependence_plot('DAYS_LAST_PHONE_CHANGE', shap_values, train[input_var])

ํ•ธ๋“œํฐ์„ ์˜ค๋ž˜ ์ „์— ๋ฐ”๊พธ์—ˆ์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜์„ ์ž˜ํ•˜๋Š” ๊ฒฝํ–ฅ์ด ๋ณด์ธ๋‹ค. 

 

 


3. ๊ฒฐ๋ก  

  • ๋Œ€์ถœ ์ƒํ™˜ ๊ธฐ๊ฐ„์ด ์ƒํ™˜์—ฌ๋ถ€์— ๊ฐ€์žฅ ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค. ํ•ด๋‹น ์˜ํ–ฅ์€ ๋น„์„ ํ˜•์  ๊ด€๊ณ„์ด๋‹ค. (์˜ํ–ฅ์ด ํฌ๋‹ค๊ณ  ํ•ด์„œ ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๊ณ  ๋‹จ์ •์ง“๊ธฐ๋Š” ์–ด๋ ต๋‹ค. )
  • ์ฃผํƒ ๋ณด์œ  ์—ฌ๋ถ€์™€ ์ž์‹์˜ ์ˆ˜๋Š” ๋Œ€์ถœ ์ƒํ™˜๋Šฅ๋ ฅ์— ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค.
  • ์ตœ๊ทผ์— ์ทจ์—…ํ–ˆ์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ๋Œ€์ถœ์„ ๋ฐ›์•˜์„ ์ˆ˜๋ก, ์ตœ๊ทผ์— ํ•ธ๋“œํฐ์„ ๋ฐ”๊ฟจ์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ์–ด๋ฆด์ˆ˜๋ก ๋Œ€์ถœ๊ธˆ ์ƒํ™ฉ ๊ฐ€๋Šฅ์„ฑ์ด ๋‚ฎ๋‹ค.  
train['DAYS_EMPLOYED'].quantile(0.75)

-748.0

์œ„์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์ƒ์œ„ 25%์˜ ๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ 4๊ฐœ์˜ ๋ณ€์ˆ˜์˜ ์ƒ์œ„ 25% ์ด์ƒ ๊ทธ๋ฃน๊ณผ ํ•˜์œ„ 25%๋ฏธ๋งŒ ๊ทธ๋ฃน์„ ๋‚˜๋ˆ„์–ด ์‹œ๊ฐํ™” ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธ ํ•ด ๋ณธ๋‹ค. 

 

- ์ƒ์œ„ 25%

group1 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.75)< train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.75)< train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.75)< train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.75)< train['DAYS_BIRTH']) ]

- ํ•˜์œ„ 25 %

group2 = train.loc[ (train['DAYS_EMPLOYED'].quantile(0.25)> train['DAYS_EMPLOYED']) &
           (train['DAYS_CREDIT'].quantile(0.25)> train['DAYS_CREDIT']) &
           (train['DAYS_LAST_PHONE_CHANGE'].quantile(0.25)> train['DAYS_LAST_PHONE_CHANGE']) &
           (train['DAYS_BIRTH'].quantile(0.25)> train['DAYS_BIRTH']) ]
group1['group'] = 1
group2['group'] = 0

group1์€ group๋ณ€์ˆ˜์— 1์„, group2๋Š” group ๋ณ€์ˆ˜์— 0์„ ๋„ฃ์–ด ์ค€๋‹ค. 

full = pd.concat([group1,group2],axis=0)

group1๊ณผ group2๋ฅผ ํ•ฉ์ณ์ค€๋‹ค. 

import seaborn as sns
sns.barplot('group','TARGET',data=full)

group2 (group=0, ํ•˜์œ„ 25%)  ์˜ Target๊ฐ’์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค(0์ด ๋งŽ๋‹ค=์ •์ƒ ์ƒํ™˜). ๊ฐ ๋ณ€์ˆ˜๋“ค์˜ ๊ฐ’์ด ์ž‘์„ ์ˆ˜๋ก ๋Œ€์ถœ ์ƒํ™˜ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๋Š” ๊ฒฐ๋ก ๊ณผ ๊ฐ™์Œ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

๋ฌธ์ œ์ 

temp.info()

๊ฒฐ์ธก๊ฐ’์„ ๊ฐ„๋‹จํžˆ ํ™•์ธํ•ด ๋ณด๊ธฐ ์œ„ํ•˜์—ฌ pandas์˜ infoํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ Non-Null Count ๋ถ€๋ถ„์ด ๋‚˜์˜ค์ง€ ์•Š๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ธธ ๋•Œ๊ฐ€ ์žˆ๋‹ค. 

 

 

 

ํ•ด๊ฒฐ๋ฐฉ๋ฒ• 

temp.info(null_counts=True)

์ธ์ž์— null_counts=True๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ์œ„์™€ ๊ฐ™์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

<์ด์ „ ๊ธ€>

https://silvercoding.tistory.com/70

 

[FIFA DATA] 2019/2020 ์‹œ์ฆŒ Manchester United ์— ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€?, EDA ๊ณผ์ •

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/69 https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding...

silvercoding.tistory.com

 

 


1. ๋ฐ์ดํ„ฐ ์†Œ๊ฐœ & ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

<Rossmann Store Sales> 

https://www.kaggle.com/c/rossmann-store-sales/data?select=test.csv 

 

Rossmann Store Sales | Kaggle

 

www.kaggle.com

ํ•ด๋‹น ๋งํฌ์˜ ์บ๊ธ€ ๋Œ€ํšŒ์—์„œ ์‚ฌ์šฉ๋˜์—ˆ๋˜ ๋กœ์Šค๋งŒ ๋ฐ์ดํ„ฐ์ด๋‹ค. 

  • train.csv - historical data including Sales
  • test.csv - historical data excluding Sales
  • sample_submission.csv - a sample submission file in the correct format
  • store.csv - supplemental information about the stores

 

๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” ์ถ•์†Œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์ ์˜ ๋งค์ถœ ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค.  

(๋ฐ์ดํ„ฐ: ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต)

 

import os
import pandas as pd
os.chdir('../data')
train = pd.read_csv("lspoons_train.csv")
test = pd.read_csv("lspoons_test.csv")
store = pd.read_csv("store.csv")

lspoons_train.csv - ํ•™์Šต ๋ฐ์ดํ„ฐ
lspoons_test.csv - ์˜ˆ์ธกํ•ด์•ผ ํ•  test ๋ฐ์ดํ„ฐ

store.csv - ์ƒ์ ์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋‹ด๊ธด ๋ณด์กฐ ๋ฐ์ดํ„ฐ

 

 

train.head()


์ปฌ๋Ÿผ ์ •๋ณด 

  • id
  • Store: ๊ฐ ์ƒ์ ์˜ id
  • Date: ๋‚ ์งœ
  • Sales: ๋‚ ์งœ์— ๋”ฐ๋ฅธ ๋งค์ถœ
  • Promo: ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์ง„ํ–‰ ์—ฌ๋ถ€
  • StateHoliday: ๊ณตํœด์ผ ์—ฌ๋ถ€/ ๊ณตํœด์ผ X-> 0, ๊ณตํœด์ผ-> ๊ณตํœด์ผ์˜ ์ข…๋ฅ˜(a, b, c)
  • SchoolHoliday: ํ•™๊ต ํœด์ผ์ธ์ง€ ์—ฌ๋ถ€

์œ„์˜ ์ปฌ๋Ÿผ๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ Sales(๋งค์ถœ) ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์„ ์ƒ์„ฑํ•œ๋‹ค. 

 

 

 

 

 


- ๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง ( feature engineering - ๋ณ€์ˆ˜์„ ํƒ - ๋ชจ๋ธ๋ง ) 

2. 2์ฐจ ๋ชจ๋ธ๋ง ( store ๋ฐ์ดํ„ฐ merge - feature engineering - ๋ณ€์ˆ˜ ์„ ํƒ - ๋ชจ๋ธ๋ง )

3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

... ๋ชจ๋ธ๋ง ๋ฐ˜๋ณต ( ์ด ํ›„ ๋ชจ๋ธ๋ง์€ ์ž์œจ, ๊นƒํ—™ ์ •๋ฆฌ ) 

 


1. ๋ฒ ์ด์Šค ๋ชจ๋ธ๋ง 

: ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ชจ๋ธ์„ ๋งŒ๋“ ๋‹ค. (๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ, ์›ํ•ซ ์ธ์ฝ”๋”ฉ) 


ํ”ผ์ณ ์—”์ง€๋‹ˆ์–ด๋ง์ด๋ž€? 

  • ์˜ˆ์ธก์„ ์œ„ํ•ด ๊ธฐ์กด์˜ input ๋ณ€์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กœ์šด input ๋ณ€์ˆ˜ ์ƒ์„ฑ
  • ๋จธ์‹ ๋Ÿฌ๋‹ ์˜ˆ์ธก ์„ฑ๋Šฅ ์˜ฌ๋ฆด ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•

train.info()

๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๊ณ , object ํƒ€์ž…์ธ Date, StateHoliday ์ปฌ๋Ÿผ์„ ์ „์ฒ˜๋ฆฌ ํ•ด์ค€๋‹ค. 

 

- StateHoliday column one-hot encoding 

train = pd.get_dummies(columns=['StateHoliday'],data=train)
test = pd.get_dummies(columns=['StateHoliday'],data=test)

get_dummies ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ StateHoliday ์ปฌ๋Ÿผ์„ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

print("train_columns: ", train.columns, end="\n\n\n")
print("test_columns: ", test.columns)

์ƒˆ๋กœ ์ƒ์„ฑ๋œ ์นผ๋Ÿผ์„ ๋ณด๋ฉด train์—๋Š” b, c ๊ฐ€ ์žˆ์ง€๋งŒ test์—๋Š” b, c ๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ์ด ๊ฒฝ์šฐ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

test['StateHoliday_b'] = 0
test['StateHoliday_c'] = 0

๋”ฐ๋ผ์„œ ๊ฐ™์€ ์นผ๋Ÿผ์„ test ๋ฐ์ดํ„ฐ์…‹์— ์ƒ์„ฑํ•ด ์ค€๋‹ค.

 

- feature engineering using Date column

train['Date']

Date ์นผ๋Ÿผ์€ ๋‚ ์งœํ˜• ํ˜•ํƒœ๋กœ ๋˜์–ด ์žˆ์ง€๋งŒ dtype์ด object์ด๋ฏ€๋กœ ๋‚ ์งœ๋กœ์„œ์˜ ์˜๋ฏธ๊ฐ€ ์—†๋‹ค. 

train['Date'] = pd.to_datetime( train['Date'] )
test['Date'] = pd.to_datetime( test['Date'] )

๋”ฐ๋ผ์„œ pandas์—์„œ ๋‚ ์งœ ๊ณ„์‚ฐ์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•ด์ฃผ๋Š” to_datetime ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‚ ์งœํ˜• ๋ณ€์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๋‹ค. 

 

 

# ์š”์ผ ์ปฌ๋Ÿผ weekday ์ƒ์„ฑ 

train['weekday'] = train['Date'].dt.weekday
test['weekday'] = test['Date'].dt.weekday

# ๋…„๋„ ์ปฌ๋Ÿผ year ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

# ์›” ์ปฌ๋Ÿผ month ์ƒ์„ฑ 

train['year'] = train['Date'].dt.year
test['year'] = test['Date'].dt.year

 

 

- ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๋ง 

from xgboost import XGBRegressor
train.columns

xgb = XGBRegressor( n_estimators= 300 , learning_rate=0.1 , random_state=2020 )
xgb.fit(train[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']],
        train['Sales'])

 

XGB ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต์„ ์‹œ์ผœ ์ค€๋‹ค. 

 

from sklearn.model_selection import cross_val_score
cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

cross validation ์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ๊ตฌํ•ด๋ณด์•˜๋”๋‹ˆ ์œ„์™€ ๊ฐ™์ด ๋‚˜์™”๋‹ค.  ์ถ”๊ฐ€ ์ž‘์—…์œผ๋กœ ์˜ค๋ฅ˜์œจ์„ ์ค„์—ฌ๋‚˜๊ฐ€ ๋ณด์ž! 

 

 

cf.  ์บ๊ธ€ ์ œ์ถœ ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ 

test['Sales'] = xgb.predict(test[['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']])

test ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์— ๋„ฃ์–ด ์˜ˆ์ธก์„ ์ง„ํ–‰ํ•œ๋‹ค. 

test[['id','Sales']].to_csv("submission.csv",index=False)

 

- ๋ณ€์ˆ˜ ์„ ํƒ 

xgb.feature_importances_

feature_importances_ ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ€์ˆ˜์˜ ์ค‘์š”๋„๋ฅผ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

input_var = ['Promo','SchoolHoliday','StateHoliday_0','StateHoliday_a','StateHoliday_b','StateHoliday_c','weekday','year','month']

input_var์— Sales๋ฅผ ์ œ์™ธํ•œ ์ธํ’‹ ๋ณ€์ˆ˜๋ฅผ ์ €์žฅํ•ด ์ค€๋‹ค. 

imp_df = pd.DataFrame({"var": input_var,
                       "imp": xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
imp_df

๋ณ€์ˆ˜ ์ค‘์š”๋„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„ ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ์„ ํ•ด ์ค€๋‹ค. Promo๊ฐ€ ์••๋„์ ์œผ๋กœ ๋ณ€์ˆ˜์ค‘์š”๋„๊ฐ€ ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. State_Holiday๋Š” ๋Œ€์ฒด์ ์œผ๋กœ ๋‚ฎ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. 

import matplotlib.pyplot as plt
plt.bar(imp_df['var'],imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

ํ•œ๋ˆˆ์— ๋ณด๊ธฐ์œ„ํ•ด ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ ค ๋ณด์•˜๋”๋‹ˆ SchoolHoliday ์ดํ›„ ์ปฌ๋Ÿผ๋“ค์€ ๋ณ„ ์˜๋ฏธ๊ฐ€ ์—†์–ด ๋ณด์ธ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday', 'month','year', 'SchoolHoliday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. ๊ทธ๋ ‡๋‹ค๋ฉด ์ปฌ๋Ÿผ์„ ๋ช‡๊ฐœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์˜ค๋ฅ˜์œจ์„ ์ค„๊ฒŒ ํ•˜๋Š”์ง€ ์‹คํ—˜ํ•ด ๋ณธ๋‹ค. 

import numpy as np
score_list=[]
selected_varnum=[]
for i in range(1,10):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

 

๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜ ๋ณ„๋กœ cross validation์„ ์ˆ˜ํ–‰ํ•œ ๊ฒฐ๊ณผ 2๊ฐœ์ผ ๋•Œ ๊ฐ€์žฅ ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ cross validation์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. 

cross_val_score(xgb, train[['Promo', 'weekday']], train['Sales'], scoring="neg_mean_squared_error", cv=3)

๋‘๋ฒˆ์งธ ๋นผ๊ณ ๋Š” ๋ชจ๋‘ ์ค„์–ด๋“  ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์˜ˆ์ธก๋ณ€์ˆ˜๊ฐ€ 2๊ฐœ์ผ ๋•Œ ๋ชจ๋ธ ํ•™์Šต์„ ํ•œ ํ›„, ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ ์ œ์ถœํ•œ ์บ๊ธ€ ์Šค์ฝ”์–ด๋„ ๋” ์ค„์–ด๋“ค์—ˆ๋‹ค. (๋ฐ˜๋ณต์ž‘์—…์ด๋ฏ€๋กœ ํฌ์ŠคํŒ…์—์„œ ์ƒ๋žต) 

 

 

 

 

 


2. 2์ฐจ ๋ชจ๋ธ๋ง 

- store ๋ฐ์ดํ„ฐ ํ•ฉ๋ณ‘ 

store


store ๋ฐ์ดํ„ฐ์…‹: ๊ฐ ์ƒ์ ์— ๋Œ€ํ•œ ํŠน์ง•์„ ์ •๋ฆฌํ•œ ๊ฒƒ 

์ปฌ๋Ÿผ ์˜๋ฏธ

  • Store: ์ƒ์ ์˜ ์œ ๋‹ˆํฌํ•œ id
  • Store Type: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • Assortment: ์ƒ์ ์˜ ์ข…๋ฅ˜
  • CompetitionDistance: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์ƒ์ ๊ณผ์˜ ๊ฑฐ๋ฆฌ
  • CompetitionOpenSinceMonth: ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๊ฒฝ์Ÿ์—…์ฒด ์˜คํ”ˆ ์›”
  • CompetitionOpenSinceYear: ์˜คํ”ˆ ๋…„๋„
  • Promo2: ์ง€์†์ ์ธ(์ฃผ๊ธฐ์ ์ธ) ํŒ๋งค์ด‰์ง„ ํ–‰์‚ฌ ์—ฌ๋ถ€
  • Promo2SinceWeek/ promo2SinceYear: ํ•ด๋‹น ์ƒ์ ์ด promo2๋ฅผ ํ•˜๊ณ ์žˆ๋‹ค๋ฉด ์–ธ์ œ ์‹œ์ž‘ํ–ˆ๋Š”์ง€
  • PromoInterval: ์ฃผ๊ธฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋˜๋Š”์ง€

train = pd.merge(train, store, on=['Store'], how='left')
test = pd.merge(test, store, on=['Store'], how='left')

Store ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ train, test ๋ฐ์ดํ„ฐ์…‹๊ณผ store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•ด ์ค€๋‹ค. 

 

 

- CompetitionOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ

: ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ (ํ•ด๋‹น ๊ฐ€๊ฒŒ ์ด์ „ ๊ฐœ์žฅ: ์–‘์ˆ˜, ์ดํ›„ ๊ฐœ์žฅ: ์Œ์ˆ˜

train['CompetitionOpen'] = 12*( train['year'] - train['CompetitionOpenSinceYear'] ) + \
                             (train['month'] - train['CompetitionOpenSinceMonth'])

test['CompetitionOpen'] = 12*( test['year'] - test['CompetitionOpenSinceYear'] ) + \
                             (test['month'] - test['CompetitionOpenSinceMonth'])

ํ•ด๋‹น ๊ฐ€๊ฒŒ๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„์—์„œ ๊ฒฝ์Ÿ์—…์ฒด๊ฐ€ ๊ฐœ์žฅํ•œ ๋…„๋„๋ฅผ ๋บ€ ํ›„ 12๋ฅผ ๊ณฑํ•˜๋ฉด ๊ฐœ์›” ์ˆ˜๋กœ ๋ณ€ํ™˜ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ๋‹ฌ์—์„œ ๊ฒฝ์Ÿ์—…์ฒด ๊ฐœ์žฅ ๋‹ฌ์˜ ์ฐจ์ด์™€ ๋”ํ•ด์ฃผ๋ฉด ํ•ด๋‹น ๊ฐ€๊ฒŒ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์–ธ์ œ ๊ฐœ์žฅํ–ˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- PromoOpen ์ปฌ๋Ÿผ ์ƒ์„ฑ 

: ํ•ด๋‹น ๊ฐ€๊ฒŒ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ํ›„์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์‹œ์ž‘๋˜์—ˆ๋Š”์ง€ 

train['WeekOfYear'] = train['Date'].dt.weekofyear # ํ˜„์žฌ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€
test['WeekOfYear'] = test['Date'].dt.weekofyear

ํ”„๋กœ๋ชจ์…˜2์— ๋Œ€ํ•œ ๋‚ ์งœ ์ •๋ณด๊ฐ€ ๋…„๋„(Year)์™€ ์ฃผ(Week)๋กœ ๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— Date์ปฌ๋Ÿผ์—์„œ ๋‚ ์งœ๊ฐ€ ๋ช‡๋ฒˆ์งธ ์ฃผ์ธ์ง€ ๊ณ„์‚ฐํ•˜์—ฌ WeekOfYear ์ปฌ๋Ÿผ์— ์ €์žฅํ•ด ์ค€๋‹ค. 

train['PromoOpen'] = 12* ( train['year'] - train['Promo2SinceYear'] ) + \
                        (train['WeekOfYear'] - train['Promo2SinceWeek']) / 4

test['PromoOpen'] = 12* ( test['year'] - test['Promo2SinceYear'] ) + \
                        (test['WeekOfYear'] - test['Promo2SinceWeek']) / 4

์ด์ „๊ณผ ๊ฐ™์ด ๋…„๋„๋ฅผ ๊ฐœ์›”์ˆ˜๋กœ ๋ฐ”๊ฟ”์ฃผ๊ณ , ์ฃผ๋ฅผ 4๋กœ ๋‚˜๋ˆ„์–ด ๊ฐœ์›”์ˆ˜๋กœ ๋ณ€ํ™˜ํ•ด ์ค€๊ฒƒ์„ ๋”ํ•˜์—ฌ ๊ฐœ์žฅ ํ›„ ๋ช‡๊ฐœ์›” ๋’ค์— ํ”„๋กœ๋ชจ์…˜2๊ฐ€ ์ง„ํ–‰๋˜์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๊ฐœ์›” ์ˆ˜๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค. 

 

 

- ์›ํ•ซ์ธ์ฝ”๋”ฉ ( get_dummies() ) 

train.dtypes

๋ฐ์ดํ„ฐํƒ€์ž…์„ ํ™•์ธ ํ•ด ๋ณด๋ฉด object์ธ ์ปฌ๋Ÿผ์ด 3๊ฐ€์ง€ ์žˆ๋‹ค. 3๊ฐœ์˜ ์ปฌ๋Ÿผ์„ get_dummies๋ฅผ ์ด์šฉํ•˜์—ฌ ์›ํ•ซ์ธ์ฝ”๋”ฉ ํ•ด์ค€๋‹ค. 

train = pd.get_dummies(columns=['StoreType'],data=train)
test = pd.get_dummies(columns=['StoreType'],data=test)
train = pd.get_dummies(columns=['Assortment'],data=train)
test = pd.get_dummies(columns=['Assortment'],data=test)
train = pd.get_dummies(columns=['PromoInterval'],data=train)
test = pd.get_dummies(columns=['PromoInterval'],data=test)
train.columns

test.columns

train column๊ณผ test column ์ด ๋™์ผํ•œ ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค. 

 

 

 

- ๋ชจ๋ธ๋ง 

input_var = ['Promo', 'SchoolHoliday',
       'StateHoliday_0', 'StateHoliday_a', 'StateHoliday_b', 'StateHoliday_c',
       'weekday', 'year', 'month', 'CompetitionDistance',
       'Promo2',
       'CompetitionOpen', 'WeekOfYear',
       'PromoOpen', 'StoreType_a', 'StoreType_b', 'StoreType_c', 'StoreType_d',
       'Assortment_a', 'Assortment_b', 'Assortment_c',
       'PromoInterval_Feb,May,Aug,Nov', 'PromoInterval_Jan,Apr,Jul,Oct',
       'PromoInterval_Mar,Jun,Sept,Dec']

ํ•„์š”์—†๋Š” ์ปฌ๋Ÿผ์€ ์‚ญ์ œํ•˜๊ณ  input_var์— ์ €์žฅํ•ด ์ค€๋‹ค. 

set(train) - set(input_var)

(์ฐธ๊ณ ) input_var์— ๋“ค์–ด๊ฐ€์ง€ ์•Š์€ ์ปฌ๋Ÿผ๋“ค ๋ชฉ๋ก์ด๋‹ค. 

xgb = XGBRegressor( n_estimators=300, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],train['Sales'])

์•ž๊ณผ ๋™์ผํ•˜๊ฒŒ xgb ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•œ๋‹ค.  

cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

store ๋ฐ์ดํ„ฐ์…‹์„ ํ•ฉ๋ณ‘ํ•˜์—ฌ ์ „์ฒ˜๋ฆฌ ํ›„ ๋ชจ๋ธ๋ง์„ ํ–ˆ๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋Œ€ํญ ํ•˜๋ฝํ•˜์˜€๋‹ค. 

 

 

- ๋ณ€์ˆ˜์ค‘์š”๋„ 

imp_df = pd.DataFrame({'var':input_var,
                       'imp':xgb.feature_importances_})
imp_df = imp_df.sort_values(['imp'],ascending=False)
plt.bar(imp_df['var'],
        imp_df['imp'])
plt.xticks(rotation=90)
plt.show()

๋ณ€์ˆ˜์ค‘์š”๋„๋ฅผ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๋”๋‹ˆ, ๋ชจ๋“  ๋ณ€์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ์„ ํƒํ•ด์„œ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ข‹์„ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ํŒ๋‹จ๋œ๋‹ค. 

score_list=[]
selected_varnum=[]
for i in range(1,25):
    selected_var = imp_df['var'].iloc[:i].to_list()
    scores = cross_val_score(xgb, 
                             train[selected_var], 
                             train['Sales'], 
                             scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    selected_varnum.append(i)
    print(i)
plt.plot(selected_varnum, score_list)

์ง€์†์ ์œผ๋กœ ํ•˜๋ฝํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด์ง€๋งŒ 17๊ฐœ ์ดํ›„๋กœ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™์ด ๋ณด์ธ๋‹ค. ๋”ฐ๋ผ์„œ 17๊ฐœ๊นŒ์ง€ ์„ ํƒํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ด ๋ณธ๋‹ค. 

input_var = imp_df['var'].iloc[:17].tolist()
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

์ „์ฒด์ ์œผ๋กœ ์˜ค๋ฅ˜์œจ์ด ์ค„์–ด๋“ค์—ˆ๋‹ค. 

 

 

 

 

 

 


3. ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ 

estim_list = [100,200,300,400,500,600,700,800,900]
score_list = []
for i in estim_list:
    xgb = XGBRegressor( n_estimators=i, learning_rate= 0.1, random_state=2020)
    scores = cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)
    score_list.append(-np.mean(scores))
    print(i)
plt.plot(estim_list,score_list)
plt.xticks(rotation=90)
plt.show()

n_estimators๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์˜ค๋ฅ˜์œจ์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์„ ์‹œ๊ฐํ™” ํ•ด๋ณด์•˜๊ณ , n_estimators=400์œผ๋กœ ํ•˜๋Š” ๊ฒƒ์ด ์ ๋‹นํ•ด ๋ณด์ธ๋‹ค.  

xgb = XGBRegressor( n_estimators=400, learning_rate= 0.1, random_state=2020)
xgb.fit(train[input_var],
        train['Sales'])
cross_val_score(xgb, train[input_var], train['Sales'], scoring="neg_mean_squared_error", cv=3)

400์œผ๋กœ ๋ณ€๊ฒฝํ•˜์˜€๋”๋‹ˆ ์˜ค๋ฅ˜์œจ์ด ๋‚ฎ์•„์กŒ๋‹ค. 

 

์•„์‰ฝ๊ฒŒ๋„ ํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹์„ ํ•œ ์ดํ›„๋กœ ์บ๊ธ€์—์„œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์€ ์˜ค๋ฅ˜์œจ์ด ๋” ๋†’๊ฒŒ ๋‚˜์™”๋‹ค. ์ด์™ธ์— ๊ฒฐ์ธก๊ฐ’, ์ด์ƒ์น˜ ๋“ฑ feature engineering์„ ์ง€์†์ ์œผ๋กœ ์‹œ๋„ํ•ด ๋ณด์•„์•ผ๊ฒ ๋‹ค. (์ถ”ํ›„ github ์—…๋กœ๋“œ ์˜ˆ์ •) 


 

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/69

 

[๋จธ์‹ ๋Ÿฌ๋‹] ๋ณ€์ˆ˜์ค‘์š”๋„, shap value

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/67 https://silvercoding.tistory.com/66 https://silvercoding.tistory.com/65 https://silvercoding.tistory.com/64 https://silvercoding...

silvercoding.tistory.com

 

 


Menchester United ํŒ€์—์„œ 2013๋…„ Alex Ferguson ๊ฐ๋…์ด ์€ํ‡ด๋ฅผ ํ•˜๊ณ , ํ•˜๋ฝ์„ธ๋ฅผ ํƒ€๋‹ค๊ฐ€ ์†”์ƒค๋ฅด ๊ฐ๋…์ด ํŒ€์„ ๋งก๊ฒŒ๋˜์—ˆ์„ ๋•Œ 2020๋…„ 3์›” ๊ธฐ์ค€ 2019/2020 ์‹œ์ฆŒ ๊ฒจ์šธ ์‹œ์žฅ์—์„œ ๋‘๋ช…์˜ ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•˜์—ฌ ํ•˜๋ฝ์„ธ๋ฅผ ๋ฐ˜์ „์‹œํ‚ฌ ์ˆ˜ ์žˆ์—ˆ๋‹ค. 

์ด๋ฅผ ์„ ์ˆ˜๋“ค์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ํ†ตํ•ด ๋ฐฉ์ถœ๊ณผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด, ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ๊นŒ? 


 

 

๋ฐ์ดํ„ฐ : FIFA ๋ฐ์ดํ„ฐ (๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ๊ฐ•์˜ ์ œ๊ณต)


1. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ 

import pandas as pd
import warnings 

warnings.filterwarnings(action='ignore')  # ๊ฒฝ๊ณ ๋ฌธ ์ œ๊ฑฐ
data = pd.read_csv("./data/FIFA_data.csv")
pd.set_option('display.max_columns', 80)

column์ด ๋งŽ์œผ๋ฉด ... ์œผ๋กœ ์ƒ๋žต๋˜์–ด์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฐ์ดํ„ฐ์˜ ์ปฌ๋Ÿผ ์ˆ˜์ธ 80๊ฐœ๋กœ ์„ค์ •ํ•ด์ค€๋‹ค. 

data.head()

๋ชจ๋“  ์ปฌ๋Ÿผ์„ ํ™•์ธํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

2. ๋ฐ์ดํ„ฐ ํ™•์ธ, ๋ถ„์„๊ณ„ํš 

์ปฌ๋Ÿผ ๋ณ„ ์˜๋ฏธ ํ™•์ธ 

ID ๊ณ ์œ ์˜ ๋ฒˆํ˜ธ
Name ์ด๋ฆ„
Age ๋‚˜์ด
Overall ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜
Potential ์ž ์žฌ ๋Šฅ๋ ฅ์น˜
Club ์†Œ์† ํŒ€
Value ์˜ˆ์ƒ ์ด์ ๋ฃŒ (์œ ๋กœ)
Wage ์ฃผ๊ธ‰ (์œ ๋กœ)
Preferred Foot ์ž˜ ์‚ฌ์šฉํ•˜๋Š” ๋ฐœ
Weak Foot ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ฐœ
Skill Moves ๊ฐœ์ธ๊ธฐ
Position ํฌ์ง€์…˜
Jersey Number ๋“ฑ๋ฒˆํ˜ธ
Joined ์†Œ์† ํŒ€ ์ž…๋‹จ ๋‚ ์งœ
Contract Valid Until ๊ณ„์•ฝ ๊ธฐ๊ฐ„
Height ํ‚ค (ํ”ผํŠธ)
Weight ๋ชธ๋ฌด๊ฒŒ (ํŒŒ์šด๋“œ)
LS ~ RB ํฌ์ง€์…˜ ๋ณ„ ๋Šฅ๋ ฅ์น˜
Crossing ~ GKReflexes ์„ธ๋ถ€ ๋Šฅ๋ ฅ์น˜
Release Clause ๋ฐ”์ด์•„์›ƒ

 

๋ถ„์„ ์ ˆ์ฐจ ์ˆ˜๋ฆฝ 

1. Manchester United ์„ ์ˆ˜ ๋ถ„์„ (์–ด๋–ค ์„ ์ˆ˜๋“ค์ด ์กด์žฌํ•˜๋Š”๊ฐ€?) 

2. Manchester United ์ง€์—ญ๋ผ์ด๋ฒŒ Manchester City ์„ ์ˆ˜๋“ค๊ณผ ๋น„๊ต ๋ถ„์„ 

3. ๋ถ€์กฑํ•œ ํฌ์ง€์…˜ 2๊ฐ€์ง€ ์„ ํƒ 

4. ๋‹ค๋ฅธํŒ€์˜ ์„ ์ˆ˜๋“ค ์ค‘ 2๋ช…์˜ ์˜์ž… ์„ ์ˆ˜ ์„ ํƒ (์žฌ์ •, ํ˜„์‹ค๊ฐ€๋Šฅ์„ฑ, ์˜์ž…๋ฐฉ์นจ ๊ณ ๋ ค

 

 

 

 

 


3. Manchester United ์„ ์ˆ˜๋“ค ๋ถ„์„ 

(1) EDA 

- ๋งจ์œ  ์„ ์ˆ˜ ์ถ”์ถœ

mu = data[data['Club'] == 'Manchester United']
mu.head()

Club์ด Manchester United์ธ ํ–‰๋งŒ ๋ฝ‘์•„ mu์— ์ €์žฅํ•ด์ค€๋‹ค.  

mu['Club'].unique()

unique() ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ™•์ธํ•ด ๋ณด๋‹ˆ ๋งจ์œ ๋งŒ ์ž˜ ๋ฝ‘ํžŒ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

- ๋งจ์œ  ์„ ์ˆ˜๋“ค ๊ฐ„๋žตํ•œ ์ •๋ณด ์ถœ๋ ฅ 

print(f"์ธ์›: {mu.shape[0]}")
print(f"๋งจ์œ  ์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜: {mu['Position'].unique()}")
print(f"ํ‰๊ท  ๋Šฅ๋ ฅ์น˜: {mu['Overall'].mean()}")
print(f"ํ‰๊ท  ์ž ์žฌ ๋Šฅ๋ ฅ์น˜: {mu['Potential'].mean()}")

 

 

- ์‹œ๊ฐํ™” 

import seaborn as sns 
sns.countplot(mu['Age'])

์„ ์ˆ˜๋“ค์˜ ๋‚˜์ด ๋ถ„ํฌ์ด๋‹ค. 19์‚ด์ด ๊ฐ€์žฅ ๋งŽ๊ณ , ๊ทธ๋‹ค์Œ์œผ๋ก  25์‚ด, 28์‚ด, 22์‚ด์ธ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. 

sns.countplot(mu['Position'])

ใ…

์„ ์ˆ˜๋“ค์˜ ํฌ์ง€์…˜ ์ค‘ ๊ฐ€์žฅ ๋งŽ์€ ๊ฒƒ์€ CM, CB ์ด๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

Position๋ณ„ ๋Šฅ๋ ฅ์น˜ boxplot ์„ ๊ทธ๋ ค๋ณด์•˜๋”๋‹ˆ CB ํฌ์ง€์…˜์—์„œ ์ด์ƒ์น˜๊ฐ€ ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค. 

 

 

* ์ด์ƒ์น˜ & ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ 


์ด์ƒ์น˜

  • ์ •์ƒ ๋ฒ”์ฃผ์—์„œ ํฌ๊ฒŒ ๋ฒ—์–ด๋‚œ ๊ฐ’
  • ์ด์ƒ์น˜๋ฅผ ํฌํ•จํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•  ๊ฒฝ์šฐ ๋ถ„์„ ๊ฒฐ๊ณผ๊ฐ€ ์™œ๊ณก๋  ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ 

๊ฒฐ์ธก์น˜

  • ๋ˆ„๋ฝ๊ฐ’, ๋น„์–ด์žˆ๋Š” ๊ฐ’ 
  • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋‹น์‹œ ๊ธฐ๋ก๋˜์ง€ ์•Š์•˜๊ฑฐ๋‚˜, ๋ˆ„๋ฝ๋œ ๊ฐ’

์ด์ƒ์น˜์™€ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ๋ฒ•

  • ์ œ๊ฑฐ: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋Š” ํ–‰, ํ˜น์€ ์—ด์„ ์ œ๊ฑฐํ•œ๋‹ค. (์ตœํ›„์˜ ์ˆ˜๋‹จ, ๋ฐ์ดํ„ฐ ํ•˜๋‚˜ํ•˜๋‚˜๊ฐ€ ์†Œ์ค‘ํ•˜๊ธฐ ๋•Œ๋ฌธ) 
  • ๋Œ€์ฒด: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๋ฅผ ํ•ด๋‹น ์ปฌ๋Ÿผ์˜ ์ตœ๋Œ“๊ฐ’, ํ‰๊ท ๊ฐ’, ์ค‘์•™๊ฐ’ ๋“ฑ์œผ๋กœ ๋Œ€์ฒด (์ถ”์ฒœํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋‹˜.)
  • ์˜ˆ์ธก: ์ด์ƒ์น˜ ๋ฐ ๊ฒฐ์ธก์น˜๊ฐ€ ํฌํ•จ๋œ ์ปฌ๋Ÿผ์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์˜ˆ์ธก ๊ฐ’์œผ๋กœ ์ฑ„์›Œ ๋„ฃ์Œ (์ถ”์ฒœ) 

mu[mu['Overall']>100]

๋Šฅ๋ ฅ์น˜๊ฐ€ 100์ด์ƒ์ธ row๋ฅผ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

 

 

์ด์ƒ์น˜ ์ฒ˜๋ฆฌ - ์˜ˆ์ธก ์‚ฌ์šฉ 

mu[mu['Position'] == 'CB'][['Position', 'Overall', 'CB']]

๊ฐ™์€ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ ๋น„๊ต๋ฅผ ํ•ด๋ณธ๋‹ค. CB๊ฐ€ ๋น„์Šทํ•œ ์„ ์ˆ˜๋“ค๋ผ๋ฆฌ์˜ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๊ฐ™์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด์ƒ์น˜๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š” ์„ ์ˆ˜๋Š” 11081 ๋ฒˆ์งธ ์„ ์ˆ˜์™€ CB๊ฐ€ ๊ฐ™์œผ๋ฏ€๋กœ 75๋กœ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu['Overall'][11422] = 75

11422 ๋ฒˆ์งธ ์„ ์ˆ˜์˜ ๋Šฅ๋ ฅ์น˜๋ฅผ 75๋กœ ๋ฐ”๊พธ์–ด์ค€๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Overall')

๋‹ค์‹œ boxplot์„ ๊ทธ๋ ค๋ณด๋‹ˆ ์ด์ƒ์น˜ ์—†์ด ๊ทธ๋ ค์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

sns.boxplot(data=mu, x='Position', y='Potential')

potential์— ๋Œ€ํ•œ boxplot๋„ ๊ทธ๋ ค์ค€๋‹ค. potential์—๋Š” ์ด์ƒ์น˜๊ฐ€ ๋‚˜์˜ค์ง€ ์•Š์•˜๋‹ค. 

 

 

 

mu.info()

mu๋Š” ์ด 33๊ฐœ์˜ row์ธ๋ฐ, 19~44 ๋ฒˆ์งธ ์ปฌ๋Ÿผ์— 3๊ฐœ์˜ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์ด ํ™•์ธ๋˜์—ˆ๋‹ค. 

mu[mu.isnull()['LS']]

ํฌ์ง€์…˜์ด GK์ธ ์„ ์ˆ˜๋“ค๋งŒ ๊ฒฐ์ธก๊ฐ’์ด ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค. GK๋Š” ๊ณจํ‚คํผ์ด๊ณ , ๊ณจํ‚คํผ๋Š” ๋‹ค๋ฅธ ํฌ์ง€์…˜์— ๋Œ€ํ•œ ๋Šฅ๋ ฅ์น˜๋ฅผ ๋ถ€์—ฌํ•  ํ•„์š”๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ์ธก๊ฐ’์œผ๋กœ ๋‘” ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค. 

mu = mu.fillna(-1)

๊ฒฐ์ธก๊ฐ’์„ -1๋กœ ์ฑ„์›Œ์ค€๋‹ค. (๊ฐ’์„ ์ธก์ •ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ์˜๋ฏธ์—์„œ ์ž„์˜์˜ ๊ฐ’ -1, ๋‹ค๋ฅธ๊ฐ’์„ ๋„ฃ์–ด์ฃผ์–ด๋„ ๋จ) 

mu.info()

๊ฒฐ์ธก๊ฐ’์ด ๋ชจ๋‘ ์ฑ„์›Œ์กŒ๋‹ค. 

 

 

 

 

 


4. Manchester United vs Manchester City 

(1) ์ „์ฒ˜๋ฆฌ 

df = data[(data['Club'] == 'Manchester United') | (data['Club']=='Manchester City')]

Manchester United์™€ Manchester City๋งŒ ๋ฝ‘์•„ df ์— ์ €์žฅํ•ด์ค€๋‹ค. 

df['Club'].unique()

df['Value'].head()

์ด์ ๋ฃŒ Value๊ฐ€ ๊ธฐํ˜ธ๋กœ ์จ์ ธ์žˆ์œผ๋ฏ€๋กœ, ๊ธฐํ˜ธ ์‚ญ์ œ, ์†Œ์ˆ˜์  ์‚ญ์ œ๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค. 

df['Value'] = df['Value'].str.replace('M', '000000')
df['Value'] = df['Value'].str.replace('K', '000')

M์ด ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 6๊ฐœ, K๊ฐ€ ์จ์ ธ์žˆ์œผ๋ฉด 0์„ 3๊ฐœ ๋ถ™์—ฌ ์ค€๋‹ค. 

df['Value']

df['Value'] = df['Value'].str.slice(1,)

๊ทธ๋‹ค์Œ str.slice๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ธฐํ˜ธ๋ฅผ ์—†์• ์ค€๋‹ค. 

df['Value'].iloc[3]

'64.5000000'

์ด๋ ‡๊ฒŒ ์†Œ์ˆ˜์ ์ด ์žˆ๋Š” ๊ฒƒ์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ์ ์„ ์—†์• ๊ณ  ๋’ค์˜ 0์„ ํ•˜๋‚˜ ์‚ญ์ œํ•œ๋‹ค. 

for i in df["Value"]:
    if '.' in i:
        df['Value'] = df['Value'].str.replace('.', '')
        df['Value'] = df['Value'].str.slice(0,-1)
df['Value']

์ ์šฉ์ด ์ž˜ ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

df['Value'] = df['Value'].astype('int')

์ด์ œ ๋ฐ์ดํ„ฐ ํƒ€์ž…์„ object -> int๋กœ ๋ฐ”๊ฟ”์ค€๋‹ค. 

df.head()

 

 

 

- mu, mc ์„ ์ˆ˜ ๋ถ„๋ฆฌ 

mu = df[df['Club'] == "Manchester United"]
mc = df[df['Club'] == "Manchester City"]

df์—์„œ Manchester United, Manchester City ์„ ์ˆ˜๋“ค์„ ๋ถ„๋ฆฌํ•ด ์ค€๋‹ค. 

mc.head()

df['Position'].unique()

์œ„์˜ ํฌ์ง€์…˜์„ ๊ณจ๊ธฐํผ, ์ˆ˜๋น„์ˆ˜, ๋ฏธ๋“œํ•„๋”, ๊ณต๊ฒฉ์ˆ˜, ์ด 4๊ฐ€์ง€๋กœ ๋ถ„๋ฅ˜ํ•˜์—ฌ ๋ถ„์„์„ ์ง„ํ–‰ํ•œ๋‹ค. ํฌ์ง€์…˜์„ ๋‚˜๋ˆ„๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. 


  • ๊ณจํ‚คํผ ๋ฆฌ์ŠคํŠธ GK= GK (๊ณจํ‚คํผ)
  • ์ˆ˜๋น„์ˆ˜ ๋ฆฌ์ŠคํŠธ CB = CB(์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LB(์™ผ์ชฝ ์ˆ˜๋น„์ˆ˜), RB(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„์ˆ˜), RCB(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜), LCB(์™ผ์ชฝ/์ค‘์•™ ์ˆ˜๋น„์ˆ˜) 
  • ๋ฏธ๋“œํ•„๋” ๋ฆฌ์ŠคํŠธ MF = RCM(์˜ค๋ฅธ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), LCM(์™ผ์ชฝ/์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RDM(์˜ค๋ฅธ์ชฝ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CDM(์ค‘์•™ ์ˆ˜๋น„ํ˜• ๋ฏธ๋“œํ•„๋”), CM(์ค‘์•™ ๋ฏธ๋“œํ•„๋”), RM(์˜ค๋ฅธ์ชฝ ๋ฏธ๋“œํ•„๋”), CAM(์ค‘์•™ ๊ณต๊ฒฉํ˜• ๋ฏธ๋“œํ•„๋”)
  • ๊ณต๊ฒฉ์ˆ˜ ๋ฆฌ์ŠคํŠธ ST = ST(์ „๋ฐฉ ๊ณต๊ฒฉ์ˆ˜), LW(์™ผ์ชฝ ๊ณต๊ฒฉ์ˆ˜), RW(์˜ค๋ฅธ์ชฝ ๊ณต๊ฒฉ์ˆ˜)

* GK(๊ณต๊ฒฉ์ˆ˜) : 1๋ช…, CB(์ˆ˜๋น„์ˆ˜) : 4๋ช…, MF(๋ฏธ๋“œํ•„๋”) : 4๋ช…, ST(๊ณต๊ฒฉ์ˆ˜) : 2๋ช… ์„ ๋ฐœ

-> ์„ ๋ฐœ์˜ ๊ธฐ์ค€์€ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(Overall ์ปฌ๋Ÿผ)

 

gk_list = ['GK']
cb_list = ['CB', 'LCB', 'RCB', 'RB', 'LB']
mf_list = ['RCM', 'LCM', 'RDM', 'CDM', 'CM', 'RM', 'CAM']
st_list = ['ST', 'LW', 'RW']

ํฌ์ง€์…˜์„ ๋ถ„๋ฅ˜ํ•œ๋Œ€๋กœ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ž‘์„ฑํ•ด์ค€๋‹ค. 

 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2



mu_id = []

for index in mu.index:
    if mu['Position'][index] in gk_list: 
        if gk_count != 0:
            mu_id.append(mu['ID'][index])
            gk_count -= 1 
    elif mu['Position'][index] in cb_list:
        if cb_count != 0:
            mu['Position'][index] = 'CB'
            mu_id.append(mu['ID'][index])
            cb_count -= 1 
    elif mu['Position'][index] in mf_list:
        if mf_count != 0:
            mu['Position'][index] = 'MF'
            mu_id.append(mu['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mu['Position'][index] = 'ST'
            mu_id.append(mu['ID'][index])
            st_count -= 1

ํ˜„์žฌ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋†’์€ ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์–ด์žˆ๋Š” ๋ฐ์ดํ„ฐ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ˆœ์„œ๋Œ€๋กœ ์ƒ์œ„ ํฌ์ง€์…˜ ์„ ์ˆ˜๋“ค์˜ ID ๊ฐ’์„ ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์ค€๋‹ค. 

mu[mu['ID'].isin(mu_id)]

11๋ช…์˜ ์„ ์ˆ˜๊ฐ€ ์•Œ๋งž๊ฒŒ ๋‚˜์˜จ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

mu = mu[mu['ID'].isin(mu_id)]

์„ ๋ฐœ๋œ 11๋ช…์˜ ์„ ์ˆ˜๋“ค๋งŒ mu ๋ณ€์ˆ˜์— ๋„ฃ์–ด ์ค€๋‹ค. 

 

 

 

๊ฐ™์€ ์ ˆ์ฐจ๋กœ Manchester City ๋˜ํ•œ ์ง„ํ–‰ํ•œ๋‹ค. 

gk_count = 1
cb_count = 4
mf_count = 4
st_count = 2


mc_id = []

for index in mc.index:
    if mc['Position'][index] in gk_list: 
        if gk_count != 0:
            mc_id.append(mc['ID'][index])
            gk_count -= 1 
    elif mc['Position'][index] in cb_list:
        if cb_count != 0:
            mc['Position'][index] = 'CB'
            mc_id.append(mc['ID'][index])
            cb_count -= 1 
    elif mc['Position'][index] in mf_list:
        if mf_count != 0:
            mc['Position'][index] = 'MF'
            mc_id.append(mc['ID'][index])
            mf_count -= 1 
    else:
        if st_count != 0:
            mc['Position'][index] = 'ST'
            mc_id.append(mc['ID'][index])
            st_count -= 1
mc = mc[mc['ID'].isin(mc_id)]

 


concat vs merge

merge: ์ขŒ์šฐํ•ฉ๋ณ‘, concat: ์ƒํ•˜ํ•ฉ๋ณ‘


df = pd.concat([mu, mc])

์„ ๋ฐœ๋œ mu, mc ์„ ์ˆ˜๋“ค์„ ํ•ฉ์ณ df์— ์ €์žฅํ•ด์ค€๋‹ค. 

 

 

(2) EDA 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ํ˜„์žฌ๋Šฅ๋ ฅ์น˜(overall) ๋น„๊ต 

df = pd.concat([mu, mc])

๊ณจ๊ธฐํผ๋ฅผ ๋บ€ ํƒ€ ํฌ์ง€์…˜์€ ๋ชจ๋‘ Manchester United ํŒ€์ด ๋‚ฎ์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

- mu vs mc ํฌ์ง€์…˜๋ณ„ ์ฃผ์ „์„ ์ˆ˜์˜ ์˜ˆ์ƒ์ด์ ๋ฃŒ(Value) ๋น„๊ต

sns.boxplot(data=df, x='Position', y='Value', hue='Club')

์ด์ ๋ฃŒ๋Š” ๊ณจ๊ธฐํผ๋ฅผ ๋นผ๊ณ  ๊ฑฐ์˜ ์ฐจ์ด๊ฐ€ ์—†๊ฑฐ๋‚˜ ๋” ๋†’์€ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 

์œ„์˜ boxplot์œผ๋กœ ๋‘ ํŒ€์„ ๋น„๊ตํ•ด๋ณด์•˜์„ ๋•Œ, ์ด์ ๋ฃŒ ๋Œ€๋น„ ๋Šฅ๋ ฅ์น˜๊ฐ€ ๋–จ์–ด์ง€๋Š” ํฌ์ง€์…˜์€ MF, CB๋กœ ํŒ๋‹จํ•˜์—ฌ ๋‘ ํฌ์ง€์…˜์— ๋Œ€ํ•ด ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ• ์ง€ ๋ถ„์„์„ ํ•ด๋ณธ๋‹ค. 

 

 

 


5. Manchester United๋Š” ์–ด๋–ค ์„ ์ˆ˜๋ฅผ ์˜์ž…ํ•ด์•ผ ํ•˜๋Š”๊ฐ€? 

(1) EDA

* ๋ฐฉ์ถœ ์„ ์ˆ˜ ์„ ์ •

์˜์ž…์ผ, ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ, ๋‚˜์ด๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ณต์‹ ์„ธ์šฐ๊ธฐ 

 Point = (Overall * 2 + Potential) / Age 

๋Šฅ๋ ฅ์น˜(๊ฐ€์ค‘์น˜๋ฅผ ๊ฐ€ํ•จ)์™€ ์ž ์žฌ๋ ฅ์ด ๋†’์„ ์ˆ˜๋ก, ๋‚˜์ด๊ฐ€ ๋‚ฎ์„ ์ˆ˜๋ก ์ข‹์Œ. 

mu['Point'] = (mu['Overall'] * 2 + mu['Potential']) / mu['Age']

 

- MF ํฌ์ง€์…˜ 

mu[mu['Position'] == 'MF'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 211๋ฒˆ ์„ ์ˆ˜์ด๋‹ค.  

 

- CB ํฌ์ง€์…˜ 

mu[mu['Position'] == 'CB'][['Name', 'Overall', 'Potential', 'Age', 'Joined', 'Point']]

๊ฐ€์žฅ ๋‚ฎ์€ ํฌ์ธํŠธ๋Š” 377๋ฒˆ ์„ ์ˆ˜์ด๋‹ค. 

 

๋งˆํƒ€, ์Šค๋ชฐ๋ง ๋‘ ์„ ์ˆ˜๋ฅผ ๋ฐฉ์ถœํ•˜๊ณ  MF, CB ํฌ์ง€์…˜์„ ํ•œ๋ช…์”ฉ ์˜์ž…ํ•œ๋‹ค. 

 

 

(2) ์‹œ๊ฐํ™” 

์ „์ฒด ์„ ์ˆ˜ ์‹œ๊ฐํ™” - ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ฅธ ์˜์ž… ์„ ์ˆ˜ ๊ฒฐ์ • 


Manchester United ์˜์ž…๋ฐฉ์นจ (์†”์ƒค๋ฅด๊ฐ๋…) 

- ์„ ์ˆ˜์˜ ๋‚˜์ด๋Š” ์–ด๋ฆด ์ˆ˜๋ก ์ข‹์Œ

- ์ž ์žฌ๋ ฅ ๋ณด๋‹ค ํ˜„์žฌ ๋ฐ”๋กœ ์ฃผ์ „์œผ๋กœ ๋›ธ ์ˆ˜ ์žˆ๋Š” ์„ ์ˆ˜ 


market = data[(data['Position']=='RM') | (data['Position']=='CB')]

ํฌ์ง€์…˜์€ ๋ฐฉ์ถœ ์„ ์ •๋œ ๋‘์„ ์ˆ˜์˜ ์„ธ๋ถ€ ํฌ์ง€์…˜์ธ RM, CB๋ฅผ ์„ ํƒํ•œ๋‹ค. 

market.head()

import matplotlib.pyplot as plt
f, ax = plt.subplots(2, 4, figsize=(20, 10))

vs_list = ['Age', 'Overall', 'Potential', 'Weak Foot']

for i in range(8):
    if i < 4:
        colors = ['firebrick' if x > market[market['Position']=='CB'][:13][vs_list[i]].mean() else 'gray' for x in market[market['Position']=='CB'][:13][vs_list[i]]]
        sns.barplot(x=vs_list[i], y='Name', data=market[market['Position']=='CB'][:13], ax=ax[i//4, i%4], palette=colors)
        ax[i//4, i%4].axvline(market[market['Position']=='CB'][:13][vs_list[i]].mean(), ls = '--', color='k')
   
    else:
        colors = ['firebrick' if x > market[market['Position']=='RM'][:13][vs_list[i%4]].mean() else 'gray' for x in market[market['Position']=='RM'][:13][vs_list[i%4]]]        
        sns.barplot(x=vs_list[i%4], y='Name', data=market[market['Position']=='RM'][:13], ax=ax[i//4, i%4], palette=colors)        
        ax[i//4, i%4].axvline(market[market['Position']=='RM'][:13][vs_list[i%4]].mean(), ls='--', color='k')

๋ฐ์ดํ„ฐ ๋ถ„์„์œผ๋กœ ๋‹ค๋ฅธ ๊ฒƒ์„ ๋ฐฐ์ œํ•˜๊ณ  ๋‚˜์ด, ํ˜„์žฌ ๋Šฅ๋ ฅ์น˜, ์ž ์žฌ๋ ฅ์œผ๋กœ๋งŒ ๋”ฐ์ง„๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ์˜์ž…๋ฐฉ์นจ์— ๋”ฐ๋ผ ์˜์ž…์„ ๊ฒฐ์ •ํ•œ๋‹ค๋ฉด S. Umtiti, K. Mbappé ์„ ์ˆ˜๊ฐ€ ๋  ๊ฒƒ์ด๋ผ ํŒ๋‹จํ•˜์˜€๋‹ค. 

+ Recent posts