๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/56

 

[์‹œ๊ฐํ™” ๋ถ„์„ ํ”„๋กœ์ ํŠธ] 1-3. ํŠน์ • ํŒ€์— ๊ฐ•ํ•œ ์•ผ๊ตฌ์„ ์ˆ˜ ๋ถ„์„ํ•˜๊ธฐ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding.tistory.com/52 [python ์‹œ๊ฐํ™”] 1. se..

silvercoding.tistory.com

 

 


import pandas as pd

* pandas๋กœ ํŒŒ์ผ ์ฝ์–ด์˜ค๊ธฐ 

์ถœ์ฒ˜ - ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ

< ๊ด€๋ จ ์ฐธ๊ณ  ๋ฌธ์„œ > 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

 

pandas.read_excel — pandas 1.3.1 documentation

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IN

pandas.pydata.org

 

 

 

 


 pandas ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—‘์…€ ํŒŒ์ผ ๋ณ‘ํ•ฉํ•˜๊ธฐ 

(1) ์—‘์…€ํŒŒ์ผ ํ•œ๊ฐœ ๋ถˆ๋Ÿฌ์™€์„œ ์‚ดํŽด๋ณด๊ธฐ 

* ํŒŒ์ผ : ์ผ์ž / ๋…ธ์„  / ์ง€ํ•˜์ฒ ์—ญ ๋ณ„ ์Šนํ•˜์ฐจ ๊ณ ๊ฐ ์ˆ˜ ( 2019๋…„ ์ƒ๋ฐ˜๊ธฐ - 1์›” ~ 6์›”, ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ œ๊ณต ) 

file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
raw = pd.read_csv(file)
raw.head()

raw.info()

์ด 18334 ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๊ณ , ๊ฒฐ์ธก๊ฐ’์€ ์—†๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 

(2) ๋‘๊ฐœ ์—‘์…€ํŒŒ์ผ ํ•ฉ์ณ๋ณด๊ธฐ  

raw = pd.DataFrame()  # ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ƒ์„ฑ 
raw.head()

# ์ฒซ ๋ฒˆ์งธ ํŒŒ์ผ
file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
temp = pd.read_csv(file)
# temp.head()
raw = raw.append(temp)  # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ถ”๊ฐ€ 
# raw.head()

# ๋‘ ๋ฒˆ์งธ ํŒŒ์ผ
file = './rawfiles/CARD_SUBWAY_MONTH_201902.csv'
temp = pd.read_csv(file)
raw = raw.append(temp)  # ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„ ์ถ”๊ฐ€ 

raw.info()

์šฐ์„  , ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์„ ์ƒ์„ฑํ•œ ํ›„, ํŒŒ์ผ์„ ์ฝ์–ด ์™€์„œ append๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋นˆ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์— ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. ์—ฌ๋Ÿฌ๊ฐœ์˜ ํŒŒ์ผ์„ ํ•ฉ์น˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฐ˜๋ณต๋˜๋Š” ์ž‘์—…์„ for๋ฌธ์œผ๋กœ ์ž‘์„ฑํ•ด์ฃผ๋ฉด ๋œ๋‹ค. 

 

 

 

(3) ํด๋”์— ์žˆ๋Š” ๋ชจ๋“  ์—‘์…€ํŒŒ์ผ ๋ณ‘ํ•ฉํ•˜๊ธฐ

- ํด๋” ์•ˆ์˜ ํŒŒ์ผ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ 

# ํด๋” , ํŒŒ์ผ์„ ๊ด€๋ฆฌํ•˜๋Š” os ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 
import os
os.listdir()

os.listdir() ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ˜„์žฌ ํด๋”์— ์žˆ๋Š” ํด๋”์™€ ํŒŒ์ผ๋“ค์˜ ์ด๋ฆ„์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

dirpath = './rawfiles/'
files = os.listdir(dirpath)
files

๊ฐ™์€ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ณธ ํฌ์ŠคํŒ…์—์„œ ์‚ฌ์šฉํ•  ํ˜„์žฌ ํด๋” ์•ˆ์˜ rawfiles์•ˆ์˜ ํŒŒ์ผ๋“ค์„ ๋ถˆ๋Ÿฌ ์˜จ๋‹ค. 

 

์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ผ์ผ์ด ํŒŒ์ผ ์ด๋ฆ„์„ ๋ณต๋ถ™ํ•˜์ง€ ์•Š์•„๋„ ํŒŒ์ผ์„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค. 

 

- ๋ณ‘ํ•ฉํ•˜๊ธฐ 

# ๋นˆ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ค€๋น„
raw = pd.DataFrame() 

# ๋ณ‘ํ•ฉ
for file in os.listdir('./rawfiles'):
#     print(file)
    fpath = './rawfiles/'+file
    print(fpath)
    temp = pd.read_csv(fpath)
    raw = raw.append(temp, ignore_index = True)   #ignore_index = True  --> ๊ธฐ์กด ์ธ๋ฑ์Šค๋Š” ๋ฌด์‹œํ•˜๋ผ.

์œ„์™€ ๊ฐ™์ด ๋นˆ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ํŒŒ์ผ์„ ํ•˜๋‚˜์”ฉ ์ถ”๊ฐ€ํ•œ๋‹ค. ignore_index=True๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ๋ฌด์‹œํ•˜์ง€ ์•Š์œผ๋ฉด ์›๋ž˜ ์ž์‹ ์˜ ์ธ๋ฑ์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๊ฐ€ ๋’ค์ฃฝ๋ฐ•์ฃฝ์œผ๋กœ ์„ค์ •๋˜์–ด ์žˆ๊ฒŒ๋œ๋‹ค. ์ˆœ์„œ๋Œ€๋กœ ํ•ด์ฃผ๊ธฐ ์œ„ํ•จ์ด๋‹ค. 

 

 

 


 ๋ณ‘ํ•ฉํ•œ ๋ฐ์ดํ„ฐ ์‚ดํŽด๋ณด๊ธฐ 
raw.head()

raw.tail()

raw.info()

์ด 99342 ๊ฐœ์˜ row๋กœ , raw.tail()์„ ํ–ˆ์„ ๋•Œ index์˜ ๊ฒฐ๊ณผ์™€ ๊ฐ™์œผ๋ฏ€๋กœ ์ •์ƒ์ ์œผ๋กœ ๋ณ‘ํ•ฉ๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. 

 

 

 


 '์š”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ ํ•ด๋ณด๊ธฐ - datetime ์‚ฌ์šฉ 
from datetime import datetime

์ถœ์ฒ˜ - ๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ

 

- datetime.strptime()

date_str = str(20190601)     # ์ˆซ์ž๊ฐ€ ์•„๋‹Œ ๋ฌธ์ž๋กœ ์ž…๋ ฅ๋˜์–ด์•ผ ํ•จ
date = datetime.strptime(date_str, "%Y%m%d")
date

datetime.datetime(2019, 6, 1, 0, 0)

# ์›”์š”์ผ : 0 ~ ์ผ์š”์ผ : 6
weekday = date.weekday()
weekday

5

์ง์ ‘ ๋‹ฌ๋ ฅ์„ ์ฐพ์•„๋ณด์ง€ ์•Š๊ณ ๋„ , datetime ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ด๋ ‡๊ฒŒ ์š”์ผ์„ ๋ฐ˜ํ™˜ํ•ด ์ค€๋‹ค. 

 

- '์š”์ผ' ์ปฌ๋Ÿผ ์ถ”๊ฐ€ 

weekday_dict = [ '์›”','ํ™”','์ˆ˜','๋ชฉ','๊ธˆ','ํ† ','์ผ']
weekday_list = []

for date_str in raw['์‚ฌ์šฉ์ผ์ž']:
    date = datetime.strptime(str(date_str), "%Y%m%d")
    weekday_index  = date.weekday()
    weekday = weekday_dict[weekday_index]
    weekday_list.append(weekday)

weekday_list์— ๊ฐ row์˜ ์š”์ผ์„ ์ถ”๊ฐ€ํ•ด ์ค€๋‹ค. 

raw['์š”์ผ'] = weekday_list
raw.sample(5)

'์š”์ผ' ์ปฌ๋Ÿผ์„ ์ถ”๊ฐ€ํ•ด ์ฃผ๊ณ  , sample์„ ์ด์šฉํ•˜์—ฌ ๋žœ๋ค์œผ๋กœ row๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ํ™•์ธํ•ด ๋ณธ๋‹ค. 

raw.columns

new_columns = ['์‚ฌ์šฉ์ผ์ž',  '์š”์ผ', '๋…ธ์„ ๋ช…', '์—ญID', '์—ญ๋ช…', '์Šน์ฐจ์ด์Šน๊ฐ์ˆ˜', 'ํ•˜์ฐจ์ด์Šน๊ฐ์ˆ˜', '๋“ฑ๋ก์ผ์ž']
raw = raw[ new_columns ]
raw.head()

์š”์ผ์„ ์•ž์— ๋‘๊ธฐ ์œ„ํ•ด ์ปฌ๋Ÿผ์˜ ์œ„์น˜๋ฅผ ๋ฐ”๊พธ์–ด ์„ ํƒํ•ด์ค€๋‹ค. 

 

 

 

 


 ๋ฐ์ดํ„ฐ ์ €์žฅ 
raw.to_excel('./data/subway_raw.xlsx', index = False)

to_excel() ์„ ์ด์šฉํ•˜์—ฌ ์—‘์…€ ํŒŒ์ผ๋กœ ์ €์žฅํ•ด ์ค€๋‹ค. 


 

 

+ Recent posts