[시각화 분석 프로젝트] 2-1 pandas로 여러 csv 파일 합치기

silversu 2021. 8. 11. 01:12

2021. 8. 11. 01:12

러닝스푼즈 수업 정리

< 이전 글 >

https://silvercoding.tistory.com/56

[시각화 분석 프로젝트] 1-3. 특정 팀에 강한 야구선수 분석하기

러닝스푼즈 수업 정리 < 이전 글 > https://silvercoding.tistory.com/55 https://silvercoding.tistory.com/54 https://silvercoding.tistory.com/53 https://silvercoding.tistory.com/52 [python 시각화] 1. se..

silvercoding.tistory.com

import pandas as pd

* pandas로 파일 읽어오기

pandas.read_excel — pandas 1.3.1 documentation

Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IN

pandas.pydata.org

pandas 를 사용하여 엑셀 파일 병합하기

(1) 엑셀파일 한개 불러와서 살펴보기

* 파일 : 일자 / 노선 / 지하철역 별 승하차 고객 수 ( 2019년 상반기 - 1월 ~ 6월, 러닝스푼즈 제공 )

file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
raw = pd.read_csv(file)

raw.head()

raw.info()

총 18334 개의 데이터가 있고, 결측값은 없는 것을 볼 수 있다.

(2) 두개 엑셀파일 합쳐보기

raw = pd.DataFrame()  # 빈 데이터 프레임 생성 
raw.head()

# 첫 번째 파일
file = './rawfiles/CARD_SUBWAY_MONTH_201901.csv'
temp = pd.read_csv(file)
# temp.head()
raw = raw.append(temp)  # 데이터 프레임 추가 
# raw.head()

# 두 번째 파일
file = './rawfiles/CARD_SUBWAY_MONTH_201902.csv'
temp = pd.read_csv(file)
raw = raw.append(temp)  # 데이터 프레임 추가 

raw.info()

우선 , 빈 데이터 프레임을 생성한 후, 파일을 읽어 와서 append를 사용하여 빈 데이터 프레임에 데이터를 추가해 준다. 여러개의 파일을 합치기 위해서는 이러한 반복되는 작업을 for문으로 작성해주면 된다.

(3) 폴더에 있는 모든 엑셀파일 병합하기

- 폴더 안의 파일 이름 가져오기

# 폴더 , 파일을 관리하는 os 라이브러리 
import os

os.listdir()

os.listdir() 을 사용하여 현재 폴더에 있는 폴더와 파일들의 이름을 그대로 가져올 수 있다.

dirpath = './rawfiles/'
files = os.listdir(dirpath)
files

같은 방법으로 본 포스팅에서 사용할 현재 폴더 안의 rawfiles안의 파일들을 불러 온다.

이를 이용하여 일일이 파일 이름을 복붙하지 않아도 파일을 불러올 수 있다.

- 병합하기

# 빈 데이터프레임 준비
raw = pd.DataFrame() 

# 병합
for file in os.listdir('./rawfiles'):
#     print(file)
    fpath = './rawfiles/'+file
    print(fpath)
    temp = pd.read_csv(fpath)
    raw = raw.append(temp, ignore_index = True)   #ignore_index = True  --> 기존 인덱스는 무시하라.

위와 같이 빈 데이터프레임에 파일을 하나씩 추가한다. ignore_index=True를 사용하여 기존 인덱스를 무시하지 않으면 원래 자신의 인덱스를 사용하여 인덱스가 뒤죽박죽으로 설정되어 있게된다. 순서대로 해주기 위함이다.

병합한 데이터 살펴보기

raw.head()

raw.tail()

raw.info()

총 99342 개의 row로 , raw.tail()을 했을 때 index의 결과와 같으므로 정상적으로 병합된 것을 볼 수 있다.

'요일' 컬럼 추가 해보기 - datetime 사용

from datetime import datetime

- datetime.strptime()

date_str = str(20190601)     # 숫자가 아닌 문자로 입력되어야 함
date = datetime.strptime(date_str, "%Y%m%d")
date

datetime.datetime(2019, 6, 1, 0, 0)

# 월요일 : 0 ~ 일요일 : 6
weekday = date.weekday()
weekday

직접 달력을 찾아보지 않고도 , datetime 라이브러리를 사용하면 이렇게 요일을 반환해 준다.

- '요일' 컬럼 추가

weekday_dict = [ '월','화','수','목','금','토','일']
weekday_list = []

for date_str in raw['사용일자']:
    date = datetime.strptime(str(date_str), "%Y%m%d")
    weekday_index  = date.weekday()
    weekday = weekday_dict[weekday_index]
    weekday_list.append(weekday)

weekday_list에 각 row의 요일을 추가해 준다.

raw['요일'] = weekday_list
raw.sample(5)

'요일' 컬럼을 추가해 주고 , sample을 이용하여 랜덤으로 row를 불러와서 확인해 본다.

raw.columns

new_columns = ['사용일자',  '요일', '노선명', '역ID', '역명', '승차총승객수', '하차총승객수', '등록일자']
raw = raw[ new_columns ]
raw.head()

요일을 앞에 두기 위해 컬럼의 위치를 바꾸어 선택해준다.

데이터 저장

raw.to_excel('./data/subway_raw.xlsx', index = False)

to_excel() 을 이용하여 엑셀 파일로 저장해 준다.

'데이터 분석 이론 > 시각화' 카테고리의 다른 글

[시각화 분석 프로젝트] 2-3 승차수가 가장 많은 지하철 역 분석 (0)	2021.08.11
[시각화 분석 프로젝트] 2-2 지하철 승객수가 많은 날? (0)	2021.08.11
[시각화 분석 프로젝트] 1-3. 특정 팀에 강한 야구선수 분석하기 (0)	2021.08.10
[시각화 분석 프로젝트] 1-2. 야구선수가 강해지는 계절이 있을까? (0)	2021.08.09
[시각화 분석 프로젝트] 1-1. best baseball player 분석 (0)	2021.08.07

🤍