๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ 

 

 

< ์ด์ „ ๊ธ€ > 

https://silvercoding.tistory.com/50

 

[python pandas] 3. pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (3) - ์ง‘๊ณ„, ๊ฒฐ์ธก๊ฐ’, ์ •๋ ฌ

๋Ÿฌ๋‹์Šคํ‘ผ์ฆˆ ์ˆ˜์—… ์ •๋ฆฌ < ์ด์ „ ๊ธ€ > https://silvercoding.tistory.com/49 https://silvercoding.tistory.com/48 [python pandas] pandas ๊ธฐ์ดˆ ์‚ฌ์šฉ (1) ๋Ÿฌ๋‹์Šคํ‘ผ ์ˆ˜์—… ์ •๋ฆฌ * ํŒ๋‹ค์Šค ๊ธฐ๋ณธ ํ•จ์ˆ˜ ๋ฐ์ดํ„ฐ ํŒŒ์ผ ์ฝ๊ธฐ :..

silvercoding.tistory.com

 

 


 ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ & ์‚ดํŽด๋ณด๊ธฐ 
import pandas as pd
file = './data/babyNamesUS.csv'
raw = pd.read_csv(file)
raw.head()

raw.info()

 


 

 ๋‚จ๋…€ ๊ตฌ๋ถ„์—†์ด '๋งŽ์ด' ์‚ฌ์šฉ๋˜๋Š” ๊ณตํ†ต ์ด๋ฆ„ ? 

idea : ๋‚จ๋…€ ์ด๋ฆ„ ๊ฐœ์ˆ˜์˜ ๋น„์œจ ์ฐจ์ด๊ฐ€ ์ž‘์„์ˆ˜๋ก ์„ฑ๋ณ„ ๊ตฌ๋ถ„์ด ์—†๋Š” ์ด๋ฆ„์ผ ๊ฒƒ์ด๋‹ค !  

# ์„ฑ๋ณ„์— ๋”ฐ๋ฅธ ์ด๋ฆ„ ๊ฐœ์ˆ˜ ์ง‘๊ณ„
name_df = raw.pivot_table(index = 'Name', columns = 'Sex', values = 'Number', aggfunc='sum')

# ๊ฒฐ์ธก๊ฐ’ ์ฑ„์šฐ๊ธฐ (0) 
name_df = name_df.fillna(0)

# float -> int 
name_df = name_df.astype(int)
name_df.head()

์—ฌ๊ธฐ๊นŒ์ง€ ์ €๋ฒˆ ํฌ์ŠคํŒ…์—์„œ ํ–ˆ๋˜ ๋‚ด์šฉ์ด๋‹ค. 

name_df['Sum'] = name_df['M'] + name_df['F']
name_df.head()

๋‚จ๋…€ ์ด๋ฆ„ ๊ฐœ์ˆ˜๋ฅผ ๋ชจ๋‘ ๋”ํ•ด์„œ sum ์ด๋ผ๋Š” ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

# ๋‚จ, ๋…€ ๋น„์œจ ๊ณ„์‚ฐ 
name_df['F_ratio'] = name_df['F'] / name_df['Sum']
name_df['M_ratio'] = name_df['M'] / name_df['Sum']

# ๋‚จ, ๋…€ ๋น„์œจ ๊ฐ„ ์ฐจ์ด
name_df['M_F_Gap'] = abs(name_df['F_ratio'] - name_df['M_ratio'])
name_df.head()

-1 ~ 1 ์˜ ๋ฒ”์œ„๋ฅผ abs() (์ ˆ๋Œ“๊ฐ’) ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ 0 ~ 1 ๋ฒ”์œ„๋กœ ๋ฐ”๊พธ์–ด ์ค€๋‹ค. 

# ์ด๋ฆ„ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 
name_df = name_df.sort_values(by = 'Sum', ascending=False)
name_df.head(20)

๋งŽ์ด ์‚ฌ์šฉ๋œ ์ด๋ฆ„์„ ๋ฝ‘๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์šฐ์„  ์ด ํ•ฉ๊ณ„ ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•ด์ค€๋‹ค. 

cond = name_df['M_F_Gap'] < 0.1
name_df[cond].head(10)

์ด ๋•Œ ๋น„์œจ์ฐจ์ด๊ฐ€ ์ ์€ ๊ฒƒ์„ 0.1 ๋ฏธ๋งŒ์œผ๋กœ ๊ธฐ์ค€ ์žก๊ณ ,  M_F_Gap ์ปฌ๋Ÿผ์ด 0.1 ๋ณด๋‹ค ์ž‘์€ ํ–‰๋“ค์„ ์ถœ๋ ฅ์‹œํ‚จ๋‹ค.  

# ์„ฑ๋ณ„ ๊ตฌ๋ถ„์—†์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ์ด๋ฆ„ Top 10 
name_df[cond].head(10).index

 

 

 

 

 

 ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋ฏธ๊ตญ์˜ ์ด๋ฆ„ ? ( ์ตœ๊ทผ ํŠธ๋ Œ๋“œ ) 

idea : ์„ธ๋Œ€๋ฅผ ๊ธฐ์ค€์œผ๋กœ ์ตœ๊ทผ ์„ธ๋Œ€(2020, 1990) ์ด๋ฆ„ ๊ฐœ์ˆ˜์˜ ๋น„์œจ์ด ํฐ ์ด๋ฆ„์ด ์ตœ๊ทผ ํŠธ๋ Œ๋“œ์— ๋งž๋Š” ๋Œ€ํ‘œ์ ์ธ ๋ฏธ๊ตญ ์ด๋ฆ„์ผ ๊ฒƒ์ด๋‹ค ! 

raw.head()

# unique() ๋ฅผ ํ†ตํ•ด, ๊ธฐ๊ฐ„์— ๋“ค์–ด๊ฐ€๋Š” ๊ฐ’๋“ค์„ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. 
raw['YearOfBirth'].unique()

array([1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015], dtype=int64)

 

 

* ์„ธ๋Œ€ ๋‚˜๋ˆ„๊ธฐ 

ํ•œ ์„ธ๋Œ€ ๋‚˜๋ˆ„๋Š” ๊ธฐ์ค€ 30๋…„ : 2020๋…„ ๊ธฐ์ค€ 30๋…„์”ฉ ๊ตฌ๋ถ„

  • 1930๋…„๋Œ€ ์ด์ „
  • 1960๋…„๋Œ€ ์ด์ „
  • 1990๋…„๋Œ€ ์ด์ „
  • 2020๋…„ ์ด์ „
year_class_list = [ ]

for year in raw['YearOfBirth']:
    if year <= 1930: 
        year_class = '1930๋…„์ด์ „'
    elif year<= 1960: 
        year_class = '1960๋…„์ด์ „'
    elif year <= 1990:
        year_class = '1990๋…„์ด์ „'
    else:
        year_class = '2020๋…„์ด์ „'
    year_class_list.append(year_class)

์œ„์™€๊ฐ™์ด ๋ฐ˜๋ณต๋ฌธ๊ณผ if๋ฌธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ถœ์ƒ๋…„๋„๋ฅผ 4๊ฐœ์˜ ์„ธ๋Œ€ ๊ทธ๋ฃน์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ค€๋‹ค. 

raw['year_class'] = year_class_list
raw.head()

์„ธ๋Œ€ ๊ทธ๋ฃน์„ ์ €์žฅํ•œ ๋ฆฌ์ŠคํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ year_class ์ปฌ๋Ÿผ์„ ์ƒ์„ฑํ•œ๋‹ค. 

name_period = raw.pivot_table(index = ['Name', 'Sex'], columns = 'year_class', values = 'Number', aggfunc='sum')
name_period = name_period.fillna(0)
name_period = name_period.astype(int)
name_period.head()

์ด๋ฆ„๊ณผ ์„ฑ๋ณ„์„ ์ธ๋ฑ์Šค๋กœ ์„ค์ •ํ•˜๊ณ , year_class์— ๋”ฐ๋ฅธ number์˜ ํ•ฉ๊ณ„๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. 

name_period['sum'] = name_period.sum(axis = 1)
name_period.head()

์ด๋ฆ„ ์ด ๊ฐœ์ˆ˜๋ฅผ ๊ตฌํ•˜๊ธฐ ์œ„ํ•ด sum(axis=1) ์„ ์‚ฌ์šฉํ•œ๋‹ค.  axis=1์ด๋ฉด ๊ฐ€๋กœ๋ฐฉํ–ฅ์œผ๋กœ ๊ณ„์‚ฐ์„ ํ•˜๊ฒŒ ๋œ๋‹ค. 

# ์„ธ๋Œ€ ๋ณ„ ๋น„์œจ ๊ณ„์‚ฐ 
for col in name_period.columns:
    col_new = col+"๋น„์œจ"
    name_period[col_new] = name_period[col] / name_period['sum']
    
name_period.head()

์„ธ๋Œ€ ๋ณ„ ๋น„์œจ์„ ๊ณ„์‚ฐํ•˜์—ฌ ๊ฐ ์ปฌ๋Ÿผ์„ ๋งŒ๋“ค์–ด ์ค€๋‹ค. 

# ์ด๋ฆ„ ์‚ฌ์šฉ์ˆ˜ ํ•ฉ๊ณ„, 2020๋…„ ์ด์ „ ๋น„์œจ, 1990๋…„์ด์ „ ๋น„์œจ ๊ธฐ์ค€ ๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ 
name_period = name_period.sort_values(by = ['sum', '2020๋…„์ด์ „๋น„์œจ','1990๋…„์ด์ „๋น„์œจ'], ascending=False)
name_period

1์ˆœ์œ„ ์ด๋ฆ„ ๊ฐœ์ˆ˜ ์ด ํ•ฉ , 2์ˆœ์œ„ 2020๋…„ ์ด์ „ ๋น„์œจ , 3์ˆœ์œ„ 1990๋…„ ์ด์ „ ๋น„์œจ ๋กœ ์ •๋ ฌ์„ ํ•˜์—ฌ ์ตœ์‹  ํŠธ๋ Œ๋“œ์— ๋งž๋Š” ๋ฏธ๊ตญ ๋Œ€ํ‘œ์ด๋ฆ„์„ ์•Œ์•„๋ณธ๋‹ค. 

# ์ธ๋ฑ์Šค๊ฐ€ ์—ฌ๋Ÿฌ ๋ ˆ๋ฒจ๋กœ ๋˜์–ด์žˆ์„ ๊ฒฝ์šฐ, ์ธ๋ฑ์Šค๋ฅผ ํ™œ์šฉํ•ด ์ปจํŠธ๋กค ํ•˜๋Š” ๊ฒƒ์€ ๋ณต์žก
# reset_index()๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ธ๋ฑ์Šค๋กœ ์„ค์ •๋œ ์ด๋ฆ„๊ณผ ์„ฑ๋ณ„์„ ์ปฌ๋Ÿผ์œผ๋กœ ๋ณ€๊ฒฝ
name_period = name_period.reset_index()
name_period.head()

์ธ๋ฑ์Šค๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ปจํŠธ๋กค์ด ์–ด๋ ค์šฐ๋ฏ€๋กœ ํ•„์š”ํ•œ ์ง‘๊ณ„, ์—ฐ์‚ฐ์ด ๋๋‚œ ๋’ค์—๋Š” reset_index๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ column์œผ๋กœ ๋ณ€๊ฒฝํ•ด ์ค€๋‹ค. 

# ๋‚จ์ž ์ด๋ฆ„๋งŒ ์„ ํƒ
cond = name_period['Sex'] =='M'
name_period[cond].head(10)

์„ฑ๋ณ„์ด ๋‚จ์„ฑ์ธ ์ด๋ฆ„๋“ค ์ค‘์—์„œ ์ƒ์œ„ 10๊ฐœ๋ฅผ ๋ฝ‘์•„๋ณด๋ฉด ์œ„์™€ ๊ฐ™๋‹ค. 

# ์ด๋ฒˆ์—๋Š” ์—ฌ์ž์ด๋ฆ„
cond = name_period['Sex'] =='F'
name_period[cond].head(10)

์ด๋ฒˆ์—” ์„ฑ๋ณ„์ด ์—ฌ์„ฑ์ธ ์ด๋ฆ„์˜ ์ƒ์œ„ 10๊ฐœ๋ฅผ ๋ฝ‘์€ ๊ฒƒ์ด๋‹ค. 

 

๊ทธ๋Ÿฐ๋ฐ ์•„์ง์€ ์ด์ƒํ•˜๋‹ค. ํŠนํžˆ ์„ฑ๋ณ„์ด ์—ฌ์ž์ธ ์ด๋ฆ„์˜ ํ‘œ์—์„œ ์ฒซ๋ฒˆ์งธ row๋Š” 1960๋…„ ์ด์ „ ์„ธ๋Œ€์—์„œ ์•ฝ 50%์˜ ๋น„์œจ์„ ์ฐจ์ง€ํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์กฐ๊ฑด์„ ๊ฑด๋‹ค. 

cond_age = name_period['2020๋…„์ด์ „๋น„์œจ'] > 0.3
cond_sex = name_period['Sex'] == 'M'
cond = cond_age & cond_sex
name_period[cond].head(5)

 

2020๋…„ ์ด์ „๋น„์œจ์ด 0.3 ์ด์ƒ์ด๋ฉด์„œ ์„ฑ๋ณ„์ด ๋‚จ์„ฑ์ธ ์กฐ๊ฑด์ธ row๋ฅผ ์„ ํƒํ•œ๋‹ค. 

๊ฒฐ๊ณผ : < ๋‚จ์„ฑ Top 5 ์ด๋ฆ„ Christopher, Daniel, Matthew, Anthony, Andrew >

cond_age = name_period['2020๋…„์ด์ „๋น„์œจ'] > 0.3
cond_sex = name_period['Sex'] == 'F'
cond = cond_age & cond_sex
name_period[cond].head(5)

์—ฌ์„ฑ์˜ ๊ฒฝ์šฐ๋„ ๋™์ผํ•œ ์กฐ๊ฑด์œผ๋กœ ์ง„ํ–‰ํ•œ๋‹ค. 

๊ฒฐ๊ณผ : < ์—ฌ์„ฑ Top 5 ์ด๋ฆ„ Jessica, Sarah, Ashley, Stephanie, Emily >

 


 

 

 

+ Recent posts