Knowledge๐Ÿฆข/๋ฐ์ดํ„ฐ ๋ถ„์„

[๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ&์‹œ๊ฐํ™” 3] ๋ถˆ๋ฆฌ์–ธ ์ธ๋ฑ์‹ฑ, ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ

ํŒŒ์นดํŒŒ์˜ค 2024. 5. 10. 12:13

 

< ๋ถˆ๋ฆฌ์–ธ ์ธ๋ฑ์‹ฑ >

 

# ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

df = pd.read_csv("tips_data.csv")
df

 

 

# ๋ถˆ๋ฆฌ์–ธ ์ธ๋ฑ์‹ฑ

 

๊ฐ€๋กœ ์ถ•์œผ๋กœ ๋ถˆ๋ฆฌ์–ธ ์ธ๋ฑ์‹ฑ์„ ์ ์šฉํ•œ๋‹ค

 

[ ์˜ˆ์ œ 1 ]

selected = df['sex'] == 'Male'
df[selected]

 

[ ์˜ˆ์ œ 2 ]

์—ฌ๋Ÿฌ๊ฐ€์ง€ ์กฐ๊ฑด์„ & ์—ฐ์‚ฐ์ž๋กœ ์ด์–ด์ฃผ๊ธฐ

ํ˜•์‹์€ ( ) & ( )

๋ฐ˜๋“œ์‹œ ๊ด„ํ˜ธ๋ฅผ ๋„ฃ์–ด์ค„๊ฒƒ!!

 

selected = (df['sex'] == 'Male') & (df['smoker'] == 'Yes')
df[selected]

 

 

[ ์˜ˆ์ œ 3 ]

์กฐ๊ธˆ ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด๊ธฐ

๋ถˆ๋ฆฌ์–ธ ์กฐ๊ฑด์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ณ€์ˆ˜์— ๋‹ด์•„์„œ ๋‚˜ํƒ€๋‚ด๊ธฐ

selected = (df['sex'] == 'Male') | (df['smoker'] == 'Yes')
df[selected]

 

 

[ ์˜ˆ์ œ 4 ]

์กฐ๊ธˆ ๋” ๊ฐ„ํŽธํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด๊ธฐ

์กฐ๊ฑด๋ผ๋ฆฌ ๋ฌถ์„์ˆ˜๋„ ์žˆ๋‹ค

cond1 = df['size'] >= 3
cond2 = df['tip'] < 2
df[cond1 & cond2]

 

 

[ ์˜ˆ์ œ 5 ]

์—ฌ๋Ÿฌ๊ฐ€์ง€ ์กฐ๊ฑด์„ \๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌถ์–ด์ฃผ๊ธฐ

cond = (df['sex'] == 'Male') \
     & (df['tip'] > 3) \
     & (df['smoker'] == 'Yes') \
     & (df['total_bill'] < 20) \
     & (df['size'] == 4)
df[cond]

 

 

# isin()์„ ์ด์šฉํ•œ ํ•„ํ„ฐ๋ง

 

Series(์‹œ๋ฆฌ์ฆˆ)๋‚˜ DataFrame(๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„)์˜ ๊ฐ’๋“ค ์ค‘์—์„œ ํŠน์ • ๊ฐ’์ด๋‚˜ ๋ฆฌ์ŠคํŠธ ์•ˆ์— ํฌํ•จ๋œ ๊ฐ’๋“ค์„ ์ฐพ์•„๋‚ด๋Š” ๋ฉ”์†Œ๋“œ. 

์›ํ•˜๋Š” ์กฐ๊ฑด์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•„ํ„ฐ๋งํ•˜๊ฑฐ๋‚˜ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Œ

 

isin() ์˜ [ ] ๋ฆฌ์ŠคํŠธ ์•ˆ์— ๊ฐ’์„ ๋„ฃ๋Š”๋‹ค

 

1) ๋‹จ์ผ ๊ฐ’ ํฌํ•จ ์—ฌ๋ถ€ ํ™•์ธ

# Series๋‚˜ DataFrame์˜ ํŠน์ • ์—ด์—์„œ ๋‹จ์ผ ๊ฐ’์ด ํฌํ•จ๋˜์–ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

import pandas as pd

data = {'A': [1, 2, 3, 4, 5],
        'B': ['apple', 'banana', 'orange', 'grape', 'melon']}

df = pd.DataFrame(data)
df

# 'B' ์—ด์—์„œ 'banana' ๊ฐ’์ด ์žˆ๋Š”์ง€ ํ™•์ธ
result = df['B'].isin(['banana'])
print(result)

 

 

2) ์—ฌ๋Ÿฌ ๊ฐ’ ํฌํ•จ ์—ฌ๋ถ€ ํ™•์ธ

df[df['day'].isin(['Sun', 'Thur'])]

 

 

3) ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์ „์ฒด์—์„œ์˜ ์‚ฌ์šฉ

 

๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•œ๋‹ค

result = df.isin({'A': [1, 3], 'B': ['apple', 'orange']})
print(result)

 

 

# ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ - ๋‹จ์ˆœ๊ฐ’

df['created_at'] = '2024-01-01'

 

# ๋ฐ์ดํ„ฐํ˜• ํ™•์ธํ•˜๊ธฐ

df['created_at'].dtype

# ๊ฒฐ๊ณผ
dtype('O')

 

โ€ป dtype('O')์˜ ๋œป์ด๋ž€

Object ํ˜•์ด๋ผ๋Š” ๋œป์ด๋‹ค!

 

df.info()

 

# ๋‚ ์งœ ํ˜•์‹์œผ๋กœ ๋ฐ”๊พธ๊ธฐ

df['created_at'] = pd.to_datetime(df['created_at'])
df.info()

 

 

# ์ปฌ๋Ÿผ ์ถ”๊ฐ€ํ•˜๊ธฐ - ์—ฐ์‚ฐ์œผ๋กœ ์ถ”๊ฐ€ํ•˜๊ธฐ

df['revenue'] = df['total_bill'] + df['tip']

 

 

# ๋ฐฑ๋ถ„์œจ ํผ์„ผํŠธ ์ถ”๊ฐ€ํ•˜๊ธฐ

df['tip_percentage'] = df['tip'] / df['revenue'] * 100

 

 

< ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ >

 

# ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ ์˜ˆ์ œ ๋ฐ์ดํ„ฐ

# ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ
df1 = pd.DataFrame({
    'A' : ['A0', 'A1', 'A2', 'A3'],
    'B' : ['B0', 'B1', 'B2', 'B3'],
    'C' : ['C0', 'C1', 'C2', 'C3'],
    'D' : ['D0', 'D1', 'D2', 'D3'],
})

df2 = pd.DataFrame({
    'A' : ['A4', 'A5', 'A6', 'A7'],
    'B' : ['B4', 'B5', 'B6', 'B7'],
    'C' : ['C4', 'C5', 'C6', 'C7'],
    'D' : ['D4', 'D5', 'D6', 'D7'],
})

df3 = pd.DataFrame({
    'A' : ['A8', 'A9', 'A10', 'A11'],
    'B' : ['B8', 'B9', 'B10', 'B11'],
    'C' : ['C8', 'C9', 'C10', 'C11'],
    'D' : ['D8', 'D9', 'D10', 'D11'],
})

 

 

< 1. concat >

 

concat ํ•จ์ˆ˜๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์œ„์•„๋ž˜ ํ˜น์€ ์ขŒ์šฐ๋กœ ํ•ฉ์น  ์ˆ˜ ์žˆ๋‹ค

 

- axis

์—ฐ๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ์ถ•(๋ฐฉํ–ฅ)์„ ์ง€์ •ํ•ฉ๋‹ˆ๋‹ค. 
๊ธฐ๋ณธ๊ฐ’์€ 0์œผ๋กœ, ์œ„์•„๋ž˜๋กœ ์—ฐ๊ฒฐํ•˜๋Š” ๊ฒฝ์šฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. 1๋กœ ์„ค์ •ํ•˜๋ฉด ์ขŒ์šฐ๋กœ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.

 

- ignore_index

๊ธฐ๋ณธ๊ฐ’์€ False์ด๋ฉฐ, ์—ฐ๊ฒฐ๋œ ๊ฒฐ๊ณผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ธ๋ฑ์Šค๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. True๋กœ ์„ค์ •ํ•˜๋ฉด ์ƒˆ๋กœ์šด ์ธ๋ฑ์Šค๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. (True → ๊ธฐ์กด ์ธ๋ฑ์Šค๋ฅผ ๋ฌด์‹œํ•˜๊ณ  ์ƒˆ๋กญ๊ฒŒ ์ธ๋ฑ์Šค๋ฅผ ์„ค์ •)

 

 

 

 

# ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ

# ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ
df1 = pd.DataFrame({
    'A' : ['A0', 'A1', 'A2', 'A3'],
    'B' : ['B0', 'B1', 'B2', 'B3'],
    'C' : ['C0', 'C1', 'C2', 'C3'],
    'D' : ['D0', 'D1', 'D2', 'D3'],
})

df2 = pd.DataFrame({
    'A' : ['A4', 'A5', 'A6', 'A7'],
    'B' : ['B4', 'B5', 'B6', 'B7'],
    'C' : ['C4', 'C5', 'C6', 'C7'],
    'D' : ['D4', 'D5', 'D6', 'D7'],
})

df3 = pd.DataFrame({
    'A' : ['A8', 'A9', 'A10', 'A11'],
    'B' : ['B8', 'B9', 'B10', 'B11'],
    'C' : ['C8', 'C9', 'C10', 'C11'],
    'D' : ['D8', 'D9', 'D10', 'D11'],
})

 

 

# concat์œผ๋กœ ํ•ฉ์น˜๊ธฐ

pd.concat([df1, df2, df3])

 

 

pd.concat([df1, df2, df3]).reset_index(drop=True)

 

# axis = 0 ์ธ ์œ„์•„๋ž˜๊ฐ€ ๊ธฐ๋ณธ๊ฐ’์ด๋‹ค

# ๊ธฐ๋ณธ๊ฐ’์€ ์œ„์•„๋ž˜์ธ axis = 0
pd.concat([df1, df2, df3], axis=0).reset_index(drop=True)

 

 

 

# ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ

# ๋ฐ์ดํ„ฐ ๊ฐ€์ ธ์˜ค๊ธฐ
df1 = pd.DataFrame({
    'A' : ['A0', 'A1', 'A2', 'A3'],
    'B' : ['B0', 'B1', 'B2', 'B3'],
    'C' : ['C0', 'C1', 'C2', 'C3'],
    'D' : ['D0', 'D1', 'D2', 'D3'],
})

df2 = pd.DataFrame({
    'A' : ['A4', 'A5', 'A6', 'A7'],
    'B' : ['B4', 'B5', 'B6', 'B7'],
    'C' : ['C4', 'C5', 'C6', 'C7'],
    'D' : ['D4', 'D5', 'D6', 'D7'],
})

 

 

# concat์—์„œ axis = 1 ๊ฐ€๋กœ์„ธ๋กœ ํ•ด๋ณด๊ธฐ

 

df = pd.concat([df1, df2], axis=1)
df

 

 

# ๋งŒ์•ฝ ์œ„์•„๋ž˜ ๋ถ™์ผ๋•Œ ๋‹ค๋ฅด๋‹ค๋ฉด

# ๋งŒ์•ฝ ์œ„์•„๋ž˜ ๋ถ™์ผ๋•Œ ๋‹ค๋ฅด๋‹ค๋ฉด
# ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ
df1 = pd.DataFrame({
    'A' : ['A0', 'A1', 'A2', 'A3'],
    'B' : ['B0', 'B1', 'B2', 'B3'],
    'C' : ['C0', 'C1', 'C2', 'C3'],
    'D' : ['D0', 'D1', 'D2', 'D3'],
})

df2 = pd.DataFrame({
    'A' : ['A4', 'A5', 'A6', 'A7'],
    'B' : ['B4', 'B5', 'B6', 'B7'],
    'C' : ['C4', 'C5', 'C6', 'C7'],
    'D' : ['D4', 'D5', 'D6', 'D7'],
})

df3 = pd.DataFrame({
    'A' : ['A8', 'A9', 'A10', 'A11'],
    'B' : ['B8', 'B9', 'B10', 'B11'],
    'C' : ['C8', 'C9', 'C10', 'C11'],
    'D' : ['D8', 'D9', 'D10', 'D11'],
    'E' : ['E8', 'E9', 'E10', 'E11']
})
df = pd.concat([df1, df2, df3], axis=0).reset_index(drop=True)
df

 

 

# ๋งŒ์•ฝ ๊ฐ€๋กœ์„ธ๋กœ ๋ถ™์ผ๋•Œ ๋‹ค๋ฅด๋‹ค๋ฉด

 

# ๋งŒ์•ฝ ๊ฐ€๋กœ์„ธ๋กœ ๋ถ™์ผ๋•Œ ๋‹ค๋ฅด๋‹ค๋ฉด
# ๋ฐ์ดํ„ฐ ๋ณ‘ํ•ฉ
df1 = pd.DataFrame({
    'A' : ['A0', 'A1', 'A2', 'A3'],
    'B' : ['B0', 'B1', 'B2', 'B3'],
    'C' : ['C0', 'C1', 'C2', 'C3'],
    'D' : ['D0', 'D1', 'D2', 'D3'],
})

df2 = pd.DataFrame({
    'A' : ['A4', 'A5', 'A6', 'A7','A8'],
    'B' : ['B4', 'B5', 'B6', 'B7','B8'],
    'C' : ['C4', 'C5', 'C6', 'C7','C8'],
    'D' : ['D4', 'D5', 'D6', 'D7','D8'],
})
df = pd.concat([df1, df2], axis=1)
df

 

 

< 2. merge >

 

๋ฐ์ดํ„ฐ๋ฅผ ์ขŒ์šฐ๋กœ ํ•ฉ์น˜๊ธฐ

SQL์˜ JOIN ์—ฐ์‚ฐ๊ณผ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ•ฉ์น  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ๋กœ ๋‘ ๊ฐœ ์ด์ƒ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ๊ณตํ†ต๋œ ์—ด์ด๋‚˜ ์ธ๋ฑ์Šค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ‘ํ•ฉํ•  ๋•Œ ํ™œ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

# how

๋ณ‘ํ•ฉ ๋ฐฉ๋ฒ•์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋กœ, 'inner', 'outer', 'left', 'right' ๋“ฑ์˜ ์˜ต์…˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

1) inner

๊ณตํ†ต๋œ ํ‚ค(์—ด)๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ต์ง‘ํ•ฉ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

2) outer

๊ณตํ†ต๋œ ํ‚ค๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์ง‘ํ•ฉ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

3) left

์™ผ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋ชจ๋“  ํ–‰์„ ํฌํ•จํ•˜๊ณ  ์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ๊ณตํ†ต๋œ ํ‚ค์— ํ•ด๋‹นํ•˜๋Š” ํ–‰๋งŒ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

 

4) right

์˜ค๋ฅธ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋ชจ๋“  ํ–‰์„ ํฌํ•จํ•˜๊ณ  ์™ผ์ชฝ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์€ ๊ณตํ†ต๋œ ํ‚ค์— ํ•ด๋‹นํ•˜๋Š” ํ–‰๋งŒ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

 

 

# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ์˜ˆ์ œ

# 5-2. merge
df1 = pd.DataFrame({
    'key' : ['A', 'B', 'C', 'D'],
    'value' : [1,2,3,4]
})

df2 = pd.DataFrame({
    'key' : ['B', 'D', 'D', 'E'],
    'value' : [5,6,7,8]
})

 

 

# inner

# ๊ธฐ๋ณธ๊ฐ’์€ inner
pd.merge(df1, df2, on='key', how='inner')

 

# outer

pd.merge(df1, df2, on='key', how='outer')

 

# left

pd.merge(df1, df2, on='key', how='left')