Knowledge๐Ÿฆข/ํ†ต๊ณ„ํ•™ ๊ธฐ์ดˆ

[ํ†ต๊ณ„์•ผ ๋†€์ž 2] A/B ํ…Œ์ŠคํŠธ, ์œ ์˜์ˆ˜์ค€, p-value

ํŒŒ์นดํŒŒ์˜ค 2024. 6. 10. 16:11

< ๋ถ„์„๊ธฐ๋ฒ• ์„ ํƒ๊ทธ๋ฆผ >

 

 

 

< A/B ํ…Œ์ŠคํŠธ >

 

 

# ํ”„๋กœ์„ธ์Šค

 

 

< ์œ ์˜์ˆ˜์ค€ >

= ์˜ค๋ฅ˜ ํ—ˆ์šฉ ๋ฒ”์œ„

์œ ์˜์ˆ˜์ค€: ๊ท€๋ฌด๊ฐ€์„ค์ด ๋งž์„ ๋•Œ ์˜ค๋ฅ˜ํ—ˆ์šฉ ๊ธฐ์ค€(ํ™•๋ฅ )

 

ํ‘œ๋ณธ์„ ์ถ”์ถœํ•˜๋Š” ์ˆœ๊ฐ„ ๋ชจ์ง‘๋‹จ๊ณผ 100% ์ผ์น˜ํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์—, ์˜ค๋ฅ˜์˜๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•œ๋‹ค๊ณ  ํ•™์Šตํ•˜์˜€์Šต๋‹ˆ๋‹ค.

๊ฐ€์„ค ๊ฒ€์ •์—์„œ ๊ฒฐ๋ก ์„ ํ•ด์„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๊ธฐ์ค€์„ ์„ธ์šฐ๊ณ , ๊ทธ ๊ธฐ์ค€์„ ๋งŒ์กฑํ•˜๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๊ธฐ์ค€์ด ๋˜๋Š” ๊ฒƒ์ด ์œ ์˜์ˆ˜์ค€์ž…๋‹ˆ๋‹ค.

 

 

 

< ๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰๊ณผ p-value >

 

๊ฒ€์ •ํ†ต๊ณ„๋Ÿ‰์ด๋ž€ ๊ท€๋ฌด๊ฐ€์„ค์„ ์ฑ„ํƒ ๋˜๋Š” ๊ธฐ๊ฐํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ํ™•๋ฅ ๋ณ€์ˆ˜๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

 

 

p-value: ์–ด๋–  ์‚ฌ๊ฑด์ด ์šฐ์—ฐํžˆ ๋ฐœ์ƒํ•  ํ™•๋ฅ 

 

 

< ์†Œ์Šค์ฝ”๋“œ ์‹ค์Šต >

 

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ํ˜ธ์ถœ 
import pandas as pd
import numpy as np 
# ๊ณผํ•™ ๊ณ„์‚ฐ์šฉ ํŒŒ์ด์ฌ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ 
import scipy.stats as stats
from PIL import Image

df = pd.read_csv("users1.csv")

 

 

# t-test

#t-test
# ๊ฐ€์„ค ์„ค์ •
# ๊ท€๋ฌด๊ฐ€์„ค: ๋‚จ์„ฑ๊ณผ ์—ฌ์„ฑ์˜ ๊ตฌ๋งค๊ธˆ์•ก์— ์ฐจ์ด๊ฐ€ ์—†์„ ๊ฒƒ์ด๋‹ค 
# ๋Œ€๋ฆฝ๊ฐ€์„ค: ๋‚จ์„ฑ๊ณผ ์—ฌ์„ฑ์˜ ๊ตฌ๋งค๊ธˆ์•ก์— ์ฐจ์ด๊ฐ€ ์žˆ์„ ๊ฒƒ์ด๋‹ค
# ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋น„๊ต
df.groupby(['Gender'])['Purchase Amount (USD)'].mean().reset_index()

# ๋ฐ์ดํ„ฐ ๋ถ„๋ฆฌ 
# mask method 
mask=(df['Gender']=='Male')
mask1 = (df['Gender']=='Female')

m_df = df[mask]
f_df = df[mask1]

# ๊ฒฐ์ œ๊ธˆ์•ก ์ปฌ๋Ÿผ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ 
m_df=m_df[['Purchase Amount (USD)']]
f_df=f_df[['Purchase Amount (USD)']]

# ์ฐจ์ด๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์—ฌ์ง
# ์œ ์˜์ˆ˜์ค€์€ ํ†ต์ƒ์ ์œผ๋กœ ๋งŽ์ด ์“ฐ์ด๋Š” 0.05 ๋กœ ์ •ํ•จ 
# scipy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•ด t-score ์™€ pvalue ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
# t-test ๋Š” ํ‘œ๋ณธ์˜ ํ‰๊ท (์ฐจ์ด ๋ถ„์„)์„ ์•Œ๊ณ ์ž ํ•  ๋•Œ ์‚ฌ์šฉ๋˜๋ฉฐ, ๋ชจ์ง‘๋‹จ์˜ ๋ถ„์‚ฐ์„ ์•Œ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ ์ฃผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. 
t, pvalue=stats.ttest_ind(m_df, f_df)

# tscore ๋Š” ๊ทธ๋ฃน ๊ฐ„ ์–ผ๋งˆ๋‚˜ ์ฐจ์ด๊ฐ€ ์žˆ๋Š”์ง€์— ๋Œ€ํ•œ ์ง€ํ‘œ
# tscore ๊ฐ€ ํฌ๋ฉด ๊ทธ๋ฃน ๊ฐ„ ์ฐจ์ด๊ฐ€ ํผ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

# p-value ๋Š” ์šฐ์—ฐ์— ์˜ํ•ด ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์— ๋Œ€ํ•œ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.
# p-value๊ฐ€ 0.05 ๋ณด๋‹ค ํฌ๋‹ค = ์šฐ์—ฐํžˆ ์ผ์–ด๋‚ฌ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค = ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์—†๋‹ค๊ณ  ์ถ”์ • 
# ์—ฌ๊ธฐ์„œ p-value ๊ฐ’์€ 0.05 ๋ณด๋‹ค ํฌ๋ฏ€๋กœ, ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์—†๋‹ค๊ณ  ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
# ๋Œ€๋ฆฝ๊ฐ€์„ค ๊ธฐ๊ฐ
t, pvalue

 

 

# ์นด์ด์ œ๊ณฑ๊ฒ€์ •

#์นด์ด์ œ๊ณฑ๊ฒ€์ •
# ๊ฐ€์„ค ์„ค์ •
# ๊ท€๋ฌด๊ฐ€์„ค: ์„ฑ๋ณ„๊ณผ ๊ตฌ๋งคSize ์—๋Š” ๊ด€๋ จ์„ฑ์ด ์—†์„ ๊ฒƒ์ด๋‹ค 
# ๋Œ€๋ฆฝ๊ฐ€์„ค: ์„ฑ๋ณ„๊ณผ ๊ตฌ๋งคSize ์—๋Š” ๊ด€๋ จ์„ฑ์ด ์žˆ์„ ๊ฒƒ์ด๋‹ค 
# ์‹ค์ œ ๋ฐ์ดํ„ฐ ๋น„๊ต
df.groupby(['Gender','Size'])['Customer ID'].count().reset_index()

# pandas ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ crosstab ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด, ๋‘ ๋ฒ”์ฃผํ˜• ์ž๋ฃŒ์˜ ๋นˆ๋„ํ‘œ๋ฅผ ๋งŒ๋“ค์–ด ์ฃผ๊ฒ ์Šต๋‹ˆ๋‹ค.

result = pd.crosstab(df['Gender'], df['Size'])

# ์นด์ด์ œ๊ณฑ ๊ฒ€์ •์„ stat ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๊ตฌํ˜„
# chi2_contingency๋ฅผ ํ†ตํ•ด, ์นด์ด์ œ๊ณฑํ†ต๊ณ„๋Ÿ‰, p-value๋ฅผ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.  
stats.chi2_contingency(observed=result)

# ๊ฐ ๊ฐ’๋“ค์„ ๋ณ„๋„๋กœ ๋ณด๊ธฐ
# ์นด์ด์ œ๊ณฑ ๊ฒ€์ • ํ†ต๊ณ„๋Ÿ‰, pvalue, ์ž์œ ๋„๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
stats.chi2_contingency(observed=result)[0]

# p-value ๋Š” ์šฐ์—ฐ์— ์˜ํ•ด ๋‚˜ํƒ€๋‚  ํ™•๋ฅ ์— ๋Œ€ํ•œ ์ง€ํ‘œ์ž…๋‹ˆ๋‹ค.
# p-value๊ฐ€ 0.05 ๋ณด๋‹ค ํฌ๋‹ค = ์šฐ์—ฐํžˆ ์ผ์–ด๋‚ฌ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค = ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์—†๋‹ค๊ณ  ์ถ”์ • 
# ์—ฌ๊ธฐ์„œ p-value ๊ฐ’์€ 0.05 ๋ณด๋‹ค ํฌ๋ฏ€๋กœ, ์ธ๊ณผ๊ด€๊ณ„๊ฐ€ ์—†๋‹ค๊ณ  ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
# ๋Œ€๋ฆฝ๊ฐ€์„ค ๊ธฐ๊ฐ
stats.chi2_contingency(observed=result)[1]

# ์ž์œ ๋„์™€ ์œ ์˜์ˆ˜์ค€์„ ํ†ตํ•ด ๊ท€๋ฌด๊ฐ€์„ค ๊ธฐ๊ฐ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค. 
# ์ž์œ ๋„๋ž€, ๊ต‰์žฅํžˆ ๋ณต์žกํ•œ ๊ฐœ๋…์ด๋ฏ€๋กœ,,, (๋ณ€์ˆ˜1 ๊ทธ๋ฃน์˜ ์ˆ˜-1)*(๋ณ€์ˆ˜2 ๊ทธ๋ฃน์˜ ์ˆ˜-1) ๊ฐ€ ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค. 
# 1*3 = 3 ์ด ๋„์ถœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 
stats.chi2_contingency(observed=result)[2]