Knowledge๐Ÿฆข/Python

[Python] ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜ํ•˜๊ธฐโšก๏ธ

ํŒŒ์นดํŒŒ์˜ค 2024. 7. 10. 20:25

 

# ํ”„๋กœ์ ํŠธ ์ฃผ์ œ : ์•„๋งˆ์กด ๊ณ ๊ฐ ๊ตฌ๋งค ์‹ํ’ˆ ๋ฐ์ดํ„ฐ ๋ถ„์„

 

 

 

# ์ปฌ๋Ÿผ ์„ค๋ช…

 

โœ”๏ธDiscount Amount : ํ• ์ธ๊ธˆ์•ก

โœ”๏ธList Amount : ์ •๊ฐ€ / ํ• ์ธ์ „๊ธˆ์•ก

โœ”๏ธSales Amount : ์‹ค์ œํŒ๋งค๊ธˆ์•ก

-> Sales Price * Sales Quantity

 

โœ”๏ธSales Amount Based on List Price : ํ• ์ธ์ ์šฉ ์•ˆ๋œ ํŒ๋งค๊ธˆ์•ก ์ „์ฒด

โœ”๏ธSales Cost Amount : ์ƒํ’ˆ์„ ํŒ๋งคํ•˜๋Š”๋ฐ ๋“ค์–ด๊ฐ„ ๋น„์šฉ

โœ”๏ธSales Margin Amount : ํŒ๋งค ๋งˆ์ง„ ๊ธˆ์•ก

-> Sales Amount - Sales Cost Amount

 

โœ”๏ธSales Price : ์‹ค์ œ ํŒ๋งค ๊ฐ€๊ฒฉ

โœ”๏ธSales Quantity : ์ƒํ’ˆ ์ˆ˜๋Ÿ‰

 

 

< ์นดํ…Œ๊ณ ๋ฆฌ ๋ถ„๋ฅ˜ํ•˜๊ธฐ >

 

1. ๋ฐ์ดํ„ฐ ๋กœ๋“œํ•˜๊ธฐ

 

์—ฌ๋Ÿฌ ํŒจํ‚ค์ง€ ๊ฐ€์ ธ์˜ค๊ธฐ

import pandas as pd
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import re

 

csv ํŒŒ์ผ ์ฝ์–ด์˜ค๊ธฐ

df = pd.read_csv('base_df.csv')
df = df.dropna()

 

 

โšก๏ธnan ํ–‰ ์ปฌ๋Ÿผ ๋ฒ„๋ฆฌ๊ธฐ

df = df.dropna()

 

 

โšก๏ธํŠน์ • nan ํ–‰ ์ปฌ๋Ÿผ ๋ฒ„๋ฆฌ๊ธฐ

df = df.dropna(subset = ['์ปฌ๋Ÿผ1', '์ปฌ๋Ÿผ2'])

 

 

โšก๏ธ์—ด ์ปฌ๋Ÿผ ๋ฒ„๋ฆฌ๊ธฐ

df = df.drop('์ปฌ๋Ÿผ1', axis=1)

 

 

 

df ํ™•์ธํ•˜๊ธฐ

df

 

 

 

Item ์ปฌ๋Ÿผ ํ™•์ธํ•˜๊ธฐ

df['Item']

 

 

 

2. ๋งˆ์ง€๋ง‰ 2๋‹จ์–ด๋กœ Product ์ปฌ๋Ÿผ ์ƒ์„ฑํ•˜๊ธฐ

 

df['Product'] = df['Item'].apply(lambda x : ' '.join(x.split()[-2:]) if len(x.split()) >= 2 else x)

 

 

โšก๏ธ์ค‘์š”ํ•œ ๋ฌธ๋ฒ•

์ปฌ๋Ÿผ.apply(lambda x : ์‹)

groupby(์ปฌ๋Ÿผ).apply(lambda x : ์‹)

 

 

โšก๏ธ๋ฌธ์ž์—ด ์•ž์˜ 2์ž๋ฆฌ ๊ฐ€์ ธ์˜ค๊ธฐ

x.split()[:2]

๋งŒ์•ฝ 2์ž๋ฆฌ์ธ ๊ฒฝ์šฐ๋Š” ์•ž์— ' '.join()์„ ํ•จ์œผ๋กœ์จ ๋ฌธ์ž์—ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฌธ์ž์—ด๋กœ ๋ฌถ์–ด์ค˜์•ผ ํ•œ๋‹ค

 

 

โšก๏ธ๋ฌธ์ž์—ด ๋’ค์˜ 2์ž๋ฆฌ ๊ฐ€์ ธ์˜ค๊ธฐ

x.split()[-2:]

 

๋งŒ์•ฝ 2์ž๋ฆฌ์ธ ๊ฒฝ์šฐ๋Š” ์•ž์— ' '.join()์„ ํ•จ์œผ๋กœ์จ ๋ฌธ์ž์—ด ๋ฆฌ์ŠคํŠธ๋ฅผ ๋ฌธ์ž์—ด๋กœ ๋ฌถ์–ด์ค˜์•ผ ํ•œ๋‹ค

 

 

df ํ™•์ธํ•˜๊ธฐ

df

 

 

df['Product']์˜ ์œ ๋‹ˆํฌ ๊ฐ’ ํ™•์ธํ•˜๊ธฐ

df['Product'].unique()

 

๋งŒ์•ฝ ์—ฌ๊ธฐ์„œ text editor๋ฅผ ํด๋ฆญํ•œ๋‹ค๋ฉด ์ „์ฒด ํŒŒ์ผ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค!

 

 

โšก๏ธ์ˆซ์ž์„ธ๊ธฐ ํ•จ์ˆ˜ ์ข…๋ฅ˜

 

1) count

์ค‘๋ณต์ œ๊ฑฐ๋ฅผ ํ•˜์ง€ ์•Š๋Š”๋‹ค

 

2) nunique

์ค‘๋ณต์ œ๊ฑฐ ํ›„ ๊ฐœ์ˆ˜์„ธ๊ธฐ

 

3) unique

๊ฐ’์˜์ข…๋ฅ˜

 

4) value_counts

๊ฐ’์˜ ์ข…๋ฅ˜๋ณ„ ๊ฐœ์ˆ˜์„ธ๊ธฐ

 

 

3. ์นดํ…Œ๊ณ ๋ฆฌ ์ •์˜ ํ›„ ๋ถ„๋ฅ˜ํ•ด๋ณด๊ธฐ

 

df_product = df[['Product']]
df_product

 

 

df['Product'].unique() ๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„ text editor๋ฅผ ์—ด์–ด์„œ ์ง€ํ”ผํ‹ฐ์—๊ฒŒ ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋ถ„๋ฅ˜ํ•ด๋‹ฌ๋ผ๊ณ  ํ•ด ๋ณด์•˜๋‹ค

 

์นดํ…Œ๊ณ ๋ฆฌ๋ณ„๋กœ ๋”•์…”๋„ˆ๋ฆฌ ์ƒ์„ฑํ•˜๊ธฐ

 

# ์นดํ…Œ๊ณ ๋ฆฌ ์ •์˜
categories = {
    'Food': ['Butter', 'Rice', 'Soup', 'Pasta', 'Pizza', 'Sandwich', 'Cake', 'Cookie', 'Brownie', 'Candy', 'Chocolate', 'Noodle'],
    'Vegetables': ['Vegetable', 'Potato', 'Onion', 'Carrot', 'Tomato', 'Lettuce', 'Broccoli', 'Asparagus', 'Mushroom', 'Pepper', 'Garlic'],
    'Fruits': ['Apple', 'Orange', 'Banana', 'Lemon', 'Lime', 'Grape', 'Peach', 'Plum', 'Cherry', 'Berry', 'Melon'],
    'Dairy': ['Milk', 'Cheese', 'Yogurt', 'Butter', 'Cream'],
    'Beverages': ['Juice', 'Soda', 'Cola', 'Wine', 'Beer', 'Drink'],
    'Snacks': ['Chips', 'Pretzels', 'Popcorn', 'Crackers', 'Jerky', 'Nuts', 'Mints', 'Waffles'],
    'Bread': ['Bread', 'Bagel', 'Muffin', 'Donut', 'Roll'],
    'Canned Food': ['Canned', 'Tuna', 'Sardines', 'Tomatos', 'Peaches', 'Beans', 'Corn', 'Soup'],
    'Other': []
}

 

 

๋”•์…”๋„ˆ๋ฆฌ items()์œผ๋กœ ํ‚ค์™€ ๊ฐ’์˜ ์Œ ํ™•์ธํ•˜๊ธฐ

categories.items()
dict_items([('Food', ['Butter', 'Rice', 'Soup', 'Pasta', 'Pizza', 'Sandwich', 'Cake', 'Cookie', 'Brownie', 'Candy', 'Chocolate', 'Noodle']), 
		    ('Vegetables', ['Vegetable', 'Potato', 'Onion', 'Carrot', 'Tomato', 'Lettuce', 'Broccoli', 'Asparagus', 'Mushroom', 'Pepper', 'Garlic']), 
            ('Fruits', ['Apple', 'Orange', 'Banana', 'Lemon', 'Lime', 'Grape', 'Peach', 'Plum', 'Cherry', 'Berry', 'Melon']), 
            ('Dairy', ['Milk', 'Cheese', 'Yogurt', 'Butter', 'Cream']), 
            ('Beverages', ['Juice', 'Soda', 'Cola', 'Wine', 'Beer', 'Drink']), 
            ('Snacks', ['Chips', 'Pretzels', 'Popcorn', 'Crackers', 'Jerky', 'Nuts', 'Mints', 'Waffles']), 
            ('Bread', ['Bread', 'Bagel', 'Muffin', 'Donut', 'Roll']), 
            ('Canned Food', ['Canned', 'Tuna', 'Sardines', 'Tomatos', 'Peaches', 'Beans', 'Corn', 'Soup']), 
            ('Other', [])])

 

 

 

โšก๏ธ ์นดํ…Œ๊ณ ๋ฆฌ ๋งคํ•‘ ํ•จ์ˆ˜

# ์นดํ…Œ๊ณ ๋ฆฌ ๋งคํ•‘ ํ•จ์ˆ˜
def categorize_item(item):
    for category, keywords in categories.items():
        if any(re.search(r'\b' + keyword + r'\b', item, re.IGNORECASE) for keyword in keywords):
            return category
    return 'Other'

 

 

re.search(keyword, item, re.IGNORECASE)๋Š” item์—์„œ ํ‚ค์›Œ๋“œ๋ฅผ ๋Œ€์†Œ๋ฌธ์ž ๊ตฌ๋ถ„ ์—†์ด ๊ฒ€์ƒ‰ํ•ฉ๋‹ˆ๋‹ค.

 

ํ‚ค์›Œ๋“œ๊ฐ€ item์— ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉด ํ•ด๋‹น ์นดํ…Œ๊ณ ๋ฆฌ๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

 

๋ชจ๋“  ์นดํ…Œ๊ณ ๋ฆฌ์— ํ•ด๋‹นํ•˜์ง€ ์•Š์œผ๋ฉด 'Other'๋ฅผ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

 

 

# ๋‹จ์–ด ๊ฒฝ๊ณ„ \b์˜ ์—ญํ• 

 

\b๋Š” ๋ฌธ์ž์™€ ๊ณต๋ฐฑ ๋˜๋Š” ๋น„ ๋ฌธ์ž(์˜ˆ: ๊ตฌ๋‘์ ) ์‚ฌ์ด์— ์žˆ๋Š” ์œ„์น˜๋ฅผ ๋งค์นญํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, \bword\b๋Š” 'word'๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ •ํ™•ํžˆ ๋งค์นญํ•˜๋ฉฐ, 'sword', 'wording' ๋“ฑ๊ณผ ๊ฐ™์€ ๋ฌธ์ž์—ด๊ณผ๋Š” ๋งค์นญํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

๋งŒ์•ฝ \b๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด word๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๋งค์นญํ• ๋•Œ sword์—์„œ๋„ ๋งค์นญ๋˜๋Š” ์ƒํ™ฉ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

 

 

 

df['Item']์— ์นดํ…Œ๊ณ ๋ฆฌ ๋งคํ•‘ ํ•จ์ˆ˜ ์ ์šฉํ›„ ์ƒˆ๋กœ์šด ์ปฌ๋Ÿผ ์นดํ…Œ๊ณ ๋ฆฌ ์ƒ์„ฑํ•˜๊ธฐ

df['Category'] = df['Item'].apply(categorize_item)

 

 

 

์นดํ…Œ๊ณ ๋ฆฌ ์ƒ์„ฑ ํ™•์ธํ•˜๊ธฐ

df['Category']

 

 

 

df