Knowledge๐Ÿฆข/๋ฐ์ดํ„ฐ ๋ถ„์„

[Pandas ๊ฐœ์ธ๊ณผ์ œ] Iris ๋ฐ์ดํ„ฐ (์•„์ด๋ฆฌ์Šค, ๋ถ—๊ฝƒ ๋ฐ์ดํ„ฐ)๋ฅผ ํ™œ์šฉํ•œ Pandas ํ™œ์šฉ ๊ณผ์ œ!

ํŒŒ์นดํŒŒ์˜ค 2024. 5. 16. 16:09

< ๋ฒ ์ด์ง  10๋ฌธ์ œ >

 

1๋ฒˆ ๋ฌธ์ œ : ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

  • ๋ฌธ์ œ) pandas๋ฅผ importํ•œ ๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์™€์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธํ•˜์„ธ์š”.
import pandas as pd

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
columns = ['Sepal Length', 'Sepal Width', 'Petal Length', 'Petal Width', 'Species']
iris = pd.read_csv(url, header=None, names=columns)
iris

 

 

 

2๋ฒˆ ๋ฌธ์ œ : ๋ฐ์ดํ„ฐ ๊ตฌ์กฐ ํŒŒ์•…ํ•˜๊ธฐ

  • ๋ฌธ์ œ) ๋ฐ์ดํ„ฐ์…‹์˜ ์ฒซ 5ํ–‰์„ ์ถœ๋ ฅํ•˜๊ณ , ๋ฐ์ดํ„ฐ์˜ ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜์„ธ์š”.
iris.head(5)

 

 

 

3๋ฒˆ ๋ฌธ์ œ : ๋ฐ์ดํ„ฐ ์š”์•ฝ ์ •๋ณด ํ™•์ธํ•˜๊ธฐ

  • ๋ฌธ์ œ) ๋ฐ์ดํ„ฐ์…‹์˜ ์š”์•ฝ ์ •๋ณด๋ฅผ ํ™•์ธํ•˜์„ธ์š”.
iris.info()

 

 

4๋ฒˆ ๋ฌธ์ œ : ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰ ํ™•์ธํ•˜๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ์—ด์˜ ๊ธฐ์ดˆ ํ†ต๊ณ„๋Ÿ‰(ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ, ์ตœ์†Ÿ๊ฐ’, ์ตœ๋Œ“๊ฐ’ ๋“ฑ)์„ ์ถœ๋ ฅํ•˜์„ธ์š”.
iris.describe()

 

 

5๋ฒˆ ๋ฌธ์ œ : ํŠน์ • ์—ด ์„ ํƒํ•˜๊ธฐ

  • ๋ฌธ์ œ) 'Sepal Length' ์—ด๋งŒ ์„ ํƒํ•˜์—ฌ ์ถœ๋ ฅํ•˜์„ธ์š”.
iris['Sepal Length']

 

 

 

6๋ฒˆ ๋ฌธ์ œ : ์กฐ๊ฑด์— ๋งž๋Š” ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋งํ•˜๊ธฐ

  • ๋ฌธ์ œ) 'Species'๊ฐ€ 'Iris-setosa'์ธ ํ–‰๋“ค๋งŒ ์„ ํƒํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“œ์„ธ์š”.
cond1 = iris['Species'] == 'Iris-setosa'
iris.loc[cond1, :]

 

 

 

7๋ฒˆ ๋ฌธ์ œ : ๊ทธ๋ฃน๋ณ„ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐํ•˜๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ํ’ˆ์ข…('Species')๋ณ„๋กœ 'Sepal Length'์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•˜์„ธ์š”.
iris.groupby('Species')['Sepal Length'].mean()

 

 

8๋ฒˆ ๋ฌธ์ œ : ์ƒˆ๋กœ์šด ์—ด ์ถ”๊ฐ€ํ•˜๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ํ–‰์˜ 'Sepal Length'์™€ 'Sepal Width'์˜ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ƒˆ๋กœ์šด ์—ด 'Sepal Sum'์„ ์ถ”๊ฐ€ํ•˜์„ธ์š”.
iris['Sepal Sum'] = iris['Sepal Length'] + iris['Sepal Width']
iris

 

 

9๋ฒˆ ๋ฌธ์ œ : ๋ฐ์ดํ„ฐ ์ •๋ ฌํ•˜๊ธฐ

  • ๋ฌธ์ œ) 'Petal Length' ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜์„ธ์š”.
iris.sort_values('Petal Length', ascending=False)

 

 

 

10๋ฒˆ ๋ฌธ์ œ : ํŠน์ • ๊ฐ’ ์„ธ๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ํ’ˆ์ข…('Species')๋ณ„๋กœ ๋ช‡ ๊ฐœ์˜ ์ƒ˜ํ”Œ์ด ์žˆ๋Š”์ง€ ์„ธ์–ด๋ณด์„ธ์š”.
iris['Species'].value_counts()

 

 

 

< ์ฑŒ๋ฆฐ์ง€ 10๋ฌธ์ œ >

 

๋ฌธ์ œ 11: ๋ฐ์ดํ„ฐ ๊ฒฐํ•ฉํ•˜๊ธฐ

  • ๋ฌธ์ œ) 'Sepal Length'์™€ 'Petal Length'์˜ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•œ ํ›„, ์ด๋ฅผ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๊ฒฐํ•ฉํ•˜์„ธ์š”.

 

sepal_length_mean = iris['Sepal Length'].mean()
petal_length_mean = iris['Petal Length'].mean()

df1 = pd.DataFrame({
    'sepal_length_mean' : [sepal_length_mean]
})

df2 = pd.DataFrame({
    'petal_length_mean' : [petal_length_mean]
})

result_horizontal = pd.concat([df1, df2], axis=1)
result_horizontal

 

 

 

๋ฌธ์ œ 12: ํ”ผ๋ฒ— ํ…Œ์ด๋ธ” ๋งŒ๋“ค๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ํ’ˆ์ข…๋ณ„ 'Sepal Length'์™€ 'Petal Length'์˜ ํ‰๊ท ์„ ํ”ผ๋ฒ— ํ…Œ์ด๋ธ”๋กœ ๋งŒ๋“ค์–ด ๋ณด์„ธ์š”.
# ๊ฐ ํ’ˆ์ข…๋ณ„๋กœ 'Sepal Length'์™€ 'Petal Length'์˜ ํ‰๊ท  ๊ณ„์‚ฐ
pivot_table = iris.pivot_table(index='Species', values=['Sepal Length', 'Petal Length'], aggfunc='mean')

# ์—ด ์ด๋ฆ„ ๋ณ€๊ฒฝ
pivot_table.columns = ['Sepal Length Mean', 'Petal Length Mean']

pivot_table

 

 

 

 

๋ฌธ์ œ 13: ๊ฒฐ์ธก๊ฐ’ ์ฒ˜๋ฆฌ

  • ๋ฌธ์ œ) 'Sepal Width' ์—ด์— ์ž„์˜๋กœ ๊ฒฐ์ธก๊ฐ’์„ 10๊ฐœ ์ถ”๊ฐ€ํ•˜๊ณ , ๊ฒฐ์ธก๊ฐ’์„ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋‘ ๊ฐ€์ง€ ์ด์ƒ ์ ์šฉํ•ด ๋ณด์„ธ์š”.
  • ์ฐธ๊ณ ) ํ˜„์žฌ ๋ฐ์ดํ„ฐ์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์—†์œผ๋ฏ€๋กœ ์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์ž„์˜์˜ ๊ฒฐ์ธก๊ฐ’์„ ๋งŒ๋“  dataframe์„ ์‚ฌ์šฉํ•˜์„ธ์š”. (numpy์˜ random ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‚ฌ๋žŒ๋งˆ๋‹ค ๊ฒฐ์ธก๊ฐ’ index์˜ ์œ„์น˜๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค)
import numpy as np

# ์ž„์˜๋กœ ๊ฒฐ์ธก๊ฐ’ 10๊ฐœ ์ถ”๊ฐ€
iris_with_nan = iris.copy()
iris_with_nan.loc[np.random.choice(iris_with_nan.index, 10), 'Sepal Width'] = np.nan
iris_with_nan

 

 

 

[ ์ •๋‹ต 1 ]

# 13๋ฒˆ (1)
iris_clean = iris_with_nan.copy()
selected = iris_clean['Sepal Width'].isna()
iris_clean.loc[selected, 'Sepal Width'] = '๊ฒฐ์ธก๊ฐ’1'
iris_clean

 

 

[ ์ •๋‹ต 2 ]

# 13๋ฒˆ (2)
iris_clean2 = iris_with_nan.copy()
iris_clean2 = iris_clean2.fillna('๊ฒฐ์ธก๊ฐ’2')
iris_clean2

 

 

 

๋ฌธ์ œ 14: ๋ฐ์ดํ„ฐ ๋ณ€ํ˜•

  • ๋ฌธ์ œ) 'Sepal Length'์™€ 'Sepal Width'์˜ ๋น„์œจ์„ ๊ณ„์‚ฐํ•˜์—ฌ ์ƒˆ๋กœ์šด ์—ด 'Sepal Ratio'๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”.
iris['Sepal Ratio'] = round(iris['Sepal Length'] / iris['Sepal Width'],2)
iris

 

 

 

 

๋ฌธ์ œ 15: ํŠน์ • ์กฐ๊ฑด์— ๋”ฐ๋ฅธ ์ƒˆ๋กœ์šด ์—ด ์ƒ์„ฑ

  • ๋ฌธ์ œ) 'Sepal Length'๊ฐ€ 5.0 ์ด์ƒ์ธ ๊ฒฝ์šฐ 'Large', ๋ฏธ๋งŒ์ธ ๊ฒฝ์šฐ 'Small'์„ ๊ฐ’์œผ๋กœ ๊ฐ€์ง€๋Š” ์ƒˆ๋กœ์šด ์—ด 'Sepal Size'๋ฅผ ์ƒ์„ฑํ•˜์„ธ์š”.
def big_small(data):
  if data >= 5.0:
    return 'Large'
  else:
    return 'Small'

iris2 = iris.copy()
iris2['Sepal Size'] = iris2['Sepal Length'].apply(big_small)
iris2

 

 

 

 

๋ฌธ์ œ 16: ๋‹ค์–‘ํ•œ ํ†ต๊ณ„๋Ÿ‰ ๊ณ„์‚ฐ

  • ๋ฌธ์ œ) ๊ฐ ํ’ˆ์ข…(Species)๋ณ„๋กœ 'Sepal Length'์™€ 'Sepal Width'์˜ ํ•ฉ๊ณ„, ํ‰๊ท , ํ‘œ์ค€ํŽธ์ฐจ๋ฅผ ๊ณ„์‚ฐํ•˜์„ธ์š”.
iris.groupby('Species')[['Sepal Length','Sepal Width']].agg(['sum', 'mean', 'std'])

 

 

 

 

๋ฌธ์ œ 17: ๋ณต์žกํ•œ ์กฐ๊ฑด ํ•„ํ„ฐ๋ง

  • ๋ฌธ์ œ) 'Sepal Length'๊ฐ€ 5.0 ์ด์ƒ์ด๊ณ  'Sepal Width'๊ฐ€ 3.5 ์ดํ•˜์ธ ๋ฐ์ดํ„ฐ๋งŒ ์„ ํƒํ•˜๊ณ , ์ด ๋ฐ์ดํ„ฐ์˜ 'Petal Length'์™€ 'Petal Width'์˜ ํ•ฉ์„ ์ƒˆ๋กœ์šด ์—ด 'Petal Sum'์œผ๋กœ ์ถ”๊ฐ€ํ•˜์„ธ์š”.
selected = (iris['Sepal Length'] >= 5.0) & (iris['Sepal Width'] <= 3.5)
iris['Petal Sum'] = iris.loc[selected, 'Petal Length'] + iris.loc[selected, 'Petal Width']
iris

 

 

 

 

๋ฌธ์ œ 18: ์‚ฐ์ ๋„ ๊ทธ๋ฆฌ๊ธฐ

  • ๋ฌธ์ œ) 'Sepal Length'์™€ 'Sepal Width'์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฌ์„ธ์š”. ๊ฐ ํ’ˆ์ข…๋ณ„๋กœ ๋‹ค๋ฅธ ์ƒ‰์ƒ์„ ์‚ฌ์šฉํ•˜์„ธ์š”.
import seaborn as sns
import matplotlib.pyplot as plt

# ์‚ฐ์ ๋„ ๊ทธ๋ฆฌ๊ธฐ
sns.scatterplot(data=iris, x='Sepal Length', y='Sepal Width', hue='Species')

# ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ
plt.title('Sepal Length vs. Sepal Width')
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.legend(title='Species')
plt.show()

 

 

 

 

๋ฌธ์ œ 19: ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ฆฌ๊ธฐ

  • ๋ฌธ์ œ) 'Sepal Length'์˜ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆฌ์„ธ์š”. ๊ฐ ํ’ˆ์ข…๋ณ„๋กœ ๋‹ค๋ฅธ ์ƒ‰์ƒ์„ ์‚ฌ์šฉํ•˜์„ธ์š”.
import seaborn as sns
import matplotlib.pyplot as plt

# ํžˆ์Šคํ† ๊ทธ๋žจ ๊ทธ๋ฆฌ๊ธฐ
sns.histplot(data=iris, x='Sepal Length', hue='Species')

# ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ
plt.title('Distribution of Sepal Length')
plt.xlabel('Sepal Length')
plt.ylabel('Total')
plt.show()

 

 

 

๋ฌธ์ œ 20: ๋ฐ•์Šคํ”Œ๋กฏ ๊ทธ๋ฆฌ๊ธฐ

  • ๋ฌธ์ œ) ๊ฐ ํ’ˆ์ข…๋ณ„๋กœ 'Petal Length'์˜ ๋ถ„ํฌ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ•์Šคํ”Œ๋กฏ์„ ๊ทธ๋ฆฌ์„ธ์š”.
import seaborn as sns

ax = sns.boxplot(data = iris, x = 'Petal Length', hue = 'Species')

# ๊ทธ๋ž˜ํ”„ ์ถœ๋ ฅ
plt.title('Distribution of Petal Length')
plt.xlabel('Petal Length')
plt.ylabel('Total')
plt.show()