Skip to main content

04. ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ธฐ์ดˆ ์ดํ•ด

2023๋…„ 12์›” 27์ผAbout 7 minPythoncrashcoursepythonpygooglegoogle-colabjupyter-notebooknumpypandasipython

04. ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ธฐ์ดˆ ์ดํ•ด ๊ด€๋ จ


04. ํŒ๋‹ค์Šค ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ธฐ์ดˆ ์ดํ•ด

๊ธˆ์œต ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•œ ํŒŒ์ด์ฌ - WikiDocs

01. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ถœ๋ ฅ

ํŒ๋‹ค์Šค๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์œ„์น˜ํ•œ URL์„ ์ž…๋ ฅํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”๋กœ ์ฝ์–ด์˜ค๋Š” ๊ธฐ๋Šฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. csv ํŒŒ์ผ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋“œํ•˜์—ฌ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์ €์žฅํ•ด๋ด…์‹œ๋‹ค. csv ํŒŒ์ผ์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ์ฝ์„ ๋•Œ pd.read_csv()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๊ฐ€ ์œ„์น˜ํ•œ url์„ ๊ธฐ์žฌํ•˜์—ฌ ํŒŒ์ผ์„ ์ฝ์–ด์˜ต์‹œ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')

drink_df์˜ ํƒ€์ž…์„ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค.

# ํƒ€์ž… ํ™•์ธ
type(drink_df)

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด๋ผ๊ณ  ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. drink_df๋ฅผ ์ถœ๋ ฅํ•ด๋ด…์‹œ๋‹ค.

drink_df
drink_df
drink_df

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ถœ๋ ฅ

head()๋Š” ์ƒ์œ„ 5๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

์ƒ์œ„ 5๊ฐœ
# ์ƒ์œ„ 5๊ฐœ์˜ ํ–‰์„ ์ถœ๋ ฅ
drink_df.head()
head5
head5

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ธ๋ฑ์Šค

index๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ธ๋ฑ์Šค๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

# ์ธ๋ฑ์Šค์˜ ๋ฒ”์œ„ ํ™•์ธ
drink_df.index
# 
# RangeIndex(start=0, stop=193, step=1)

ํ˜„์žฌ ์ธ๋ฑ์Šค๋Š” 0๋ถ€ํ„ฐ ์‹œ์ž‘ํ•ด์„œ +1์”ฉ(step) ์ฆ๊ฐ€ํ•˜์—ฌ 192๊นŒ์ง€ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฐธ๊ณ ๋กœ 193์—์„œ stopํ•œ๋‹ค๋Š” ๊ฒƒ์€ 193์€ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ์ฆ‰, 0๋ฒˆ๋ถ€ํ„ฐ 192๋ฒˆ๊นŒ์ง€์˜ ์ƒ˜ํ”Œ์ด ์žˆ๋Š” ์…ˆ์ด๋ฏ€๋กœ ์ด ์ƒ˜ํ”Œ์˜ ์ˆ˜๋Š” 193๊ฐœ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๋ฐ์ดํ„ฐํƒ€์ž…

dtypes๋Š” ๊ฐ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด์˜ ํƒ€์ž…์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

# ๊ฐ ์ปฌ๋Ÿผ์˜ ํƒ€์ž… ์ถœ๋ ฅ
drink_df.dtypes
# 
# country                          object
# beer_servings                     int64
# spirit_servings                   int64
# wine_servings                     int64
# total_litres_of_pure_alcohol    float64
# continent                        object
# dtype: object

์ฐธ๊ณ ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํƒ€์ž…์—์„œ object๋ผ๊ณ  ํ‘œํ˜„๋˜๋Š” ๋ถ€๋ถ„์€ ํ•ด๋‹น ํƒ€์ž…์ด ๋ฌธ์ž์—ด์ด๋ผ๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. int64๋Š” ์ •์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ, float64๋Š” ์‹ค์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์‹œ ๋งํ•ด ์œ„์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•ด๋ณด๋ฉด, country, continent ์—ด์€ ๋ฌธ์ž์—ด ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด์ ธ ์žˆ๊ณ , beer_servings, spirit_servings, wine_servings ์—ด์€ ์ •์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ, total_litres_of_prue_alcohol ์—ด์€ ์‹ค์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด์ ธ ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด์„œ ๊ฐ ์—ด์˜ ๋ฐ์ดํ„ฐ ์ž๋ฃŒํ˜•์ด ๋ฌด์—‡์ธ์ง€๋ฅผ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. shape๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰๊ณผ ์—ด์˜ ์ˆ˜๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํฌ๊ธฐ

# ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰๊ณผ ์—ด์˜ ๊ฐœ์ˆ˜ ์ถœ๋ ฅ
drink_df.shape

์œ„ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋Š” ํ•ด๋‹น ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด 193๊ฐœ์˜ ํ–‰, 6๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ์•Œ๋ ค์ค๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰๋ ฌ ๋ณ€ํ™˜

values๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋ฅผ Numpy ํ–‰๋ ฌ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

# Numpy ํƒ€์ž…์œผ๋กœ ์ถœ๋ ฅ
drink_df.values
#
# array([['Afghanistan', 0, 0, 0, 0.0, 'AS'],
#        ['Albania', 89, 132, 54, 4.9, 'EU'],
#        ['Algeria', 25, 0, 14, 0.7, 'AF'],
#        ...,
#        ['Yemen', 6, 0, 0, 0.1, 'AS'],
#        ['Zambia', 32, 19, 4, 2.5, 'AF'],
#        ['Zimbabwe', 64, 18, 4, 4.7, 'AF']], dtype=object)
type(drink_df.values)
#
# numpy.ndarray

Numpy์— ๋Œ€ํ•ด์„œ๋Š” ์ด ์ˆ˜์—…์—์„œ ๋‹ค๋ฃจ์ง€๋Š” ์•Š์•˜์ง€๋งŒ ์ •๋ง~ ๊ฐ„๋‹จํžˆ ์š”์•ฝํ•˜์ž๋ฉด ๊ฐ๊ฐ์˜ ํ–‰์„ ์•„๋ž˜์™€ ๊ฐ™์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰ ํ˜•ํƒœ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜๋œ๋‹ค๊ณ  ๋ณด๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

๋ณ€ํ™˜ํ•œ ๋ฐ์ดํ„ฐ์˜ ์ฒซ๋ฒˆ์งธ ํ–‰์„ ์ถœ๋ ฅํ•ด๋ณผ๊นŒ์š”? ์ด๋Š” ๋ณ€ํ™˜ ํ›„ 0๋ฒˆ ์ธ๋ฑ์Šค๋กœ ์ ‘๊ทผํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

drink_df.values[0]
#
# array(['Afghanistan', 0, 0, 0, 0.0, 'AS'], dtype=object)

์œ„์™€ ๊ฐ™์€ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜์ด ๋˜์—ˆ๋‹ค๋Š” ๊ฒƒ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด for๋ฌธ ์ ‘๊ทผ๋„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

for element in drink_df.values[0]:
  print(element)
#   
# Afghanistan
# 0
# 0
# 0
# 0.0
# AS

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ info

info()๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ „๋ฐ˜์ ์ธ ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. info() ์‚ฌ์šฉํ•˜๊ณ , info()๋ฅผ ํ†ตํ•ด ์•Œ ์ˆ˜ ์žˆ๋Š” ์ •๋ณด๋“ค์„ ์ตœ๋Œ€ํ•œ ์ •๋ฆฌํ•ด๋ด…์‹œ๋‹ค.

drink_df.info()
#
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 193 entries, 0 to 192
# Data columns (total 6 columns):
#  #   Column                        Non-Null Count  Dtype  
# ---  ------                        --------------  -----  
#  0   country                       193 non-null    object 
#  1   beer_servings                 193 non-null    int64  
#  2   spirit_servings               193 non-null    int64  
#  3   wine_servings                 193 non-null    int64  
#  4   total_litres_of_pure_alcohol  193 non-null    float64
#  5   continent                     170 non-null    object 
# dtypes: float64(1), int64(3), object(2)
# memory usage: 9.2+ KB

info()๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ด๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„ ์šฉ์–ด๋กœ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์—ด์„ ํŠน์„ฑ(feature) ์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ด 6๊ฐœ์˜ ํŠน์„ฑ์ด ์žˆ๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค.

info()์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ๋กœ๋ถ€ํ„ฐ ์ด 193๊ฐœ์˜ ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜๋ฉฐ, contry, continet ๋ฐ์ดํ„ฐ์˜ ๊ฒฝ์šฐ์—๋Š” object. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ object๋Š” ๋ฌธ์ž์—ด์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ์™ธ์—๋Š” ์ •์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ(int64) ๋˜๋Š” ์‹ค์ˆ˜ํ˜• ๋ฐ์ดํ„ฐ(float64)๋กœ ๊ตฌ์„ฑ๋˜์–ด์ ธ ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋˜ ํ•˜๋‚˜ ์ฃผ๋ชฉํ•  ์ ์€ ๋ฐ์ดํ„ฐ์˜ ์ด ๊ฐœ์ˆ˜๋Š” 193๊ฐœ์ธ๋ฐ, Non-Null Count๋ฅผ ๋ณด๋ฉด continent์˜ ๊ฒฝ์šฐ์—๋งŒ 170๊ฐœ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋Š” Null ๋ฐ์ดํ„ฐ. ๋‹ค์‹œ ๋งํ•ด ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ๊ฐ€ 23๊ฐœ ์กด์žฌํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฒฐ์ธก๊ฐ’(Null)

๊ฒฐ์ธก๊ฐ’์ด๋ž€ ์ •์ƒ์ ์œผ๋กœ ์กด์žฌํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์˜ ๊ฐ’์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. Null ๊ฐ’์ด๋ผ๊ณ ๋„ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค. isnull().sum()์€ ํ•ด๋‹น ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฐ ์—ด์—์„œ Null ๋ฐ์ดํ„ฐ๊ฐ€ ์ด ๋ช‡ ๊ฐœ์ธ์ง€๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

print(drink_df.isnull().sum())
# 
# country                          0
# beer_servings                    0
# spirit_servings                  0
# wine_servings                    0
# total_litres_of_pure_alcohol     0
# continent                       23
# dtype: int64

continent๋ผ๋Š” ์—ด์—์„œ ์ด 23๊ฐœ์˜ Null`(๊ฒฐ์ธก) ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


02. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด ์ ‘๊ทผ

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด ์ ‘๊ทผ

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํŠน์ • ์—ด์— ์ ‘๊ทผํ•˜๋Š” ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ•์€

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.์—ด์˜ ์ด๋ฆ„

๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„์„ ์ ๊ณ , ์˜จ์ ์„ ์ฐ์€ ํ›„์— ์—ด์˜ ์ด๋ฆ„์„ ์ ์œผ๋ฉด ํ•ด๋‹น ์—ด๋งŒ์„ ๋ถˆ๋Ÿฌ์˜ต๋‹ˆ๋‹ค.

drink_df.beer_servings
#
# 0        0
# 1       89
# 2       25
# 3      245
# 4      217
#       ... 
# 188    333
# 189    111
# 190      6
# 191     32
# 192     64
# Name: beer_servings, Length: 193, dtype: int64

๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„['ํ•ด๋‹น ์—ด์˜ ์ด๋ฆ„']๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์ด๋Š” ์œ„์™€ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

# ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•
drink_df['beer_servings']
#
# 0        0
# 1       89
# 2       25
# 3      245
# 4      217
#       ... 
# 188    333
# 189    111
# 190      6
# 191     32
# 192     64
# Name: beer_servings, Length: 193, dtype: int64

2์ฐจ์› ํ…Œ์ด๋ธ” ํ˜•ํƒœ๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด๋ผ๊ณ  ๋ถ€๋ฅด์ง€๋งŒ, ์ด๋ ‡๊ฒŒ ํŠน์ • ํ•˜๋‚˜์˜ ์—ด๋งŒ์„ ๋ถˆ๋Ÿฌ์˜ฌ ๊ฒฝ์šฐ์—๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ ํƒ€์ž…์ด ํŒ๋‹ค์Šค(Pandas)์—์„œ ์ œ๊ณตํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ํƒ€์ž…์ธ '์‹œ๋ฆฌ์ฆˆ(Series)'๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ ํŠน์ • ์—ด๋งŒ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ํƒ€์ž…์„ ํ™•์ธํ•ด๋ณผ๊นŒ์š”?

# ์ปฌ๋Ÿผ์˜ ํƒ€์ž… ํ™•์ธ
type(drink_df.beer_servings)
# 
# pandas.core.series.Series

๋งŒ์•ฝ ํ•˜๋‚˜์˜ ์—ด์ด ์•„๋‹ˆ๋ผ ์„ ํƒ์ ์œผ๋กœ ๋‹ค์ˆ˜์˜ ์—ด์— ์ ‘๊ทผ๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ๋ ๊นŒ์š”?

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„[['ํŠน์ •์—ด์˜ ์ด๋ฆ„1', 'ํŠน์ •์—ด์˜ ์ด๋ฆ„2']]

์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด์„œ beer_servings๊ณผ wine_servings 2๊ฐœ์˜ ์—ด๋งŒ์„ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์‹ถ๋‹ค๊ณ  ํ•ด๋ด…์‹œ๋‹ค.

drink_df[['beer_servings','wine_servings']] 
drink_df_columns
drink_df_columns

๋‘ ๊ฐœ์˜ ์—ด๋งŒ์ด ๋ฝ‘ํžŒ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

['ํŠน์ • ์—ด์˜ ์ด๋ฆ„1', 'ํŠน์ • ์—ด์˜ ์ด๋ฆ„2'] ์€ ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

๋‹ค์‹œ ๋งํ•ด ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ์˜ ํ˜•ํƒœ๋กœ ์—ด์˜ ์ด๋ฆ„์„ ๋‚˜์—ดํ•œ ๋’ค์—

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„[์—ด์˜ ์ด๋ฆ„์ด ๋‚˜์—ด๋œ ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ]

๋ฅผ ์‚ฌ์šฉํ•ด๋„ ๋™์ผํ•œ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

cols = ['beer_servings','wine_servings']
drink_df[cols]
drink_df_columns
drink_df_columns

03. ํŠน์„ฑ์˜ ์ˆ˜์น˜ ์ •๋ณด ํŒŒ์•…ํ•˜๊ธฐ

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

ํŠน์„ฑ(features)

์•ž์„œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ info()๋ฅผ ์„ค๋ช…ํ•œ ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. info()๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ „๋ฐ˜์ ์ธ ์ •๋ณด๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

drink_df.info()
# 
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 193 entries, 0 to 192
# Data columns (total 6 columns):
#  #   Column                        Non-Null Count  Dtype  
# ---  ------                        --------------  -----  
#  0   country                       193 non-null    object 
#  1   beer_servings                 193 non-null    int64  
#  2   spirit_servings               193 non-null    int64  
#  3   wine_servings                 193 non-null    int64  
#  4   total_litres_of_pure_alcohol  193 non-null    float64
#  5   continent                     170 non-null    object 
# dtypes: float64(1), int64(3), object(2)
# memory usage: 9.2+ KB

info()๋ฅผ ํ†ตํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ •๋ณด๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ด๋‹น ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์—ด๋“ค์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ ๋ถ„์„ ์šฉ์–ด๋กœ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ์ด๋Ÿฌํ•œ ์—ด์„ ํŠน์„ฑ(feature) ์ด๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค. ์ด 6๊ฐœ์˜ ํŠน์„ฑ์ด ์žˆ๋Š” ์…ˆ์ž…๋‹ˆ๋‹ค.

ํŠน์„ฑ์˜ ์ˆ˜์น˜ ์ •๋ณด ํŒŒ์•…ํ•˜๊ธฐ

์ด์ œ ๋ณธ๊ฒฉ์ ์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ†ตํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒ์•…ํ•ด๊ฐ€๋Š” ๊ณผ์ •์„ ์‹œ์ž‘ํ•ด๋ณผํ…๋ฐ์š”. ์ˆซ์ž์™€ ๊ฐ™์€ ์ˆ˜์น˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ณ  ์žˆ๋‹ค๋ฉด, ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ์ตœ์†Ÿ๊ฐ’, ์ตœ๋Œ“๊ฐ’, ํ‰๊ท ๊ฐ’ ๋“ฑ์„ ํŒŒ์•…ํ•˜๋Š” ๊ฒƒ์€ ๋ฐ์ดํ„ฐ ํŒŒ์•…์˜ ๊ฐ€์žฅ ์ฒซ ๊ฑธ์Œ์ž…๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ์ €์žฅ๋œ ์ƒํ™ฉ์—์„œ ์ด๋ฅผ ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€

describe()๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

drink_df.describe()
drink_df_describe
drink_df_describe

describe()๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด ๋ฐ์ดํ„ฐ์˜ ์ˆ˜(count), ํ‰๊ท (mean), ํ‘œ์ค€ํŽธ์ฐจ(std), ๋ถ„์œ„์ˆ˜(25%, 50%, 75%)๋ฅผ ํŒŒ์•…ํ•˜์—ฌ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” ์ˆซ์ž. ์ฆ‰, ์ˆ˜์น˜ ์ •๋ณด์— ๊ตญํ•œ๋˜์–ด์„œ ๊ณ„์‚ฐํ•˜๋ฏ€๋กœ ๋ฌธ์ž์—ด ํƒ€์ž…์˜ ๋ฐ์ดํ„ฐ์˜€๋˜ country ์—ด๊ณผ continent ์—ด์€ ์ œ์™ธ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํŠน์ • ์—ด์— ๋Œ€ํ•ด์„œ๋งŒ ์ถœ๋ ฅํ•ด๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

drink_df.beer_servings.describe()
#
# count    193.000000
# mean     106.160622
# std      101.143103
# min        0.000000
# 25%       20.000000
# 50%       76.000000
# 75%      188.000000
# max      376.000000
# Name: beer_servings, dtype: float64

ํŠน์ • ์—ด์˜ ์ตœ๋Œ€๊ฐ’, ์ตœ์†Œ๊ฐ’, ํ‰๊ท ๊ฐ’, ํ•ฉ๊ณ„, ์นด์šดํŠธ๋„ ๊ณ„์‚ฐ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

ํ‰๊ท 
drink_df.beer_servings.mean()
#
# 106.16062176165804

์ด๋ ‡๊ฒŒ ๋ฐ”๋กœ ๊ณ„์‚ฐํ•œ ์ˆ˜์น˜๋ฅผ ๋ฝ‘์„ ์ˆ˜ ์žˆ๋‹ค๋ฉด, ์ด ์ˆ˜์น˜๋ฅผ ๊ฐ€์ง€๊ณ  ๊ณ„์‚ฐ๋„ ๊ฐ€๋Šฅํ•˜๊ฒ ์ฃ ? mean()์œผ๋กœ ํ‰๊ท ์„ ๋ฐ”๋กœ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ธด ํ•˜์ง€๋งŒ, sum()๊ณผ count()๋ฅผ ์ด์šฉํ•˜์—ฌ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•ด๋ด…์‹œ๋‹ค.

float(drink_df.beer_servings.sum())/drink_df.beer_servings.count()
#
# 106.16062176165804

04. ์กฐ๊ฑด๋ถ€ ๋กœ์ง

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

์กฐ๊ฑด๋ถ€ ๋กœ์ง

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์กฐ๊ฑด์„ ๊ฑธ์–ด์„œ ํ•ด๋‹น ์กฐ๊ฑด์„ ์ถฉ์กฑํ•˜๋Š” ๊ฐ’๋“ค๋งŒ์„ ๋ฝ‘์•„์˜ค๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์šฐ์„  ํŠน์ • ์—ด์— ๋Œ€ํ•ด์„œ ์กฐ๊ฑด์„ ๊ฑธ์—ˆ์„ ๋•Œ ์–ด๋–ค ๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜๋Š”์ง€๋ฅผ ๋ด…์‹œ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.ํŠน์ • ์—ด์˜ ์ด๋ฆ„ == 'ํŠน์ •๊ฐ’'

์ด๋ผ๋Š” ์ฝ”๋“œ๋Š” ๊ฐ ํ–‰์—์„œ ํ•ด๋‹น ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š”์ง€๋ฅผ ํŒ๋‹จํ•˜์—ฌ ๋งŒ์กฑํ•œ๋‹ค๋ฉด True, ์•„๋‹ˆ๋ผ๋ฉด False์˜ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ์‹œ๋ฆฌ์ฆˆ(Series)๋ฅผ ๋ฆฌํ„ดํ•ฉ๋‹ˆ๋‹ค.

drink_df.continent=='EU'
#
# 0      False
# 1       True
# 2      False# 3       True
# 4      False
#        ...  
# 188    False
# 189    False
# 190    False
# 191    False
# 192    False
# Name: continent, Length: 193, dtype: bool

ํ•˜์ง€๋งŒ ์ผ๋ฐ˜์ ์œผ๋กœ ์ €๋Ÿฐ ์กฐ๊ฑด์„ ๊ฑธ์—ˆ์„ ๋•Œ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๊ฑด True์™€ False๋กœ ๊ตฌ์„ฑ๋œ ์‹œ๋ฆฌ์ฆˆ(Series)๊ฐ€ ์•„๋‹ˆ๋ผ continent์˜ ๊ฐ’์ด 'EU'์ผ ๋•Œ์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ๊ฐ’์„ ๋ฝ‘์•„๋‚ด๋Š” ๊ฒƒ์ผํ…๋ฐ์š”.

์ด ๊ฒฝ์šฐ์—๋Š” ์ €๋ ‡๊ฒŒ True์™€ False๋กœ ๊ตฌ์„ฑ๋œ Series๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„[True์™€ False๋กœ ๊ตฌ์„ฑ๋œ ์‹œ๋ฆฌ์ฆˆ]

์ผ๋‹จ drink_df.continent=='EU'์˜ ํƒ€์ž…์ด ์‹œ๋ฆฌ์ฆˆ์ž„์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

type(drink_df.continent=='EU')
#
# pandas.core.series.Series

์ด๋ฅผ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„[ ] ์˜ ๋Œ€๊ด„ํ˜ธ ์•ˆ์— ๋„ฃ์–ด๋ด…์‹œ๋‹ค. head(20)์„ ์ด์šฉํ•˜์—ฌ ์ƒ์œ„ 20๊ฐœ๋งŒ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

drink_df[drink_df.continent=='EU'].head(20)
dataframe
dataframe

๊ทธ๋ ‡๋‹ค๋ฉด ๋งŒ์•ฝ beer_servings์˜ ๊ฐ’์ด 158๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ๋งŒ์„ ํ•„ํ„ฐ๋งํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”?

์•ž์„œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์—ด์„ ํ˜ธ์ถœํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘ ๊ฐ€์ง€๋ฅผ ์†Œ๊ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์‹œ ๋งํ•ด beer_servings์˜ ๊ฐ’์ด 158๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ๋กœ ํ•„ํ„ฐ๋งํ•˜๋Š” ์ฝ”๋“œ๋Š” ์•„๋ž˜์˜ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ์Šต๋‹ˆ๋‹ค.

drink_df[drink_df.beer_servings > 158]
drink_df[drink_df['beer_servings'] > 158]
# ์•„๋ž˜ ๋‘ ๊ฐ€์ง€๋Š” ๋™์ผํ•จ
# drink_df[drink_df.beer_servings > 158]
# drink_df[drink_df['beer_servings'] > 158] 
drink_df[drink_df['beer_servings'] > 158].head(20)
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„

๊ทธ๋ ‡๋‹ค๋ฉด ์กฐ๊ฑด์„ ๊ฑธ๋˜, ํŠน์ • ์—ด ๋ช‡ ๊ฐœ๋งŒ ์ถœ๋ ฅํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ์–ด๋–จ๊นŒ์š”?

์•ž์„œ ํŠน์ • ์—ด๋“ค๋งŒ์„ ๋ฝ‘์•„์„œ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์†Œ๊ฐœํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฅผ ์‘์šฉํ•˜์—ฌ beer_servings์˜ ๊ฐ’์ด 10 ์ดํ•˜์ด๋ฉด์„œ country, beer_servings์˜ ๋‘ ๊ฐœ์˜ ์—ด๋งŒ์„ ๋ฝ‘์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

drink_df[drink_df.beer_servings <= 10][['country','beer_servings']]

์šฐ์„ , drink_df[drink_df.beer_servings <= 10]๋กœ ์กฐ๊ฑด์— ๋งž๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ฝ‘์•„๋‚ธ ๋’ค์— ์ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— [['country','beer_servings']]๋ฅผ ๋ถ™์—ฌ์„œ 2๊ฐœ์˜ ์—ด๋งŒ์„ ๋ฝ‘์•„๋‚ด๋Š” ๊ฒƒ์ด์ฃ .

# drink_df[drinks.beer_servings <= 10]๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด๋‹ค. 
drink_df[drink_df.beer_servings <= 10][['country','beer_servings']].head(20)
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„2
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„2

์ด๋ฒˆ์—๋Š” ์กฐ๊ฑด์œผ๋กœ ํ•„ํ„ฐ๋งํ•˜์—ฌ ์‹œ๋ฆฌ์ฆˆ(Series)๋ฅผ ์–ป์–ด๋‚ธ ๋’ค์— ํ‰๊ท (mean)์„ ์–ป์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

drink_df[drink_df.continent=='EU'].beer_servings.mean()
#
# 193.77777777777777

์œ„ ์ฝ”๋“œ๋Š” continent์˜ ๊ฐ’์ด 'EU'์ธ ๊ฒฝ์šฐ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ฝ‘์•„๋‚ด์ง€๋งŒ, ๊ทธ ์ค‘ beer_servings๋ผ๋Š” ํ•˜๋‚˜์˜ ์—ด๋งŒ์„ ๋ฝ‘์•„๋‚ด๋ฏ€๋กœ (์•ž์„œ ์—ด์„ ํ•˜๋‚˜๋งŒ ๋ฝ‘์„ ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ด ์•„๋‹ˆ๋ผ ์‹œ๋ฆฌ์ฆˆ๊ฐ€ ๋œ๋‹ค๊ณ  ์–ธ๊ธ‰ํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค.) ์‹œ๋ฆฌ์ฆˆ๋กœ ๋ณ€ํ™˜๋˜๋ฉฐ, ์‹œ๋ฆฌ์ฆˆ์— ๋Œ€ํ•ด์„œ mean()์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ท ๊ฐ’์„ ๋ฝ‘์•„๋‚ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ continent์˜ ๊ฐ’์ด EU์ธ beer_servings ์—ด์˜ ํ‰๊ท ๊ฐ’(mean)์„ ๊ตฌํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด์ฃ .

์ด๋ฒˆ์—๋Š” beer_servings ์—ด์˜ ํ‰๊ท ๊ฐ’์„ ์กฐ๊ฑด์— ๋„ฃ์–ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

drink_df[drink_df.beer_servings > drink_df.beer_servings.mean()]
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„3
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„3

ํ•ด๋‹น ์—ด์— null ๊ฐ’(๊ฒฐ์ธก๊ฐ’)์ด ์กด์žฌํ•˜๋Š”์ง€ ์œ ๋ฌด์— ๋Œ€ํ•œ True์™€ False์˜ ์‹œ๋ฆฌ์ฆˆ(Series)๋กœ ์–ป์–ด๋‚ด๋Š” ๋ฐฉ๋ฒ•์€ ํ•ด๋‹น ์—ด์˜ isnull()์„ ํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. isnull()์€ ํ•ด๋‹น ํ–‰์— Null ๊ฐ’์ด ๋“ค์–ด์žˆ๋Š” ๊ฒฝ์šฐ๋งŒ True๋ฅผ ๋ฆฌํ„ดํ•˜๊ณ , ์•„๋‹ˆ๋ผ๋ฉด False๋ฅผ ๋ฆฌํ„ดํ•ฉ๋‹ˆ๋‹ค.

drink_df.continent.isnull()
#
# 0      False
# 1      False
# 2      False
# 3      False
# 4      False
#        ...  
# 188    False
# 189    False
# 190    False
# 191    False
# 192    False
# Name: continent, Length: 193, dtype: bool
#
drink_df.continent.isnull().sum()
#
# 23

์ด๋ฅผ ์ด์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ ํŠน์ • ์—ด์— Null ๊ฐ’์ด ์œ„์น˜ํ•œ ๊ฒฝ์šฐ๋งŒ ๋ฝ‘์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

drink_df[drink_df.continent.isnull()]
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„4
๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„4
len(drink_df[drink_df.continent.isnull()])
#
# 23

continent์˜ ๊ฐ’์— NaN ๊ฐ’(๊ฒฐ์ธก๊ฐ’์„ ์˜๋ฏธ)์ธ ๊ฒฝ์šฐ๋งŒ ์ถœ๋ ฅ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ฆ‰, ์ด ๋ฐ์ดํ„ฐ๋“ค์€ ๋‹ค๋ฅธ ์—ด์—๋Š” ๊ฐ’์ด ๋‹ค ์กด์žฌํ•˜์ง€๋งŒ continent์˜ ์—ด์—๋Š” ์‹ค์งˆ์ ์œผ๋กœ ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


05. AND, OR, NOT ์—ฐ์‚ฐ์ž ์‚ฌ์šฉํ•˜๊ธฐ

๐Ÿ“š์ฐธ๊ณ 

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df

AND, OR, NOT ์—ฐ์‚ฐ์ž ์‚ฌ์šฉํ•˜๊ธฐ

AND๋‚˜ OR ๋˜๋Š” NOT๊ณผ ๊ฐ™์€ ์—ฐ์‚ฐ์ž๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์กฐ๊ฑด์„ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜, ์กฐ๊ฑด๋ฌธ ์ž์ฒด๋ฅผ ๋ฐ˜๋Œ€๋กœ ํ•ด์„ํ•˜๋„๋ก ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ๋Š” AND, OR, NOT์€ ๊ฐ๊ฐ &, |, ~๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์‹œ ๋งํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฌผ๋ก  ์‹ค์ œ๋กœ๋Š” ์กฐ๊ฑด์ด 2๊ฐœ ์ด์ƒ์ผ ์ˆ˜๋„ ์žˆ๊ณ , ์ด๋“ค์„ ์„ž์–ด์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ์œ„ ์ผ€์ด์Šค๋ณด๋‹ค ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ์œ„์˜ ์„ธ ๊ฐ€์ง€ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋งŒ ๋‹ค๋ค„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

์šฐ์„  NOT์„ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ๋ฅผ ๋ด…์‹œ๋‹ค.

NOT ์กฐ๊ฑด
drink_df[~(drink_df.continent=='EU')]
not
not

์œ„ ์ฝ”๋“œ๋Š” continent์˜ ๊ฐ’์ด 'EU'๊ฐ€ ์•„๋‹Œ ๊ฒฝ์šฐ๋งŒ์„ ํ•„ํ„ฐ๋งํ•˜๋Š” ์ฝ”๋“œ์ž…๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋‘ ๊ฐ€์ง€ ์กฐ๊ฑด์„ ๋ชจ๋‘ ๋งŒ์กฑํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋‹จ, ํ•œ ๊ฐœ๋งŒ ๋งŒ์กฑํ•˜๋”๋ผ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘์œผ๋ ค๊ณ  ํ•œ๋‹ค๋ฉด AND๋ฅผ ๋‹จ์ง€ OR๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ๋‘ ๊ฐœ์˜ ์กฐ๊ฑด์€ ๊ฐ™์ง€๋งŒ AND๋ฅผ ๋‹จ์ˆœํžˆ OR๋กœ ๋ฐ”๊พผ ๊ฒฝ์šฐ์—๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋ช‡ ๊ฐœ๋‚˜ ๋ฝ‘ํžˆ๋Š”์ง€ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค.

len(์ž…๋ ฅ)

์„ ์‚ฌ์šฉํ•˜๋ฉด ์ž…๋ ฅ์˜ ๊ธธ์ด๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์•Œ๋ ค์ค๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„์„ ์ž…๋ ฅ์œผ๋กœ ํ•  ๊ฒฝ์šฐ์—๋Š” ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ํ–‰์˜ ๊ธธ์ด๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.

# OR ์กฐ๊ฑด
len(drink_df[(drink_df.continent=='EU') | (drink_df.wine_servings > 300)])
#
# 45

AND ์กฐ๊ฑด์„ ์‚ฌ์šฉํ•˜์˜€์„ ๊ฒฝ์šฐ์—๋Š” ํ–‰์ด 3๊ฐœ๋ฐ–์— ์—†์—ˆ์ง€๋งŒ, ์ด๋ฒˆ์—๋Š” 45๊ฐœ๋กœ ํ›จ์”ฌ ๋งŽ์Šต๋‹ˆ๋‹ค.


06. ๋กœ์ง๊ณผ ์ˆ˜์น˜ ์ •๋ณด์˜ ๊ฒฐํ•ฉ & ์ •๋ ฌ

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

๋กœ์ง๊ณผ ์ˆ˜์น˜ ์ •๋ณด์˜ ๊ฒฐํ•ฉ

drink_df์—์„œ total_litres_of_pure_alchohol์˜ ๊ฐ’์ด ์ตœ๋Œ€๊ฐ’์ธ ๊ฒฝ์šฐ์˜ country ์—ด์„ ์ถœ๋ ฅํ•œ๋‹ค๋ฉด?

  1. ํŠน์ • ์—ด์˜ ์ตœ๋Œ€๊ฐ’์„ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ด์—ˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค.
  2. ํŠน์ • ์—ด๋งŒ์„ ์ถœ๋ ฅํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ด์—ˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค.
# ์–ด๋–ค ๋‚˜๋ผ๊ฐ€ total litres of pure alcohol์˜ ๊ฐ’์ด ๊ฐ€์žฅ ํด๊นŒ์š”?
drink_df[drink_df.total_litres_of_pure_alcohol == drink_df.total_litres_of_pure_alcohol.max()]['country']
#
# 15    Belarus
# Name: country, dtype: object

์šฐ์„  drink_df.total_litres_of_pure_alcohol.max()๋ฅผ ํ†ตํ•ด์„œ total_litres_of_pure_alcohol์—ด์˜ ์ตœ๋Œ“๊ฐ’์„ ๊ตฌํ•˜๊ณ , ์ด๋ฅผ ๋กœ์ง์— ์ ์šฉํ•˜์—ฌ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๋ฒˆ์—๋Š” drink_df์—์„œ wine_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜, beer_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜, spirit_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ์˜ country์—ด์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ๋‘ ์นด์šดํŠธํ•˜์˜€์„ ๋•Œ์˜ ์ˆซ์ž๋ฅผ ์ถœ๋ ฅํ•ด๋ด…์‹œ๋‹ค. ์ด๋ฅผ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„œ ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ์€ ์ด ๋‘ ๊ฐ€์ง€์ž…๋‹ˆ๋‹ค.

  1. ๋‹ค์ˆ˜์˜ ์กฐ๊ฑด์„ '๋˜๋Š”'์œผ๋กœ ํ•œ ๋ฒˆ์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ด์—ˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค.
  2. ์ˆซ์ž๋ฅผ '์นด์šดํŠธ' ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋ฌด์—‡์ด์—ˆ๋Š”์ง€ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค.
# ์ด๋ฒˆ์—๋Š” drink_df์—์„œ wine_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜,
# beer_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฌ๊ฑฐ๋‚˜,
# spirit_servings์˜ ๊ฐ’์ด 300๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ๋ฅผ ๋ชจ๋‘ ์นด์šดํŠธํ•˜์˜€์„ ๋•Œ์˜ ์ˆซ์ž๋ฅผ ์ถœ๋ ฅํ•ด๋ด…์‹œ๋‹ค.
drink_df[(drink_df.wine_servings > 300) | (drink_df.beer_servings > 300) | (drink_df.spirit_servings > 300)].country.count()
#
# 18

์ •๋ ฌ

๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ • ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌํ•ด์„œ ๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ •๋ ฌํ•ด์„œ ๋ณด๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.sort_values('์ •๋ ฌ ๊ธฐ์ค€์ด ๋˜๋Š” ์—ด์˜ ์ด๋ฆ„')

drink_df.sort_values('beer_servings') # ํŠน์ • ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ
์ •๋ ฌ
์ •๋ ฌ

beer_servings์˜ ์—ด์„ ๋ณด๋ฉด ์˜ค๋ฆ„์ฐจ์ˆœ์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ์ ์œผ๋กœ๋Š” ์˜ค๋ฆ„์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ๋˜์ง€๋งŒ, ๋งŒ์•ฝ ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.sort_values('์ •๋ ฌ ๊ธฐ์ค€์ด ๋˜๋Š” ์—ด์˜ ์ด๋ฆ„', ascending=False)sort_values์˜ ์ธ์ž๋กœ ascending=False๋ฅผ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

# ๋‚ด๋ฆผ์ฐจ์ˆœ์œผ๋กœ ์ •๋ ฌ
drink_df.sort_values('beer_servings', ascending=False)

๋‚ด๋ฆผ์ฐจ์ˆœ ์ •๋ ฌ์˜ ๊ธฐ์ค€์€ ํ•˜๋‚˜์˜ ์—ด์ด ์•„๋‹ˆ๋ผ ๋‹ค์ˆ˜์˜ ์—ด์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ด์˜ ์ด๋ฆ„๋“ค์„ ์›์†Œ๋กœํ•˜๋Š” ๋ฆฌ์ŠคํŠธ๋ฅผ sort_values์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด ํ•ด๋‹น ์—ด๋“ค์„ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

# 2๊ฐœ์˜ ์ปฌ๋Ÿผ ๊ธฐ์ค€์œผ๋กœ ์ •๋ ฌ
drink_df.sort_values(['beer_servings', 'wine_servings'])
์ •๋ ฌ2
์ •๋ ฌ2

07. ์ƒ๊ด€ ๊ด€๊ณ„ ๋ถ„์„

์ƒ๊ด€ ๋ถ„์„ ์ด๋ž€ ๋‘ ๋ณ€์ˆ˜ ๊ฐ„์˜ ์„ ํ˜•์  ๊ด€๊ณ„๋ฅผ ์ƒ๊ด€ ๊ณ„์ˆ˜๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์ƒ๊ด€ ๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ ๊ณต๋ถ„์‚ฐ์˜ ๊ฐœ๋…์„ ํฌํ•จํ•˜๋Š”๋ฐ, ๊ณต๋ถ„์‚ฐ์€ 2๊ฐœ์˜ ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ƒ๊ด€ ์ •๋„. 2๊ฐœ์˜ ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜์˜ ๊ฐ’์ด ์ƒ์Šนํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ด๋ฉด ๋‹ค๋ฅธ ๊ฐ’๋„ ์ƒ์Šนํ•˜๋Š” ๊ฒฝํ–ฅ์„ ์ˆ˜์น˜๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ๊ณต๋ถ„์‚ฐ๋งŒ์œผ๋กœ ๋‘ ํ™•๋ฅ  ๋ณ€์ˆ˜์˜ ์ƒ๊ด€ ๊ด€๊ณ„๋ฅผ ๊ตฌํ•œ๋‹ค๋ฉด ๋‘ ๋ณ€์ˆ˜์˜ ๋‹จ์œ„ ํฌ๊ธฐ์— ์˜ํ–ฅ์„ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ -1๊ณผ 1 ์‚ฌ์ด ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ์ƒ๊ด€ ๊ณ„์ˆ˜๋ผ ํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ ์ƒ๊ด€ ๊ณ„์ˆ˜๊ฐ€ 1์— ๊ฐ€๊น๋‹ค๋ฉด ์„œ๋กœ ๊ฐ•ํ•œ ์–‘์˜ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ด๊ณ , -1์— ๊ฐ€๊น๋‹ค๋ฉด ์Œ์˜ ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. 0์ด๋ฉด ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

Matplotlib๋Š” ํŒŒ์ด์ฌ์—์„œ ์ž๋ฃŒ๋ฅผ ์ฐจํŠธ๋‚˜ ํ”Œ๋กฏ์œผ๋กœ ์‹œ๊ฐํ™”ํ•˜๋Š” ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. Seaborn์€ Matplotlib์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‹ค์–‘ํ•œ ํ…Œ๋งˆ์™€ ๊ธฐ๋Šฅ์„ ์ถ”๊ฐ€ํ•œ ์‹œ๊ฐํ™” ํŒจํ‚ค์ง€์ž…๋‹ˆ๋‹ค. ๋ฐ์ดํ„ฐ๋ฅผ ์‹œ๊ฐํ™”ํ•˜๊ธฐ์œ„ํ•ด์„œ Matplotlib๋งŒ์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ณ , Seaborn์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์ด ๋‘ ๊ฐ€์ง€๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„

์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„ ์‹ค์Šต์„ ์œ„ํ•ด ์ž„์˜๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ํ•˜๋‚˜ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค.

import pandas as pd # ํŒ๋‹ค์Šค ์ž„ํฌํŠธ.
import matplotlib.pyplot as plt # ๋งทํ”Œ๋กฏ๋ฆฝ ์ž„ํฌํŠธ.
import seaborn as sns # ์‹œ๋ณธ ์ž„ํฌํŠธ

๋‹ค์ˆ˜์˜ ํŒŒ์ด์ฌ ๋ฆฌ์ŠคํŠธ๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

pd.DataFrame({"์—ด์ด๋ฆ„1":๋ฆฌ์ŠคํŠธ1, "์—ด์ด๋ฆ„2":๋ฆฌ์ŠคํŠธ2, "์—ด์ด๋ฆ„3":๋ฆฌ์ŠคํŠธ3})

test_df = pd.DataFrame({"v1":[100,200,300,400], "v2":[400,200,100,250], "v3":[40,60,60,100]})
test_df

| . | v1 | v2 | v3 | | :--- | ---: | ---: ---: | | 0 | 100 | 400 | 40 | | 1 | 200 | 200 | 60 | | 2 | 300 | 100 | 60 | | 3 | 400 | 250 | 10 |

๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.corr(method='pearson')

corr = test_df.corr(method = 'pearson')
corr

| . | v1 | v2 | v3 | | :--- | ---: | ---: ---: | | v1 | 1.000000 | -0.568038 | 0.923381 | | v2 | -0.568038 | 1.000000 | -0.291397 | | v3 | 0.923381 | -0.291397 | 1.000000 |

ํžˆํŠธ๋งต ๊ทธ๋ฆฌ๊ธฐ

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์—์„œ .values๋ฅผ ํ•˜๊ฒŒ๋˜๋ฉด ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ฐ ํ–‰์ด ๋งˆ์น˜ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ณ€ํ™˜๋˜์–ด์„œ ์ถœ๋ ฅ์ด ๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ๋’ค์— ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆด ๋•Œ, ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

corr.values
#
# array([[ 1.        , -0.56803756,  0.92338052],
#        [-0.56803756,  1.        , -0.29139712],
#        [ 0.92338052, -0.29139712,  1.        ]])

์ฐจํŠธ์— ์ด๋ฆ„์„ ์ž…๋ ฅํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์ด column_names๋ผ๋Š” ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

column_names = ['ver1', 'ver2', 'ver3']

์ผ๋ฐ˜์ ์œผ๋กœ ์ƒ๊ด€๊ณ„์ˆ˜ ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆด ๋•Œ๋Š” seaborn์—์„œ ์ œ๊ณตํ•˜๋Š” heatmap()์„ ์ฃผ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. seaborn์„ sns์ด๋ผ๋Š” ์ด๋ฆ„์œผ๋กœ ์ž„ํฌํŠธํ•˜์˜€๋‹ค๋ฉด, ๊ธฐ๋ณธ์ ์ธ ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

sns.heatmap(๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ๋ฐ์ดํ„ฐ)

์•„๋ž˜ ์ฝ”๋“œ์— heatmap์—์„œ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ์ข… ์ถ”๊ฐ€์ ์ธ ์„ค์ •๊ฐ’์— ๋Œ€ํ•ด์„œ ์ฃผ์„์„ ๋‹ฌ์•„๋‘์—ˆ์Šต๋‹ˆ๋‹ค. ์ด ์„ค์ •๊ฐ’๋“ค์€ ํ•„์ˆ˜์ ์ธ ์ž…๋ ฅ์ด ์•„๋‹ˆ๋ฉฐ, ์ข€ ๋” ์˜ˆ์˜๊ฒŒ ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด์„œ ์—ฌ๋Ÿฌ๋ถ„๋“ค์ด ์ถ”๊ฐ€์ ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๊ฒƒ๋“ค์ž…๋‹ˆ๋‹ค.

# ๋ ˆ์ด๋ธ”์˜ ํฐํŠธ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •
sns.set(font_scale=2.0)

test_heatmap = sns.heatmap(corr.values, # ๋ฐ์ดํ„ฐ
                          cbar = False, # ์˜ค๋ฅธ์ชฝ ์ปฌ๋Ÿฌ ๋ง‰๋Œ€ ์ถœ๋ ฅ ์—ฌ๋ถ€
                           annot = True, # ์ฐจํŠธ์— ์ˆซ์ž๋ฅผ ๋ณด์—ฌ์ค„ ๊ฒƒ์ธ์ง€ ์—ฌ๋ถ€
                          annot_kws={'size' : 20}, # ์ˆซ์ž ์ถœ๋ ฅ ์‹œ ์ˆซ์ž ํฌ๊ธฐ ์กฐ์ ˆ
                           fmt = '.2f', # ์ˆซ์ž์˜ ์ถœ๋ ฅ ์†Œ์ˆ˜์ ์ž๋ฆฌ ๊ฐœ์ˆ˜ ์กฐ์ ˆ
                           square = 'True', # ์ฐจํŠธ๋ฅผ ์ •์‚ฌ๊ฐํ˜•์œผ๋กœ ํ•  ๊ฒƒ์ธ์ง€
                          yticklabels=column_names, # y์ถ•์— ์ปฌ๋Ÿผ๋ช… ์ถœ๋ ฅ
                          xticklabels=column_names) # x์ถ•์— ์ปฌ๋Ÿผ๋ช… ์ถœ๋ ฅ
plt.tight_layout()
plt.show()
ํžˆํŠธ๋งต
ํžˆํŠธ๋งต

์ด๋ฒˆ์—๋Š” cbar๋ฅผ True๋กœ ํ•ด๋ด…์‹œ๋‹ค.

# ๋ ˆ์ด๋ธ”์˜ ํฐํŠธ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •
sns.set(font_scale=2.0)

test_heatmap = sns.heatmap(corr.values,
                           cbar = True,
                           annot = True,
                           annot_kws={'size' : 20},
                           fmt = '.2f',
                           square = True,
                          yticklabels=column_names,
                          xticklabels=column_names)
plt.tight_layout()
plt.show()
ํžˆํŠธ๋งต2
ํžˆํŠธ๋งต2

๊ฐ์ข… ์„ค์ •๊ฐ’์ด ์–ด๋–ป๊ฒŒ ๋™์ž‘ํ•˜๋Š”์ง€ ๊ฐ€์žฅ ์‰ฝ๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ํ•ด๋‹น ์„ค์ •๊ฐ’์„ ์ง€์›Œ์„œ ์žฌ์ถœ๋ ฅํ•ด๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค! ์˜ˆ๋ฅผ ๋“ค์–ด์„œ fmt = '.2f'์˜ ๊ฐ’์„ fmt = '.3f'์œผ๋กœ ๋ฐ”๊ฟ”์„œ ์žฌ์‹คํ–‰ํ•ด๋ณด์„ธ์š”. ๊ทธ๋Ÿฌ๋ฉด ์ฐจํŠธ ์œ„์˜ ์ˆซ์ž๊ฐ€ ์†Œ์ˆ˜์  ์…‹ ์งธ์ž๋ฆฌ๊นŒ์ง€ ์ถœ๋ ฅ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋‹ค๋ฅธ ์„ค์ •๊ฐ’์— ๋Œ€ํ•ด์„œ๋„ ์—ฌ๋Ÿฌ๋ถ„๋“ค์ด ์ž์œ ์ž์žฌ๋กœ ๋ฐ”๊ฟ”์„œ ์ถœ๋ ฅํ•ด๋ณด์„ธ์š”.

์ƒ๊ด€ ๋ถ„์„์„ ์‹œ๊ฐํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๋˜ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์‚ฐ์ ๋„(scatter plot) ๋ฅผ ๊ทธ๋ฆฌ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‚ฐ์ ๋„๋Š” ์ขŒํ‘œ์ƒ์— ์ ๋“ค์„ ํ‘œ์‹œํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ๋‘ ๊ฐœ ๋ณ€์ˆ˜ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ทธ๋ž˜ํ”„ ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ๋ณ€์ˆ˜ A๊ฐ€ ์ฆ๊ฐ€ํ•  ๋•Œ ๋ณ€์ˆ˜ B ๋˜ํ•œ ์ฆ๊ฐ€ํ•˜๋Š” ์–ด๋–ค ์ƒ๊ด€ ๊ด€๊ณ„๊ฐ€ ์žˆ๋Š”์ง€, ์•„๋‹ˆ๋ฉด ์•„๋ฌด๋Ÿฐ ๊ด€๊ณ„๊ฐ€ ์—†๋Š”์ง€ ์‚ฐ์ ๋„๋ฅผ ํ†ตํ•ด์„œ ํ™•์ธํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

seaborn์„ sns๋ž€ ์ด๋ฆ„์œผ๋กœ ์ž„ํฌํŠธํ•˜์˜€๋‹ค๋ฉด, ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฌ๋Š” ๊ธฐ๋ณธ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค. pairplot์€ ๊ฐ ์—ด์˜ ์กฐํ•ฉ์— ๋Œ€ํ•ด์„œ ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฌ๊ณ , ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŒ๋‚˜๋Š” ๋Œ€๊ฐ์„  ์˜์—ญ์—๋Š” ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

sns.pairplot(๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„)

# whitegrid = ๋ฐฐ๊ฒฝ์— ํ•˜์–€ ์„ ์„ ๋„ฃ๋Š”๋‹ค.
sns.set(style='whitegrid')
sns.pairplot(test_df)
plt.show()
ํŽ˜์–ดํ”Œ๋กฏ
ํŽ˜์–ดํ”Œ๋กฏ

์ง€๊ธˆ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ์ ์–ด์„œ ์‚ฐ์ ๋„๋ฅผ ํ•ด์„ํ•˜๊ธฐ์—๋Š” ๋ฌด๋ฆฌ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฐ์ ๋„์˜ ํ•ด์„์€ ์ฃผ๋ฅ˜ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ ์ง„ํ–‰ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค!


08. ์‘์šฉ! ์ƒ๊ด€ ๊ด€๊ณ„ ๋ถ„์„๊ณผ ์‚ฐ์ ๋„

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

์ฃผ๋ฅ˜ ๋ฐ์ดํ„ฐ์˜ ์ƒ๊ด€ ๊ณ„์ˆ˜

์ฃผ๋ฅ˜ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํŠน์„ฑ(feature)์— ๋Œ€ํ•ด์„œ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ seaborn์˜ heatmap๊ณผ pairplot์„ ํ†ตํ•ด์„œ ์‹œ๊ฐํ™”ํ•ด๋ด…์‹œ๋‹ค! ์šฐ์„ , 'beer_servings', 'wine_servings' ๋‘ ํŠน์„ฑ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

# 'beer_servings', 'wine_servings' ๋‘ ํ”ผ์ฒ˜๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
# pearson์€ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ตฌํ•˜๋Š” ๊ณ„์‚ฐ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋ฅผ ์˜๋ฏธํ•˜๋ฉฐ, ๊ฐ€์žฅ ๋„๋ฆฌ ์“ฐ์ด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
corr = drink_df[['beer_servings', 'wine_servings']].corr(method = 'pearson')
corr
์ƒ๊ด€๊ณ„์ˆ˜
์ƒ๊ด€๊ณ„์ˆ˜

์ฃผ๋ฅ˜ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ํŠน์„ฑ(feature)์— ๋Œ€ํ•ด์„œ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ณ , ์ด๋ฅผ seaborn์˜ heatmap๊ณผ pairplot์„ ํ†ตํ•ด์„œ ์‹œ๊ฐํ™”ํ•ด๋ด…์‹œ๋‹ค! ์šฐ์„ , 'beer_servings', 'wine_servings' ๋‘ ํŠน์„ฑ ๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋ฅผ ๊ณ„์‚ฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

# ํ”ผ์ฒ˜๊ฐ„์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ํ–‰๋ ฌ์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.
cols = ['beer_servings', 'spirit_servings', 'wine_servings', 'total_litres_of_pure_alcohol']
corr = drink_df[cols].corr(method = 'pearson')
corr
์ƒ๊ด€๊ณ„์ˆ˜2
์ƒ๊ด€๊ณ„์ˆ˜2

๊ฐ€์žฅ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ฐ’์ด ๋†’์€ ๊ฒฝ์šฐ๋Š” beer_servings์™€ total_litres_of_pure_alcohold์˜ ์ƒ๊ด€๊ณ„์ˆ˜ ๊ฐ’์œผ๋กœ 0.835839์— ํ•ด๋‹น๋ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ seaborn์˜ heatmap์„ ํ†ตํ•ด์„œ ์‹œ๊ฐํ™”ํ•ด๋ด…์‹œ๋‹ค. ์šฐ์„  heatmap์˜ ์ž…๋ ฅ์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด์„œ ์ƒ๊ด€๊ณ„์ˆ˜๊ฐ’์— .values๋ฅผ ์ ์šฉ์‹œ์ผœ์ค๋‹ˆ๋‹ค.

corr.values
#
# array([[1.        , 0.45881887, 0.52717169, 0.83583863],
#        [0.45881887, 1.        , 0.19479705, 0.65496818],
#        [0.52717169, 0.19479705, 1.        , 0.66759834],
#        [0.83583863, 0.65496818, 0.66759834, 1.        ]])

ํžˆํŠธ๋งต ์ฐจํŠธ์˜ x์ถ•๊ณผ y์ถ•์— ๊ฐ๊ฐ์˜ ๋ ˆ์ด๋ธ”์„ ๋‹ฌ์•„์ฃผ๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ์˜ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์–ด์ค๋‹ˆ๋‹ค.

column_names = ['beer', 'spirit', 'wine', 'alcohol']

seaborn์˜ heatmap ์‹œ๊ฐํ™”๋ฅผ ์ง„ํ–‰ํ•ด๋ด…์‹œ๋‹ค!

import seaborn as sns
import matplotlib.pyplot as plt

# ๋ ˆ์ด๋ธ”์˜ ํฐํŠธ ์‚ฌ์ด์ฆˆ๋ฅผ ์กฐ์ •
sns.set(font_scale=1.5)
hm = sns.heatmap(corr.values,
            cbar=True,
            annot=True, 
            square=True,
            fmt='.2f',
            annot_kws={'size': 15},
            yticklabels=column_names,
            xticklabels=column_names)

plt.tight_layout()
plt.show()
์ƒ๊ด€๊ณ„์ˆ˜3
์ƒ๊ด€๊ณ„์ˆ˜3

alcohol์€ ๋Œ€์ฒด์ ์œผ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ๋“ค๊ณผ ๋ชจ๋‘ ์ƒ๊ด€ ๊ณ„์ˆ˜๊ฐ€ ๋†’์Šต๋‹ˆ๋‹ค. ์ด๋ฏธ ์‹œ๊ฐํ™”๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ์ „์—๋„ ํ™•์ธํ•œ ๋‚ด์šฉ์ด๊ธฐ๋Š” ํ•˜์ง€๋งŒ, ๊ทธ ์ค‘์—์„œ๋„ 0.84๋กœ beer์™€ ์ƒ๊ด€ ๊ณ„์ˆ˜๊ฐ’์ด ๊ฐ€์žฅ ๋†’์Šต๋‹ˆ๋‹ค. ์•ž์„œ ํ™•์ธํ•˜์˜€์„ ๋•Œ๋Š” 0.835839์˜€์ง€๋งŒ, ์—ฌ๊ธฐ์„œ๋Š” ์†Œ์ˆ˜์  ๋‘ ์ž๋ฆฌ๊นŒ์ง€๋งŒ ์ถœ๋ ฅํ•˜๋ฉด์„œ ๋ฐ˜์˜ฌ๋ฆผ๋˜์–ด 0.84๋กœ ์ถœ๋ ฅ๋œ ๊ฒƒ์ด๋ผ ๋ณด๋ฉด ๋ฉ๋‹ˆ๋‹ค.

ํ”ผ์–ด์Šจ์˜ ์ƒ๊ด€๊ณ„์ˆ˜๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ,

์‚ฐ์ ๋„

์ฃผ๋ฅ˜ ๋ฐ์ดํ„ฐ์˜ ์‚ฐ์ ๋„. ์ฆ‰, pairplot์„ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค. pairplot์€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ์ธ์ˆ˜๋กœ ๋ฐ›์•„ ๊ทธ๋ฆฌ๋“œ(grid) ํ˜•ํƒœ๋กœ ๊ฐ ๋ฐ์ดํ„ฐ ์—ด์˜ ์กฐํ•ฉ์— ๋Œ€ํ•ด ์‚ฐ์ ๋„๋ฅผ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค. ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŒ๋‚˜๋Š” ๋Œ€๊ฐ์„  ์˜์—ญ์—๋Š” ํ•ด๋‹น ๋ฐ์ดํ„ฐ์˜ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ๊ทธ๋ฆฝ๋‹ˆ๋‹ค.

# ์‹œ๊ฐํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•œ ํ”ผ์ฒ˜๊ฐ„์˜ scatter plot์„ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.
sns.set(style='whitegrid')
sns.pairplot(drink_df[['beer_servings', 'spirit_servings', 
                     'wine_servings', 'total_litres_of_pure_alcohol']])
plt.show()
์‚ฐ์ ๋„
์‚ฐ์ ๋„

09. ๋ฐ์ดํ„ฐ ํƒ์ƒ‰ํ•˜๊ธฐ (๊ฒฐ์ธก๊ฐ’ ์ œ๊ฑฐ, ์‹œ๊ฐํ™”, ํ†ต๊ณ„ ํ™•์ธ)

์•ž์„œ ์‚ฌ์šฉํ–ˆ๋˜ ๋™์ผํ•œ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์ธ drink_df๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐ ๋กœ๋“œ

import pandas as pd

url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'
drink_df = pd.read_csv(url, ',')
drink_df
drink_df
drink_df

๊ฒฐ์ธก๊ฐ’ ์ œ๊ฑฐ

๊ฒฐ์ธก๊ฐ’์€ ํƒ์ƒ‰์  ๋ฐ์ดํ„ฐ ๋ถ„์„์—์„œ๋„, ๊ทธ ํ›„ ๋” ๋‚˜์•„๊ฐ€ ๋จธ์‹  ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๋ถ„์„์„ ํ•  ๋•Œ์—๋„ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค. ๊ฒฐ์ธก๊ฐ’์€ ์•„์˜ˆ ์ œ๊ฑฐ๋ฅผ ํ•ด์ฃผ๊ฑฐ๋‚˜, ํŠน์ • ๊ฐ’์œผ๋กœ ์ฑ„์›Œ์ฃผ๊ฑฐ๋‚˜ ํฌ๊ฒŒ ๋‘ ๊ฐ€์ง€ ์„ ํƒ์„ ํ•ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ œ๊ฑฐํ•  ๋•Œ๋Š” ์ฃผ๋กœ dropna()๋ฅผ ์“ฐ๋Š” ๋ฐ˜๋ฉด, ์ฑ„์›Œ์ค„ ๋•Œ๋Š” ์ฃผ๋กœ fillna()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์•ž์„œ continent์—ด์— ๊ฒฐ์ธก๊ฐ’์ด 23๊ฐœ๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ–ˆ์—ˆ์Šต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” ํŠน์ •๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋Š” fillna()๋ฅผ ์‚ฌ์šฉํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

# ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ๋ฅผ ํŠน์ •๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋Š” ๋ฐฉ๋ฒ•์€ .fillna()๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค.
# ์ด ๊ฒฝ์šฐ ๊ธฐํƒ€ ๊ฐ’์ด๋ผ๋Š” ์˜๋ฏธ์—์„œ 'ETC'๋ฅผ ๋„ฃ์–ด์ค€๋‹ค.
drink_df['continent'] = drink_df['continent'].fillna('ETC')
drink_df.sample(10)
continent_etc
continent_etc

continent์—ด์— ๊ฒฐ์ธก๊ฐ’์ด ์žˆ์„ ๊ฒฝ์šฐ 'ETC'๋ผ๋Š” ๊ฐ’์œผ๋กœ ์ฑ„์šฐ๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ฑ„์šฐ๊ณ  ๋‚˜์„œ ๋žœ๋ค์œผ๋กœ 10๊ฐœ๋ฅผ ์ถœ๋ ฅํ•˜๋„๋ก ํ•ด๋ณด์•˜๋Š”๋ฐ, ETC๋ž€ ๊ฐ’์ด ์ค‘๊ฐ„์— ๋ณด์ž…๋‹ˆ๋‹ค. ์ด์ œ ๊ฒฐ์ธก๊ฐ’์ด ์‚ฌ๋ผ์กŒ๋Š”์ง€ ์•ž์„œ ๋ฐฐ์šด ์ฝ”๋“œ์ธ isnull().sum()์œผ๋กœ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค.

drink_df.isnull().sum()
#
# country                         0
# beer_servings                   0
# spirit_servings                 0
# wine_servings                   0
# total_litres_of_pure_alcohol    0
# continent                       0
# dtype: int64

์ด์ œ continent ์—ด์—๋Š” ๊ฒฐ์ธก๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ํŒŒ์ด์ฐจํŠธ ๊ทธ๋ฆฌ๊ธฐ

ํŒŒ์ด ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฌ๋ ค๋ฉด

์ด ๋‘ ๊ฐ€์ง€๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์„œ๋Š” continent์— ๋Œ€ํ•ด์„œ ํŒŒ์ด์ฐจํŠธ๋ฅผ ๊ทธ๋ ค๋ณด๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ด ์ฐจํŠธ์˜ ๊ฐ’์œผ๋กœ๋Š” continent์— ์žˆ๋Š” ๊ฐ ๊ฐ’๋“ค์˜ ๊ฐœ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•˜๊ตฌ์š”. continent ์—ด์— ๋Œ€ํ•ด์„œ value_counts()๋ฅผ ์‚ฌ์šฉํ•˜์˜€์„ ๋•Œ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ถœ๋ ฅํ•ด๋ณด๊ณ , ํƒ€์ž…์„ ํ™•์ธํ•ด๋ด…์‹œ๋‹ค.

drink_df['continent'].value_counts()
#
# AF     53
# EU     45
# AS     44
# ETC    23
# OC     16
# SA     12
# Name: continent, dtype: int64

type(drink_df['continent'].value_counts())
#
# pandas.core.series.Series

์ธ๋ฑ์Šค์—๋Š” continents์˜ ์ด๋ฆ„, ๊ทธ๋ฆฌ๊ณ  ์‹ค์ œ ๊ฐ’ ๋ถ€๋ถ„์—๋Š” ํ•ด๋‹น continents์— ์žˆ๋Š” ๊ฐ’๋“ค์„ ๊ฐ๊ฐ ์นด์šดํŠธํ•œ ๊ฐ’์ด ๋“ค์–ด๊ฐ€์žˆ์Šต๋‹ˆ๋‹ค. ์ธ๋ฑ์Šค๋Š” index๋กœ ์ ‘๊ทผํ•˜๊ณ  ๊ฐ’์— ๋Œ€ํ•ด์„œ๋Š” values๋กœ ์ ‘๊ทผํ•˜๋ฏ€๋กœ ๊ฐ๊ฐ ์ ‘๊ทผ ํ›„์— ๋ฆฌ์ŠคํŠธ๋กœ ๊ฐ’์„ ๋ณ€ํ™˜ํ•˜๋Š” tolist()๋ฅผ ์‚ฌ์šฉํ•ด์ค๋‹ˆ๋‹ค.

pie_labels = drink_df['continent'].value_counts().index.tolist()
pie_values = drink_df['continent'].value_counts().values.tolist()
print(pie_labels)
print(pie_values)
#
# ['AF', 'EU', 'AS', 'ETC', 'OC', 'SA']
# [53, 45, 44, 23, 16, 12]

์ด์ œ ํŒŒ์ด ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฌ๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ ๋‘ ๊ฐ€์ง€๋ฅผ ์–ป์—ˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์ด ์ฐจํŠธ๋ฅผ ๊ทธ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•์€ plt.pie()๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. autopct๋Š” ์†Œ์ˆ˜์ ์„ ๋ช‡ ๊ฐœ๊นŒ์ง€ ์ถœ๋ ฅํ• ์ง€๋ฅผ ์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

์‚ฌ์šฉ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

plt.pie(๋ฐ์ดํ„ฐ์˜ ์‹ค์งˆ์ ์ธ ๊ฐ’, labels=๋ฐ์ดํ„ฐ์˜ ๋ ˆ์ด๋ธ” ๋ฆฌ์ŠคํŠธ)

autopct๋Š” ์†Œ์ˆ˜์ ์„ ๋ช‡ ๊ฐœ๊นŒ์ง€ ์ถœ๋ ฅํ• ์ง€๋ฅผ ์ •ํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ’์ž…๋‹ˆ๋‹ค.

plt.pie(pie_values, labels=pie_labels, autopct='%.02f%%')
plt.title('Percentage of each continent')
plt.show()
ํŒŒ์ด์ฐจํŠธ
ํŒŒ์ด์ฐจํŠธ

GroupBy๋ฅผ ์ด์šฉํ•œ ํ†ต๊ณ„ ํ™•์ธ

GroupBy๋Š” ํŠน์ • ๊ฐ’์„ ๊ธฐ์ค€์œผ๋กœ ๊ทธ๋ฃนํ•‘ํ•œ ํ›„์— ๊ทธ๋ฃน ๋ณ„๋กœ ํ†ต๊ณ„์ ์ธ ์ˆ˜์น˜ ์ •๋ณด๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค. ์‚ฌ์šฉ ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ์ด๋ฆ„.groupby('๊ทธ๋ฃนํ•‘ ๊ธฐ์ค€์ด ๋˜๋Š” ์—ด')['๋ณด๊ณ ์ž ํ•˜๋Š” ์—ด'].ํ†ต๊ณ„ ํ•จ์ˆ˜

GroupBy๋ฅผ ํ†ตํ•ด์„œ ์ˆ˜์น˜ ์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•ด๋ด…์‹œ๋‹ค.

๋Œ€๋ฅ™์— ๋Œ€ํ•œ ์—ด์€ continent์— ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๋งฅ์ฃผ์˜ ์†Œ๋น„๋Ÿ‰์€ beer_servings ์—ด์ž…๋‹ˆ๋‹ค.

ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ํ•จ์ˆ˜๋Š” mean()์ด๋ฏ€๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

์–ด๋–ค ๋Œ€๋ฅ™์ด ํ‰๊ท ์ ์œผ๋กœ ๋งฅ์ฃผ๋ฅผ ๋” ๋จน์„๊นŒ์š”?

drink_df.groupby('continent')['beer_servings'].mean()
#
# continent
# AF      61.471698
# AS      37.045455
# ETC    145.434783
# EU     193.777778
# OC      89.687500
# SA     175.083333
# Name: beer_servings, dtype: float64

type(drink_df.groupby('continent')['beer_servings'].mean())
#
# pandas.core.series.Series

๊ฐ ๋Œ€๋ฅ™ ๋ณ„๋กœ ์™€์ธ ์†Œ๋น„์— ๋Œ€ํ•œ ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ์ถœ๋ ฅํ•ด๋ณผ๊นŒ์š”?

drink_df.groupby('continent')['wine_servings'].describe()
wine
wine

๋ชจ๋“  ์ปฌ๋Ÿผ์— ๋Œ€ํ•ด์„œ ๋Œ€๋ฅ™๋ณ„๋กœ ํ‰๊ท  ์•Œ์ฝœ ์†Œ๋น„๋Ÿ‰์„ ์ถœ๋ ฅํ•ด๋ณผ๊นŒ์š”

drink_df.groupby('continent').mean()
alcohol_usage
alcohol_usage

์ „์ฒด ํ‰๊ท ๋ณด๋‹ค ๋งŽ์€ ์•Œ์ฝ”์˜ฌ์„ ์„ญ์ทจํ•˜๋Š” ๋Œ€๋ฅ™์„ ๊ตฌํ•ด๋ด…์‹œ๋‹ค.

total_mean = drink_df.total_litres_of_pure_alcohol.mean()
continent_mean = drink_df.groupby('continent')['total_litres_of_pure_alcohol'].mean()
continent_over_mean = continent_mean[continent_mean >= total_mean]
print(continent_over_mean)
# 
# continent
# ETC    5.995652
# EU     8.617778
# SA     6.308333
# Name: total_litres_of_pure_alcohol, dtype: float64

ํ‰๊ท  wine_servings์ด ๊ฐ€์žฅ ๋†’์€ ๋Œ€๋ฅ™์„ ๊ตฌํ•ด๋ด…์‹œ๋‹ค.

beer_continent = drink_df.groupby('continent').wine_servings.mean().idxmax()
print(beer_continent)

10. Quiz

Quiz 1

sklearn์€ ๋จธ์‹  ๋Ÿฌ๋‹ ํŒจํ‚ค์ง€๋กœ ๊ฐ์ข… ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฐ์ดํ„ฐ์…‹๋„ ์ œ๊ณตํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฒˆ ํ€ด์ฆˆ์—์„œ ์‚ฌ์šฉํ•  ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฐ์ดํ„ฐ์…‹์€ iris data(๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ)์ž…๋‹ˆ๋‹ค.

# ํ€ด์ฆˆ์— ํ•„์š”ํ•œ ํŒจํ‚ค์ง€๋Š” ์•„๋ž˜์˜ ๋‘ ๊ฐ€์ง€๊ฐ€ ์ „๋ถ€๊ฐ€ ์•„๋‹™๋‹ˆ๋‹ค.
# ์—ฌ๋Ÿฌ๋ถ„๋“ค์˜ ํ•„์š”์— ๋”ฐ๋ผ์„œ ํŒจํ‚ค์ง€๋ฅผ ์ง€์† ์ถ”๊ฐ€ํ•˜์…”๋„ ๋ฉ๋‹ˆ๋‹ค.
import pandas as pd
from sklearn import datasets

# ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ ๋กœ๋“œ
iris_data = datasets.load_iris()

iris ๋ฐ์ดํ„ฐ์…‹์˜ ๊ฐ ์—ด์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

์ด๋Ÿฌํ•œ 4๊ฐœ์˜ ํŠน์„ฑ์„ ์•Œ๋ ค์ค„ํ…Œ๋‹ˆ

๊ฝƒ์˜ ์ข…๋ฅ˜๋ฅผ ์˜ˆ์ธกํ•ด๋ณด๋ผ๋Š” ๊ฒƒ์ด iris ๋ฐ์ดํ„ฐ์…‹์ด ์ œ์‹œํ•˜๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

์šฐ์„  4๊ฐœ์˜ ํŠน์„ฑ์— ํ•ด๋‹น๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ df_data์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

df_data = pd.DataFrame(iris_data['data'], columns=iris_data['feature_names'])
df_data.sample(5)
sample
sample

4๊ฐœ์˜ ์—ด์ด ์ถœ๋ ฅ๋˜๋Š”๋ฐ, ๊ฐ๊ฐ ๊ฝƒ๋ฐ›์นจ์˜ ๊ธธ์ด ์ •๋ณด, ๊ฝƒ๋ฐ›์นจ์˜ ๋„ˆ๋น„ ์ •๋ณด, ๊ฝƒ์žŽ์˜ ๊ธธ์ด ์ •๋ณด, ๊ฝƒ์žŽ์˜ ๋„ˆ๋น„ ์ •๋ณด์— ํ•ด๋‹น๋ฉ๋‹ˆ๋‹ค. iris ๋ฐ์ดํ„ฐ(๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ)๋Š” ์ด 4๊ฐœ์˜ ํŠน์„ฑ๋งŒ ๋ณด๊ณ  ์ด ๊ฝƒ์ด ์–ด๋–ค ํ’ˆ์ข…์˜ ๋ถ“๊ฝƒ์ธ์ง€๋ฅผ ์˜ˆ์ธกํ•ด์•ผํ•˜๋Š” ๋จธ์‹  ๋Ÿฌ๋‹ ๋ฌธ์ œ๋ฅผ ์œ„ํ•œ ๋ฐ์ดํ„ฐ์…‹์ž…๋‹ˆ๋‹ค. ์ •๋‹ต์— ํ•ด๋‹น๋˜๋Š” ๋ ˆ์ด๋ธ”์„ df_target์— ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

df_target = pd.DataFrame(iris_data['target'], columns=['species'])
df_target.sample(5)
species
species

์ด ์‹œ๋ฆฌ์ฆˆ์—๋Š” ์–ด๋–ค ๊ฐ’๋“ค์ด ์žˆ๋Š”์ง€ ์ถœ๋ ฅํ•ด๋ด…์‹œ๋‹ค.

# ๊ฐ’์˜ ์ข…๋ฅ˜๋ฅผ ์ „๋ถ€ ์ถœ๋ ฅ
df_target['species'].unique()
array([0, 1, 2])

์ˆซ์ž๊ฐ€ 0, 1, 2๊ฐ€ ๋‚˜์˜ค๋Š”๋ฐ ๊ฐ๊ฐ์ด ์˜๋ฏธํ•˜๋Š” ํ’ˆ์ข…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

pd.concat()์„ ์‚ฌ์šฉํ•˜๋ฉด ์œ„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„๊ณผ ์‹œ๋ฆฌ์ฆˆ๋ฅผ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ํ•ฉ์น˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

df = pd.concat([df_data, df_target], axis=1)
df
ํ€ด์ฆˆ
ํ€ด์ฆˆ

์ด๋ ‡๊ฒŒ 150๊ฐœ์˜ ํ–‰๊ณผ 5๊ฐœ์˜ ์—ด์„ ๊ฐ€์ง„ ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๊ฐ€ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์— ๋กœ๋“œ๋œ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

์ด 150๊ฐœ์˜ ๋ถ“๊ฝƒ์ด ์žˆ๋Š”๋ฐ, ์ข…๋ฅ˜๋Š” 3๊ฐœ์˜€๊ณ ,

๊ฐ ๋ถ“๊ฝƒ์˜ ๊ฝƒ๋ฐ›์นจ์˜ ๊ธธ์ด์™€ ๋„ˆ๋น„, ๊ฝƒ์žŽ์˜ ๊ธธ์ด์™€ ๋„ˆ๋น„๋ฅผ ๊ธฐ๋กํ•œ ๋ฐ์ดํ„ฐ์…‹์ธ ๊ฒƒ์ด์ฃ .

์—ฌ๋Ÿฌ๋ถ„๋“ค์˜ ํ€ด์ฆˆ๋Š” ์ด ๋ถ“๊ฝƒ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์˜ ๊ธฐ๋Šฅ(์กฐ๊ฑด๋ถ€ ํ•„ํ„ฐ๋ง, ์ƒ๊ด€ ๊ณ„์ˆ˜ ๋ถ„์„ ๋“ฑ), Matplotlib, Seaborn์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ์ข… ํ†ต๊ณ„ ์ •๋ณด๋ฅผ ๊ตฌํ•˜๊ณ , ์‹œ๊ฐํ™”๋ฅผ ํ•ด๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋‹จ ํ•˜๋‚˜์˜ ์ •๋‹ต์ด ์กด์žฌํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹™๋‹ˆ๋‹ค. ์—ฌ๋Ÿฌ๋ถ„๋“ค์ด ๋ฐฐ์šด ๋‚ด์šฉ๋“ค์„ ๋ณต์Šตํ•˜์—ฌ ์ตœ๋Œ€ํ•œ ์„ฑ์‹คํ•˜๊ฒŒ ํ€ด์ฆˆ๋ฅผ ์ˆ˜ํ–‰ํ•ด๋ณด์„ธ์š”.