탐색적 데이터 분석 (EDA) — 데이터에서 인사이트 발굴

EDA란?

Exploratory Data Analysis — 데이터를 분석하기 전 깊이 이해하는 단계입니다.

실습 데이터: 타이타닉

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

1단계: 데이터 개요

# 기본 정보
print("=== 기본 정보 ===")
print(f"행: {df.shape[0]}, 열: {df.shape[1]}")
print(df.dtypes)

# 결측값
print("\n=== 결측값 ===")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(1)
print(pd.DataFrame({"결측 수": missing, "결측 비율(%)": missing_pct})[missing > 0])

# 통계 요약
print("\n=== 수치형 변수 요약 ===")
print(df.describe())

print("\n=== 범주형 변수 요약 ===")
print(df.describe(include="object"))

2단계: 단변량 분석

fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# 생존율
survival_rate = df["Survived"].mean()
axes[0, 0].bar(["사망", "생존"], df["Survived"].value_counts().sort_index(),
               color=["salmon", "steelblue"])
axes[0, 0].set_title(f"생존 여부 (생존율: {survival_rate:.1%})")

# 나이 분포
df["Age"].hist(bins=30, ax=axes[0, 1], edgecolor="white", color="steelblue")
axes[0, 1].axvline(df["Age"].mean(), color="red", linestyle="--",
                   label=f"평균: {df['Age'].mean():.1f}")
axes[0, 1].set_title("나이 분포")
axes[0, 1].legend()

# 객실 등급
pclass_counts = df["Pclass"].value_counts().sort_index()
axes[0, 2].bar(["1등급", "2등급", "3등급"], pclass_counts.values)
axes[0, 2].set_title("객실 등급별 승객 수")

# 운임 분포 (로그 스케일)
np.log1p(df["Fare"]).hist(bins=30, ax=axes[1, 0], color="steelblue", edgecolor="white")
axes[1, 0].set_title("운임 분포 (log scale)")

# 성별
df["Sex"].value_counts().plot(kind="bar", ax=axes[1, 1], rot=0)
axes[1, 1].set_title("성별 분포")

# 탑승 항구
df["Embarked"].value_counts().plot(kind="bar", ax=axes[1, 2], rot=0)
axes[1, 2].set_title("탑승 항구")

plt.tight_layout()
plt.savefig("univariate.png", dpi=150, bbox_inches="tight")

3단계: 이변량 분석

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 성별 × 생존
survival_by_sex = df.groupby("Sex")["Survived"].mean()
survival_by_sex.plot(kind="bar", ax=axes[0, 0], rot=0, color=["salmon", "steelblue"])
axes[0, 0].set_title(f"성별 생존율\n여성: {survival_by_sex['female']:.1%}, 남성: {survival_by_sex['male']:.1%}")
axes[0, 0].set_ylabel("생존율")

# 객실 등급 × 생존
survival_by_pclass = df.groupby("Pclass")["Survived"].mean()
axes[0, 1].bar(["1등급", "2등급", "3등급"], survival_by_pclass.values)
axes[0, 1].set_title("객실 등급별 생존율")

# 나이 × 생존
df[df["Survived"] == 1]["Age"].hist(bins=20, ax=axes[1, 0], alpha=0.7, label="생존", color="steelblue")
df[df["Survived"] == 0]["Age"].hist(bins=20, ax=axes[1, 0], alpha=0.7, label="사망", color="salmon")
axes[1, 0].set_title("나이 분포 (생존 vs 사망)")
axes[1, 0].legend()

# 운임 × 생존
sns.boxplot(data=df, x="Survived", y="Fare", ax=axes[1, 1])
axes[1, 1].set_xticklabels(["사망", "생존"])
axes[1, 1].set_title("운임 vs 생존 여부")

plt.tight_layout()

4단계: 상관관계 분석

# 수치형 변수 간 상관관계
numeric_df = df[["Survived", "Pclass", "Age", "SibSp", "Parch", "Fare"]]

plt.figure(figsize=(8, 6))
sns.heatmap(numeric_df.corr(), annot=True, fmt=".2f",
            cmap="RdYlGn", vmin=-1, vmax=1, center=0)
plt.title("변수 간 상관관계")

5단계: 인사이트 정리

print("=== EDA 주요 인사이트 ===")
print(f"1. 전체 생존율: {df['Survived'].mean():.1%}")
print(f"2. 여성 생존율: {df[df['Sex']=='female']['Survived'].mean():.1%} (남성: {df[df['Sex']=='male']['Survived'].mean():.1%})")
print(f"3. 1등급 생존율: {df[df['Pclass']==1]['Survived'].mean():.1%} (3등급: {df[df['Pclass']==3]['Survived'].mean():.1%})")
print(f"4. 나이 결측: {df['Age'].isnull().sum()}명 ({df['Age'].isnull().mean():.1%})")
print(f"5. 생존자 평균 운임: {df[df['Survived']==1]['Fare'].mean():.1f} (사망: {df[df['Survived']==0]['Fare'].mean():.1f})")

정리

EDA 체크리스트:

단계	질문
개요	몇 행? 어떤 타입? 결측값 얼마?
단변량	각 변수의 분포는? 이상값은?
이변량	목표 변수와 관계는?
다변량	상관관계 패턴은?
인사이트	주요 발견 사항은?

다음 편에서는 머신러닝 기초 — scikit-learn으로 데이터를 기반으로 예측 모델을 만드는 방법을 배웁니다.

EDA란?

실습 데이터: 타이타닉

1단계: 데이터 개요

2단계: 단변량 분석

3단계: 이변량 분석

4단계: 상관관계 분석

5단계: 인사이트 정리

정리

궁금한 점이 있으신가요?