파인튜닝 평가와 배포 — 성능 측정과 서비스화

평가의 어려움

LLM 평가는 분류 정확도 같은 단순한 지표로 측정하기 어렵습니다.

flowchart TB
    subgraph EVAL["평가 방법"]
        Q1["정량적 평가\nBLEU, ROUGE\n(자동)"]
        Q2["LLM-as-Judge\nGPT-4로 채점\n(반자동)"]
        Q3["인간 평가\n전문가 검토\n(비용 높음)"]
    end

자동 평가 지표

from rouge_score import rouge_scorer
from sacrebleu.metrics import BLEU

scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=False)

reference = "파이썬은 간결한 문법을 가진 프로그래밍 언어입니다."
hypothesis = "파이썬은 문법이 간결한 프로그래밍 언어입니다."

scores = scorer.score(reference, hypothesis)
print(f"ROUGE-1: {scores['rouge1'].fmeasure:.3f}")
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")

LLM-as-Judge 평가 (권장)

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class EvaluationResult(BaseModel):
    score: int        # 1-5
    reasoning: str
    strengths: list[str]
    weaknesses: list[str]

def evaluate_response(
    question: str,
    response: str,
    reference: str | None = None
) -> EvaluationResult:
    prompt = f"""다음 응답을 1~5점으로 평가해주세요.

질문: {question}
응답: {response}
{"참고 답변: " + reference if reference else ""}

평가 기준:
- 정확성 (사실 오류 없음)
- 완성도 (질문에 충분히 답변)
- 명확성 (이해하기 쉬움)
- 유용성 (실제 도움이 됨)"""

    result = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format=EvaluationResult,
    )
    return result.choices[0].message.parsed

# 테스트셋 자동 평가
def evaluate_model(test_cases: list[dict], model_fn) -> dict:
    scores = []
    for case in test_cases:
        response = model_fn(case["question"])
        result = evaluate_response(case["question"], response, case.get("reference"))
        scores.append(result.score)

    return {
        "mean": sum(scores) / len(scores),
        "min": min(scores),
        "max": max(scores),
    }

Ollama로 로컬 배포

# Ollama 설치 (Mac)
curl -fsSL https://ollama.ai/install.sh | sh

# 병합된 모델 파일 생성
cat > Modelfile << EOF
FROM ./my-merged-model
SYSTEM "당신은 전문 법률 상담사입니다."
PARAMETER temperature 0.3
PARAMETER top_p 0.9
EOF

# 모델 등록
ollama create legal-advisor -f Modelfile

# 실행
ollama run legal-advisor

# REST API로 사용
curl http://localhost:11434/api/generate -d '{
  "model": "legal-advisor",
  "prompt": "임대차 계약 주의사항은?",
  "stream": false
}'

vLLM으로 고성능 서빙

pip install vllm

# 서버 시작
python -m vllm.entrypoints.openai.api_server \
    --model ./my-merged-model \
    --tensor-parallel-size 1 \
    --gpu-memory-utilization 0.9

# OpenAI 호환 API로 사용
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"  # vLLM은 키 불필요
)

response = client.chat.completions.create(
    model="./my-merged-model",
    messages=[{"role": "user", "content": "질문"}],
)

HuggingFace Hub에 공유

from huggingface_hub import HfApi

api = HfApi()

# 어댑터 업로드
api.upload_folder(
    folder_path="./my-lora-adapter",
    repo_id="username/my-legal-advisor-lora",
    repo_type="model",
)

# README 자동 생성 카드
model_card = """
---
base_model: mistralai/Mistral-7B-Instruct-v0.2
tags: [fine-tuned, LoRA, legal, Korean]
---
# 한국 법률 상담 모델 (LoRA)
Mistral-7B를 한국 법률 Q&A 1000건으로 파인튜닝한 모델입니다.
"""

LLM 파인튜닝 시리즈 정리

편	주제
1편	파인튜닝 개념, LoRA 원리
2편	데이터 수집, 정제, 포맷
3편	OpenAI API 파인튜닝
4편	LoRA/QLoRA 오픈소스 파인튜닝
5편	평가와 배포

다음은 멀티모달 AI — 이미지, 오디오 등 텍스트 외 다양한 모달리티를 다루는 방법을 배웁니다.

평가의 어려움

자동 평가 지표

LLM-as-Judge 평가 (권장)

Ollama로 로컬 배포

vLLM으로 고성능 서빙

HuggingFace Hub에 공유

LLM 파인튜닝 시리즈 정리

궁금한 점이 있으신가요?