πŸ–‹οΈ
noviceforever
  • About me
  • Miscellaneous
    • Introduction
      • 컀리어 μš”μ•½
      • 데이터 κ³Όν•™μ΄λž€?
  • Machine Learning
    • Tabular Data
      • XGBoost Algorithm Overview
      • TabNet Overview
      • Imbalanced Learning
        • Introduction
        • Oversampling Basic (SMOTE variants)
        • Undersampling Basic
        • Cost-sensitive Learning
        • RBF(Radial Basis Function)-based Approach
    • Computer Vision (CNN-based)
      • [Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod
      • R-CNN(Regions with Convolutional Neuron Networks)
      • Fast R-CNN
      • Faster R-CNN
      • Mask R-CNN
      • YOLO (You Only Look Once)
      • YOLO v2(YOLO 9000) Better, Faster, Stronger
      • YOLO v3
      • SSD (Single Shot Multibox Detector)
      • Data Augmentation Tips
    • Computer Vision (Transformer-based)
      • ViT for Image Classification
      • DeiT (Training Data-efficient Image Transformers & Distillation through Attention)
      • DETR for Object Detection
      • Zero-Shot Text-to-Image Generation (DALL-E) - Paper Review
    • Natural Language Processing
      • QRNN(Quasi-Recurrent Neural Network)
      • Transformer is All You Need
      • BERT(Bi-directional Encoder Representations from Transformers)
      • DistilBERT, a distilled version of BERT
      • [Hands-on] Fine Tuning Naver Movie Review Sentiment Classification with KoBERT using GluonNLP
      • OpenAI GPT-2
      • XLNet: Generalized Autoregressive Pretraining for Language Understanding
    • Recommendation System
      • Recommendation System Overview
      • Learning to Rank
      • T-REC(Towards Accurate Bug Triage for Technical Groups) λ…Όλ¬Έ 리뷰
    • Reinforcement Learning
      • MAB(Multi-Armed Bandits) Overview
      • MAB Algorithm Benchmarking
      • MAB(Multi-Armed Bandits) Analysis
      • Policy Gradient Overview
    • IoT on AWS
      • MXNet Installation on NVIDIA Jetson Nano
      • Neo-DLR on NVIDIA Jetson Nano
    • Distributed Training
      • Data Parallelism Overview
      • SageMaker's Data Parallelism Library
      • SageMaker's Model Parallelism Library
    • Deployment
      • MobileNet V1/V2/V3 Overview
      • TensorRT Overview
      • Multi Model Server and SageMaker Multi-Model Endpoint Overview
  • AWS AIML
    • Amazon Personalize
      • Amazon Personalize - User Personalization Algorithm Deep Dive
      • Amazon Personalize Updates(~2021.04) 및 FAQ
Powered by GitBook
On this page
  • Metrics
  • ROC Curve
  • PR(Precision-Recall) Curve
  • AUROC (Area Under a ROC Curve, aka ROC AUC, AUC)
  • AUPRC (Area Under a PR Curve, aka PR AUC)
  • MCC (Matthews correlation coefficient)

Was this helpful?

  1. Machine Learning
  2. Tabular Data
  3. Imbalanced Learning

Introduction

PreviousImbalanced LearningNextOversampling Basic (SMOTE variants)

Last updated 4 years ago

Was this helpful?

MNIST dataset 클래슀 λΆ„ν¬λŠ” μ™„λ²½ν•˜κ²Œ λ™μΌν•œ λΉ„μœ¨λ‘œ 맞좰져 μžˆμ§€λ§Œ, μ‹€μ œ λ°μ΄ν„°λŠ” νŠΉμ • 클래슀의 뢄포가 맀우 적은 κ²½μš°λ“€μ΄ λ§ŽμŠ΅λ‹ˆλ‹€. 특히 Predictive Analyticsμ—μ„œ κ°€μž₯ 많이 ν™œμš©λ˜λŠ” Tabular λ°μ΄ν„°μ—μ„œ 많이 μ°Ύμ•„λ³Ό 수 μžˆλŠ”λ°, 고객 μ΄νƒˆμ„ λ°©μ§€ν•˜κΈ° μœ„ν•œ churn prediction 이진 λΆ„λ₯˜ 문제λ₯Ό μ˜ˆμ‹œλ‘œ 듀어도 μ‹€μ œ μ΄νƒˆν•œ 고객의 λΉ„μœ¨μ€ μ΄νƒˆν•˜μ§€ μ•Šμ€ 고객 λŒ€λΉ„ 맀우 μ μŠ΅λ‹ˆλ‹€. (1:10~1:100)

μ΄λŸ¬ν•œ 데이터λ₯Ό κ·ΈλŒ€λ‘œ ν›ˆλ ¨ μ‹œμ—λŠ” λ‹€μˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ λ°μ΄ν„°λ“€μ˜ 뢄포λ₯Ό μœ„μ£Όλ‘œ κ³ λ €ν•˜κΈ°μ— λ‹€μˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터에 과적합이 λ°œμƒν•˜κ²Œ 되며, μ†Œμˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ λ°μ΄ν„°λŠ” 잘 λΆ„λ₯˜ν•˜μ§€ λͺ»ν•  κ°€λŠ₯성이 λ†’μ•„μ§‘λ‹ˆλ‹€.

μ΄λŸ¬ν•œ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ‹€μˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터듀을 μƒ˜ν”Œλ§ κΈ°λ²•μœΌλ‘œ 적게 μΆ”μΆœν•˜λŠ” undersampling κΈ°λ²•μ΄λ‚˜ μ†Œμˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ λ°μ΄ν„°λ“€μ˜ νŒ¨ν„΄μ„ νŒŒμ•…ν•˜μ—¬ 데이터λ₯Ό λŠ˜λ¦¬λŠ” oversampling 기법듀을 생각해볼 수 μžˆμŠ΅λ‹ˆλ‹€.

κ·Έ 외에 μ†Œμˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터듀에 더 큰 κ°€μ€‘μΉ˜λ₯Ό λΆ€μ—¬ν•˜λŠ” weighting κΈ°λ²•μ΄λ‚˜, μ†Œμˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터λ₯Ό 잘λͺ» λΆ„λ₯˜ μ‹œ penaltyλ₯Ό 크게 λΆ€μ—¬ν•˜λŠ” cost-sensitive learning 기법, λ‹€μˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 일뢀 데이터λ₯Ό μ†Œμˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터 λ‚΄μ—μ„œ 볡원 μΆ”μΆœ ν›„ μ•™μƒλΈ”ν•˜λŠ” ensemble sampling 기법도 ν™œμš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

λΆˆκ· ν˜• 클래슀 λ°μ΄ν„°μ…‹μ—μ„œ 일반적으둜 μ‚¬μš©ν•˜λŠ” metric듀을 κ°„λ‹¨νžˆ μ‚΄νŽ΄ λ³΄κ² μŠ΅λ‹ˆλ‹€. 맀우 기본적인 λ‚΄μš©μ΄λ―€λ‘œ, 이미 λ‚΄μš©μ„ μ•Œκ³  있으면 μŠ€ν‚΅ν•΄λ„ λ¬΄λ°©ν•©λ‹ˆλ‹€.

Metrics

ROC Curve

Receiver Operating Characteristic(μˆ˜μ‹ μž μ‘°μž‘ νŠΉμ„±)μ΄λΌλŠ” μ΄μƒν•œ μš©μ–΄ λ•Œλ¬Έμ— ν—·κ°ˆλ¦΄ 것 κ°™μ•„ 잠깐 μš©μ–΄μ˜ 유래λ₯Ό μ–ΈκΈ‰ν•˜κ² μŠ΅λ‹ˆλ‹€. 이 μš©μ–΄λŠ” 2μ°¨ 세계 λŒ€μ „ λ•Œ "Chain Home" λ ˆμ΄λ” μ‹œμŠ€ν…œμ˜ μΌλΆ€λ‘œ μ˜κ΅­μ—μ„œ 처음 μ‚¬μš©λœ κ°œλ…μœΌλ‘œ λ ˆμ΄λ”λ‘œ 적ꡰ μ „νˆ¬κΈ°μ™€ μ‹ ν˜Έ 작음(예: μƒˆ) νŒλ³„ν•˜κΈ° μœ„ν•΄ μ‚¬μš©λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

λ ˆμ΄λ” λ²”μœ„μ— 적ꡰ μ „νˆ¬κΈ°λΏλ§Œ μ•„λ‹ˆλΌ μƒˆλ„ λ“€μ–΄μ˜€λŠ” κ²½μš°λ“€μ΄ μ’…μ’… μžˆλŠ”λ°, 이 λ•Œ λ ˆμ΄λ” 정찰병이 경보λ₯Ό λͺ¨λ‘ μ „νˆ¬κΈ°λ‘œ νŒλ‹¨ν•˜λ©΄ 였보일 ν™•λ₯ μ΄ μ˜¬λΌκ°€κ³  경보λ₯Ό λŒ€μˆ˜λ‘­μ§€ μ•Šκ²Œ μƒκ°ν•΄μ„œ λ¬΄μ‹œν•˜λ©΄ μ •μž‘ μ€‘μš”ν•œ λ•Œλ₯Ό λ†“μΉ˜κ²Œ λ©λ‹ˆλ‹€. 이에 λŒ€ν•œ trade-offλ₯Ό 2차원 μ’Œν‘œ(y좕은 TPR; True Positive Ratio, x좕은 FPR; False Positive Ratio)둜 λ‚˜νƒ€λ‚Έ 것이 ROC κ³‘μ„ μž…λ‹ˆλ‹€. νŒλ³„ 기쀀이 μ •μ°°λ³‘λ§ˆλ‹€ λ‹€λ₯΄κΈ° λ•Œλ¬Έμ— 각 μ •μ°°λ³‘μ˜ νŒλ³„ κ²°κ³Όκ°€ λ‹¬λžμ§€λ§Œ, μ •μ°°λ³‘λ“€μ˜ 데이터λ₯Ό μ’…ν•©ν•˜λ‹ˆ 곑선 ν˜•νƒœκ°€ 크게 λ°”λ€Œμ§€ μ•Šλ‹€λŠ” 것을 μ•Œ 수 있게 λ˜μ—ˆκ³  μ΄λŠ” μ•ˆμ •μ μœΌλ‘œ λͺ¨λΈμ˜ μ„±λŠ₯을 νŒλ³„ν•˜λŠ” μ§€ν‘œ 쀑 ν•˜λ‚˜κ°€ λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

PR(Precision-Recall) Curve

μ „λ°˜μ μΈ λͺ¨λΈμ˜ μ„±λŠ₯을 νŒλ³„ν•˜λŠ” μ§€ν‘œλ‘œ ROC 곑선이 ν˜„μž¬λ„ 널리 μ“°μ΄μ§€λ§Œ, λΆˆκ· ν˜•λ„κ°€ 맀우 큰 λ°μ΄ν„°μ…‹μ΄λ‚˜ νŠΉμ • ν…ŒμŠ€νŠΈ μ…‹μ—μ„œμ˜ κ²°κ³Όκ°€ μ€‘μš”ν•˜λ‹€λ©΄ PR 곑선도 같이 κ³ λ €ν•΄μ•Ό ν•©λ‹ˆλ‹€. μ—μ„œ PR κ³‘μ„ μ˜ ν•„μš”μ„±μ— λŒ€ν•œ 이유λ₯Ό μ•„λž˜μ™€ 같이 κΈ°μˆ ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

Consequently, a large change in the number of false positives can lead to a small change in the false positive rate used in ROC analysis. Precision, on the other hand, by comparing false positives to true positives rather than true negatives, captures the effect of the large number of negative examples on the algorithm’s performance.

즉, TN이 λ§Žλ‹€λ©΄(λ‹€μˆ˜ 범주에 μ†ν•œ 데이터가 λ§Žλ‹€λ©΄), FP의 λ³€ν™”λŸ‰μ— λΉ„ν•΄ FPR의 λ³€ν™”λŸ‰μ΄ λ―Έλ―Έν•©λ‹ˆλ‹€.

ROC 곑선은 TN(True Negative)에 μ†ν•œ 데이터가 λ§Žλ‹€λ©΄ (즉, λ‹€μˆ˜ ν΄λž˜μŠ€μ— μ†ν•œ 데이터겠죠), FP(False Positive)의 λ³€ν™”λŸ‰μ— λΉ„ν•΄ FPR의 λ³€ν™”λŸ‰μ΄ λ―Έλ―Έν•©λ‹ˆλ‹€

κ°„λ‹¨ν•œ μ˜ˆμ‹œλ‘œ 1백만 λͺ…μ˜ 정상인과 100λͺ…μ˜ μ•”ν™˜μžκ°€ ν¬ν•¨λœ λ°μ΄ν„°μ…‹μ—μ„œ μ•”ν™˜μžλ₯Ό λΆ„λ₯˜ν•˜λŠ” 두 개의 λͺ¨λΈμ„ ν›ˆλ ¨ν–ˆλ‹€κ³  κ°€μ •ν•˜κ² μŠ΅λ‹ˆλ‹€.

  • 1번 λͺ¨λΈ: 100λͺ…μ˜ μ•”ν™˜μžλ‘œ κ²€μΆœν–ˆλŠ”λ° μ‹€μ œλ‘œ μ•”ν™˜μžκ°€ 90λͺ…인 경우

  • 2번 λͺ¨λΈ: 2,000λͺ…을 μ•”ν™˜μžλ‘œ κ²€μΆœν–ˆλŠ”λ° μ‹€μ œλ‘œ μ•”ν™˜μžκ°€ 90λͺ…인 경우

λ”°λ‘œ κ³„μ‚°ν•˜μ§€ μ•Šμ•„λ„ λ‹Ήμ—°νžˆ 1번 λͺ¨λΈμ΄ 더 쒋은 λͺ¨λΈμ΄κ² μ£ ? 그럼 ROC와 PR κΈ°μ€€μœΌλ‘œ μ‹€μ œλ‘œ 계산을 μˆ˜ν–‰ν•΄ λ³΄κ² μŠ΅λ‹ˆλ‹€.

  • ROC κΈ°μ€€μœΌλ‘œ 평가 μ‹œ,

    • 1번 λͺ¨λΈ: TPR=0.9,β€…β€ŠFPR=(100βˆ’90)/1000000=0.00001\text{TPR} = 0.9, \; \text{FPR} = (100 - 90) / 1000000 = 0.00001TPR=0.9,FPR=(100βˆ’90)/1000000=0.00001

    • 2번 λͺ¨λΈ: TPR=0.9,β€…β€ŠFPR=(2000βˆ’90)/1000000β‰ˆ0.00191\text{TPR} = 0.9, \;\text{FPR} = (2000 - 90) / 1000000 \approx 0.00191TPR=0.9,FPR=(2000βˆ’90)/1000000β‰ˆ0.00191

    • 두 λͺ¨λΈμ˜ FPR μ°¨μ΄λŠ” 0.00191βˆ’0.00001=0.00190.00191 - 0.00001 = 0.00190.00191βˆ’0.00001=0.0019μž…λ‹ˆλ‹€.

  • PR κΈ°μ€€μœΌλ‘œ 평가 μ‹œ,

    • 1번 λͺ¨λΈ: Recall=0.9,β€…β€ŠPrecision=90/100=0.9\text{Recall} = 0.9, \; \text{Precision} = 90/100 = 0.9Recall=0.9,Precision=90/100=0.9

    • 2번 λͺ¨λΈ: Recall=0.9,β€…β€ŠPrecision=90/100=0.9\text{Recall} = 0.9, \; \text{Precision} = 90/100 = 0.9Recall=0.9,Precision=90/100=0.9

    • 두 λͺ¨λΈμ˜ Precision μ°¨μ΄λŠ” 0.9βˆ’0.0045=0.8550.9 - 0.0045 = 0.8550.9βˆ’0.0045=0.855μž…λ‹ˆλ‹€.

  • λΆˆκ· ν˜• 클래슀 λ°μ΄ν„°μ…‹μ—μ„œ 두 λͺ¨λΈμ˜ μ„±λŠ₯ 차이λ₯Ό λͺ…ν™•νžˆ νŒŒμ•…ν•˜λ €λ©΄, PR μ»€λΈŒλ„ ν•„μš”ν•˜λ‹€λŠ” 것을 μ•Œ 수 μžˆμŠ΅λ‹ˆλ‹€.

AUROC (Area Under a ROC Curve, aka ROC AUC, AUC)

ROC 곑선 μ•„λž˜ μ˜μ—­, 즉 TPRκ³Ό FPR에 λŒ€ν•œ 면적을 μ˜λ―Έν•˜λ©°, 이 κ°’μ˜ λ²”μœ„λŠ” 0~1μž…λ‹ˆλ‹€. μž„κ³„κ°’(threshold)κ³Ό 상관 없이 λͺ¨λΈμ˜ 예츑 μ„±λŠ₯을 μ •λŸ‰μ μœΌλ‘œ μ•Œ 수 μžˆκΈ°μ— λΆ„λ₯˜ 문제의 metric으둜 널리 쓰이고 μžˆμŠ΅λ‹ˆλ‹€.

AUPRC (Area Under a PR Curve, aka PR AUC)

PR 곑선 μ•„λž˜ μ˜μ—­, 즉 Precisionκ³Ό Recall에 λŒ€ν•œ 면적을 μ˜λ―Έν•˜λ©°, 이 κ°’μ˜ λ²”μœ„λŠ” 0~1μž…λ‹ˆλ‹€.

MCC (Matthews correlation coefficient)

F1 μ μˆ˜λŠ” TN을 λ¬΄μ‹œν•˜μ§€λ§Œ, MCCλŠ” confusion matrix의 4개 κ°’ λͺ¨λ‘λ₯Ό κ³ λ €ν•˜λ―€λ‘œ 4개 κ°’ λͺ¨λ‘ λͺ¨λ‘ 쒋은 예츑 결ㅁ과λ₯Ό μ–»λŠ” κ²½μš°μ—λ§Œ 높은 점수λ₯Ό 받을 수 μžˆμŠ΅λ‹ˆλ‹€.

MCC=TPβ‹…TNβˆ’FPβ‹…FN(TP+FP)β‹…(TP+FN)β‹…(TN+FP)β‹…(TN+FN)Β {\begin{aligned} \textrm{MCC} = \frac{\text{TP}\cdot\text{TN}-\text{FP}\cdot\text{FN}}{\sqrt{ (\text{TP}+\text{FP})\cdot(\text{TP}+\text{FN})\cdot(\text{TN}+\text{FP})\cdot(\text{TN}+\text{FN}) }}\ \end{aligned}}MCC=(TP+FP)β‹…(TP+FN)β‹…(TN+FP)β‹…(TN+FN)​TPβ‹…TNβˆ’FPβ‹…FN​ ​

MCCλŠ” -1μ—μ„œ 1μ‚¬μ΄μ˜ κ°’μœΌλ‘œ 1은 Perfect Prediction, 0은 Random Prediction, -1은 Worst Prediction을 μ˜λ―Έν•©λ‹ˆλ‹€. Accuracy, F1 점수, MCC의 κ²°κ³Ό 비ꡐ에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ μ•„λž˜ 링크λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

The Relationship Between Precision-Recall and ROC Curve λ…Όλ¬Έ
https://github.com/davidechicco/MCC