πŸ–‹οΈ
noviceforever
  • About me
  • Miscellaneous
    • Introduction
      • 컀리어 μš”μ•½
      • 데이터 κ³Όν•™μ΄λž€?
  • Machine Learning
    • Tabular Data
      • XGBoost Algorithm Overview
      • TabNet Overview
      • Imbalanced Learning
        • Introduction
        • Oversampling Basic (SMOTE variants)
        • Undersampling Basic
        • Cost-sensitive Learning
        • RBF(Radial Basis Function)-based Approach
    • Computer Vision (CNN-based)
      • [Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod
      • R-CNN(Regions with Convolutional Neuron Networks)
      • Fast R-CNN
      • Faster R-CNN
      • Mask R-CNN
      • YOLO (You Only Look Once)
      • YOLO v2(YOLO 9000) Better, Faster, Stronger
      • YOLO v3
      • SSD (Single Shot Multibox Detector)
      • Data Augmentation Tips
    • Computer Vision (Transformer-based)
      • ViT for Image Classification
      • DeiT (Training Data-efficient Image Transformers & Distillation through Attention)
      • DETR for Object Detection
      • Zero-Shot Text-to-Image Generation (DALL-E) - Paper Review
    • Natural Language Processing
      • QRNN(Quasi-Recurrent Neural Network)
      • Transformer is All You Need
      • BERT(Bi-directional Encoder Representations from Transformers)
      • DistilBERT, a distilled version of BERT
      • [Hands-on] Fine Tuning Naver Movie Review Sentiment Classification with KoBERT using GluonNLP
      • OpenAI GPT-2
      • XLNet: Generalized Autoregressive Pretraining for Language Understanding
    • Recommendation System
      • Recommendation System Overview
      • Learning to Rank
      • T-REC(Towards Accurate Bug Triage for Technical Groups) λ…Όλ¬Έ 리뷰
    • Reinforcement Learning
      • MAB(Multi-Armed Bandits) Overview
      • MAB Algorithm Benchmarking
      • MAB(Multi-Armed Bandits) Analysis
      • Policy Gradient Overview
    • IoT on AWS
      • MXNet Installation on NVIDIA Jetson Nano
      • Neo-DLR on NVIDIA Jetson Nano
    • Distributed Training
      • Data Parallelism Overview
      • SageMaker's Data Parallelism Library
      • SageMaker's Model Parallelism Library
    • Deployment
      • MobileNet V1/V2/V3 Overview
      • TensorRT Overview
      • Multi Model Server and SageMaker Multi-Model Endpoint Overview
  • AWS AIML
    • Amazon Personalize
      • Amazon Personalize - User Personalization Algorithm Deep Dive
      • Amazon Personalize Updates(~2021.04) 및 FAQ
Powered by GitBook
On this page
  • Notation
  • SMOTE (Synthetic Minority Oversampling Technique)
  • Algorithm
  • Borderline SMOTE
  • Algorithm
  • ADASYN (Adaptive Synthetic Sampling)
  • Algorithm
  • K-Means SMOTE
  • Algorithm

Was this helpful?

  1. Machine Learning
  2. Tabular Data
  3. Imbalanced Learning

Oversampling Basic (SMOTE variants)

PreviousIntroductionNextUndersampling Basic

Last updated 4 years ago

Was this helpful?

Notation

S={(xi,yi)},β€…β€Ši=1,…,mΒ (mΒ examples)Smin=theΒ setΒ ofΒ minorityΒ classΒ examplesSmaj=theΒ setΒ ofΒ majorityΒ classΒ examplesms=∣Smin∣=theΒ numberΒ ofΒ minorityΒ classΒ examplesml=∣Smaj∣=theΒ numberΒ ofΒ majorityΒ classΒ examples\begin{aligned} S &= \{(x_i, y_i)\},\; i = 1, \dots, m \text{ (m examples)} \\ S_{min} &= \text{the set of minority class examples} \\ S_{maj} &= \text{the set of majority class examples} \\ m_s &= |S_{min}| = \text{the number of minority class examples} \\ m_l &= |S_{maj}| = \text{the number of majority class examples} \\ \end{aligned}SSmin​Smaj​ms​ml​​={(xi​,yi​)},i=1,…,mΒ (mΒ examples)=theΒ setΒ ofΒ minorityΒ classΒ examples=theΒ setΒ ofΒ majorityΒ classΒ examples=∣Sminβ€‹βˆ£=theΒ numberΒ ofΒ minorityΒ classΒ examples=∣Smajβ€‹βˆ£=theΒ numberΒ ofΒ majorityΒ classΒ examples​

SMOTE (Synthetic Minority Oversampling Technique)

μ†Œμˆ˜ 범주에 μ†ν•˜λŠ” 데이터 μƒ˜ν”Œκ³Ό κ°€μž₯ κ°€κΉŒμš΄ μ†Œμˆ˜ λ²”μ£Όμ˜ 데이터 μƒ˜ν”Œλ“€μ„ K-nearest neighbor(K-NN)으둜 찾은 ν›„, 보간(interpolation)을 μ΄μš©ν•˜μ—¬ μƒˆλ‘œμš΄ ν•©μ„±(synthetic) μƒ˜ν”Œμ„ μƒμ„±ν•˜λŠ” λ°©μ‹μž…λ‹ˆλ‹€.

Algorithm

  • SminS_{min}Smin​에 μ†ν•˜λŠ” 데이터듀 쀑 μž„μ˜μ˜ μƒ˜ν”Œ 포인트 xix_ixi​에 λŒ€ν•΄ K-NN 을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.

  • K개의 μ΅œκ·Όμ ‘ 이웃 μƒ˜ν”Œλ“€ 쀑 μž„μ˜μ˜ μƒ˜ν”Œμ„ λžœλ€ν•˜κ²Œ μ„ νƒν•©λ‹ˆλ‹€. (xzix_{zi}xzi​)

  • N개의 ν•©μ„± μƒ˜ν”Œλ“€μ„ μƒ˜ν”Œ 포인트 xix_{i}xi​와 xzix_{zi}xzi​ μ‚¬μ΄μ—μ„œ λžœλ€ν•˜κ²Œ μƒμ„±ν•©λ‹ˆλ‹€. 맀우 κ°„λ‹¨ν•œ μˆ˜μ‹μ΄λΌ κ·Έλ¦Ό ν•œ μž₯으둜 μ‰½κ²Œ 이해할 수 μžˆμŠ΅λ‹ˆλ‹€.

xnew=xi+λ×(xziβˆ’xi),β€…β€ŠΞ»βˆˆ[0,1]x_{new} = x_i + \lambda \times (x_{zi} - x_i), \; \lambda \in [0,1]xnew​=xi​+λ×(xziβ€‹βˆ’xi​),λ∈[0,1]

SMOTEλŠ” oversampling을 κ°„λ‹¨νžˆ μ μš©ν•  수 μžˆλŠ” baselineμ΄μ§€λ§Œ, λ‹€μˆ˜ λ²”μ£Όμ˜ 데이터 뢄포λ₯Ό μ „ν˜€ κ³ λ €ν•˜μ§€ μ•Šκ³  μ†Œμˆ˜ 범주에 μ†ν•˜λŠ” λ°μ΄ν„°λ“€λ§Œ λ³΄κ°„ν•˜λ―€λ‘œ λ‹€μˆ˜ λ²”μ£Όμ˜ 데이터듀과 κ°„μ„­ν•˜κ²Œ λ˜λŠ” overlapping이 λ°œμƒν•©λ‹ˆλ‹€.

이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ Borderline SMOTE, ADASYNκ³Ό 같은 SMOTE의 κ°œμ„  μ•Œκ³ λ¦¬μ¦˜λ“€μ΄λ‚˜ oversamplingκ³Ό undersampling 기법을 κ²°ν•©ν•˜μ—¬ (aka hybrid sampling) μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

Borderline SMOTE

Decision boundary에 μ†ν•˜λŠ” μ†Œμˆ˜ λ²”μ£Όμ˜ λ°μ΄ν„°λ§Œ oversamplingν•˜λŠ” λ°©λ²•μž…λ‹ˆλ‹€.

Algorithm

  • k개의 μ΅œκ·Όμ ‘ 이웃 μƒ˜ν”Œλ“€ 쀑 μž„μ˜μ˜ μƒ˜ν”Œμ„ λžœλ€ν•˜κ²Œ μ„ νƒν•©λ‹ˆλ‹€. SMOTE와 달리, k개의 μ΅œκ·Όμ ‘ 이웃 μƒ˜ν”Œλ“€μ€ Smin,SmajS_{min}, S_{maj}Smin​,Smaj​에 μƒκ΄€ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

  • K-NN μ€‘μ—μ„œ SmajS_{maj}Smaj​에 μ†ν•˜λŠ” μƒ˜ν”Œλ“€μ˜ 비쀑에 따라 DANGER instance에 μ†ν•˜λŠ” xix_ixiβ€‹λ§Œ μ„ νƒν•©λ‹ˆλ‹€. k개의 μ΅œκ·Όμ ‘ 이웃듀 사이에 μ‘΄μž¬ν•˜λŠ” majority μƒ˜ν”Œλ“€μ˜ 개수λ₯Ό k^\hat{k}k^라고 ν•  λ•Œ, Borderline SMOTEλŠ” μ•„λž˜ 쑰건에 따라 μ„Έ κ°€μ§€ μΈμŠ€ν„΄μŠ€λ₯Ό κ΅¬λ³„ν•©λ‹ˆλ‹€.

    • DANGER instance

      • λŒ€λΆ€λΆ„μ˜ μ΅œκ·Όμ ‘ 이웃듀이 λ‹€μˆ˜ λ²”μ£Ό; k/2≀k^<kk/2 \leq \hat{k} < kk/2≀k^<k

    • SAFE instance

      • λŒ€λΆ€λΆ„μ˜ μ΅œκ·Όμ ‘ 이웃듀이 μ†Œμˆ˜ λ²”μ£Ό; 0≀k^<k/20 \leq \hat{k} < k/20≀k^<k/2

    • Noise instance

      • λͺ¨λ“  μ΅œκ·Όμ ‘ 이웃듀이 λ‹€μˆ˜ λ²”μ£Ό; k^=k\hat{k} = kk^=k

  • λ³΄κ°„ν•˜λŠ” 방법은 SMOTE와 λ™μΌν•©λ‹ˆλ‹€.

ADASYN (Adaptive Synthetic Sampling)

SMOTE, Borderline SMOTEλŠ” λͺ¨λ‘ 각 μƒ˜ν”Œ λ‹Ή ν•©μ„±ν•˜λŠ” μƒ˜ν”Œλ“€μ˜ κ°œμˆ˜κ°€ λͺ¨λ‘ λ™μΌν•©λ‹ˆλ‹€. 즉, μ—¬μ „νžˆ 데이터 뢄포λ₯Ό κ³ λ €ν•˜μ§€ μ•ŠκΈ°μ— μ‹€μ œ ν˜„μ—…μ— κ³§λ°”λ‘œ μ μš©ν•˜κΈ°μ—λŠ” λ§Žμ€ μœ„ν—˜μ΄ λ”°λ¦…λ‹ˆλ‹€. ADASYN은 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ λ°μ΄ν„°μ˜ 밀도 뢄포λ₯Ό κ³„μ‚°ν•˜μ—¬ ν•©μ„± μƒ˜ν”Œλ“€μ˜ 수λ₯Ό λ™μ μœΌλ‘œ μ‘°μ ˆν•©λ‹ˆλ‹€.

Algorithm

  • SminS_{min}Smin​에 μ†ν•˜λŠ” 총 ν•©μ„± λ°μ΄ν„°μ˜ 수λ₯Ό κ³„μ‚°ν•©λ‹ˆλ‹€.

    • G=(∣Smajβˆ£βˆ’βˆ£Smin∣)Γ—Ξ²,β€…β€ŠΞ²βˆˆ[0,1]G = (|S_{maj}| - |S_{min}|) \times \beta, \; \beta \in [0,1]G=(∣Smajβ€‹βˆ£βˆ’βˆ£Sminβ€‹βˆ£)Γ—Ξ²,β∈[0,1] ; betaλŠ” ν•©μ„± 데이터 생성 ν›„μ˜ balance level을 μ‘°μ ˆν•˜λŠ” ν•˜μ΄νΌλΌλΌλ©”ν„°λ‘œ beta = 1인 경우 κ· ν˜• 데이터λ₯Ό μ˜λ―Έν•©λ‹ˆλ‹€.

  • SminS_{min}Smin​에 μ†ν•˜λŠ” 각 μƒ˜ν”Œ xix_ixi​에 λŒ€ν•œ μ΅œκ·Όμ ‘ 이웃듀을 μ°Ύκ³  κ·Έ λΉ„μœ¨μ„ κ³„μ‚°ν•©λ‹ˆλ‹€.

    • Ξ“i=Ξ”i/kZ,β€…β€Ši=1,…,∣Smin∣\Gamma_i = \dfrac{\Delta_i/k}{Z}, \; i = 1, \dots , |S_{min}|Ξ“i​=ZΞ”i​/k​,i=1,…,∣Sminβ€‹βˆ£

    • Ξ”i\Delta_iΞ”i​: xix_ixiβ€‹μ˜ μ΅œκ·Όμ ‘ 이웃 μ€‘μ—μ„œ SmajS_{maj}Smaj​에 μ†ν•˜λŠ” μƒ˜ν”Œλ“€μ˜ 개수, ZZZ: Ξ“i\Gamma_iΞ“i​가 ν™•λ₯  밀도 ν•¨μˆ˜(PDF)κ°€ λ˜λ„λ‘ ν•˜λŠ” μ •κ·œν™” μƒμˆ˜

  • SminS_{min}Smin​에 μ†ν•˜λŠ” μƒ˜ν”Œλ“€ 쀑 ν•œ μƒ˜ν”Œ 포인트 xix_ixi​에 λŒ€ν•΄ ν•©μ„± 데이터 μƒ˜ν”Œμ˜ 개수λ₯Ό λ™μ μœΌλ‘œ κ³„μ‚°ν•©λ‹ˆλ‹€; gi=Ξ“iΓ—Gg_i = \Gamma_i \times Ggi​=Ξ“i​×G

  • 보간 방법은 SMOTE와 λ™μΌν•©λ‹ˆλ‹€. 단, xix_ixi​에 λŒ€ν•΄ NNN개의 ν•©μ„± 데이터가 μ•„λ‹Œ gig_igiβ€‹κ°œμ˜ ν•©μ„± 데이터λ₯Ό λ³΄κ°„ν•©λ‹ˆλ‹€.

K-Means SMOTE

K-Means ν΄λŸ¬μŠ€ν„°λ§ μˆ˜ν–‰ ν›„, λ‹€μˆ˜ λ²”μ£Όμ˜ 데이터가 ν¬ν•¨λœ ν΄λŸ¬μŠ€ν„°λŠ” ν•©μ„± 데이터λ₯Ό μƒμ„±ν•˜μ§€ μ•ŠλŠ” λ°©λ²•μž…λ‹ˆλ‹€. μ•Œκ³ λ¦¬μ¦˜ λ˜ν•œ 맀우 κ°„λ‹¨ν•©λ‹ˆλ‹€.

Algorithm

  • 전체 데이터에 λŒ€ν•΄ K-Means ν΄λŸ¬μŠ€ν„°λ§μ„ μˆ˜ν–‰ν•©λ‹ˆλ‹€.

  • λ‹€μˆ˜ λ²”μ£Όμ˜ 데이터가 ν¬ν•¨λœ ν΄λŸ¬μŠ€ν„°λŠ” oversampling λŒ€μƒμ—μ„œ μ œμ™Έν•©λ‹ˆλ‹€.

  • μ†Œμˆ˜ λ²”μ£Όμ˜ λ°μ΄ν„°λ§Œ μ‘΄μž¬ν•˜λŠ” ν΄λŸ¬μŠ€ν„°μ— λŒ€ν•΄ SMOTE둜 ν•©μ„± 데이터λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.

좜처: ​https://ieeexplore.ieee.org/document/8260753