๐Ÿ–‹๏ธ
noviceforever
  • About me
  • Miscellaneous
    • Introduction
      • ์ปค๋ฆฌ์–ด ์š”์•ฝ
      • ๋ฐ์ดํ„ฐ ๊ณผํ•™์ด๋ž€?
  • Machine Learning
    • Tabular Data
      • XGBoost Algorithm Overview
      • TabNet Overview
      • Imbalanced Learning
        • Introduction
        • Oversampling Basic (SMOTE variants)
        • Undersampling Basic
        • Cost-sensitive Learning
        • RBF(Radial Basis Function)-based Approach
    • Computer Vision (CNN-based)
      • [Hands-on] Fast Training ImageNet on on-demand EC2 GPU instances with Horovod
      • R-CNN(Regions with Convolutional Neuron Networks)
      • Fast R-CNN
      • Faster R-CNN
      • Mask R-CNN
      • YOLO (You Only Look Once)
      • YOLO v2(YOLO 9000) Better, Faster, Stronger
      • YOLO v3
      • SSD (Single Shot Multibox Detector)
      • Data Augmentation Tips
    • Computer Vision (Transformer-based)
      • ViT for Image Classification
      • DeiT (Training Data-efficient Image Transformers & Distillation through Attention)
      • DETR for Object Detection
      • Zero-Shot Text-to-Image Generation (DALL-E) - Paper Review
    • Natural Language Processing
      • QRNN(Quasi-Recurrent Neural Network)
      • Transformer is All You Need
      • BERT(Bi-directional Encoder Representations from Transformers)
      • DistilBERT, a distilled version of BERT
      • [Hands-on] Fine Tuning Naver Movie Review Sentiment Classification with KoBERT using GluonNLP
      • OpenAI GPT-2
      • XLNet: Generalized Autoregressive Pretraining for Language Understanding
    • Recommendation System
      • Recommendation System Overview
      • Learning to Rank
      • T-REC(Towards Accurate Bug Triage for Technical Groups) ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ
    • Reinforcement Learning
      • MAB(Multi-Armed Bandits) Overview
      • MAB Algorithm Benchmarking
      • MAB(Multi-Armed Bandits) Analysis
      • Policy Gradient Overview
    • IoT on AWS
      • MXNet Installation on NVIDIA Jetson Nano
      • Neo-DLR on NVIDIA Jetson Nano
    • Distributed Training
      • Data Parallelism Overview
      • SageMaker's Data Parallelism Library
      • SageMaker's Model Parallelism Library
    • Deployment
      • MobileNet V1/V2/V3 Overview
      • TensorRT Overview
      • Multi Model Server and SageMaker Multi-Model Endpoint Overview
  • AWS AIML
    • Amazon Personalize
      • Amazon Personalize - User Personalization Algorithm Deep Dive
      • Amazon Personalize Updates(~2021.04) ๋ฐ FAQ
Powered by GitBook
On this page
  • 1. XGBoost(Extreme Gradient Boosting) ํŠน์žฅ์ 
  • 2. Boosting ์›๋ฆฌ: Objective Function
  • Objective Function: Training Loss + Regularization
  • Additive Training
  • Taylor ๊ธ‰์ˆ˜๋ฅผ ํ™œ์šฉํ•œ objective function ๊ฐ„์†Œํ™”
  • Regularization
  • 3. Boosting ์›๋ฆฌ: Optimization
  • References

Was this helpful?

  1. Machine Learning
  2. Tabular Data

XGBoost Algorithm Overview

์•„๋ž˜ ๋‚ด์šฉ์€ Decision Tree, Bagging, AdaBoost, GBDT(Gradient Boosting Decision Tree)์— ๋Œ€ํ•ด ์•Œ๊ณ  ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•œ๋‹ค. ์ˆ˜์‹๋“ค๋กœ ์ธํ•ด ์•„๋ž˜ ๊ณผ์ •๋“ค์ด ๋ณต์žกํ•ด ๋ณด์ผ ์ˆ˜ ์žˆ์ง€๋งŒ, ํŽœ์„ ์žก๊ณ  ๋…ผ๋ฌธ๊ณผ ๊ณต์‹ ์‚ฌ์ดํŠธ์˜ ์„ค๋ช…์„ ๊ฐ™์ด ์ฐธ์กฐํ•˜๋ฉด ์‰ฝ๊ฒŒ ์ดํ•ด ๊ฐ€๋Šฅํ•˜๋‹ค.

1. XGBoost(Extreme Gradient Boosting) ํŠน์žฅ์ 

  • LightGBM, CatBoost๊ณผ ๊ฐ™์ด ์บ๊ธ€์—์„œ ๊ฐ๊ด‘๋ฐ›๊ณ  ์žˆ๋Š” gradient boosting ์•Œ๊ณ ๋ฆฌ์ฆ˜

  • Parallel Processing: GBDT์— ๊ธฐ๋ฐ˜ํ•˜๊ณ  ์žˆ์ง€๋งŒ, GBDT๊ณผ ๋‹ฌ๋ฆฌ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜์—ฌ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ํ›จ์”ฌ ๋น ๋ฆ„ (LightGBM๋ณด๋‹ค๋Š” ๋А๋ฆผ)

  • Robust to Overfitting: Greedy ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์ง€์น˜๊ธฐ(pruning)๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ณ  early stopping ๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณผ์ ํ•ฉ์ด ์ž˜ ์ผ์–ด๋‚˜์ง€ ์•Š์Œ

  • Cost Function: Cost Function์— ๋Œ€ํ•ด 1์ฐจ, 2์ฐจ ๋„ํ•จ์ˆ˜ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ์‚ฌ์šฉ

  • Regularization: Decision Tree ๊ตฌ์„ฑ ์‹œ Regularization Term์„ ์ถ”๊ฐ€ํ•˜์—ฌ overfitting ๋ฐฉ์ง€

  • Flexible Loss function: ์‚ฌ์šฉ์ž ์ •์˜ Loss function์„ ์›ํ•˜๋Š” ๋Œ€๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ์Œ

  • Sparsity aware: Missing Value์— ๋Œ€ํ•ด์„œ ์ž์ฒด์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•จ

2. Boosting ์›๋ฆฌ: Objective Function

Objective Function: Training Loss + Regularization

KKK๊ฐœ์˜ ํŠธ๋ฆฌ๊ฐ€ ์žˆ๊ณ  ๋ชจ๋ธ fkf_kfkโ€‹๋ฅผ ๋”ํ•ด y^\hat{y}y^โ€‹๋ฅผ ์˜ˆ์ธกํ•  ๋•Œ, ์ด ํŠธ๋ฆฌ๋“ค์ด ๋ชจ์—ฌ์„œ ๊ตฌ์„ฑ๋˜๋Š” ๋ชจ๋ธ FFF(CART; Classification And Regression Trees์˜ ๋ชจ๋“  tree ์„ธํŠธ)๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

y^i=โˆ‘k=1Kfk(xi),fkโˆˆF(1)\hat{y}_i = \sum_{k=1}^K f_k(x_i), f_k \in F \tag {1}y^โ€‹iโ€‹=k=1โˆ‘Kโ€‹fkโ€‹(xiโ€‹),fkโ€‹โˆˆF(1)

๋ชฉ์  ํ•จ์ˆ˜(objective function)๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•˜์—ฌ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.

obj=โˆ‘i=1nl(yi,y^i(t))+โˆ‘i=1tฮฉ(fi)(2)\text{obj} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \tag{2}obj=i=1โˆ‘nโ€‹l(yiโ€‹,y^โ€‹i(t)โ€‹)+i=1โˆ‘tโ€‹ฮฉ(fiโ€‹)(2)

์—ฌ๊ธฐ์—์„œ ์ขŒ์ธก ํ•ญ์€ ์‹ค์ œ๊ฐ’๊ณผ ์˜ˆ์ธก๊ฐ’๊ณผ์˜ ์ฐจ์ด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” training loss์ด๊ณ , ์šฐ์ธก ํ•ญ์€ ํŠธ๋ฆฌ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ๊ณผ์ ํ•ฉ(overfitting)์„ ๋ฐฉ์ง€ํ•˜๋Š” ์ •๊ทœํ™”(regularization) ํ•ญ์ด๋‹ค. ํ•™์Šตํ•  ํŒŒ๋ผ๋ฉ”ํ„ฐ๋“ค์€ ๊ฐ ํŠธ๋ฆฌ์˜ ๊ตฌ์กฐ์™€ leaf ๋…ธ๋“œ์˜ score๋กœ ฮธ=f1,f2,โ‹ฏโ€‰,fK\theta= { {f1, f_2, \cdots, f_{K}} }ฮธ=f1,f2โ€‹,โ‹ฏ,fKโ€‹์ด๋‹ค.

Additive Training

XGBoost๋Š” ์ข…๋ž˜ GBM๊ณผ ๋‹ฌ๋ฆฌ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ์ง์ ‘ ํ•™์Šตํ•œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, ํŠธ๋ฆฌ์— ๋Œ€ํ•œ ์ตœ์  ํŒŒ๋ผ๋ฉ”ํ„ฐ๋“ค์„ ์ฐพ๋Š”๋‹ค.

์ข…๋ž˜ GBM์—์„œ๋Š” Stochastic Gradient Descent๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ์— ์ž”์ฐจ(residual)์— ๋Œ€ํ•œ ์ˆœ์ฐจ์ ์ธ ๊ณ„์‚ฐ์ด ํ•„์š”ํ•˜๊ธฐ์— ํ•œ ๋ฒˆ์— ๋ชจ๋“  ํŠธ๋ฆฌ์˜ ์ตœ์ ํ™”๊ฐ€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด XGBoost์—์„œ๋Š” ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ํŠธ๋ฆฌ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๊ฐ€์‚ฐ(additive) ์ „๋žต์„ ์‚ฌ์šฉํ•œ๋‹ค.

y^i(0)=0y^i(1)=f1(xi)=y^i(0)+f1(xi)y^i(2)=f1(xi)+f2(xi)=y^i(1)+f2(xi)โ€ฆy^i(t)=โˆ‘k=1tfk(xi)=y^i(tโˆ’1)+ft(xi)(3)\begin{aligned}\hat{y}_i^{(0)} &= 0 \\ \hat{y}_i^{(1)} &= f_1(x_i) = \hat{y}_i^{(0)} + f_1(x_i) \\ \hat{y}_i^{(2)} &= f_1(x_i) + f_2(x_i)= \hat{y}_i^{(1)} + f_2(x_i )\\ &\dots \\ \hat{y}_i^{(t)} &= \sum_{k=1}^t f_k(x_i)= \hat{y}_i^{(t-1)} + f_t(x_i)\end{aligned} \tag{3}y^โ€‹i(0)โ€‹y^โ€‹i(1)โ€‹y^โ€‹i(2)โ€‹y^โ€‹i(t)โ€‹โ€‹=0=f1โ€‹(xiโ€‹)=y^โ€‹i(0)โ€‹+f1โ€‹(xiโ€‹)=f1โ€‹(xiโ€‹)+f2โ€‹(xiโ€‹)=y^โ€‹i(1)โ€‹+f2โ€‹(xiโ€‹)โ€ฆ=k=1โˆ‘tโ€‹fkโ€‹(xiโ€‹)=y^โ€‹i(tโˆ’1)โ€‹+ftโ€‹(xiโ€‹)โ€‹(3)

y^i(t)\hat{y}_i^{(t)}y^โ€‹i(t)โ€‹ ๋Š” ttt ์‹œ๊นŒ์ง€์˜ ์˜ˆ์ธก ๊ฒฐ๊ณผ์ด๊ณ  ft(xi)f_t(x_i)ftโ€‹(xiโ€‹)๋Š” ttt ์‹œ์ ์˜ ๋ถ„๋ฅ˜๊ธฐ ํ•จ์ˆ˜์ด๋‹ค.

(3)์˜ ์ˆ˜์‹์„ ๊ทธ๋Œ€๋กœ ์น˜ํ™˜ํ•˜๋ฉด ๋ชฉ์ ํ•จ์ˆ˜๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

obj(t)=โˆ‘i=1nl(yi,y^i(t))+โˆ‘i=1tฮฉ(fi)=โˆ‘i=1nl(yi,y^i(tโˆ’1)+ft(xi))+ฮฉ(ft)+constant(5)\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t) + \mathrm{constant}\end{aligned} \tag{5}obj(t)โ€‹=i=1โˆ‘nโ€‹l(yiโ€‹,y^โ€‹i(t)โ€‹)+i=1โˆ‘tโ€‹ฮฉ(fiโ€‹)=i=1โˆ‘nโ€‹l(yiโ€‹,y^โ€‹i(tโˆ’1)โ€‹+ftโ€‹(xiโ€‹))+ฮฉ(ftโ€‹)+constantโ€‹(5)

๋งŒ์•ฝ, loss function์„ MSE(Mean Squared Error)๋กœ ์ •์˜ํ•˜๋ฉด ๋ชฉ์  ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

obj(t)=โˆ‘i=1n(yiโˆ’(y^i(tโˆ’1)+ft(xi)))2+โˆ‘i=1tฮฉ(fi)=โˆ‘i=1n[2(y^i(tโˆ’1)โˆ’yi)ft(xi)+ft(xi)2]+ฮฉ(ft)+constant(6)\begin{aligned}\text{obj}^{(t)} & = \sum_{i=1}^n (y_i - (\hat{y}_i^{(t-1)} + f_t(x_i)))^2 + \sum_{i=1}^t\Omega(f_i) \\ & = \sum_{i=1}^n [2(\hat{y}_i^{(t-1)} - y_i)f_t(x_i) + f_t(x_i)^2] + \Omega(f_t) + \mathrm{constant}\end{aligned} \tag{6}obj(t)โ€‹=i=1โˆ‘nโ€‹(yiโ€‹โˆ’(y^โ€‹i(tโˆ’1)โ€‹+ftโ€‹(xiโ€‹)))2+i=1โˆ‘tโ€‹ฮฉ(fiโ€‹)=i=1โˆ‘nโ€‹[2(y^โ€‹i(tโˆ’1)โ€‹โˆ’yiโ€‹)ftโ€‹(xiโ€‹)+ftโ€‹(xiโ€‹)2]+ฮฉ(ftโ€‹)+constantโ€‹(6)

(6)์˜ ์ˆ˜์‹์„ ์ž˜ ์‚ดํŽด๋ณด์ž. ๋ถ„๋ช… ttt ์— ๋Œ€ํ•œ ๋ฌธ์ œ์ธ๋ฐ tโˆ’1t-1tโˆ’1 ์‹œ์ ๊นŒ์ง€ ๊ณ ๋ คํ•ด์•ผ ํ•˜๊ธฐ์— loss function ์ตœ์ ํ™”๊ฐ€ ๋งค์šฐ ๋ณต์žกํ•ด์ง„๋‹ค. ์ด๋ฅผ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์„๊นŒ?

Taylor ๊ธ‰์ˆ˜๋ฅผ ํ™œ์šฉํ•œ objective function ๊ฐ„์†Œํ™”

์ƒ๊ธฐ์™€ ๊ฐ™์ด ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ๊ทธ๋Œ€๋กœ ์ „๊ฐœํ•˜๋ฉด, ์‹๋„ ๋ณต์žกํ•ด์ง€๊ณ  ๋‹ค์–‘ํ•œ loss function์„ ์ ์šฉํ•˜๋Š”๋ฐ ์–ด๋ ค์›€์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜๋ฅผ ์‚ฌ์šฉํ•ด ๊ฐ„์†Œํ™”ํ•˜๋ฉด ์ด๋Ÿฌํ•œ ์–ด๋ ค์›€๋“ค์„ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋‹ค.

ํ…Œ์ผ๋Ÿฌ ๊ธ‰์ˆ˜ ์ „๊ฐœ๋กœ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์น˜ํ™˜ํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค. gig_igiโ€‹๋Š” 1์ฐจ ํŽธ๋ฏธ๋ถ„ ํ•จ์ˆ˜๋กœ gradient์ด๊ณ , hih_ihiโ€‹๋Š” 2์ฐจ ํŽธ๋ฏธ๋ถ„ ํ•จ์ˆ˜๋กœ Hessian ํ–‰๋ ฌ์ด๋‹ค.

obj(t)=โˆ‘i=1n[l(yi,y^i(tโˆ’1))+gift(xi)+12hift2(xi)]+ฮฉ(ft)+constantgi=โˆ‚y^i(tโˆ’1)l(yi,y^i(tโˆ’1))hi=โˆ‚y^i(tโˆ’1)2l(yi,y^i(tโˆ’1))(7)\text{obj}^{(t)} = \sum_{i=1}^n [l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant} \tag {7} \\ \begin{aligned}g_i &= \partial_{\hat{y}_i^{(t-1)}} l(y_i, \hat{y}_i^{(t-1)})\\ h_i &= \partial_{\hat{y}_i^{(t-1)}}^2 l(y_i, \hat{y}_i^{(t-1)})\end{aligned}obj(t)=i=1โˆ‘nโ€‹[l(yiโ€‹,y^โ€‹i(tโˆ’1)โ€‹)+giโ€‹ftโ€‹(xiโ€‹)+21โ€‹hiโ€‹ft2โ€‹(xiโ€‹)]+ฮฉ(ftโ€‹)+constantgiโ€‹hiโ€‹โ€‹=โˆ‚y^โ€‹i(tโˆ’1)โ€‹โ€‹l(yiโ€‹,y^โ€‹i(tโˆ’1)โ€‹)=โˆ‚y^โ€‹i(tโˆ’1)โ€‹2โ€‹l(yiโ€‹,y^โ€‹i(tโˆ’1)โ€‹)โ€‹(7)

loss function lll์€ ttt์‹œ์ ์—์„œ ์ƒ์ˆ˜์ด๋ฏ€๋กœ, ์ƒ์ˆ˜๋ฅผ constant ํ•ญ์— ํฌํ•จ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ, (7) ์„ ์•„๋ž˜์™€ ๊ฐ™์ด ์ถ•์•ฝ ๊ฐ€๋Šฅํ•˜๋‹ค.

obj(t)=โˆ‘i=1n[gift(xi)+12hift2(xi)]+ฮฉ(ft)+constant(8)\text{obj}^{(t)} = \sum_{i=1}^n [ g_i f_t(x_i) + \frac{1}{2} h_i f_t^2(x_i)] + \Omega(f_t) + \mathrm{constant} \tag {8}obj(t)=i=1โˆ‘nโ€‹[giโ€‹ftโ€‹(xiโ€‹)+21โ€‹hiโ€‹ft2โ€‹(xiโ€‹)]+ฮฉ(ftโ€‹)+constant(8)

์ด์ œ loss function ํ˜•ํƒœ์— ์˜์กด์ ์ด์ง€ ์•Š์œผ๋ฏ€๋กœ, ggg์™€ hhh๋ฅผ ์‚ฌ์šฉ์ž ์ •์˜ํ•˜์—ฌ ๋‹ค์–‘ํ•œ loss function์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ๊ฐ€ ์šฉ์ดํ•˜๋‹ค.

์ด์ œ ์ˆ˜์‹ ์šฐ์ธก์˜ ฮฉ(ft)\Omega(f_t)ฮฉ(ftโ€‹) ํ•ญ์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์ž.

Regularization

Decision Tree ๋‹ฌ๋ฆฌ XGBoost์—์„œ๋Š” ํŠธ๋ฆฌ์˜ ๋น„์ค‘์„ ์กฐ์ ˆํ•˜๋Š” ์ •๊ทœํ™” ํ•จ์ˆ˜๋ฅผ ์ ๊ทน ์‚ฌ์šฉํ•œ๋‹ค.

์šฐ์„ , ํŠธ๋ฆฌ ํ•จ์ˆ˜ ft(x)f_t(x)ftโ€‹(x) ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•˜์ž. (www: leaf ๋…ธ๋“œ์˜ score, qqq: ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋“ค์„ qqq๋ฒˆ์งธ leaf ๋…ธ๋“œ์— ํ• ๋‹นํ•˜๋Š” ํ•จ์ˆ˜, TTT: leaf ๋…ธ๋“œ์˜ ๊ฐœ์ˆ˜)

ft(x)=wq(x),wโˆˆRT,q:Rdโ†’{1,2,โ‹ฏโ€‰,T}(9)f_t(x) = w_{q(x)}, w \in R^T, q:R^d\rightarrow \{1,2,\cdots,T\} \tag{9}ftโ€‹(x)=wq(x)โ€‹,wโˆˆRT,q:Rdโ†’{1,2,โ‹ฏ,T}(9)

๋‹ค์‹œ ๋งํ•ด, ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ xxx๋ฅผ qqq๋ฒˆ์งธ leaf์— ํ• ๋‹นํ•˜๊ณ , ๊ทธ์— ๋Œ€์‘ํ•˜๋Š” q(x)q(x)q(x)๋ฒˆ์งธ leaf์˜ score wq(x)w_{q(x)}wq(x)โ€‹๋ฅผ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์— ํ• ๋‹นํ•œ๋‹ค.

XGBoost์—์„œ๋Š” ํŠธ๋ฆฌ์˜ ๋น„์ค‘(weights)์„ ์กฐ์ ˆํ•˜๋Š” ์ •๊ทœํ™” ํ•จ์ˆ˜ ฮฉ(ft)\Omega(f_t)ฮฉ(ftโ€‹)๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค.

ฮฉ(ft)=ฮณT+12ฮปโˆ‘j=1Twj2(10)\Omega(f_t) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2 \tag{10}ฮฉ(ftโ€‹)=ฮณT+21โ€‹ฮปj=1โˆ‘Tโ€‹wj2โ€‹(10)

ฮณT\gamma TฮณT๋Š” leaf์˜ ๊ฐœ์ˆ˜๋กœ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์กŒ๋Š”์ง€์— ๋Œ€ํ•œ ๋ณต์žก๋„๋ฅผ ์กฐ์ •ํ•˜๊ณ , 12ฮปโˆ‘j=1Twj2\frac{1}{2}\lambda \sum_{j=1}^T w_j^221โ€‹ฮปโˆ‘j=1Tโ€‹wj2โ€‹๋Š” leaf score์˜ L2 norm ํ•จ์ˆ˜์ด๋‹ค. ์ด ํ•จ์ˆ˜๋„ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ด์ œ ์ˆ˜์‹ (8)์€ (9)์™€ (10)์„ ๊ทธ๋Œ€๋กœ ์น˜ํ™˜ํ•˜์—ฌ ์•„๋ž˜์™€ ๊ฐ™์ด ํ’€์–ด์“ธ ์ˆ˜ ์žˆ๋‹ค.

obj(t)โ‰ˆโˆ‘i=1n[giwq(xi)+12hiwq(xi)2]+ฮณT+12ฮปโˆ‘j=1Twj2=โˆ‘j=1T[(โˆ‘iโˆˆIjgi)wj+12(โˆ‘iโˆˆIjhi+ฮป)wj2]+ฮณT(11)\begin{aligned}\text{obj}^{(t)} &\approx \sum_{i=1}^n [g_i w_{q(x_i)} + \frac{1}{2} h_i w_{q(x_i)}^2] + \gamma T + \frac{1}{2}\lambda \sum_{j=1}^T w_j^2\\ &= \sum^T_{j=1} [(\sum_{i\in I_j} g_i) w_j + \frac{1}{2} (\sum_{i\in I_j} h_i + \lambda) w_j^2 ] + \gamma T\end{aligned} \tag{11}obj(t)โ€‹โ‰ˆi=1โˆ‘nโ€‹[giโ€‹wq(xiโ€‹)โ€‹+21โ€‹hiโ€‹wq(xiโ€‹)2โ€‹]+ฮณT+21โ€‹ฮปj=1โˆ‘Tโ€‹wj2โ€‹=j=1โˆ‘Tโ€‹[(iโˆˆIjโ€‹โˆ‘โ€‹giโ€‹)wjโ€‹+21โ€‹(iโˆˆIjโ€‹โˆ‘โ€‹hiโ€‹+ฮป)wj2โ€‹]+ฮณTโ€‹(11)

๋‘๋ฒˆ์งธ ์ค„์˜ Ij=iโˆฃq(xi)=jI_j = {i|q(x_i)=j}Ijโ€‹=iโˆฃq(xiโ€‹)=j๋Š” j๋ฒˆ์งธ leaf์— ํ• ๋‹น๋œ ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ์˜ ์ธ๋ฑ์Šค ์„ธํŠธ๋กœ, ๋™์ผ leaf์— ์†ํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ๋Š” ๋™์ผํ•œ score๋ฅผ ๊ฐ€์ง€๋ฏ€๋กœ ์ฒซ๋ฒˆ์งธ ์ค„์˜ ์ˆ˜์‹์„ ๋ณ€๊ฒฝํ•œ ๊ฒƒ์ด๋‹ค. ์ˆ˜์‹์˜ sigma ์œ—์ฒจ์ž๋ฅผ ์ž˜ ํ™•์ธํ•ด ๋ณด๋ฉด, ๋ชฉ์  ํ•จ์ˆ˜๋Š” weight์— ๋Œ€ํ•œ 2์ฐจ์‹์˜ TTT๊ฐœ์˜ ํ•ฉ์œผ๋กœ ์ „๊ฐœ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค. ์ด์ œ ggg์™€ hhh์— ๋Œ€ํ•ด ์ตœ์†Ÿ์ ์ด ๋˜๋Š” ํŒŒ๋ผ๋ฉ”ํ„ฐ๋ฅผ ๊ตฌํ•˜๋Š” ๋ชฉ์  ํ•จ์ˆ˜ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•ด ๋ณด์ž.

3. Boosting ์›๋ฆฌ: Optimization

Formulation

Gj=โˆ‘iโˆˆIjgi,โ€…โ€Šโ€…โ€ŠHj=โˆ‘iโˆˆIjhiG_j = \sum{i\in Ij} g_i, \;\; H_j = \sum{i\in I_j} h_iGjโ€‹=โˆ‘iโˆˆIjgiโ€‹,Hjโ€‹=โˆ‘iโˆˆIjโ€‹hiโ€‹๋กœ ์ •์˜ํ•˜์—ฌ (11)์˜ ์ˆ˜์‹์„ ๋‹จ์ˆœํ™”ํ•ด๋ณด๋ฉด, 2์ฐจ์‹(quadratic form)์˜ ํ˜•ํƒœ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

obj(t)=โˆ‘j=1T[Gjwj+12(Hj+ฮป)wj2]+ฮณT(12)\text{obj}^{(t)} = \sum^T_{j=1} [G_jw_j + \frac{1}{2} (H_j+\lambda) w_j^2] +\gamma T \tag{12}obj(t)=j=1โˆ‘Tโ€‹[Gjโ€‹wjโ€‹+21โ€‹(Hjโ€‹+ฮป)wj2โ€‹]+ฮณT(12)

๋‹ค์‹œ ๋งํ•ด, ggg์™€ hhh์— ๋Œ€ํ•ด ์ตœ์†Ÿ์ ์„ ๋งŒ์กฑํ•˜๋Š” ํŠธ๋ฆฌ jjj๋Š” ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

argminj[Gj+12(H+ฮป)j2]=โˆ’GH,โ€…โ€ŠH>0minj[Gj+12(H+ฮป)j2]=โˆ’12G2H(13)argmin_j{[G_j + \frac{1}{2}(H+\lambda)j^2]= -\dfrac{G}{H}, \;H > 0} \\ min_j{[G_j + \frac{1}{2}(H+\lambda)j^2]= -\dfrac{1}{2}\dfrac{G^2}{H}}\tag{13}argminjโ€‹[Gjโ€‹+21โ€‹(H+ฮป)j2]=โˆ’HGโ€‹,H>0minjโ€‹[Gjโ€‹+21โ€‹(H+ฮป)j2]=โˆ’21โ€‹HG2โ€‹(13)

loss function์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด wjw_jwjโ€‹์— ๋Œ€ํ•œ loss function์˜ ๋„ํ•จ์ˆ˜๋ฅผ 0์œผ๋กœ ์„ค์ •ํ•˜, 2์ฐจ์‹์— ๋Œ€ํ•œ ์ตœ์ ํ™” ๋ฌธ์ œ๋Š” ์•„๋ž˜ ํ˜•ํƒœ๋กœ ์ตœ์ ํ™”๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

wjโˆ—=โˆ’GjHj+ฮปobjโˆ—=โˆ’12โˆ‘j=1TGj2Hj+ฮป+ฮณT(14)\begin{aligned}w_j^\ast &= -\frac{G_j}{H_j+\lambda}\\ \text{obj}^\ast &= -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j+\lambda} + \gamma T\end{aligned} \tag{14}wjโˆ—โ€‹objโˆ—โ€‹=โˆ’Hjโ€‹+ฮปGjโ€‹โ€‹=โˆ’21โ€‹j=1โˆ‘Tโ€‹Hjโ€‹+ฮปGj2โ€‹โ€‹+ฮณTโ€‹(14)

ํ•˜์ง€๋งŒ, ํ•œ ๊ฐ€์ง€ ์ ˆ์ฐจ๊ฐ€ ๋” ๋‚จ์•„์žˆ๋‹ค. ์•„๋ž˜์—์„œ ํ™•์ธํ•ด ๋ณด์ž.

Tree ๋ถ„๊ธฐ

Decision Tree ๊ธฐ๋ฐ˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ๊ณผ์ ํ•ฉ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ€์ง€์น˜๊ธฐ(pruning)๊ฐ€ ํ•„์ˆ˜์ด๋‹ค. ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ์ด์ƒ์ ์œผ๋กœ๋Š” ๋ชจ๋“  ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ด ์ค‘ ๊ฐ€์žฅ ์ตœ์ ์˜ ํŠธ๋ฆฌ๋ฅผ ์„ ํƒํ•ด์•ผ ํ•˜์ง€๋งŒ, ์‹ค์ œ๋กœ ๊ทธ๋ ‡๊ฒŒ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ NP-hard ๋ฌธ์ œ๋กœ ๋งค์šฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— XGBoost๋Š” Level-wise tree growth ์ „๋žต์œผ๋กœ ํ•œ ๋ฒˆ์— ํ•œ ๋‹จ๊ณ„์˜ ํŠธ๋ฆฌ๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค. (์ฐธ๊ณ ๋กœ, LightGBM์€ leaf-wise tree growth ์ „๋žต์œผ๋กœ ๋” ๋ณต์žกํ•œ ํŠธ๋ฆฌ๋ฅผ ์ƒ์„ฑํ•˜์ง€๋งŒ, GOSS (Gradient-based One-Side Sampling) ์™€ EFB (Exclusive Feature Bundling) ๋“ฑ์˜ ์ฐจ๋ณ„ํ™”๋œ ์ตœ์ ํ™” ๋ฐฉ๋ฒ•์œผ๋กœ XGBoost๋ณด๋‹ค ๋น ๋ฅธ ํ•™์Šต ์†๋„๋ฅผ ์ž๋ž‘ํ•œ๋‹ค.)

์ข€ ๋” ๊ตฌ์ฒด์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉด, ํŠธ๋ฆฌ๋ฅผ greedyํ•˜๊ฒŒ ํ‚ค์šฐ๋ฉด์„œ ํŠธ๋ฆฌ์˜ ๊ฐ€์ง€๋ฅผ ๋‚˜๋ˆ„๋Š” ์‹œ์ ์—์„œ ์™ผ์ชฝ ๊ฐ€์ง€์™€ ์˜ค๋ฅธ์ชฝ ๊ฐ€์ง€์— ๋Œ€ํ•œ score๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ Information Gain์„ ๊ณ„์‚ฐํ•˜๊ณ  Gain์ด ์Œ์ˆ˜์ผ ๋•Œ์—๋Š” ๊ฐ€์ง€์น˜๊ธฐ๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. ํŠน์ • depth์—์„œ ๊ฐ€์ง€์น˜๊ธฐ ์ˆ˜ํ–‰ ์‹œ์˜ Information Gain์€ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‘œํ˜„ ๊ฐ€๋Šฅํ•˜๋‹ค.

Gain=12[GL2HL+ฮป+GR2HR+ฮปโˆ’(GL+GR)2HL+HR+ฮป]โˆ’ฮณ(15)Gain = \frac{1}{2} \left[\frac{G_L^2}{H_L+\lambda}+\frac{G_R^2}{H_R+\lambda}-\frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma \tag{15}Gain=21โ€‹[HLโ€‹+ฮปGL2โ€‹โ€‹+HRโ€‹+ฮปGR2โ€‹โ€‹โˆ’HLโ€‹+HRโ€‹+ฮป(GLโ€‹+GRโ€‹)2โ€‹]โˆ’ฮณ(15)

๋Œ€๊ด„ํ˜ธ ์•ˆ์˜ ์ฒซ๋ฒˆ์งธ ํ•ญ์€ ์ขŒ์ธก leaf score(i.e., left side children score), ๋‘๋ฒˆ์งธ ํ•ญ์€ ์šฐ์ธก leaf score, ๋งˆ์ง€๋ง‰ ํ•ญ์€ ์›๋ž˜ leaf์˜ score๋ฅผ ์‚ฐ์ถœํ•˜๋ฉฐ, ๋งˆ์ง€๋ง‰ gamma๋Š” ์ถ”๊ฐ€ leaf์˜ ์ •๊ทœํ™” term์ด๋‹ค. ๊ฐ€์ง€ LLL๊ณผ RRR์€ ์„œ๋กœ ์ค‘๋ณต์ด ๋ฐœํ•˜์ง€ ์•Š๋Š” disjoint set์ด๋‹ค. ๋งŒ์•ฝ ๋Œ€๊ด„ํ˜ธ ์•ˆ์˜ ๊ฐ’์ด ฮณ\gammaฮณ๋ณด๋‹ค ์ž‘๋‹ค๋ฉด, information gain์ด ์Œ์ˆ˜๊ฐ€ ๋˜๋ฏ€๋กœ ๋” ์ด์ƒ ๊ฐ€์ง€๋ฅผ ์น˜์ง€ ์•Š๋Š” ๊ฒƒ์ด ์ข‹๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค.

์ด์— ๋Œ€ํ•œ psuedo ์ฝ”๋“œ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.(Exact Greedy Algorithm for Split Finding) ๋‹ค๋งŒ, ์‹ค์ œ ๊ตฌํ˜„์€ missing value๊นŒ์ง€ ๊ณ ๋ คํ•œ Sparsity-aware Split Finding์„ ์‚ฌ์šฉํ•œ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ Gain์ด ๋†’์€ TTT๊ฐœ์˜ ํŠธ๋ฆฌ๋“ค์„ ์กฐํ•ฉํ•˜์—ฌ ๋ถ€์ŠคํŒ…์„ ์ˆ˜ํ–‰ํ•˜๋ฉด ์ตœ์ ์˜ ๋ชจ๋ธ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

References

  • Paper

  • Official website

PreviousTabular DataNextTabNet Overview

Last updated 4 years ago

Was this helpful?

XGBoost: A Scalable Tree Boosting System:

https://arxiv.org/pdf/1603.02754.pdf
https://xgboost.readthedocs.io/en/latest/tutorials/model.html