MAB Algorithm Benchmarking

0. ๊ฐœ์š”

Objective

  • MAB ๊ธฐ๋ณธ ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์„ฑ๋Šฅ(convergence ์†๋„, latency) ์ธก์ •

๋ฐฉ๋ฒ•

  • convergence: experiments์˜ ํ‰๊ท  reward๊ฐ’๊ณผ ์ตœ์ ์˜ arm์„ ์„ ํƒํ•  ํ™•๋ฅ ์„ plotting

  • latency: agent ์ดˆ๊ธฐํ™” ๋ถ€๋ถ„์„ ์ œ์™ธํ•œ ์‹ค์ œ MAB ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๊ตฌ๋™๋ถ€์˜ ์†๋„๋งŒ ์ธก์ •

%matplotlib inline
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

import bandits as bd
from bandits import kullback

1. ํ•˜์ดํผํŒŒ๋ผ๋ฉ”ํ„ฐ ์„ค์ •

  • Number of Arms: arm์˜ ๊ฐฏ์ˆ˜, ์ฆ‰ k

  • Number of Trials: Number of Iterations (Time steps)

  • Number of Experiments: Converge ์ฒ™๋„ ์‹คํ—˜; ๊ฐ์ž ๋‹ค๋ฅธ ์„ธํŒ…์˜(default:random) 500๊ฐœ์˜ k arms๋“ค์„ ๊ตฌ๋™์‹œ์ผœ ํ‰๊ท ์„ ์ทจํ•˜๊ธฐ ์œ„ํ•œ ๋ชฉ์ ์ž„

2. ๐œ–-greedy(epsilon-greedy)

  • ๊ฐœ์š”: ๐œ– ํ™•๋ฃฐ๋กœ randomํ•˜๊ฒŒ ํƒ์ƒ‰ํ•˜๊ณ  1 โˆ’ ๐œ–์˜ ํ™•๋ฅ ๋กœ greedyํ•˜๊ฒŒ ๊ฐ€์žฅ ์ข‹์€ arm์„ ์„ ํƒํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ ๐œ–์ด ํฌ๋ฉด ํƒ์ƒ‰์˜ ๋น„์ค‘์ด ๋†’์•„์ง€๊ณ  ๐œ–์ด ์ž‘์œผ๋ฉด ํš๋“์˜ ๋น„์ค‘์ด ๋†’์•„์ง

  • ๋น„๊ต ์•Œ๊ณ ๋ฆฌ์ฆ˜

    • Random: ๐œ–=1 ์ธ ๊ฒฝ์šฐ

    • Greedy: ๐œ–=0 ์ธ ๊ฒฝ์šฐ

    • EpsilonGreedy (w/ ๐œ–=0.01): ๐œ–=0.01 ์ธ ๊ฒฝ์šฐ

    • EpsilonGreedy (w/ ๐œ–=0.1): ๐œ–=0.01 ์ธ ๊ฒฝ์šฐ (์ผ๋ฐ˜์ ์œผ๋กœ ์“ฐ์ด๋Š” ์„ธํŒ…๊ฐ’์ด๋‚˜ ํ™˜๊ฒฝ์— ๋”ฐ๋ผ ์กฐ์ ˆ ํ•„์š”)

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Converge Ratio):

    • Random: Average Reward์™€ Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด ๋งค์šฐ ๋‚ฎ์Œ

    • Greedy: Average Reward๊ฐ€ Random ๋Œ€๋น„ ๋†’์œผ๋‚˜ Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด ์ผ์ • ์‹œ์  ์ดํ›„์— ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์Œ

    • EpsilonGreedy (w/ ๐œ–=0.01): Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ์ˆ˜๋ ด ์†๋„๊ฐ€ ๋”๋”ค

    • EpsilonGreedy (w/ ๐œ–=0.1): Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด ๋‚˜๋จธ์ง€ ์…‹ ๋Œ€๋น„ ๊ฐ€์žฅ ์ข‹์œผ๋ฉฐ ์ฆ๊ฐ€ ์†๋„๋„ ๋น ๋ฆ„

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Elapsed Time):

    • 4๊ฐ€์ง€ Policy ๋ชจ๋‘ ํฐ ์ฐจ์ด๋Š” ์—†์œผ๋‚˜ Random์˜ ๊ฒฝ์šฐ ์†๋„๊ฐ€ ์•ฝ๊ฐ„ ๋–จ์–ด์ง

3. UCB(Upper Confidence Bound)

  • ๊ฐœ์š”: Upper Confidence Bound(์ดํ•˜ UCB)๋ฅผ ์„ค์ •ํ•˜์—ฌ reward์— ๋Œ€ํ•œ ๋ถˆํ™•์‹ค์„ฑ์„ ๊ณ ๋ ค

    • As-is: ์ฃผ์‚ฌ์œ„๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ๋˜์กŒ์„ ๋•Œ ๊ธฐ๋Œ€๊ฐ’์€ 3.5์ด์ง€๋งŒ, ๋‘ ๋ฒˆ๋งŒ ๋˜์กŒ์„ ๋•Œ ๋ˆˆ๊ธˆ์ด 1, 3์ด ๋‚˜์˜ค๋ฉด ๊ธฐ๋Œ€๊ฐ’์ด 2๊ฐ€ ๋‚˜์˜ค๋ฏ€๋กœ ์‹ค์ œ ๊ธฐ๋Œ€๊ฐ’๊ณผ ํŽธ์ฐจ๊ฐ€ ์‹ฌํ•จ

    • To-be: ์ฃผ์‚ฌ์œ„๋ฅผ ๋‘ ๋ฒˆ๋งŒ ๋˜์กŒ์„ ๋•Œ [2, 5.2]์˜ ๋ฒ”์œ„๋กœ Confidence Interval์„ ์ •ํ•˜๊ณ  ํšŸ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก Confidence Interval์„ ๊ฐ์†Œ์‹œ์ผœ ๋ถˆํ™•์‹ค์„ฑ์„ ์ค„์ž„

  • ๋น„๊ต ์•Œ๊ณ ๋ฆฌ์ฆ˜

    • EpsilonGreedy (w/ ๐œ–=0.1)

    • EpsilonGreedy (w/ ๐œ–=0.1, Initial Value=5): Initial Value๋ฅผ ๋†’๊ฒŒ ์„ค์ • ์‹œ ์ดˆ๊ธฐ Iteration์—์„œ ํƒ์ƒ‰ ๋น„์ค‘์ด ์˜ฌ๋ผ๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— ์ตœ์  Action์˜ ์„ ํƒ ๋นˆ๋„๊ฐ€ ๋Š˜์–ด๋‚จ

    • UCB (w/ c): c๋Š” Confidence level๋ฅผ ์กฐ์ ˆํ•˜๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฉ”ํ„ฐ๋กœ ๊ฐ’์ด ๋†’์„์ˆ˜๋ก ํƒ์ƒ‰ ๋นˆ๋„ ์ฆ๊ฐ€

      • ์ง๊ด€์  ์ดํ•ด1: ๊ณ„์† ํŠน์ • arm๋งŒ ์„ ํƒํ•˜๋ฉด UCB term์ด ์ž‘์•„์ง€๋ฏ€๋กœ Confidence Interval์ด ๊ฐ์†Œํ•˜์—ฌ ํƒ์ƒ‰ ๋นˆ๋„ ๊ฐ์†Œ (๐‘๐‘ก(๐‘Ž) ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฏ€๋กœ)

      • ์ง๊ด€์  ์ดํ•ด2: log์˜ ์—ญํ• ์€ UCB decay๋กœ ์ ์ฐจ Confidence Interval์„ ๊ฐ์†Œ์‹œํ‚ค๋Š” ์—ญํ• 

      • ์ง๊ด€์  ์ดํ•ด3: ์ž˜ ์„ ํƒ๋˜์ง€ ์•Š์€ arm์€ Confidence Interval์ด ๊ฐ์†Œํ•˜์—ฌ ํƒ์ƒ‰ ๋นˆ๋„ ์ฆ๊ฐ€

3.1. Gaussian Bandits

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Converge Ratio):

    • EpsilonGreedy (w/ ๐œ–=0.1, Initial Value=5): Iteration ๊ทน์ดˆ๊ธฐ์— % Optimal Action ์ฆ๊ฐ€ ์†๋„๊ฐ€ ์…‹ ์ค‘ ๊ฐ€์žฅ ๋น ๋ฆ„

    • UCB: Average Reward ๋ฐ Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด ์…‹ ์ค‘ ๊ฐ€์žฅ ๋†’์ง€๋งŒ ์ค‘๊ฐ„์ค‘๊ฐ„ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ตฌ๊ฐ„ ๋ฐœ์ƒ

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Elapsed Time):

    • UCB์˜ ๊ฒฝ์šฐ ๐œ–-greedy ๋Œ€๋น„ ์•ฝ 1.5๋ฐฐ ์ •๋„ ์ง€์—ฐ๋จ

3.2. Bernoulli Bandits

4. UCB Variants

  • ๊ฐœ์š”: ๋‹ค์–‘ํ•œ UCB ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ํ‰๊ฐ€

4.1. Bernoulli Bandits

4.2. Binomial Bandits

5. TS(Thompson Sampling)

  • ๊ฐœ์š”: Beta ๋ถ„ํฌ๋ฅผ prior๋กœ ๊ฐ€์ •ํ•˜๊ณ  Bandit์˜ ๋ถ„ํฌ๋ฅผ ๋ฒ ๋ฅด๋ˆ„์ด ๋ถ„ํฌ๋‚˜ ์ดํ•ญ ๋ถ„ํฌ์˜ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๋Š” likelihood๋กœ ๊ฐ€์ •ํ•˜์—ฌ ํ™•๋ฃฐ ๋ถ„ํฌ๋ฅผ ๋ชจ๋ธ๋งํ•˜๋Š” ๋ฐฉ๋ฒ•

  • ๋น„๊ต ์•Œ๊ณ ๋ฆฌ์ฆ˜

    • EpsilonGreedy (w/ ๐œ–=0.1, Initial Value=5)

    • UCB

    • TS

5.1. pymc3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • ์›๋ณธ ์†Œ์Šค์ฝ”๋“œ๋กœ theano์‚ฌ์šฉ์œผ๋กœ ์ธํ•ด non-GPU ํ™˜๊ฒฝ์—์„œ ์†๋„๊ฐ€ ๋А๋ฆผ

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Converge Ratio):

    • EpsilonGreedy (w/ ๐œ–=0.1, Initial Value=5): ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•˜์ง€๋งŒ ์ผ์ • Iteration ์ด์ƒ์—์„œ ์ˆ˜๋ ด์†๋„๊ฐ€ ๋–จ์–ด์ง

    • UCB: Average Reward์™€ Optimal Action์˜ ์„ ํƒ ํ™•๋ฅ ์ด Iteration์ด ๋ฐ˜๋ณต๋ ์ˆ˜๋ก ๊พธ์ค€ํžˆ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ์ค‘๊ฐ„์ค‘๊ฐ„ ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ตฌ๊ฐ„ ๋ฐœ์ƒ

    • TS: 1000๋ฒˆ ์ด๋‚ด์˜ time step์—์„œ๋Š” ๊ฐ€์žฅ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋‚˜ 5000ํšŒ, 10000ํšŒ ๋“ฑ long run์œผ๋กœ ๊ฐˆ์ˆ˜๋ก UCB์— ๋น„ํ•ด ๋ถˆ๋ฆฌํ•  ์ˆ˜ ์žˆ์Œ

  • ์‹คํ—˜ ๊ฒฐ๊ณผ (Elapsed Time):

    • TS์˜ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ๋กœ์ง์€ ๊ฐ„๋‹จํ•˜๋‚˜ pymc3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ Beta ํ™•๋ฅ  ๋ถ„ํฌ ๋ชจ๋ธ๋ง์—์„œ ๋กœ๋“œ๊ฐ€ ๋งŽ์ด ๊ฑธ๋ฆผ

5.2. numpy ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

  • pymc3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ™œ์šฉํ•˜์ง€ ์•Š๊ณ  numpy๋กœ๋งŒ ๊ตฌํ˜„ํ•œ ์ฝ”๋“œ

  • Non-caching๊ณผ caching์œผ๋กœ ๋‚˜๋ˆ„์–ด์„œ running time ๋น„๊ต (pymc3 ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๊ธฐ๋ฐ˜ ๊ตฌํ˜„์€ caching ์‚ฌ์šฉ)

    • Non-caching์‹œ, reward๊ฐ’์˜ prior ๋ถ„ํฌ๋Š” ๋งค time step๋งˆ๋‹ค [num_arms x 1] ๋ฒกํ„ฐ๋กœ ๊ณ„์‚ฐ

    • Caching์‹œ, reward๊ฐ’์˜ prior ๋ถ„ํฌ๋Š” [num_trials x num_arms] ํ–‰๋ ฌ๋กœ ๋ฏธ๋ฆฌ ๊ณ„์‚ฐํ•˜์—ฌ ๋งค time step๋งˆ๋‹ค [num_arms x 1] ๋ฒกํ„ฐ ๋ฆฌํ„ด

  • ์‹คํ—˜ ๊ฒฐ๊ณผ, Caching ์‚ฌ์šฉ ์‹œ์˜ running time์ด non-caching ๋Œ€๋น„ ์•ฝ 3๋ฐฐ ์ด์ƒ ๋” ๋น ๋ฆ„

Last updated