OpenAI GPT-2

1. Background

  • ๋ชจ๋ธ์ด ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ์—๋งŒ ์ž˜ ์ž‘๋™ํ•˜๋Š” narrow expert๋ณด๋‹ค generalist๋ฅผ ์›ํ•จ โ†’ ์ตœ๊ทผ ๋ณด๋‹ค ๋„“์€ ๋ฒ”์œ„์˜ dataset๊ณผ ์—ฌ๋Ÿฌ ๊ณผ์ œ๋“ค์— ๋Œ€ํ•œ GLUE benchmark ๋“ฑ์ด ์ œ์•ˆ๋˜๊ธฐ ์‹œ์ž‘

  • ๊ธฐ์กด์˜ Language Model(LM)์€ ํŠน์ • ๋„๋ฉ”์ธ์— ์น˜์šฐ์นœ ํ…์ŠคํŠธ๋งŒ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ™œ์šฉ (์˜ˆ: BERT๋Š” BookCorpous(800M words) + ์œ„ํ‚คํ”ผ๋””์•„(2500M words))

  • Common Crawl๋„ ๊ณ ๋ คํ•ด ๋ดค์ง€๋งŒ, ๋ฐ์ดํ„ฐ ํ€„๋ฆฌํ‹ฐ ์ด์Šˆ๊ฐ€ ์žˆ๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์—†์–ด์„œ ๋ถ€์ ํ•ฉ

  • ๋”ฐ๋ผ์„œ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์›น์Šคํฌ๋ž˜ํ•‘์œผ๋กœ WebText ๋ฐ์ดํ„ฐ์…‹์„ ์ƒˆ๋กœ ์ƒ์„ฑ

    • Reddit์—์„œ ์™ธ๋ถ€(outbound) ๋งํฌ ์ค‘์— karma(ํŽ˜์ด์Šค๋ถ์˜ like์™€ ์œ ์‚ฌ)๋ž€ ๊ฒŒ ์žˆ๊ณ  karma 3๊ฐœ ์ด์ƒ ๋ฐ›์€ ๊ธ€๋“ค๋งŒ ์‚ฌ์šฉ

    • Text subset์œผ๋กœ 4500๋งŒ๊ฐœ ๋งํฌ๊ฐ€ ์žˆ๋Š”๋ฐ ์—ฌ๊ธฐ์—์„œ html ํŒŒ์‹ฑ, ์œ„ํ‚คํ”ผ๋””์•„ ๋ฌธ์„œ ์ œ๊ฑฐ, ์ค‘๋ณต ์ฒ˜๋ฆฌ ๋“ฑ์˜ ์ „์ฒ˜๋ฆฌ ํ›„ 8๋ฐฑ๋งŒ ๊ฐœ ๋ฌธ์„œ, 40GB corpus๋กœ ๊ฐ„์†Œํ™”

  • ๋˜ํ•œ, BERT์™€ ๋‹ฌ๋ฆฌ pre-training+fine-tuning์˜ ์กฐํ•ฉ์ด ์•„๋‹ˆ๋ผ ํ•™์Šต ์™„๋ฃŒ ํ›„ ๋” ์ด์ƒ์˜ task-specificํ•œ ๋ฐ์ดํ„ฐ๋ฅผ fine-tuningํ•˜์ง€ ์•Š์Œ (๋ฌผ๋ก  fine-tuning๋„ ๊ฐ€๋Šฅํ•˜๋ฉฐ BERT์™€ ํฐ ์„ฑ๋Šฅ ์ฐจ์ด๋Š” ์—†์Œ)

2. Model

๊ฐœ์š”

  • Transformer ๋””์ฝ”๋”๋งŒ ์‚ฌ์šฉ

    • BERT์˜ ์…€ํ”„ ์–ดํ…์…˜์ด ์•„๋‹Œ Masked ์…€ํ”„ ์–ดํ…์…˜ ์‚ฌ์šฉ

  • ์ด 4๊ฐœ์˜ ๋ชจ๋ธ ์ œ์‹œ (GPT-2 small, GPT-2 medium, GPT-2 large, GPT-2 extra large)

    • GPT-2 small์€ GPT-1๊ณผ ํŒŒ๋ผ๋ฉ”ํ„ฐ ๊ฐœ์ˆ˜ ๋™์ผ

    • GPT-2 medium์€ BERT์™€ ํŒŒ๋ผ๋ฉ”ํ„ฐ ๊ฐœ์ˆ˜ ๋™์ผ

    • ์•ฝ 15์–ต๊ฐœ์˜ ํŒŒ๋ผ๋ฉ”ํ„ฐ ๊ฐœ์ˆ˜๋กœ GPT ๋Œ€๋น„ 10๋ฐฐ ์ด์ƒ ๋งŽ์Œ

  • ์ฃผ์š” ํŒŒ๋ผ๋ฉ”ํ„ฐ ๋ณ€๊ฒฝ

    • ์–ดํœ˜ ๊ฐœ์ˆ˜: 50,527๊ฐœ

    • Context size: 512 โ†’ 1024 ํ† ํฐ

    • Batch size: 512

  • ๊ฐ residual ๊ณ„์ธต์˜ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐ๊ฐ’์— 1/sqrt(N)์œผ๋กœ ์Šค์ผ€์ผ๋ง (N์€ residual ๊ณ„์ธต๋“ค์˜ ๊ฐœ์ˆ˜)

  • Layer Normalization์ด ์›๋ž˜๋Š” attention ๋‹ค์Œ์ด์—ˆ๋Š”๋ฐ ๊ฐ sub-block์˜ input์œผ๋กœ ์˜ฎ๊ฒจ์ง

  • ๋งˆ์ง€๋ง‰ ์…€ํ”„ ์–ดํ…์…˜ ๋ธ”๋ก์— ์ถ”๊ฐ€ layer normalization ์ ์šฉ

  • ๊ฐ ๋ชจ๋ธ์˜ learning rate๋Š” WebText์˜ 5%๋ฅผ ๋–ผ์„œ ๋งŒ๋“  held-out ์ƒ˜ํ”Œ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ˆ˜๋™ ์กฐ์ • โ†’ ์—ฌ์ „ํžˆ WebText์— ๊ณผ์†Œ์ ํ•ฉ(underfitted)๋˜์—ˆ๊ธฐ์— ๋” ์˜ค๋ž˜ ํ•™์Šต์‹œํ‚ค๋ฉด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์„ ๊ฑฐ๋ผ ๊ธฐ๋Œ€

A Deeper Look Inside

  • ์‹œ์ž‘ ํ† ํฐ์œผ๋กœ <|endoftext|> ์‚ฌ์šฉ. ์ด์ œ๋ถ€ํ„ด ํŽธ์˜์ƒ <s>๋ผ๊ณ  ์นญํ•จ

  • ํ† ํฐ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์—์„œ ํ•ด๋‹น vocab ๊ฒ€์ƒ‰ ํ›„ Positional encoding ๊ฒฐ๊ณผ ๊ฐ€์‚ฐ

  • Decoder๋“ค์„ ๊ฑฐ์ณ ๋‚˜์˜จ ์ถœ๋ ฅ ๋ฒกํ„ฐ์— ํ† ํฐ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ์„ ๊ณฑํ•ด ์ถœ๋ ฅ ํ† ํฐ์˜ logit(ํ™•๋ฅ ) ๊ณ„์‚ฐ

  • ์ž…๋ ฅ ํ† ํฐ์ด Decoder ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด ์—ฐ์†์ ์œผ๋กœ ์ฒ˜๋ฆฌ๋œ ๋‹ค์Œ ์ตœ์ข… ๋ฒกํ„ฐ(vocab ์ด ๊ฐœ์ˆ˜)๊ฐ€ ์ƒ์„ฑ๋จ. ์ตœ์ข… ๋ฒกํ„ฐ๋Š” top_1์ด๋‚˜ top_k๋ฅผ ํ†ตํ•ด ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ ๋‹จ์–ด๋ฅผ ๋‹ค์Œ ๋‹จ์–ด๋กœ ์„ ํƒ

    • top_1: vocab ์ค‘์—์„œ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ vocab๋ฅผ ์„ ํƒ (top_k = 1)

    • top_k: ์ƒ์œ„ k๊ฐœ์˜ vocab๋ฅผ ์„ ํƒ ํ›„ ์ƒ˜ํ”Œ๋ง

References

Last updated

Was this helpful?