Transformer is All You Need

1. Background

  • Encoder-Decoder źµ¬ģ”°ģ˜ seq2seqķ•œź³„ė„¼ CNN/RNN ģ—†ģ“ ģ–“ķ…ģ…˜(Attention)만으딜 극복

  • ģžė™ ķšŒź·€(auto-regressive) ėŖØėøė”œ, ķ•œ ė²ˆģ— ķ•œ 부분씩 ģ˜ˆģø”ķ•˜ź³  ź·ø 결과넼 ģ‚¬ģš©ķ•˜ģ—¬ ė‹¤ģŒģ— ģˆ˜ķ–‰ķ•  ģž‘ģ—… ź²°ģ •

  • ģž„ģ 

    • ė°ģ“ķ„° ģ „ģ²“ģ˜ ģ‹œź°„ģ /공간적 ꓀계에 ėŒ€ķ•œ 가정 ģ—†ģŒ

    • ė³‘ė ¬ģ²˜ė¦¬ ź°€ėŠ„

    • Long-term dependency에 강걓

  • NLPģ˜ 계볓

    • 2001 - Neural language models

    • 2008 - Multi-task learning

    • 2013 - Word embeddings (word embedding)

    • 2014 - Sequence-to-sequence models

    • 2015 - Attention (Attention Mechanism)

    • 2015 - Memory-based networks

    • 2017 - Transformer

    • 2018 - Pre-trained language models (2012ė…„ AlexNetģ˜ ģž„ķŒ©ķŠø)

2. Model Architecture

Overview

  • Encoder ėø”ė”ź³¼ Decoder ėø”ė”ģ€ 각 6ź°œģ”© ģ”“ģž¬

  • 각 Encoder/각 Decoderė“¤ģ€ 모두 ė™ģ¼ķ•œ 구씰넼 가지고 ģžˆģ§€ė§Œ, weight넼 ź³µģœ ķ•˜ģ§€ėŠ” ģ•ŠģŒ

  • Encoder

    • 2ź°œģ˜ sub-layerė“¤ė”œ 구성: Self-Attention → Feed Forward

        Stage1_out = Embedding512 + TokenPositionEncoding512 #w2vź²°ź³¼ + pos정볓
        Stage2_out = layer_normalization(multihead_attention(Stage1_out) + Stage1_out)
        Stage3_out = layer_normalization(FFN(Stage2_out) + Stage2_out)
        ​
        out_enc = Stage3_out
    • ģµœėŒ€ ģ‹œķ€€ģŠ¤ źøøģ“(예: 512토큰)ź¹Œģ§€ ģž…ė „ģ„ 받고 ģž…ė „ ģ‹œķ€€ģŠ¤ź°€ ģ§§ģ„ ź²½ģš°ģ—ėŠ” padding으딜 ė‚˜ėØøģ§€ ģ‹œķ€€ģŠ¤ė„¼ 채움

  • Decoder

    • 3ź°œģ˜ sub-layerė“¤ė”œ 구성 Self-Attention → Encoder-Decoder Attention → Feed Forward

        Stage1_out = OutputEmbedding512 + TokenPositionEncoding512
        # i볓다 ģž‘ģ€ ģœ„ģ¹˜ģ˜ ģ‹œķ€€ģŠ¤ ģš”ģ†Œģ— ėŒ€ķ•“ģ„œė§Œ ģ–“ķ…ģ…˜ ė©”ģ»¤ė‹ˆģ¦˜ģ“ ė™ģž‘ķ•  수 ģžˆė„ė” ė§ˆģŠ¤ķ‚¹ķ•œ ģ–“ķ…ģ…˜
        Stage2_Mask = masked_multihead_attention(Stage1_out)
        Stage2_Norm1 = layer_normalization(Stage2_Mask) + Stage1_out
        Stage2_Multi = multihead_attention(Stage2_Norm1 + out_enc) + Stage2_Norm1
        Stage2_Norm2 = layer_normalization(Stage2_Multi) + Stage2_Multi
        ​
        Stage3_FNN = FNN(Stage2_Norm2)
        Stage3_Norm = layer_normalization(Stage3_FNN) + Stage2_Norm2
        ​
        out_dec = Stage3_Norm
  • 각 sub-layer ģ‚¬ģ“ģ— Residual connection → Layer normalization으딜 ģ—°ź²° (Layer Normalizationģ€ 각 feature ėŒ€ķ•œ ķ‰ź· , ė¶„ģ‚°ģ„ ź³„ģ‚°ķ•˜ź³  각 example와 ģ„œė”œ ė…ė¦½ģ ģž„)

  • https://arxiv.org/pdf/1607.06450.pdf ģ°øģ”° (RNNź³¼ ķŠøėžœģŠ¤ķ¬ėØøģ—ģ„œ ģž˜ ģž‘ė™ķ•œė‹¤ź³  ģ•Œė ¤ģ§)

y=LayerNorm(x+sublayer(x))y = LayerNorm(x + sublayer(x))

3. Positional Encoding

  • ģœ„ģ¹˜ 정볓 반영; ģ‚¬ģø(sine) ķ•Øģˆ˜ģ™€ ģ½”ģ‚¬ģø(cosine)ķ•Øģˆ˜ģ˜ ź°’ģ„ ģž„ė² ė”© 범터에 ė”ķ•“ģ¤Œģœ¼ė”œģØ, ķ† ķ°ģ˜ ģƒėŒ€ģ ģ“ź±°ė‚˜ ģ ˆėŒ€ģ ģø ģœ„ģ¹˜ 정볓넼 ģ•Œė ¤ģ¤Œ

  • d차원 ź³µź°„ģ—ģ„œ ė¬øģž„ģ˜ ģ˜ėÆøģ™€ ė¬øģž„ģ—ģ„œģ˜ ģœ„ģ¹˜ģ˜ ģœ ģ‚¬ģ„±ģ— ė”°ė¼ ķ† ķ°ģ“ ģ„œė”œ ė” ź°€ź¹Œģ›Œģ§€ź²Œ 함

  • 가변 źøøģ“ ģ‹œķ€€ģŠ¤ģ— ėŒ€ķ•“ģ„œ positional encoding ģ„ ģƒģ„±ķ•  수 ģžˆźø° ė•Œė¬øģ— scalabilityģ—ģ„œ 큰 ģ“ģ ģ„ 가짐 (예넼 들얓, ģ“ėÆø ķ•™ģŠµėœ ėŖØėøģ“ ķ•™ģŠµ ė°ģ“ķ„°ė³“ė‹¤ ė” źø“ ė¬øģž„ģ— ėŒ€ķ•“ģ„œ ė²ˆģ—­ģ„ 핓야 ķ•  ė•Œģ—ė„ positional encoding ģƒģ„± ź°€ėŠ„)

  • PE(pos,Ā 2i)=sin(pos/100002i/dmodel)PE(pos,Ā 2i+1)=cos(pos/100002i/dmodel)PE_{(pos,\ 2i)}=sin(pos/10000^{2i/d_{model}}) \\ PE_{(pos,\ 2i+1)}=cos(pos/10000^{2i/d_{model}})

4. Scaled Dot-product Attention

Self-Attention

ģžźø° ģžģ‹ (Query)ģ„ ģž˜ ķ‘œķ˜„ķ•  수 ģžˆėŠ” (key, value) pair넼 ģ°¾ėŠ” 것

  • ģ¢…ėž˜ seq2seq

    • Q = Query : t-1 ģ‹œģ ģ˜ ė””ģ½”ė” ģ…€ģ—ģ„œģ˜ ģ€ė‹‰ 상태(hidden state)

      K = Keys : ėŖØė“  ģ‹œģ ģ˜ ģøģ½”ė” ģ…€ģ˜ ģ€ė‹‰ ģƒķƒœė“¤

      V = Values : ėŖØė“  ģ‹œģ ģ˜ ģøģ½”ė” ģ…€ģ˜ ģ€ė‹‰ ģƒķƒœė“¤

  • Self-attention

    • Q : ģž…ė „ ė¬øģž„ģ˜ ėŖØė“  토큰 범터들

      K : ģž…ė „ ė¬øģž„ģ˜ ėŖØė“  토큰 범터들

      V : ģž…ė „ ė¬øģž„ģ˜ ėŖØė“  토큰 범터들

  • 각 토큰 범터(ģž„ė² ė”©+Positional Embedding)ė”œė¶€ķ„° Q, K, V 계산 (ź°€ģ¤‘ģ¹˜ 행렬 Wq, Wk, WvėŠ” ėŖØėø ķŒŒė¼ė©”ķ„°)

  • Multi-Head Attentionģ„ ģ ģš©ķ•˜źø° ė•Œė¬øģ— Q, K, V 범터 ģ°Øģ›ģ€ ģž…ė „ ė²”ķ„°ģ˜ 차원/Number of Headsģž„ (ė…¼ė¬øģ—ģ„œėŠ” 512/8 = 64)

Scaled Dot-product Attention

  • 각 Q ė²”ķ„°ėŠ” ėŖØė“  K 범터에 ėŒ€ķ•“ģ„œ ģ–“ķ…ģ…˜ ģŠ¤ģ½”ģ–“ė„¼ źµ¬ķ•˜ź³ , ģ–“ķ…ģ…˜ ė¶„ķ¬(attention distribution)넼 źµ¬ķ•œ 뒤에 ģ“ė„¼ ģ‚¬ģš©ķ•˜ģ—¬ ėŖØė“  V 범터넼 가중합(weighted sum)ķ•˜ģ—¬ ģ–“ķ…ģ…˜ ź°’ ė˜ėŠ” ģ»Øķ…ģŠ¤ķŠø 범터(context vector)넼 계산 → ģ“ė„¼ ėŖØė“  Q 범터에 ėŒ€ķ•“ģ„œ 반복

    • Q 범터와 K ė²”ķ„°ģ˜ 낓적(dot-product)으딜 핓당 query 범터와 ź°€ģž„ ģœ ģ‚¬ķ•œ key 범터넼 ģ°¾ģ„ 수 ģžˆź³ , ģ“ė„¼ softmax딜 0~1 ģ‚¬ģ“ģ˜ ķ™•ė„ ź°’ģœ¼ė”œ ė§Œė“¤ 수 ģžˆģŒ.

      Attention(Q,K,V)=softmax(QKTdk)VAttention(Q, K, V) = softmax({QK^T\over{\sqrt{d_k}}})V

  • d_kģ˜ 제곱근으딜 ė‚˜ėˆ„ėŠ” ģ“ģœ ėŠ” 쿼리-키 낓적 ķ–‰ė ¬ģ˜ ģ°Øģ›ģ“ ėŠ˜ģ–“ė‚  ģˆ˜ė” softmax ķ•Øģˆ˜ģ—ģ„œ ģž‘ģ€ 기울기(gradient)넼 가지기 ė•Œė¬øģ— ģ“ė„¼ 완화핓 주기 ģœ„ķ•Ø.

  • ź°€ģ¤‘ķ•©ģœ¼ė”œ 각 토큰 범터에 ėŒ€ķ•œ context 범터 계산; ģ“ ź³¼ģ •ģ—ģ„œ ź“€ė Øģ“ ģžˆėŠ” ķ† ķ°ģ˜ ė¹„ģ¤‘ģ“ 높아지고 ź“€ė Øģ“ ģ—†ėŠ” ķ† ķ°ģ˜ ė¹„ģ¤‘ģ€ 낮아짐.

Decoder

  • ėÆøėž˜ 정볓넼 ģ°øģ”°ķ•  수 ģ—†ģœ¼ėÆ€ė”œ ķ˜„ģž¬ ģœ„ģ¹˜ģ˜ ģ“ģ „ ģœ„ģ¹˜ė“¤ė§Œ ģ‚¬ģš© ź°€ėŠ„

  • ģ“ė„¼ ģœ„ķ•“ ķ˜„ģž¬ time-step ģ“ķ›„ģ˜ ģœ„ģ¹˜ė“¤ģ— ėŒ€ķ•“ģ„œ masking 처리 (softmax 적용 전에 -inf딜 변경; ģ‹¤ģ œ źµ¬ķ˜„ģ€ -1e^9; ģ™œėƒķ•˜ė©“ ģ†Œķ”„ķŠøė§„ģŠ¤ ķ•Øģˆ˜ģ—ģ„œ 큰 ģŒģˆ˜ ģž…ė „ģ€ 0에 ź°€ź¹źø° ė•Œė¬ø)

Code snippet

def scaled_dot_product_attention(q, k, v, mask):
    """Calculate the attention weights.
    q, k, v must have matching leading dimensions.
    k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
    The mask has different shapes depending on its type(padding or look ahead) 
    but it must be broadcastable for addition.

    Args:
        q: query shape == (..., seq_len_q, depth)
        k: key shape == (..., seq_len_k, depth)
        v: value shape == (..., seq_len_v, depth_v)
    mask: Float tensor with shape broadcastable 
          to (..., seq_len_q, seq_len_k). Defaults to None.

    Returns:
    output, attention_weights
    """

    matmul_qk = tf.matmul(q, k, transpose_b=True)  # (..., seq_len_q, seq_len_k)

    # scale matmul_qk
    dk = tf.cast(tf.shape(k)[-1], tf.float32)
    scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)

    # add the mask to the scaled tensor.
    if mask is not None:
        scaled_attention_logits += (mask * -1e9)  

    # softmax is normalized on the last axis (seq_len_k) so that the scores
    # add up to 1.
    attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1)  # (..., seq_len_q, seq_len_k)

    output = tf.matmul(attention_weights, v)  # (..., seq_len_q, depth_v)

    return output, attention_weights

5. Multi-Head Attention

  • ėŖØėøģ“ 다넸 ģœ„ģ¹˜ģ— ģ§‘ģ¤‘ķ•˜ėŠ” 늄렄 ķ™•ģž„

  • Attention layerź°€ ģ—¬ėŸ¬ ź°œģ˜ ā€œrepresentation ź³µź°„ā€ģ„ ź°€ģ§€ź²Œ 함 (ģ•™ģƒėø” 효과)

    • ģ„ ķ˜• ė³€ķ™˜ėœ Q, K, V넼 h개딜 분리 → ģœ„ģ¹˜ź°€ ģƒģ“ķ•œ 각기 다넸 ķ‘œķ˜„ 부분공간(representation subspaces) ėø”ė”ė“¤ģ“ ź³µė™ģœ¼ė”œ(jointly) 정볓넼 ģ–»ź²Œ 됨

    • h ź°œģ˜ Attention Matrixź°€ ģƒźø°ė©“ģ„œ Q와 Kź°„ģ˜ ķ† ķ°ė“¤ģ„ ė” ė‹¤ģ–‘ķ•œ ź“€ģ ģœ¼ė”œ ė³¼ 수 ģžˆģŒ.

  • Scaled Dot-Product ģ–“ķ…ģ…˜ģ„ h번 계산핓 ģ“ė„¼ ģ„œė”œ concatenate (ė…¼ė¬øģ˜ 경우 h=8)

  • hź°œģ˜ ķ–‰ė ¬ģ„ ė°”ė”œ Feed Forward Layer딜 볓낼 수 없기 ė•Œė¬øģ— ė˜ė‹¤ė„ø ź°€ģ¤‘ģ¹˜ 행렬 Wo넼 곱함

headi=Attention(QWiQ,KWiK,VWiV),i=1,…,hĀ whereĀ projectionsĀ areĀ parameterĀ metricsĀ WiQ,WiK∈RdmodelƗdk,WiV∈RdmodelƗdvĀ forĀ dk=dv=dmodel/h=64\text{head}_i = \text{Attention}(Q W^Q_i, K W^K_i, V W^V_i) , i=1,\dots,h \\ \text{ where projections are parameter metrics }W^Q_i, W^K_i\in\mathbb{R}^{d_{model}\times d_k}, \\ W^V_i\in\mathbb{R}^{d_{model}\times d_v} \text{ for } d_k=d_v=d_{model}/h = 64
MultiHead(Q,K,V)=[head1;… ;headh]WOwhereĀ W0∈RdhdvƗdmodel\text{MultiHead}(Q, K, V) = [\text{head}_1; \dots; \text{head}_h]W^O \\ \text{where } W^0\in\mathbb{R}^{d_{hd_v}\times d_{model}}

Code snippet

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads # 8
        self.d_model = d_model # 512

        assert d_model % self.num_heads == 0

        self.depth = d_model // self.num_heads

        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)

        self.dense = tf.keras.layers.Dense(d_model)

    def split_heads(self, x, batch_size):
        """Split the last dimension into (num_heads, depth).
        Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
        """
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])

    def call(self, v, k, q, mask):
        batch_size = tf.shape(q)[0]

        q = self.wq(q)  # (batch_size, seq_len, d_model)
        k = self.wk(k)  # (batch_size, seq_len, d_model)
        v = self.wv(v)  # (batch_size, seq_len, d_model)

        q = self.split_heads(q, batch_size)  # (batch_size, num_heads, seq_len_q, depth)
        k = self.split_heads(k, batch_size)  # (batch_size, num_heads, seq_len_k, depth)
        v = self.split_heads(v, batch_size)  # (batch_size, num_heads, seq_len_v, depth)

        # scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
        # attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
        scaled_attention, attention_weights = scaled_dot_product_attention(
            q, k, v, mask)

        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])  # (batch_size, seq_len_q, num_heads, depth)

        concat_attention = tf.reshape(scaled_attention, 
                                      (batch_size, -1, self.d_model))  # (batch_size, seq_len_q, d_model)

        output = self.dense(concat_attention)  # (batch_size, seq_len_q, d_model)

        return output, attention_weights

6. Position-wise Feed Forward Network

  • Positionė§ˆė‹¤, 각 ź°œė³„ 단얓 ė²”ķ„°ė§ˆė‹¤ FFN(Feed Forward Network)ģ“ 적용됨

  • 두 ź°œģ˜ ģ„ ķ˜• ė³€ķ™˜(linear transformation); x에 ģ„ ķ˜• ė³€ķ™˜ 적용 후, ReLU(max(0,z))넼 거쳐 ė‹¤ģ‹œ ķ•œė²ˆ ģ„ ķ˜• ė³€ķ™˜ 적용;FFN(x)=max(0,xW1+b1)W2+b2FFN(x)=max(0, xW_1+b_1)W_2+b_2

  • ģ“ė•Œ ź°ź°ģ˜ positionė§ˆė‹¤ ź°™ģ€ parameter W,b넼 ģ‚¬ģš©ķ•˜ģ§€ė§Œ, layerź°€ ė‹¬ė¼ģ§€ė©“ 다넸 parameter ģ‚¬ģš©

  • kernel sizeź°€ 1ģ“ź³  channelģ“ layerģø convolutionģ„ 두 번 ģˆ˜ķ–‰ķ•œ 결과와 ė™ģ¼ (쓈기 źµ¬ķ˜„ ģ‹œ ģ‚¬ģš©)

    While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.

  • kernel size넼 3으딜 ģ”°ģ • ģ‹œ context 정볓가 ė” ģž˜ ķ‘œķ˜„ėœė‹¤ź³  ķ•˜ė‚˜, ķ˜„ģž¬ 구글 ź³µģ‹ BERT źµ¬ķ˜„ģ²“ėŠ” convolution ģ—°ģ‚° 미적용

  • Input/output dimensionģ€ d_model = 512ģ“ź³ , hidden layer dimension d_ff = 2048ģž„

7. Other Techniques

Label Smoothing Regularization

  • ģ›ėž˜ ėŖ©ķ‘œģø ķƒ€ź²Ÿ ė¼ė²Øģ„ ė§žģ¶”ėŠ” 목적과 정답 ė¼ė²Øģ˜ ė”œģ§“(logit) ź°’ģ“ ķ•™ģŠµ ź³¼ģ •ģ—ģ„œ ź³¼ė„ķ•˜ź²Œ 다넸 logit 값볓다 ģ»¤ģ§€ėŠ” ķ˜„ģƒ ė°©ģ§€

  • https://arxiv.org/abs/1512.00567 ģ°øģ”°ķ•  것

Optimizer

  • Adam optimizer넼 기반으딜 warmup 기법 적용 (ķ•™ģŠµė„ ģ“ 쓈기 stepģ—ģ„œ ź°€ķŒŒė„“ź²Œ ģƒģŠ¹ķ•˜ė‹¤ź°€ step 수넼 ė§Œģ”±ķ•˜ė©“ ģ“ķ›„ ģ²œģ²œķžˆ ķ•˜ź°•)

References

Last updated

Was this helpful?