Transformer is All You Need
1. Background
Encoder-Decoder 구씰ģ seq2seqķź³ė„¼ CNN/RNN ģģ“ ģ“ķ ģ (Attention)ė§ģ¼ė” 극복
ģė ķź·(auto-regressive) ėŖØėøė”, ķ ė²ģ ķ ė¶ė¶ģ© ģģø”ķź³ ź·ø 결과넼 ģ¬ģ©ķģ¬ ė¤ģģ ģķķ ģģ ź²°ģ
ģ„ģ
ė°ģ“ķ° ģ 첓ģ ģź°ģ /ź³µź°ģ ź“ź³ģ ėķ ź°ģ ģģ
ė³ė ¬ģ²ė¦¬ ź°ė„
Long-term dependencyģ ź°ź±“
NLPģ ź³ė³“
2001 - Neural language models
2008 - Multi-task learning
2013 - Word embeddings (word embedding)
2014 - Sequence-to-sequence models
2015 - Attention (Attention Mechanism)
2015 - Memory-based networks
2017 - Transformer
2018 - Pre-trained language models (2012ė AlexNetģ ģķ©ķø)
2. Model Architecture
Overview
Encoder ėøė”ź³¼ Decoder ėøė”ģ ź° 6ź°ģ© 씓ģ¬
ź° Encoder/ź° Decoderė¤ģ ėŖØė ėģ¼ķ 구씰넼 ź°ģ§ź³ ģģ§ė§, weight넼 ź³µģ ķģ§ė ģģ
Encoder
2ź°ģ sub-layerė¤ė” 구ģ±: Self-Attention ā Feed Forward
Stage1_out = Embedding512 + TokenPositionEncoding512 #w2vź²°ź³¼ + posģ 볓 Stage2_out = layer_normalization(multihead_attention(Stage1_out) + Stage1_out) Stage3_out = layer_normalization(FFN(Stage2_out) + Stage2_out) ā out_enc = Stage3_out
ģµė ģķģ¤ źøøģ“(ģ: 512ķ ķ°)ź¹ģ§ ģ ė „ģ ė°ź³ ģ ė „ ģķģ¤ź° ģ§§ģ ź²½ģ°ģė paddingģ¼ė” ėėØøģ§ ģķģ¤ė„¼ ģ±ģ
Decoder
3ź°ģ sub-layerė¤ė” źµ¬ģ± Self-Attention ā Encoder-Decoder Attention ā Feed Forward
Stage1_out = OutputEmbedding512 + TokenPositionEncoding512 # iė³“ė¤ ģģ ģģ¹ģ ģķģ¤ ģģģ ėķ“ģė§ ģ“ķ ģ ė©ģ»¤ėģ¦ģ“ ėģķ ģ ģėė” ė§ģ¤ķ¹ķ ģ“ķ ģ Stage2_Mask = masked_multihead_attention(Stage1_out) Stage2_Norm1 = layer_normalization(Stage2_Mask) + Stage1_out Stage2_Multi = multihead_attention(Stage2_Norm1 + out_enc) + Stage2_Norm1 Stage2_Norm2 = layer_normalization(Stage2_Multi) + Stage2_Multi ā Stage3_FNN = FNN(Stage2_Norm2) Stage3_Norm = layer_normalization(Stage3_FNN) + Stage2_Norm2 ā out_dec = Stage3_Norm
ź° sub-layer ģ¬ģ“ģ Residual connection ā Layer normalizationģ¼ė” ģ°ź²° (Layer Normalizationģ ź° feature ėķ ķź· , ė¶ģ°ģ ź³ģ°ķź³ ź° exampleģ ģė” ė 립ģ ģ)
https://arxiv.org/pdf/1607.06450.pdf ģ°øģ”° (RNNź³¼ ķøėģ¤ķ¬ėØøģģ ģ ģėķė¤ź³ ģė ¤ģ§)
3. Positional Encoding
ģģ¹ ģ 볓 ė°ģ; ģ¬ģø(sine) ķØģģ ģ½ģ¬ģø(cosine)ķØģģ ź°ģ ģė² ė© ė²”ķ°ģ ėķ“ģ¤ģ¼ė”ģØ, ķ ķ°ģ ģėģ ģ“ź±°ė ģ ėģ ģø ģģ¹ ģ 볓넼 ģė ¤ģ¤
dģ°Øģ ź³µź°ģģ 문ģ„ģ ģ미ģ 문ģ„ģģģ ģģ¹ģ ģ ģ¬ģ±ģ ė°ė¼ ķ ķ°ģ“ ģė” ė ź°ź¹ģģ§ź² ķØ
ź°ė³ źøøģ“ ģķģ¤ģ ėķ“ģ positional encoding ģ ģģ±ķ ģ ģźø° ė문ģ scalabilityģģ ķ° ģ“ģ ģ ź°ģ§ (ģ넼 ė¤ģ“, ģ“미 ķģµė ėŖØėøģ“ ķģµ ė°ģ“ķ°ė³“ė¤ ė źø“ 문ģ„ģ ėķ“ģ ė²ģģ ķ“ģ¼ ķ ėģė positional encoding ģģ± ź°ė„)
4. Scaled Dot-product Attention
Self-Attention
ģźø° ģģ (Query)ģ ģ ķķķ ģ ģė (key, value) pair넼 ģ°¾ė ź²
ģ¢ ė seq2seq
Q = Query : t-1 ģģ ģ ėģ½ė ģ ģģģ ģė ģķ(hidden state)
K = Keys : ėŖØė ģģ ģ ģøģ½ė ģ ģ ģė ģķė¤
V = Values : ėŖØė ģģ ģ ģøģ½ė ģ ģ ģė ģķė¤
Self-attention
Q : ģ ė „ 문ģ„ģ ėŖØė ķ ķ° ė²”ķ°ė¤
K : ģ ė „ 문ģ„ģ ėŖØė ķ ķ° ė²”ķ°ė¤
V : ģ ė „ 문ģ„ģ ėŖØė ķ ķ° ė²”ķ°ė¤
ź° ķ ķ° ė²”ķ°(ģė² ė©+Positional Embedding)ė”ė¶ķ° Q, K, V ź³ģ° (ź°ģ¤ģ¹ ķė ¬ Wq, Wk, Wvė ėŖØėø ķė¼ė©ķ°)
Multi-Head Attentionģ ģ ģ©ķźø° ė문ģ Q, K, V ė²”ķ° ģ°Øģģ ģ ė „ ė²”ķ°ģ ģ°Øģ/Number of Headsģ (ė ¼ė¬øģģė 512/8 = 64)
Scaled Dot-product Attention
ź° Q ė²”ķ°ė ėŖØė K ė²”ķ°ģ ėķ“ģ ģ“ķ ģ ģ¤ģ½ģ“넼 źµ¬ķź³ , ģ“ķ ģ ė¶ķ¬(attention distribution)넼 구ķ ė¤ģ ģ“넼 ģ¬ģ©ķģ¬ ėŖØė V ė²”ķ°ė„¼ ź°ģ¤ķ©(weighted sum)ķģ¬ ģ“ķ ģ ź° ėė 컨ķ ģ¤ķø ė²”ķ°(context vector)넼 ź³ģ° ā ģ“넼 ėŖØė Q ė²”ķ°ģ ėķ“ģ ė°ė³µ
Q ė²”ķ°ģ K ė²”ķ°ģ ė“ģ (dot-product)ģ¼ė” ķ“ė¹ query ė²”ķ°ģ ź°ģ„ ģ ģ¬ķ key ė²”ķ°ė„¼ ģ°¾ģ ģ ģź³ , ģ“넼 softmaxė” 0~1 ģ¬ģ“ģ ķė„ ź°ģ¼ė” ė§ė¤ ģ ģģ.
d_kģ ģ ź³±ź·¼ģ¼ė” ėėė ģ“ģ ė 쿼리-ķ¤ ė“ģ ķė ¬ģ ģ°Øģģ“ ėģ“ė ģė” softmax ķØģģģ ģģ źø°ģøźø°(gradient)넼 ź°ģ§źø° ė문ģ ģ“넼 ģķķ“ ģ£¼źø° ģķØ.
ź°ģ¤ķ©ģ¼ė” ź° ķ ķ° ė²”ķ°ģ ėķ context ė²”ķ° ź³ģ°; ģ“ ź³¼ģ ģģ ź“ė Øģ“ ģė ķ ķ°ģ ė¹ģ¤ģ“ ėģģ§ź³ ź“ė Øģ“ ģė ķ ķ°ģ ė¹ģ¤ģ ė®ģģ§.
Decoder
미ė ģ 볓넼 ģ°øģ”°ķ ģ ģģ¼ėÆė” ķģ¬ ģģ¹ģ ģ“ģ ģģ¹ė¤ė§ ģ¬ģ© ź°ė„
ģ“넼 ģķ“ ķģ¬ time-step ģ“ķģ ģģ¹ė¤ģ ėķ“ģ masking ģ²ė¦¬ (softmax ģ ģ© ģ ģ -infė” ė³ź²½; ģ¤ģ 구ķģ -1e^9; ģėķė©“ ģķķøė§„ģ¤ ķØģģģ ķ° ģģ ģ ė „ģ 0ģ ź°ź¹źø° ė문)
Code snippet
def scaled_dot_product_attention(q, k, v, mask):
"""Calculate the attention weights.
q, k, v must have matching leading dimensions.
k, v must have matching penultimate dimension, i.e.: seq_len_k = seq_len_v.
The mask has different shapes depending on its type(padding or look ahead)
but it must be broadcastable for addition.
Args:
q: query shape == (..., seq_len_q, depth)
k: key shape == (..., seq_len_k, depth)
v: value shape == (..., seq_len_v, depth_v)
mask: Float tensor with shape broadcastable
to (..., seq_len_q, seq_len_k). Defaults to None.
Returns:
output, attention_weights
"""
matmul_qk = tf.matmul(q, k, transpose_b=True) # (..., seq_len_q, seq_len_k)
# scale matmul_qk
dk = tf.cast(tf.shape(k)[-1], tf.float32)
scaled_attention_logits = matmul_qk / tf.math.sqrt(dk)
# add the mask to the scaled tensor.
if mask is not None:
scaled_attention_logits += (mask * -1e9)
# softmax is normalized on the last axis (seq_len_k) so that the scores
# add up to 1.
attention_weights = tf.nn.softmax(scaled_attention_logits, axis=-1) # (..., seq_len_q, seq_len_k)
output = tf.matmul(attention_weights, v) # (..., seq_len_q, depth_v)
return output, attention_weights
5. Multi-Head Attention
ėŖØėøģ“ ė¤ė„ø ģģ¹ģ ģ§ģ¤ķė ė„ė „ ķģ„
Attention layerź° ģ¬ė¬ ź°ģ ārepresentation ź³µź°āģ ź°ģ§ź² ķØ (ģģėø ķØź³¼)
ģ ķ ė³ķė Q, K, V넼 hź°ė” ė¶ė¦¬ ā ģģ¹ź° ģģ“ķ ź°źø° ė¤ė„ø ķķ ė¶ė¶ź³µź°(representation subspaces) ėøė”ė¤ģ“ ź³µėģ¼ė”(jointly) ģ 볓넼 ģ»ź² ėØ
h ź°ģ Attention Matrixź° ģźø°ė©“ģ Qģ Kź°ģ ķ ķ°ė¤ģ ė ė¤ģķ ź“ģ ģ¼ė” ė³¼ ģ ģģ.
Scaled Dot-Product ģ“ķ ģ ģ hė² ź³ģ°ķ“ ģ“넼 ģė” concatenate (ė ¼ė¬øģ ź²½ģ° h=8)
hź°ģ ķė ¬ģ ė°ė” Feed Forward Layerė” ė³“ė¼ ģ ģźø° ė문ģ ėė¤ė„ø ź°ģ¤ģ¹ ķė ¬ Wo넼 ź³±ķØ
Code snippet
class MultiHeadAttention(tf.keras.layers.Layer):
def __init__(self, d_model, num_heads):
super(MultiHeadAttention, self).__init__()
self.num_heads = num_heads # 8
self.d_model = d_model # 512
assert d_model % self.num_heads == 0
self.depth = d_model // self.num_heads
self.wq = tf.keras.layers.Dense(d_model)
self.wk = tf.keras.layers.Dense(d_model)
self.wv = tf.keras.layers.Dense(d_model)
self.dense = tf.keras.layers.Dense(d_model)
def split_heads(self, x, batch_size):
"""Split the last dimension into (num_heads, depth).
Transpose the result such that the shape is (batch_size, num_heads, seq_len, depth)
"""
x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
return tf.transpose(x, perm=[0, 2, 1, 3])
def call(self, v, k, q, mask):
batch_size = tf.shape(q)[0]
q = self.wq(q) # (batch_size, seq_len, d_model)
k = self.wk(k) # (batch_size, seq_len, d_model)
v = self.wv(v) # (batch_size, seq_len, d_model)
q = self.split_heads(q, batch_size) # (batch_size, num_heads, seq_len_q, depth)
k = self.split_heads(k, batch_size) # (batch_size, num_heads, seq_len_k, depth)
v = self.split_heads(v, batch_size) # (batch_size, num_heads, seq_len_v, depth)
# scaled_attention.shape == (batch_size, num_heads, seq_len_q, depth)
# attention_weights.shape == (batch_size, num_heads, seq_len_q, seq_len_k)
scaled_attention, attention_weights = scaled_dot_product_attention(
q, k, v, mask)
scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3]) # (batch_size, seq_len_q, num_heads, depth)
concat_attention = tf.reshape(scaled_attention,
(batch_size, -1, self.d_model)) # (batch_size, seq_len_q, d_model)
output = self.dense(concat_attention) # (batch_size, seq_len_q, d_model)
return output, attention_weights
6. Position-wise Feed Forward Network
Positionė§ė¤, ź° ź°ė³ ėØģ“ ė²”ķ°ė§ė¤ FFN(Feed Forward Network)ģ“ ģ ģ©ėØ
ė ź°ģ ģ ķ ė³ķ(linear transformation); xģ ģ ķ ė³ķ ģ ģ© ķ, ReLU(max(0,z))넼 ź±°ģ³ ė¤ģ ķė² ģ ķ ė³ķ ģ ģ©;
ģ“ė ź°ź°ģ positionė§ė¤ ź°ģ parameter W,b넼 ģ¬ģ©ķģ§ė§, layerź° ė¬ė¼ģ§ė©“ ė¤ė„ø parameter ģ¬ģ©
kernel sizeź° 1ģ“ź³ channelģ“ layerģø convolutionģ ė ė² ģķķ ź²°ź³¼ģ ėģ¼ (ģ“źø° 구ķ ģ ģ¬ģ©)
While the linear transformations are the same across different positions, they use different parameters from layer to layer. Another way of describing this is as two convolutions with kernel size 1.
kernel size넼 3ģ¼ė” ģ”°ģ ģ context ģ ė³“ź° ė ģ ķķėė¤ź³ ķė, ķģ¬ źµ¬źø ź³µģ BERT 구ķ첓ė convolution ģ°ģ° 미ģ ģ©
Input/output dimensionģ d_model = 512ģ“ź³ , hidden layer dimension d_ff = 2048ģ
7. Other Techniques
Label Smoothing Regularization
ģė ėŖ©ķģø ķź² ė¼ė²Øģ ė§ģ¶ė ėŖ©ģ ź³¼ ģ ėµ ė¼ė²Øģ ė”ģ§(logit) ź°ģ“ ķģµ ź³¼ģ ģģ ź³¼ėķź² ė¤ė„ø logit ź°ė³“ė¤ ģ»¤ģ§ė ķģ ė°©ģ§
https://arxiv.org/abs/1512.00567 ģ°øģ”°ķ ź²
Optimizer
Adam optimizer넼 źø°ė°ģ¼ė” warmup źø°ė² ģ ģ© (ķģµė„ ģ“ ģ“źø° stepģģ ź°ķė„“ź² ģģ¹ķė¤ź° step ģ넼 ė§ģ”±ķė©“ ģ“ķ ģ²ģ²ķ ķź°)
References
Implementation
TensorFlow ģģ ė²ģ
Last updated
Was this helpful?