Attention Is All You Need 精读笔记

论文信息

作者: Vaswani et al., 2017
核心贡献: 提出 Transformer 架构，完全基于注意力机制，摒弃了 RNN 和 CNN

1. Scaled Dot-Product Attention

给定 Query $Q$、Key $K$、Value $V$，注意力函数定义为：

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

其中 $d_k$ 是 Key 的维度，除以 $\sqrt{d_k}$ 防止点积值过大导致 softmax 梯度消失。

2. Multi-Head Attention

将 $Q, K, V$ 分别通过 $h$ 组不同的线性投影，再分别计算注意力：

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)W^O

其中每个 head：

\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)

3. Position-wise Feed-Forward Network

每个位置独立应用两层全连接网络：

\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2

这等价于两个 kernel size 为 1 的卷积。

4. Positional Encoding

由于模型不含递归或卷积结构，需要显式注入位置信息：

PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)

5. 伪代码

def transformer_encoder(x, mask):
    # Self-attention sub-layer
    attn_output = multi_head_attention(x, x, x, mask)
    x = layer_norm(x + attn_output)  # residual + norm

    # Feed-forward sub-layer
    ff_output = feed_forward(x)
    x = layer_norm(x + ff_output)    # residual + norm
    return x

Algorithm: Scaled Dot-Product Attention
Input: Q (n×d_k), K (m×d_k), V (m×d_v)
Output: Attention(Q, K, V)

1. scores ← Q · K^T / sqrt(d_k)
2. weights ← softmax(scores, axis=-1)
3. output ← weights · V
4. return output

6. 关键结论

模型	BLEU (EN-DE)	BLEU (EN-FR)
Transformer (base)	27.3	38.1
Transformer (big)	28.4	41.0

Transformer 在机器翻译任务上超越了当时所有 RNN/CNN 模型，同时训练速度显著更快。