Attention is All You Need

模块

Attention与 Multi-Head Attention

输入向量 x=(x1,,xn)Rnx = (x_1,\cdots,x_n)\in\mathbb{R}^n经过待训练的矩阵WQ,WK,WVRn×dW_Q, W_K, W_V\in\mathbb{R}^{n\times d}得到三条输入向量Q,K,VRdQ,K,V\in\mathbb{R}^d

Attention(Q,K,V)=Softmax(QTKdk)V\mathrm{Attention}(Q,K,V) = \mathrm{Softmax}\left(\frac{Q^TK}{\sqrt{d_k}}\right)V

实际上, 对于向量Q,KQ,K, 展开得

QTK=(iWQixi)T(jWKjxj)=ijxiTWQiTWKjxj=ijxiRijxj\begin{aligned} Q^TK &= \left(\sum_i W_{Qi} x_i\right)^T\left(\sum_j W_{Kj}x_j\right)\\ & = \sum_{ij} x_i^T W_{Qi}^T W_{Kj} x_j\\ & = \sum_{ij} x_i R_{ij} x_j \end{aligned}

是关于xx的双线性型

多头注意力Multi-Head Attention

多头注意力本质是将多个Attention的输出结合与归并的过程,相比单Attention需要多训练一个输出矩阵 WOW_O,满足

{hi=Attention(XWQ,i,XWK,i,XWV,i)=Softmax(QiKiTdi)ViMHA(X)=Concatin(hi)WO\begin{dcases} h_i =\mathrm{Attention}(XW_{Q,i},XW_{K,i},XW_{V,i}) =\mathrm{Softmax}(\frac{Q_iK_i^T}{\sqrt{d_i}})\cdot V_i\\ \mathrm{MHA}(X) = \mathrm{Concat}_{i\leq n}(h_i)\cdot W_O \end{dcases}

其中结合函数Concat为一个简单嵌入

Concat(h1,,hn)=(h1,,hn)\mathrm{Concat}(h_1,\cdots,h_n) = \left(h_1,\cdots,h_n\right)