Cracking the Annotated Transformer - Part I

[[Transformer]] is undoubtedly one of the most important neural architectures in recent years, it is considered as the core of the foundation models in many complex AI tasks. It is therefore desirable for every AI researcher to understand every detail about it and how to implement it. In this blog, I will use [annotated-transformer](https://nlp.seas.harvard.edu/annotated-transformer/) as the reference, walk through the model with Pytorch code, and add some comments and notes that hopefully can be a bit more on top of the original article, which is already an amazing writing about the Transformer any way. And this is the part-I of the series, which focuses on the transformer architecture. # A high-level encoder-decoder architecture As below we firstly show a high-level diagram of the Transformer as an encoder-decoder architecture, although it is not necessary to have both encoder and decoder at the same time, as long as we understand the key part of this architecture lies in the [[attention]] map. We show only the key part of the architecture deliberately, in the next sections we will roll out the details of the each block. ![[transformer-highlevel_1.excalidraw.light.svg]] %%[[transformer-highlevel_1.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer-highlevel_1.excalidraw.dark.svg|dark exported image]]%% Given training pairs of $(inputs, outputs)$, the encoder on the left part will encode the inputs as the `memory`, which goes to the decoder on the right hand side. Encoding is relatively straightforward, as the memory learns to attend all positions of inputs, however the decoding follows an autoregressive style, therefore it has two attentions. The first attention is self-attention that learns to attend every position up to the *"to-be"* generated position. And this goes to the second attention (target attention), and it combines with the memory of all the input positions such that it can generate the output. Generally speaking, to generate output at location $i$, the decoder can leverage the input memory $h(\mathbf{x})$ and all the outputs up to $i$, i.e., $y_{i}=g(h(\mathbf{x}), y_{j<i})$. Another interesting part is the [[positional encoder]] (*PE*), what it does is simple: the positional encoder injects position information to the input and output sequence embeddings. We will come back to this later. # Model decompose and deep dive Now let us start our discussion by decomposing the model, and deep dive into the details of architecture. First we take a look at the inputs and outputs. ## Input and output embeddings The inputs and outputs are normally sequences with variable lengths, and each item in the sequence is an token index. In batch mode, the shape of inputs is $(b, s)$, where $b$ is batch size and $s$ is the sequence length. Note that we normally keep the same sequence length $s$ for each batch, therefore padding is required, for instance to append token `<pad>` to the end of sequence if the length is less than $s$. ![[transformer_input_output.excalidraw.light.svg]] %%[[transformer_input_output.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer_input_output.excalidraw.light.svg|light exported image]]%% Now we have a batch of input sequences with token Ids and padding Ids, and it goes to the Embedding layer which is a lookup table that map Ids to vectors in the embedding space. The output tensor's shape is $b\times s \times d$. Compared to the input's shape, the last dimension is expanded to $d$-dimension. Thus we convert the batch of input sequences to a tensor of embeddings. Note that the embedding layer can be loaded with a pre-trained tensor, e.g., [[word2vec]], or we simply initialise the weights randomly. Same is applied to output sequences as well. ## Positional encoding As mentioned before, we need to inject position information to the input embeddings, as the architecture does not have recurrence or convolution by nature, on the other hand the Transformer decouples the sequence dependencies such that parallelisation become possible. In the [paper](https://arxiv.org/abs/1706.03762), one positional encoding approach is used with sine and cosine functions. In the following, you can see how the positional encoding is added along the embedding dimensions. ![[transformer_pos_encoding.excalidraw.light.svg]] %%[[transformer_pos_encoding.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer_pos_encoding.excalidraw.light.svg|light exported image]]%% Positions are calculated with a sinusoid at each embedding dimension, and the wavelength progresses from $2\pi$ ($i=0$) to $10000\cdot 2\pi$ ($i=d_{model}/2$). The source code can be found from the [original post](https://nlp.seas.harvard.edu/annotated-transformer/), we put some lines here to add some comments. ```python pe = torch.zeros(max_len, d_model) position = torch.arange(0, max_len).unsqueeze(1) div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)) pe[:, 0::2] = torch.sin(position * div_term) pe[:, 1::2] = torch.cos(position * div_term) # make 1st dimension as the batch pe = pe.unsqueeze(0) def forward(self, x): # make sure we don't add positional encoding on padded positions # positional encodings should be fixed, therefore requires_grad_ is False x = x + self.pe[:, : x.size(1)].requires_grad_(False) return self.dropout(x) ``` ## Self-Attention Once positional encodings are added on top of input embeddings, we are ready to push them through the encoder layer. Now we show the details blocks of encoder layer which contains a self-attention layer, a feed-forward layer, and a residual connector. Since we have $N$ encoder layers, the output of previous encoder layer will be the input for the next encoder layer. The last layer's output is normalised, and regarded as the `memory`. ### Attention map From the original paper, we define the attention as mapping a query and a set of key-value pairs to an output. When using scaled dot-product attention, it is defined as $\text{Attention}(Q,K,V)=\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V$ The *softmax* part computes a $s\times s$ attention map, which gives what positions to attend on the value matrix $V$. To learn the attention function, we need to estimate linear parameters on $(Q,K,V)$, for instance, given a source embedding $x$ with shape $(s,d_{model})$, the self-attention function is $\text{softmax}(\frac{xW_{1}W_{2}^Tx}{\sqrt{d_k}})xW_{3}$ and $W_{1},W_{2}\in \mathbb{R}^{d_{model}\times d_{k}}$ project input $x$ from embedding space to $d_k$, $W_{3}\in \mathbb{R}^{d_{model}\times d_{v}}$ and $d_k$ and $d_v$ are not necessarily the same, however, in the paper and implementation $d_k=d_v$. ### Multi-head attention If we want to jointly attend to information from different representation subspaces, we can employ the multi-head attention structure, which splits a single attention head to multiple in parallel, and concatenate the attentions via another linear projection. In the paper, it sets $d_k=d_v=d_{model}/h$ and the last linear projection is $W_{4}\in \mathbb{R}^{hd_{v}\times d_{model}}$. Now we depict the details of the multi-head attention in the following diagram, hope it will help you understand how it works exactly. ![[transformer_multihead_attention_details.excalidraw.light.svg]] %%[[transformer_multihead_attention_details.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer_multihead_attention_details.excalidraw.dark.svg|dark exported image]]%% ## Position-wise Feed-Forward network Another building block of the encoder layer is a fully-connected feed-forward network which is defined as a 2-layer neural network that applies on every position with the same weights. $FFN(x)=\text{max}(0, xW_{5}+b_5)W_{6}+b_6$ ## Sublayer connection To connect the two residual blocks (self-attention and FFN), the paper proposes a *add & norm* sublayer connection. Depends on the sublayer, the connection can be defined as $x+\text{sublayer}(\text{LayerNorm}(x))$We found this interesting, as the layer normalisation happens before the sublayer (either the self-attention or the FFN), this is called a pre-LN residual connection, however there is a post-LN residual connection. In practice, the pre-LN is widely used as best practice, as there's research shows training stability and reduces warm-up initialisation. Some insights can be found in this [paper: Ruibin Xiong et al.](https://arxiv.org/abs/2002.04745) ## Put things together for encoder Now let us put the blocks together to show a detailed structure of encoder. ![[transformer_encoder_details.excalidraw.light.svg]] %%[[transformer_encoder_details.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer_encoder_details.excalidraw.dark.svg|dark exported image]]%% ## Assembling the decoder Now we have introduced almost all the building blocks, the decoder consists the same building blocks we defined for the encoder, i.e., multi-head attention, feed-forward network, and sublayer connection for residual blocks. The difference lies in two different attention blocks, the first attention is called target attention, which is masked out leftwards information that forces the outputs can only attend information prior to the current position. The second attention is similar to the encoder, however the query and key are the memory from the encoder. ### Subsequent mask To implement the target mask, we should exploit the softmax function where masked value is filled by $-1e9$, such that it doesn't contribute to the softmax score in the attention matrix. To see this is easy, $\text{softmax}([1,2,-1e9])=[\frac{e^1}{e^1+e^2+e^{-1e9}},\frac{e^2}{e^1+e^{2}+e^{-1e9}}\frac{e^{-1e9}}{e^1+e^2+e^{-1e9}}]=[\frac{e^1}{e^1+e^2},\frac{e^2}{e^1+e^2},0]$ The last entry doesn't contribute anything. ### Put things together with both encoder and decoder Now, let us put things together. ![[transformer_decoder_details.excalidraw.light.svg]] %%[[transformer_decoder_details.excalidraw.md|🖋 Edit in Excalidraw]], and the [[transformer_decoder_details.excalidraw.dark.svg|dark exported image]]%% We complete the introduction of the transformer architecture in details now, in the next blog we will talk a bit about the implementation and the training part of transformer. > Check the part II [[Cracking the Annotated Transformer - Part II]]