Original article
https://jalammar.github.io/illustrated-transformer/
Source
https://github.com/tensorflow/tensor2tensor

Prereading

https://jalammar.github.io/illustrated-word2vec/

Helpful glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
Multilayer perceptron
Feed-Forward Neural Network
FFNN
    Basically, these are multi-level logistic
    regression classifiers.

    Many layers of scales are separated by
    non-linearities.

    Can be used for as autoencoders.

    Can be used to train a classifier or
    extract functions as autoencoders.

self-attention
intra-attention
    [attention mechanism]

    Intuition:
        Reflects on its own position/context
        within a greater whole.

    Relates different positions of a single
    sequence in order to compute a
    representation of the sequence.

    An attention operation of a single
    sequence in order to calculate the
    representation of the very same sequence.

    This concept has been very useful in NLP
    tasks such as Text summarization, Machine
    Translation and Image Description
    Generation.

    The method the Transformer uses to bake
    the β€œunderstanding” of other relevant
    words into the one we’re currently
    processing.

dot product
    [#math]
    [algebraic operation]

    What is dot product used for?

    The original motivation is a geometric
    one:
        The dot product can be used for
        computing the angle Ξ± between two
        vectors a and b: aβ‹…b=|a|β‹…|b|β‹…cos(Ξ±).

Optional reading

Topic URL
Self-attention Why do Transformers yield Superior Sequence to Sequence(Seq2Seq)Results?

High level overview

In a machine translation application, it would take a sentence in one language, and output its translation in another.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
input [label="foreign language"]
output [label="your language"]

subgraph transformer {
    decoders [label="decoders: stack of N decoders"]
    encoders [label="encoders: stack of N encoders"] -> decoders
}

input -> encoders [label="input (embed the training set)"]
decoders -> output [label=output]
  βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
  ┃        foreign language         ┃
  βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
    ┃
    ┃ input (embed the training set)
    v
βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜
┆             transformer             ┆
┆                                     ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜ ┆
┆ ┃  encoders: stack of N encoders  ┃ ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜ ┆
┆   ┃                                 ┆
┆   ┃                                 ┆
┆   v                                 ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜ ┆
┆ ┃  decoders: stack of N decoders  ┃ ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜ ┆
┆                                     ┆
βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜
    ┃
    ┃ output
    v
  βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
  ┃          your language          ┃
  βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜

Flowchart

A look inside the encoder stack

Each stack contains N encoders.

list of 512D vectors
This is a list of words (initially, at least). It refers to either the input sentence or the output of an encoder.

Each encoder contains a self-attention layer and an FFNN:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
subgraph clusterEncoders {
    style = filled
    fillcolor = lightgrey
    node [style=filled,fillcolor=lightgrey,shape=circle];

    label = "Set of encoders"
    subgraph clusterEncoder1 {
        fillcolor = white
        label = "Encoder 1";
        f1[label="FFNN"]
        a1[label="Self-Attention layer"]
        a1 -> f1
    }
    subgraph clusterEncoder2 {
        fillcolor = white
        label = "Encoder 2";
        f2[label="FFNN"]
        a2[label="Self-Attention layer"]
        a2 -> f2
    }
    subgraph clusterEncoderN {
        fillcolor = white
        label = "Encoder N";
        etc [label="..."]
    }

    f1 -> a2 [label="list of 512D vectors"]
    f2 -> etc [label="list of 512D vectors"]
}

Explanation of the encoder above

The first encoder receives one sentence [at a time] as input.

Subsequent encoders receive the output of the previous encoder (which is of the same dimensionality).

Each word of the sentence is embedded into a vector of size 512 and the length of the embedded sentence is the number of words in the longest sentence of our training dataset Γ— 512, the size of a word.

A self-attention layer helps the encoder look at other words in the input sentence as it encodes a specific word.

The exact same FFNN is independently applied to each position (i.e. each vector flows through it separately).

My thoughts on this…
If you imagine an encoder is a CNN, you can think of the self-attention layer as being the sliding window but it has context of the entire input text (the sentence), not merely the few words around it. The FFNN takes the tensor of attention values as its input.
A key property of the transformer
The word in each position flows through its own path in the encoder. There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies, however, and thus the various paths can be executed in parallel while flowing through the feed-forward layer.

A look inside the decoder stack

Each decoder is the same as an encoder except with an encoder-decoder attention layer in between.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
#+BEGIN_SRC graphviz-dot -n :filter dot-digraph-ascii-lr :async :results verbatim code
  subgraph Decoder1 {
      f1[label="FFNN"]
      eda1[label="Encoder-Decoder-Attention layer"] -> f1
      a1[label="Self-Attention layer"] -> eda1
  }
  subgraph Decoder2 {
      f2[label="FFNN"]
      eda2[label="Encoder-Decoder-Attention layer"] -> f2
      a2[label="Self-Attention layer"] -> eda2
  }
  subgraph DecoderN {
      etc [label="..."]
  }

  f1 -> a2
  f2 -> etc [label="..."]
#+END_SRC
1
2
3
4
5
6
7
8
βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜     ∘ β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜        βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜
┆                                           Decoder1                            ┆     ┆                                           Decoder2                            ┆        ┆  DecoderN  ┆
┆                                                                               ┆     ┆                                                                               ┆        ┆            ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”βˆ˜ ┆     ┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”βˆ˜ ┆  ...   ┆ βˆ˜β”β”β”β”β”β”β”β”βˆ˜ ┆
┆ ┃ Self-Attention layer ┃ ━━> ┃ Encoder-Decoder-Attention layer ┃ ━━> ┃ FFNN ┃ ┆ ━━> ┆ ┃ Self-Attention layer ┃ ━━> ┃ Encoder-Decoder-Attention layer ┃ ━━> ┃ FFNN ┃ ┆ ━━━━-> ┆ ┃  ...   ┃ ┆
┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”βˆ˜ ┆     ┆ βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜     βˆ˜β”β”β”β”β”β”βˆ˜ ┆        ┆ βˆ˜β”β”β”β”β”β”β”β”βˆ˜ ┆
┆                                                                               ┆     ┆                                                                               ┆        ┆            ┆
βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜     ∘ β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜        βˆ˜β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„β”„βˆ˜

Explanation of the decoder above

The encoder-decoder attention layer helps the decoder focus on relevant parts of the input sentence (similar what attention does in seq2seq models).

My thoughts on this…
It seems as though the decoder’s extra attention layer learns a more macroscopic attention than the self-attention layer.
The intuition of self-attention…
Reflects on its own position/context within a greater whole. Each \(512D\) row reflects on its own position/context within the entire matrix. If you’re familiar with RNNs, think of how maintaining a hidden state allows an RNN to incorporate its representation of previous words/vectors it has processed with the current one it’s processing. Self-attention is the method the Transformer uses to bake the β€œunderstanding” of other relevant words into the one we’re currently processing.

Self-attention in detail

The query, key, and value vectors are abstractions that are useful for calculating and thinking about attention.

The self-attention must be calculated for each word.

Step 1: Create a \(q\), \(k\) and \(v\) vector for each word / \(512D\) input vector.

1
"the first word of the input sentence" -> "the first vector of the input list" [label="is analogous to"]
1
2
3
4
5
6
7
8
9
βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
┃ the first word of the input sentence ┃
βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
  ┃
  ┃ is analogous to
  v
βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
┃  the first vector of the input list  ┃
βˆ˜β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”β”βˆ˜
relations
word ↦ vector ↦ word embedding
sentence ↦ input of encoder ↦ list of vectors ↦ list of embeddings

This is the path of one word of the input sentence through an encoder.

Firstly we need the weight matrices.

We trained these during the training process. When was that? This is explained later, maybe.

matrix long name
\(W^Q\) WQ weight matrix
\(W^K\) WK weight matrix
\(W^V\) WV weight matrix

Then we compute \(q\), \(k\) and \(v\) from the weight matrices.

e.g. make \(q_1\), \(k_1\) and \(v_1\) from \(X_1\) (the first word of the input sentence)

vector long name size created by equation
\(X_1\) first word embedding 512 (input word) embed a single word from input sentence
\(q_1\) Query vector 64 (< the input word) multiply embedding by weight \(X_1\cdotp W^Q\)
\(k_1\) Key vector 64 multiply embedding by weight \(X_1\cdotp W^K\)
\(v_1\) Value vector 64 multiply embedding by weight \(X_1\cdotp W^V\)
64
The q,k,v vectors don’t have to be smaller. This is an architecture choice to make the computation of multiheaded attention (mostly) constant.

Step 2: calculate (\(n\times n\)) scores, \(n\) for each word as we compare each word with every other word

The score
determines how much focus to place on other parts of the input sentence as we encode a word at a certain position.

Take the dot product of the query vector with the key vector of the respective word we’re scoring.

\(\mathit{score}_\mathit{i,j} = q_i \cdot k_j\)

e.g. calculate the self-attention for the 1st word in the sentence

Score each word of the input sentence against the 1st word.

If we’re processing the self-attention for the word in position #1, the 1st score would be the dot product of \(q_1\) and \(k_1\).

The 2nd score would be the dot product of \(q_1\) and \(k_2\).

Step 3: divide the scores by 8

8 is obtained by taking the square root of the dimension of the key vectors used in the paper.

See 64 earlier in this document.

Dividing the scores leads to more stable gradients.

Step 4: pass the scores through softmax

Softmax normalizes the scores so they’re all positive and add up to 1.

softmax score
determines how much how much each word will be expressed at this position.

Step 5: multiply each value vector by the softmax score

This is in preparation to sum them up.

the intuition
To keep intact the values of the word or words we want to focus on, and drown-out irrelevant words (by multiplying them by tiny numbers like 0.001, for example).

Step 6: sum up the weighted value vectors

This produces the output of the self-attention layer at this position (for the 1st word).

In conclusion of explaining the self-attention calculation

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
\Function{Self-attention}{${word}_i$}
\State $X_i\gets embed({word}_i)$
\State $q_i\gets X_i\cdotp W^Q$
\State $k_i\gets X_i\cdotp W^K$
\State $v_i\gets X_i\cdotp W^V$
\For{$\textrm{every other word }j$}
\State $\mathit{score}_\mathit{i,j}\gets q_i \cdot k_j$
\EndFor
\EndFunction
\newline
\Function{Self-attention-layer}{$sentence$}
\For{$\textrm{each } word_i \textrm{ in } sentence$}
\State $\mathit{self\char`_attention}_\mathit{i}\gets \verb|Self-attention|({word}_i)$
\EndFor
\EndFunction

Calculating self-attention with matrices

First step - calculate \(Q\), \(K\) and \(V\)

The 1st step is to calculate the Query, Key, and Value matrices.

We do that by packing our embeddings into a matrix \(X\), and multiplying it by the weight matrices we’ve trained (\(W^Q\), \(W^K\), \(W^V\)).