Original article
The Illustrated GPT-2 (Visualizing Transformer Language Models) Jay Alammar Visualizing machine learning one concept at a time

Prereading

Overview of The Illustrated Transformer // Bodacious Blog

Parameters

When an article talks about the number of parameters, this is what it’s referring to.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
                                                                                      Parameters
Single Transformer block     Conv1d      attn/c_attn     w       768     2304         1769472
                                                         b               2304         2304
                                         attn/c_proj     w       768     768          589824
                                                         b               768          768
                                         mlp/c_fc        w       768     3072         2359296
                                                         b               768          768
                                         mlp/c_proj      w       3072    3072         2359296
                                                         b               768          768
                             Norm        ln_1            g               768          768
                                                         b               768          768
                                         ln_2            g               768          768
                                                         b               768          768
                                                                         total        7085568     per block

                                                                         X 12 blocks  85026816    In all blocks

                                         Embeddings              50257   768          38597376
                                         Positional Embeddings   2024    768          786432

                                                                         Grand Total  124410624

Goal

Supplement The Illustrated Transformer with more visuals explaining the inner-workings of transformers, and how they’ve evolved since the original paper.

Contents

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Part 1: GPT2 And Language Modeling
    What is a Language Model
    Transformers for Language Modeling
    One Difference From BERT
    The Evolution of The Transformer Block
    Crash Course in Brain Surgery: Looking Inside GPT-2
    A Deeper Look Inside
    End of part #1: The GPT-2, Ladies and Gentlemen

Part 2: The Illustrated Self-Attention
    Self-Attention (without masking)
    1- Create Query, Key, and Value Vectors
    2- Score
    3- Sum
    The Illustrated Masked Self-Attention
    GPT-2 Masked Self-Attention
    Beyond Language modeling
    You’ve Made it!

Part 3: Beyond Language Modeling
    Machine Translation
    Summarization
    Transfer Learning
    Music Generation

Glossary

1
2
3
4
5
WebText
    [dataset]

    The OpenAI researchers crawled it from the
    internet as part of research into GPT-2.

Model variants

Here, M means millions. Sometimes it can be used in this way to mean thousands.

https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

GPT-2 variant Parameters (M=Million) Layers \(\mathit{d}_\mathit{model}\) (Model dimension)
SMALL 117M 12 768
MEDIUM 345M 24 1024
LARGE 762M 36 1280
EXTRA LARGE 1542M 48 1600

Factoids

GPT-2 was trained on WebText

The GPT-2 was trained on a massive 40GB dataset called WebText that the OpenAI researchers crawled from the internet as part of the research effort.