Original video
A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube
Related reading
https://blog.floydhub.com/knowledge-distillation/

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
marginalized
    Treated as insignificant or peripheral.

marginal likelihood function
integrated likelihood
model evidence
evidence
    [#statistics]
    [#bayesian statistics]

    A likelihood function in which some
    parameter variables have been
    marginalized.

Predicted distribution vs true distribution

Predicted distribution

When designing a code to represent weather predictions, you try to assign fewer bits for outcomes which are probably going to be more common.

You can compute the assumed probabily of an outcome based on the number of bits that have been assigned to represent a prediction for that outcome.

True distribution

The true distribution contains the actual probabilities of the events you are trying to represent with a code.

The code cannot exactly match the true distribution but you can optimise the code to more accurately represent it.

Entropy

A function of the true distribution \(p\).

Related to the number of ways you can rearrange a set.

Shannon Entropy and Information Gain - YouTube

But if the number of ways they can be arranged is maximal, then the entropy is highest.

Knowledge and entropy are opposites.

Entropy
Measures the average amount of information that you get from one sample drawn from a given probability distribution p.

It tells you how unpredictable that probability distribution is.

\begin{equation} S = -\sum_{i} p_i log_2(p_i) \end{equation}

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
expected information
information-theoretic entropy
entropy
H
    Measures the average amount of information
    that you get when you learn the weather
    each day, or more generally the average
    amount of information that you get from
    one sample drawn from a given probability
    distribution p.

    It tells you how unpredictable that
    probability distribution is.

    If you live in the middle of a desert
    where it’s sunny every day, on average you
    won’t get much information from the
    weather station.

    The entropy will be close to zero.

    Conversely, if the weather varies a lot,
    the entropy will be much larger.

    Optimising codes:
        For example, when we use a 2-bit
        message for sunny weather, we’re
        implicitly assuming that it will be
        sunny every 4 days (2 to the power of
        2), at least on average.

        In other words, by using this code,
        we’re implicitly predicting a
        probability of 25% for sunny weather,
        or else our code will not be optimal.

        https://youtu.be/ErfnhcEV1O8?t=425

The surprisal of each event (the amount of information conveyed) becomes a random variable whose expected value is the information entropy.

Surprisal
When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data produces a high-probability value.
Information gain
The information gain is based on the decrease in entropy after a data-set is split on an attribute.

Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

https://medium.com/@rishabhjain%5F22692/decision-trees-it-begins-here-93ff54ef134

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
information gain
    [data mining]

    The amount of information that's gained by
    knowing the value of the attribute, which
    is the entropy of the distribution before
    the split minus the entropy of the
    distribution after it.

    The largest information gain is equivalent
    to the smallest entropy.

    vim +/"mutual information" "$NOTES/ws/glossaries/information-theory.txt"

information gain ratio
    [#decision tree learning]

    Ratio of information gain to the intrinsic
    information.

    It was proposed by Ross Quinlan, to reduce
    a bias towards multi-valued attributes by
    taking the number and size of branches
    into account when choosing an attribute.

Step 1

Calculate entropy of the target.

Step 2

The dataset is then split on the different attributes.

The entropy for each branch is calculated.

Then it is added proportionally, to get total entropy for the split.

The resulting entropy is subtracted from the entropy before the split.

The result is the Information Gain, or decrease in entropy.

Step 3

Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

Step 4a

A branch with entropy of 0 is a leaf node.

Step 4b

A branch with entropy more than 0 needs further splitting.

Step 5

The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
ID3
    [algorithm]

    The core algorithm for building decision trees.

    Employs a top-down, greedy search through
    the space of possible branches with no
    backtracking.

    Uses Entropy and Information Gain to
    construct a decision tree.

Cross entropy

A function of the predicted distribution \(q\) and the true distribution \(p\).

\begin{equation} H(p, q) = -\sum_{i} p_i log_2(q_i) \end{equation}

As you can see, it looks pretty similar to the equation for the Entropy, but instead of computing the log of the true probability, we use the log of the predicted probability, which is equal to the message length.

If our predictions are perfect

If the predicted distribution is equal to the true distribution then the cross-entropy is simply equal to the entropy.

If the distributions differ

The cross-entropy will be greater than the entropy by some number of bits.

This amount by which the cross-entropy exceeds the entropy is called the relative entropy, or more commonly the KL Divergence.

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
cross entropy
cross-entropy
    [#information theory]

    [[https://www.youtube.com/watch?v=ErfnhcEV1O8][A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube]]

    readsubs "https://www.youtube.com/watch?v=ErfnhcEV1O8" | v +/"average message length"

    The average message length.

    For example, if the weather station
    encodes each of the 8 possible options
    using a 3-bit code like this then every
    message will have 3 bits, so the average
    message length will of course be 3 bits,
    and that’s the cross-entropy.

Bayes’ Rule

\begin{equation} \underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace , \end{equation}

where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.

KL Divergence

The amount by which the cross-entropy exceeds the entropy.

\begin{equation} \text{KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) , \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace . \end{equation}

where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.

Another equation for KL divergence:

\begin{equation} D_{KL}(p_{\phi}, q_{\theta}) = \sum_{x \in C} p_{\phi}(x)log(\frac{q_{\theta}(x)}{p_{\phi}(x)}) = H(p_{\phi}, q_{\theta}) - H(p_{\phi}) \end{equation}

Glossary

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Kullback-Leibler Divergence
KL Divergence
relative entropy
    The amount by which the cross-entropy
    exceeds the entropy.

    cross-entropy is equal to the entropy plus
    the KL divergence.

    D_KL(p||q) = H(p,q) - H(p)
        KL divergence D_KL(p||q) is equal to
        the cross-entropy H(p,q) minus the
        entropy H(p).

    Example:
        cross-entropy = 4.58 bits,
        entropy = 2.23 bits,
        KL Divergence = 2.35 bits.

    A measure of how one probability
    distribution is different from a second,
    reference probability distribution.

    Applications include characterizing the
    relative (Shannon) entropy in information
    systems, randomness in continuous time-
    series, and information gain when
    comparing statistical models of inference.