Original video
A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube
https://blog.floydhub.com/knowledge-distillation/

## Glossary

  1 2 3 4 5 6 7 8 9 10 11 12 13  marginalized Treated as insignificant or peripheral. marginal likelihood function integrated likelihood model evidence evidence [#statistics] [#bayesian statistics] A likelihood function in which some parameter variables have been marginalized.

## Predicted distribution vs true distribution

### Predicted distribution

When designing a code to represent weather predictions, you try to assign fewer bits for outcomes which are probably going to be more common.

You can compute the assumed probabily of an outcome based on the number of bits that have been assigned to represent a prediction for that outcome.

### True distribution

The true distribution contains the actual probabilities of the events you are trying to represent with a code.

The code cannot exactly match the true distribution but you can optimise the code to more accurately represent it.

## Entropy

A function of the true distribution $$p$$.

Related to the number of ways you can rearrange a set.

Shannon Entropy and Information Gain - YouTube

But if the number of ways they can be arranged is maximal, then the entropy is highest.

Knowledge and entropy are opposites.

Entropy
Measures the average amount of information that you get from one sample drawn from a given probability distribution p.

It tells you how unpredictable that probability distribution is.

$$S = -\sum_{i} p_i log_2(p_i)$$

### Glossary

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37  expected information information-theoretic entropy entropy H Measures the average amount of information that you get when you learn the weather each day, or more generally the average amount of information that you get from one sample drawn from a given probability distribution p. It tells you how unpredictable that probability distribution is. If you live in the middle of a desert where it’s sunny every day, on average you won’t get much information from the weather station. The entropy will be close to zero. Conversely, if the weather varies a lot, the entropy will be much larger. Optimising codes: For example, when we use a 2-bit message for sunny weather, we’re implicitly assuming that it will be sunny every 4 days (2 to the power of 2), at least on average. In other words, by using this code, we’re implicitly predicting a probability of 25% for sunny weather, or else our code will not be optimal. https://youtu.be/ErfnhcEV1O8?t=425

The surprisal of each event (the amount of information conveyed) becomes a random variable whose expected value is the information entropy.

Surprisal
When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data produces a high-probability value.
Information gain
The information gain is based on the decrease in entropy after a data-set is split on an attribute.

Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).

https://medium.com/@rishabhjain%5F22692/decision-trees-it-begins-here-93ff54ef134

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  information gain [data mining] The amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy. vim +/"mutual information" "\$NOTES/ws/glossaries/information-theory.txt" information gain ratio [#decision tree learning] Ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.

### Step 1

Calculate entropy of the target.

### Step 2

The dataset is then split on the different attributes.

The entropy for each branch is calculated.

Then it is added proportionally, to get total entropy for the split.

The resulting entropy is subtracted from the entropy before the split.

The result is the Information Gain, or decrease in entropy.

### Step 3

Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.

### Step 4a

A branch with entropy of 0 is a leaf node.

### Step 4b

A branch with entropy more than 0 needs further splitting.

### Step 5

The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.

#### Glossary

  1 2 3 4 5 6 7 8 9 10 11  ID3 [algorithm] The core algorithm for building decision trees. Employs a top-down, greedy search through the space of possible branches with no backtracking. Uses Entropy and Information Gain to construct a decision tree.

## Cross entropy

A function of the predicted distribution $$q$$ and the true distribution $$p$$.

$$H(p, q) = -\sum_{i} p_i log_2(q_i)$$

As you can see, it looks pretty similar to the equation for the Entropy, but instead of computing the log of the true probability, we use the log of the predicted probability, which is equal to the message length.

### If our predictions are perfect

If the predicted distribution is equal to the true distribution then the cross-entropy is simply equal to the entropy.

### If the distributions differ

The cross-entropy will be greater than the entropy by some number of bits.

This amount by which the cross-entropy exceeds the entropy is called the relative entropy, or more commonly the KL Divergence.

### Glossary

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  cross entropy cross-entropy [#information theory] [[https://www.youtube.com/watch?v=ErfnhcEV1O8][A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube]] readsubs "https://www.youtube.com/watch?v=ErfnhcEV1O8" | v +/"average message length" The average message length. For example, if the weather station encodes each of the 8 possible options using a 3-bit code like this then every message will have 3 bits, so the average message length will of course be 3 bits, and that’s the cross-entropy.

## Bayes’ Rule

$$\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,$$

where $$\mathbf{z}$$ denotes latent parameters we want to infer and $$\mathbf{x}$$ denotes data.

## KL Divergence

The amount by which the cross-entropy exceeds the entropy.

$$\text{KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) , \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .$$

where $$\mathbf{z}$$ denotes latent parameters we want to infer and $$\mathbf{x}$$ denotes data.

Another equation for KL divergence:

$$D_{KL}(p_{\phi}, q_{\theta}) = \sum_{x \in C} p_{\phi}(x)log(\frac{q_{\theta}(x)}{p_{\phi}(x)}) = H(p_{\phi}, q_{\theta}) - H(p_{\phi})$$

### Glossary

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  Kullback-Leibler Divergence KL Divergence relative entropy The amount by which the cross-entropy exceeds the entropy. cross-entropy is equal to the entropy plus the KL divergence. D_KL(p||q) = H(p,q) - H(p) KL divergence D_KL(p||q) is equal to the cross-entropy H(p,q) minus the entropy H(p). Example: cross-entropy = 4.58 bits, entropy = 2.23 bits, KL Divergence = 2.35 bits. A measure of how one probability distribution is different from a second, reference probability distribution. Applications include characterizing the relative (Shannon) entropy in information systems, randomness in continuous time- series, and information gain when comparing statistical models of inference.