Entropy, CrossEntropy and KLDivergence
 Original video
 A Short Introduction to Entropy, CrossEntropy and KLDivergence  YouTube
 Related reading
 https://blog.floydhub.com/knowledgedistillation/
Glossary


Predicted distribution vs true distribution
Predicted distribution
When designing a code to represent weather predictions, you try to assign fewer bits for outcomes which are probably going to be more common.
You can compute the assumed probabily of an outcome based on the number of bits that have been assigned to represent a prediction for that outcome.
True distribution
The true distribution contains the actual probabilities of the events you are trying to represent with a code.
The code cannot exactly match the true distribution but you can optimise the code to more accurately represent it.
Entropy
A function of the true distribution \(p\).
Related to the number of ways you can rearrange a set.
Shannon Entropy and Information Gain  YouTube
But if the number of ways they can be arranged is maximal, then the entropy is highest.
Knowledge and entropy are opposites.
Entropy
 Measures the average amount of information
that you get from one sample drawn from a
given probability distribution p.
It tells you how unpredictable that probability distribution is.
\begin{equation} S = \sum_{i} p_i log_2(p_i) \end{equation}
Glossary


The surprisal
of each event (the amount
of information conveyed) becomes a random
variable whose expected value is the
information entropy.
Surprisal
 When the data source produces a lowprobability value (i.e., when a lowprobability event occurs), the event carries more “information” (“surprisal”) than when the source data produces a highprobability value.
Information gain
 The information gain is based on the decrease in entropy after a dataset is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).
https://medium.com/@rishabhjain%5F22692/decisiontreesitbeginshere93ff54ef134
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
information gain [data mining] The amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy. vim +/"mutual information" "$NOTES/ws/glossaries/informationtheory.txt" information gain ratio [#decision tree learning] Ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, to reduce a bias towards multivalued attributes by taking the number and size of branches into account when choosing an attribute.
Step 1
Calculate entropy of the target.
Step 2
The dataset is then split on the different attributes.
The entropy for each branch is calculated.
Then it is added proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before the split.
The result is the Information Gain, or decrease in entropy.
Step 3
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.
Step 4a
A branch with entropy of 0 is a leaf node.
Step 4b
A branch with entropy more than 0 needs further splitting.
Step 5
The ID3 algorithm is run recursively on the nonleaf branches, until all data is classified.
Glossary


Cross entropy
A function of the predicted distribution \(q\) and the true distribution \(p\).
\begin{equation} H(p, q) = \sum_{i} p_i log_2(q_i) \end{equation}
As you can see, it looks pretty similar to the equation for the Entropy, but instead of computing the log of the true probability, we use the log of the predicted probability, which is equal to the message length.
If our predictions are perfect
If the predicted distribution is equal to the true distribution then the crossentropy is simply equal to the entropy.
If the distributions differ
The crossentropy will be greater than the entropy by some number of bits.
This amount by which the crossentropy exceeds
the entropy is called the relative entropy, or
more commonly the KL Divergence
.
Glossary


Bayes' Rule
\begin{equation} \underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace , \end{equation}
where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.
KL Divergence
The amount by which the crossentropy exceeds the entropy.
\begin{equation} \text{KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) , \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace . \end{equation}
where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.
Another equation for KL divergence:
\begin{equation} D_{KL}(p_{\phi}, q_{\theta}) = \sum_{x \in C} p_{\phi}(x)log(\frac{q_{\theta}(x)}{p_{\phi}(x)}) = H(p_{\phi}, q_{\theta})  H(p_{\phi}) \end{equation}
Glossary


If this article appears incomplete, it may be intentional. Try prompting for a continuation.