Entropy, Cross-Entropy and KL-Divergence
- Original video
- A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - YouTube
- Related reading
- https://blog.floydhub.com/knowledge-distillation/
Glossary
|
|
Predicted distribution vs true distribution
Predicted distribution
When designing a code to represent weather predictions, you try to assign fewer bits for outcomes which are probably going to be more common.
You can compute the assumed probabily of an outcome based on the number of bits that have been assigned to represent a prediction for that outcome.
True distribution
The true distribution contains the actual probabilities of the events you are trying to represent with a code.
The code cannot exactly match the true distribution but you can optimise the code to more accurately represent it.
Entropy
A function of the true distribution \(p\).
Related to the number of ways you can rearrange a set.
Shannon Entropy and Information Gain - YouTube
But if the number of ways they can be arranged is maximal, then the entropy is highest.
Knowledge and entropy are opposites.
Entropy
- Measures the average amount of information
that you get from one sample drawn from a
given probability distribution p.
It tells you how unpredictable that probability distribution is.
\begin{equation} S = -\sum_{i} p_i log_2(p_i) \end{equation}
Glossary
|
|
The surprisal
of each event (the amount
of information conveyed) becomes a random
variable whose expected value is the
information entropy.
Surprisal
- When the data source produces a low-probability value (i.e., when a low-probability event occurs), the event carries more “information” (“surprisal”) than when the source data produces a high-probability value.
Information gain
- The information gain is based on the decrease in entropy after a data-set is split on an attribute.
Constructing a decision tree is all about finding attribute that returns the highest information gain (i.e., the most homogeneous branches).
https://medium.com/@rishabhjain%5F22692/decision-trees-it-begins-here-93ff54ef134
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
information gain [data mining] The amount of information that's gained by knowing the value of the attribute, which is the entropy of the distribution before the split minus the entropy of the distribution after it. The largest information gain is equivalent to the smallest entropy. vim +/"mutual information" "$NOTES/ws/glossaries/information-theory.txt" information gain ratio [#decision tree learning] Ratio of information gain to the intrinsic information. It was proposed by Ross Quinlan, to reduce a bias towards multi-valued attributes by taking the number and size of branches into account when choosing an attribute.
Step 1
Calculate entropy of the target.
Step 2
The dataset is then split on the different attributes.
The entropy for each branch is calculated.
Then it is added proportionally, to get total entropy for the split.
The resulting entropy is subtracted from the entropy before the split.
The result is the Information Gain, or decrease in entropy.
Step 3
Choose attribute with the largest information gain as the decision node, divide the dataset by its branches and repeat the same process on every branch.
Step 4a
A branch with entropy of 0 is a leaf node.
Step 4b
A branch with entropy more than 0 needs further splitting.
Step 5
The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified.
Glossary
|
|
Cross entropy
A function of the predicted distribution \(q\) and the true distribution \(p\).
\begin{equation} H(p, q) = -\sum_{i} p_i log_2(q_i) \end{equation}
As you can see, it looks pretty similar to the equation for the Entropy, but instead of computing the log of the true probability, we use the log of the predicted probability, which is equal to the message length.
If our predictions are perfect
If the predicted distribution is equal to the true distribution then the cross-entropy is simply equal to the entropy.
If the distributions differ
The cross-entropy will be greater than the entropy by some number of bits.
This amount by which the cross-entropy exceeds
the entropy is called the relative entropy, or
more commonly the KL Divergence
.
Glossary
|
|
Bayes’ Rule
\begin{equation} \underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace , \end{equation}
where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.
KL Divergence
The amount by which the cross-entropy exceeds the entropy.
\begin{equation} \text{KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) , \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace . \end{equation}
where \(\mathbf{z}\) denotes latent parameters we want to infer and \(\mathbf{x}\) denotes data.
Another equation for KL divergence:
\begin{equation} D_{KL}(p_{\phi}, q_{\theta}) = \sum_{x \in C} p_{\phi}(x)log(\frac{q_{\theta}(x)}{p_{\phi}(x)}) = H(p_{\phi}, q_{\theta}) - H(p_{\phi}) \end{equation}
Glossary
|
|
If this article appears incomplete, it may be intentional. Try prompting for a continuation.