Original article
https://fabiandablander.com/r/Variational-Inference.html

https://mullikine.github.io/posts/entropy-cross-entropy-and-kl-divergence/

## Bayes' Theorm

  1 2 3 4 5 6 7 8 9 10 11  $$\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mat hbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,$$ where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.

$$\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,$$

where $$\mathbf{z}$$ denotes latent parameters we want to infer and $$\mathbf{x}$$ denotes data.

## Integration vs derivative

My thoughts
Is it that by integrating we marginalise some of the parameter variables?

Bayes' Rule involves integration.

Optimization, involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.

## Remainder of article

### KL Divergence

We want to get at the posterior distribution, but instead of sampling we simply try to find a density $$q^\star(\mathbf{z})$$ from a family of densities $$Q$$ that best approximates the posterior distribution:

$$q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,$$

where $$\text{KL}(. \lvert \lvert.)$$ denotes the KL divergence:

$$\text{KL}\left(q(\mathbf{z}) , \lvert\lvert , p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) , \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .$$

We cannot compute this KL divergence because it still depends on the nasty integral $$p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) , p(\mathbf{z}) , \mathrm{d}\mathbf{z}$$ . To see this dependency, observe that:

  1 2 3 4 5 6 7 8 9 10  \begin{math} \begin{aligned} \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace , \end{aligned} \end{math}