Entropy properties
In this chapter, we'll go over the fundamental properties of KL-divergence, entropy, and cross-entropy. We have already encountered most of the properties before, but it's good to understand them in more depth.
This chapter contains a few exercises. I encourage you to try them to check your understanding.
KL divergence is always nonnegative. Equivalently, .
KL divergence can blow up
Recall, KL divergence is algebraically defined like this:
Here's the biggest difference between KL and some more standard, geometrical ways of measuring distance like norm () or norm (). Consider these two possibilities:
Regular norms () treat these errors as roughly equivalent. But KL knows better: the first situation is basically fine, while the second is catastrophic! For example, the letters "God" are rarely followed by "zilla", but any model of language should understand that this may sometimes happen. If , the model will be infinitely surprised when 'Godzilla' appears!
Try making KL divergence infinite in the widget below. Next level: Try to make it infinite while keeping and norm close to zero (say ).
KL Divergence Explorer
KL divergence is asymmetrical
The KL formula isn't symmetrical—in general, . Some describe this as a disadvantage, especially when comparing KL to simple symmetric distance functions like or . But I want to stress that the asymmetry is a feature, not a bug! KL measures how well a model fits the true distribution . That's inherently asymmetrical, so we need an asymmetrical formula—and that's perfectly fine.
In fact, that's why people call it a divergence instead of a distance. Divergences are kind of wonky distance measures that are not necessarily symmetric.
Imagine the true probability is 50%/50% (fair coin), but our model says 100%/0%. KL divergence is ...
... infinite. That's because there's a 50% chance we gain infinitely many bits of evidence toward (our posterior jumps to 100% fair, 0% biased).
Now flip it around: truth is 100%/0%, model is 50%/50%. Then
Every flip gives us heads, so we gain one bit of evidence that the coin is biased. As we keep flipping, our belief in fairness drops exponentially fast, but it never hits zero. We've gotta account for the (exponentially unlikely) possibility that a fair coin just coincidentally came heads in all our past flips.
Here's a question: The following widget contains two distributions—one peaky and one broad. Which KL is larger? 1
KL Divergence Asymmetry
KL is nonnegative
If you plug in the same distribution into KL twice, you get:
because . Makes sense—you can't tell the truth apart from the truth. 🤷
This is the only occasion on which KL can be equal to zero. Otherwise, KL divergence is always positive. This fact is sometimes called Gibbs inequality. I think we built up a pretty good intuition for this in the first chapter. Imagine sampling from while Bayes' rule increasingly convinces you that you're sampling from some other distribution . That would be really messed up!
This is not a proof though, just an argument that the world with possibly negative KL is not worth living in. Check out the formal proof if you're curious.
Since KL can be written as the difference between cross-entropy and entropy, we can equivalently rewrite as
That is, the best model of that accumulates the surprisal at the least possible rate is ... 🥁 🥁 🥁 ... itself.
Additivity
Whenever we keep flipping coins, the total entropy/cross-entropy/relative entropy just keeps adding up. This property is called additivity and it's so natural that it's very simple to forget how important it is. We've used this property implicitly in earlier chapters, whenever we talked about repeating the flipping experiment and summing surprisals.
More formally: Say you've got a distribution pair and - think is a model of - and another pair and . Let's use for the product distribution -- a joint distribution with marginals where they are independent. In this setup, we have this:
Entropy has an even stronger property called subadditivity: Imagine any distribution with marginals . Then,
For example, imagine you flip a coin and record the same outcome twice. Then, the entropy of each record is 1 bit and subadditivity says that the total entropy is at most bits. In this case, it's actually still just 1 bit.
Anthem battle
I collected the national anthems of the USA, UK, and Australia, and put them into one file. The other text file contains anthems of a bunch of random-ish countries. For both text files, I compute the frequencies of 26 letters 'a' to 'z'. So there are two distributions (English-speaking) and (others). The question is: which one has larger entropy? And which of the two KL divergences is larger?
Make your guess before revealing the answer.
Next steps
We now understand pretty well what KL divergence and cross-entropy stand for.
The mini-course has two more parts. We will:
- Pondering what happens if we try to make KL divergence small. This will explain a lot about ML loss functions, and includes some fun applications of probability to several of our riddles. Continue with the next chapter (Maximum likelihood) in the menu.
- Play with codes and see what they can tell us about neural nets. That's the Coding theory chapter in the menu.
The chapters are mostly independent, so I suggest you jump to whatever sounds more interesting. See you there!