[ Probability , Information Theory ]

Problem 9 : Kullback-Leibler (KL) Divergence

I had never heard about the KL Divergence until the Summer of 2019 when I happened to work on a summer project on modeling of the ATP Synthase Enzyme. I was new to the subject and came across the expression when I was reading about the parallels between statmech and information theory. For reasons that will become clear later, it is also known as relative entropy. Let us consider the discrete probability distribution here and prove the following inequality -

\[D_{\mathrm{KL}}(P \| Q)=\sum_{x \in \mathcal{X}} P(x) \ln \left(\frac{P(x)}{Q(x)}\right) >= 0\]

for probability distributions \(P\) and \(Q\) over ensemble containing \(x\) in \(\mathcal{X}\).

An Intuitive Approach

Let us assume that we have a message transmitter sending out discrete signals that take on values ranging form \(x1, x2, ... xM\). Each of these signals are sent out with a probability given by \(p1, p2,....., pM\). Before we proceed any further, we need to realize that the minimum number of bits needed to encode \(xi\) having probability \(pi\) is given by \(log(\frac{1}{pi})\).

For this, we can just resort to a trick - Let us encode the \(xi\) that occurs most frequently with the fewest possible number of bits. Then, to represent the entire sequence of outputs by the transmitter we would end up with fewer bits than the number of bits we would have to use if we chose to encode all the \(xi\) using the same number of bits (\(log(M)\)). The next natural question that comes to the mind is to decide the number of bits that need to be allotted to encode each \(xi\). We already know that the more frequent the occurence of \(xi\) is, the lesser the number of bits that should be allotted to it.

Now let us consider the following way of encoding \(xi\) -

If \(pi >= 0.5\) then we use 1 bit
If \(0.5 > pi >= 0.25\) then we use 2 bits
If \(0.25 > pi >= 0.125\) then we use 3 bits . . .

This method of division comes naturally as we keep bisecting the domain of \(pi\) and subsqueuenty setting the number of bits. In fact, this is the same as saying that we use \(ceil[log(1/pi)]\) bits for each \(pi\). A continuous extension of this scheme would simply say that \(log(\frac{1}{pi})\) is the number of bits to encode \(xi\).

So, the choice of \(log(\frac{1}{pi})\) as the number of bits to encode \(xi\) seems legitimate. I hope that adds some intuition to the problem. Now, it just so happens that this is the minimum number of bits needed. I couldn’t come up with an intuitive explanation of this.

Finally, for the data stream coming from the transmitter, the expected number of bits needed would simply be given by \(-p(x).\log \left(p(x)\right)\). Since, the current choice of encoding leads to minimum expected value of the number of bits. The any other encoding based on probability distribution \(q(x)\) would lead to an expected value of \(-p(x).\log \left(q(x)\right)\) which is bound to be greater than the minimum.

Mathematical formulation of the above paragraph leads us to the final result that -

\[D_{\mathrm{KL}}(p \| q)=\sum_{x \in \mathcal{X}} p(x) \log \left(\frac{p(x)}{q(x)}\right) >= 0\]

Mathematical Proof

The proof below can be backtracked to show that the minimum number of bits needed to encode \(xi\) having probability \(pi\) is indeed given by \(log(\frac{1}{pi})\).

Proof of the KL Divergence Inequality using Jensen’s Inequality : Proof