Shannon's noiseless coding theorem

Shannon's noiseless coding theorem places an upper and a lower bound on the minimal possible expected length of codewords as a function of the entropy of the input word (which is viewed as a random variable) and of the size of the target alphabet.

Shannon's statement

Let X be a random variable taking values in some finite alphabet Σ1 and let f be a decipherable code from Σ1 to Σ2 where |\Sigma_2^*|=a. Let S denote the resulting wordlength of f(X).

If f is optimal in the sense that it has the minimal expected wordlength for X, then

 \frac{H(X)}{\log_2 a} \leq \mathbb{E}S < \frac{H(X)}{\log_2 a} +1

Proof

Let si denote the wordlength of each possible xi (i=1,\ldots,n). Define q_i = a^{-s_i}/C, where C is chosen so that  \sum q_i = 1.

Then


\begin{matrix}
H(X) &=& - \sum_{i=1}^n p_i \log_2 p_i \\
&& \\
&\leq& - \sum_{i=1}^n p_i \log_2 q_i \\
&& \\
&=& - \sum_{i=1}^n p_i \log_2 a^{-s_i} + \sum_{i=1}^n \log_2 C \\
&& \\
&\leq&  - \sum_{i=1}^n - s_i p_i \log_2 a \\
&& \\
&\leq&  - \mathbb{E}S \log_2 a \\
\end{matrix}

where the second line follows from Gibbs' inequality and the third line follows from Kraft's inequality: C = \sum_{i=1}^n a^{-s_i} \leq 1 so \log C \leq 0.

For the second inequality we may set

s_i = \lceil - \log_a p_i \rceil

so that

 - \log_a p_i \leq s_i < -\log_a p_i + 1

and so

 a^{-s_i} \leq p_i

and

 \sum  a^{-s_i} \leq \sum p_i = 1

and so by Kraft's inequality there exists a prefix-free code having those wordlengths. Thus the minimal S satisfies


\begin{matrix}
\mathbb{E}S & = & \sum p_i s_i \\
&& \\
& < & \sum p_i \left( -\log_a p_i +1 \right) \\
&& \\
& = & \sum - p_i \frac{\log_2 p_i}{\log_2 a} +1 \\
&& \\
& = & \frac{H(X)}{\log_2 a} +1 \\
\end{matrix}