Symbol Code
Other than block code where symbols are encoded in chunks, symbol code will assign each symbol a unique codeword. Among the codeword schemes, we prefer those where no codeword is a prefix of any other codeword. These are called prefix code.
The problem is how to give such scheme, or does it really exist?
Kraft-McMillan Inequality
Kraft-McMillan inequality reveals the relation
Denote the length of each symbol code as \(l_i\) and suppose there are \(I\) symbols. If \(\sum_{i=1}^I 2^{-l_i} \le 1\), then there exists a set of uniquely-decodable prefix coding with these lengths as their symbol codes’ lengths.
Proof: The proof is done by construction. The number of codes of length \(l\) should be less than \(2^{l}\), or else the inequality will be violated. Therefore \(\forall l = 0,1,\dots\), we can loosely arrange all the codes of length \(l\) to be unique (since there are \(2^l\) many distinct bit strings of length \(l\)). Then the uniqueness condition is checked.
Denote the number of codes of length \(l\) by \(C_l\). For any two consecutive lengths \(l\) and \((l+1)\), we have \[ \begin{aligned} 2^{-l} C_l + 2^{-(l+1)} C_{l+1} &\le\sum_{i=1}^I 2^{-l_i}\le 1 \\ C_{l+1} &\le 2^{l+1} - 2 C_l \\ C_{l+1} &\le 2(2^l - C_l) \\ \end{aligned} \] This means we can append these unused \((2^l - C_l)\) codes of length \(l\) with \(0\) and \(1\) to suit the number of codes of length \(l+1\). Construction completes.
Suppose we have a set of uniquely-decodable prefix coding. Denote the length of each symbol code as \(l_i\) and there are \(I\) symbols. Then, \[ \sum_{i=1}^I 2^{-l_i} \le 1 \]
- Proof: Let \(S = \sum_{i=1}^I 2^{-l_i} \le 1\), then, \[ \begin{aligned} S^N &= (\sum_{i=1}^I 2^{-l_i})^N \\ &= \sum_{i_1=1}^I \dots \sum_{i_N=1}^I 2^{-(l_{i_1} + \dots + l_{i_N})} \end{aligned} \] The \((l_{i_1} + \dots + l_{i_N})\) term can be treated as the length of encoding of \(a_{i_1} \dots a_{i_N}\) of arbitrary length \(N\). Let \(l_\min = \min_i l_i, l_\max = \max_i l_i\), the above can be re-written as \[ S^N = \sum_{l = Nl_\min}^{Nl_\max} 2^{-l}C_l \] where \(C_l\) represents the number of symbol codes of length \(l\). Since the coding is uniquely-decodable, \(C_l \le 2^l\). Therefore, \[ S^N \le \sum_{l = Nl_\min}^{Nl_\max} 2^{-l} 2^l \le Nl_\max \] If \(S > 1\), the above cannot hold for arbitrary \(N\). Therefore \(S \le 1\).
Source Coding Theorem for Symbol Code
For an ensemble \(X\), there exists a prefix code \(C\) with expected length satisfying \[ H(X) \le L(C,X) < H(X) + 1 \]
Proof: We define the implicit probabilities \(q_i = 2^{-l_i} / z\) where \(z = \sum_i 2^{-l_i}\). Then, \[ \begin{aligned} L(C,X) &= \sum_i p_i l_i = -\sum_i [p_i \log (q_iz)] \\ &=\sum_i [p_i \log 1/q_i] - \log z \\ &\ge H(X) \end{aligned} \] The equality holds when \(z = 1\) (the code is complete) and \(q = p\) (\(l_i = \log 1/p_i\)).
From another perspective, suppose the coding is complete but not optimal, \[ \begin{aligned} L(C,X) &= -\sum_i [p_i \log (q_iz)] = -\sum_i [p_i \log (q_i)] \\ &= -\sum_i [p_i \log (p_i)] + \sum_i [p_i \log (p_i)] -\sum_i [p_i \log (q_i)] \\ &= H(X) + D_{KL}(p || q) \end{aligned} \] where the cost is the extra \(D_{KL}(p || q)\) bits, which is brought by instead treating \(q\) as the real distribution. \(D_{KL(p||q)}\) is termed as relative entropy or the KL-divergence.