Source Coding Theorem
Notations and Concepts
An ensemble \(X\) (we specifically use the random variable symbol to denote the ensemble) is a triplet \((X, \newcommand{A}{\mathcal A} \newcommand{P}{\mathcal P} \A_X, \P_X)\), where \(X\) denotes the random variable, which takes on values from \(\A_X = \{a_1, a_1, \dots \}\), that has probability \(\P_X = \{p_1, p_2, \dots \}\).
The Shannon information content of an outcome \(x\) is: \[ h(x) = \log_2 \frac{1}{P(x)} \] The raw bit content of \(X\) is \[ H_0(X) = \log |\mathcal A_X| \] The smallest \(\delta\)-sufficient subset \(S_\delta\) of \(\mathcal A_X\) is the smallest subset satisfying \[ P(X \in S_\delta) \ge 1 - \delta \] The essential bit content of \(X\) is \[ H_\delta(X) = \log |S_\delta| \]
Shannon’s Source Coding Theorem
Let \(X\) be an ensemble with entropy \(H(X) = H\) bits, and \(X^N\) be the ensemble composed of \(N\) such i.i.d. random variables. Given \(\epsilon > 0\) and \(0 < \delta < 1\), there exists a positive integer \(N_0\) such that for \(N > N_0\), \[ |\frac{1}{N} H_\delta (X^N) - H| < \epsilon \]
Or put it in a verbal way,
\(N\) i.i.d. random variables each with entropy \(H(X) = H\) can be compressed into more than \(NH\) bits with negligible information loss as \(N \to \infty\); conversely if they are compressed fewer than \(NH\) bits, it is virtually certain that there is information loss.
Proof: Let the random variable \(\frac 1 N \log \frac 1 {P(Y)}\) be defined for the ensemble \(Y = X^N\), which composes of \(N\) i.i.d. ensembles \(X_1 \dots X_N\). \(\frac 1 N \log \frac 1 {P(Y)}\) can be re-written as the average of \(N\) information contents \(h_i = \log \frac 1 {P(X_i)}, i \in 1,2,\dots,N\): \[ \frac 1 N \log \frac 1 {P(Y)} = \frac 1 N \log \frac 1 {P(X_1) \dots P(X_N)} = \frac 1 N (\log \frac{1}{P(X_1)} + \dots + \log \frac{1}{P(X_N)}) \] Each of these information contents is in turn a random variable with mean \(\bar h_i = H(X) = H\) and variance \(\sigma_{h_i}^2 = \sigma^2\).
A long string of \(N\) symbols would usually contain roughly \(p_1 N\) occurrences of symbol \(a_1\), \(p_2 N\) occurrences of symbol \(a_2\)… The probability of those elements is roughly \((p_1)^{p_1 N} (p_2)^{p_2 N} \dots\) The information content of each such element is thus roughly \(N \sum_i p_i \log \frac{1}{p_i} = N H\).
We define the typical elements of \(\mathcal{A}_X^N\) to be just those element that have probability close to \(2^{-NH}\). Note that interestingly, the most probable string is not usually typical because its probability is far away from \(2^{-NH}\).
We introduce another parameter \(\beta\) to control how close the probability has to be to \(2^{-NH}\) for an element to be typical. The set of typical elements is called typical set \(T_{N \beta}\) and is defined as \[ T_{N \beta} = \left\{y \in \mathcal A_X^N: [\frac 1 N \log \frac{1}{P(y)} - H]^2 < \beta^2 \right\} \] By the Weak Law of Large Numbers, \[ P \left( \left( \frac 1 N \log \frac{1}{P(Y)} - H \right)^2 \ge \beta^2 \right) \le \frac{\sigma^2}{\beta^2 N} \] and thus \[ P(Y \in T_{N \beta}) \ge 1 - \frac{\sigma^2}{\beta^2 N} \] As \(N\) increases, the probability that \(y\) falls in \(T_{N \beta}\) draws near to \(1\). We need to relate this to the theorem that for any given \(\epsilon, \delta\), there is a sufficiently-large \(N\) such that \(H_\delta(X^N) \simeq NH\).
\(H_\delta(X^N) < N(H + \epsilon)\)
The set \(T_{N \beta}\) is not the best sufficient subset for compression (because it doesn’t include those most probable). Therefore, \(\log |T_{N \beta}|\) upper-bounds the \(H_\delta(X^N)\) for any \(\delta\). On the other hand, for all \(y \in T_{N \beta}\), \(P(y) > 2^{-N(H + \beta)}\), thus \[ \begin{align*} |T_{N \beta}| 2^{-N(H + \beta)} &< \sum_{y \in T_{N \beta}} P(y) < 1 \\ &\Downarrow \\ |T_{N \beta}| &< 2^{N(H + \beta)} \\ \end{align*} \] If we set \(\beta = \epsilon\) and set \(N \ge N_0\) in a way such that \(\frac{\sigma^2}{\epsilon^2 N_0} \le \delta\) and thus \(P(Y \in T_{N \epsilon}) \ge 1 - \delta\), \(T_{N \epsilon}\) is a \(\delta\)-sufficient subset. Then, \(H_\delta(X^N) \le \log |T_{N \epsilon}| \le N(H + \epsilon)\) holds for any \(N \ge N_0\).
\(H_\delta(X^N) > N(H - \epsilon)\)
This part is reached by contradiction. Suppose instead there exists a \(\delta'\) such that there exists a sufficiently large \(N_0\) which results in \(H_\delta(X^{N}) \le N(H - \epsilon)\) for arbitrary \(\epsilon > 0\) and \(N > N_0\). Let \(\beta = \epsilon/2\), we now have \[ H_\delta(X^N) \le N_0(H - 2\beta) \] Denote the associated subset by \(S'\). We are to disprove \(S'\) with \(|S'| \le 2^{N(H - 2\beta)}\) can achieve \(P(Y \in S') \ge 1 - \delta\). We break this probability into two parts: \[ P(Y \in S') = P(Y \in S' \cap T_{N \beta}) + P(Y \in S' \cap \overline{T_{N \beta}}) \] For the first part, we have \[ |S' \cap T_{N \beta}| \le |S'| \le 2^{N(H - 2\beta)} \] Thus, the maximum value of the first part is obtained when \(S' \cap T_{N \beta}\) contains \(2^{N(H - 2\beta)}\) outcomes all with probability \(2^{-N(H - \beta)}\) (property of elements in \(T_{N \beta}\)).
As for the second part, since \(S' \cap \overline{T_{N \beta}} \subseteq \overline{T_{N \beta}}\), \[ P(Y \in S' \cap \overline{T_{N \beta}}) \le P(Y \in \overline{T_{N \beta}}) < \frac{\sigma^2}{\beta^2 N} \] Therefore, \[ P(Y \in S') \le 2^{N(H - 2\beta)} 2^{-N(H - \beta)} + \frac{\sigma^2}{\beta^2 N} = 2^{-N \beta} + \frac{\sigma^2}{\beta^2 N} \] For arbitrary \(\delta\) and a sufficiently-large \(N\), we can have \(P(Y \in S') \le 1 - \delta\) instead of \(P(Y \in S') \ge 1 - \delta\). Then we shall conclude that \(S'\) with \(|S'| \le 2^{N(H - 2\beta)}\) cannot achieve \(P(Y \in S') \ge 1 - \delta\) and thus \(H_\delta(X^N) > N(H - \epsilon)\).
The intuition behind Shannon’s source coding theorem is that, as \(N\) grows, the deviation of the random variable may grow in a lower order than that of the number of values the random variable can take on. Therefore, the resulting outcomes will fall in a narrower range, making the typical set smaller. Encoding the elements inside the typical set is almost enough for the purpose of communication.