1-exp-family
Exponential Family Basics
[!note]
Though the results below are mainly with regards to the finite support case, they also apply to infinite support case after switching to differential entropy.
[!important]
In fact, we are restricting the discussion to the natural exponential family.
Example and Definition
Recall the PDF of univariate Gaussian distribution: \[ p_{\mathcal{N}(\mu, \sigma^2)}(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2 \sigma^2}} \] It can be rewritten as \[ p_{\mathcal{N}(\mu, \sigma^2)}(x) = \exp(-\frac{1}{2 \sigma^2} x^2 + \frac{\mu}{\sigma^2} x - \frac{\mu}{2 \sigma^2} - \frac{1}{2} \log(2 \pi \sigma^2)) \] Let the mapping \(\phi(x) = [x^2, x]\) and the parameter vector \(\theta = [-\frac{1}{2 \sigma^2}, \frac{\mu}{\sigma^2}]\). Then for a properly-chosen function \(A: \R^2 \mapsto \R\) we have \[ p_{\mathcal{N}(\mu, \sigma^2)}(x) = \exp(\theta^T \phi(x) - A(\theta)) \] That is \[ p_{\mathcal{N}(\mu, \sigma^2)}(x) \propto \exp(\theta^T \phi(x)) \]
Definition: exponential family. Given a feature function \(\phi: \mathcal{X} \mapsto \R^m\) and an \(m\)-dimensional parameter vector \(\theta \in \R^m\), an exponential family is defined as the set \(\mathcal{P} \triangleq \{ p_\theta: \theta \in \R^m \}\) where for a function \(A: \R^m \mapsto \R\) the mass (or the density) function \(p_\theta\) is defined as: \[ p_\theta(x) = \exp(\theta^T \phi(x) - A(\theta)) \] In the above definition,
- we call \(\phi: \R^d \mapsto \R^m\) the feature function;
- we call \(\theta \in \R^m\) the canonical parameters;
- we call \(A: \R^m \mapsto \R\) the log-partition function.
The underlying max-entropy assumption of exponential family seems to be the reason why we usually use softmax in the last layer in the neural network model.
Properties
Convexity
Proposition: derivation of log-partition function in discrete case. Given an exponential family with feature function \(\phi: \mathcal{X} \mapsto \R^m\) over a finite support set \(\mathcal{X}\), canonical parameters \(\theta \in \R^m\), the log partition function \(A: \R^m \mapsto \R\) can be determined as \[ A(\theta) = \log \left( \sum_{x \in \mathcal{X}} e^{\theta^T \phi(x)} \right) \]
Proposition: derivatives of log-partition function. Consider an exponential family with feature function \(\phi: \mathcal{X} \mapsto \R^m\) over a finite support set \(\mathcal{X}\), canonical parameters \(\theta \in \R^m\), and the log partition function \(A: \R^m \mapsto \R\) . Then, the first- and second-order derivatives of \(A\) satisfy:
The gradient of \(A\) will be the mean of random vector \(\phi(X)\): \[ \nabla A(\theta) = \E_{X \sim p_\theta} [\phi(x)] \]
The Hessian of \(A\) will be the covariance matrix of random vector \(\phi(X)\):?? \[ \nabla^2 A(\theta) = \Cov_{X \sim p_\theta}(\phi(X)) \]
Corollary: convexity of log-partition function. The log-partition function of an exponential family distribution is a convex function.
Theorem: monotone gradient of convex function. A differentiable function \(f: \R^d \mapsto \R\) is convex if and only if its gradient \(\nabla f: \R^d \mapsto \R^d\) is monotone, i.e. \[ \forall x, y \in \R^d, (x-y)^T(\nabla f(x) - \nabla f(y)) \ge 0 \] Corollary: monotone gradient of mean vector. The mean vector \(\mu_\theta = \E_{X \sim p_\theta} [\phi(X)]\) of an exponential family model \(p_\theta\) is a monotone function of the canonical parameter \(\theta\), i.e. \[ \forall \theta_1, \theta_2 \in \R^d, (\theta_1 - \theta_2)^T (\nabla A(\theta_1) - \nabla A(\theta_2)) \ge 0 \]
Strong Convexity
Convexity is not enough for fruitful derivation. Below we introduce the idea of strong convexity.
Definition: strongly convex function. We call a twice-differentiable \(f: \R^d \mapsto \R\) \(\mu\)-strongly-convex if the eigenvalues of its Hessian are always greater than or equal to \(\mu > 0\).
Proposition: invertibility of the gradient of strongly convex function. The gradient of a \(\mu\)-strongly-convex function is an invertible mapping and satisfies the following: \[ \forall x, y \in \R^d, (\nabla f(x) - \nabla f(y))^T (x-y) \ge \mu \|x-y\|_2^2 \] The implication is that, it is impossible for \(x \ne y\) to have the same gradient, verifying that \(\nabla f\) is invertible.
Estimation of Exponential Family
Maximum Likelihood Estimation
In this section, we assume a strong convexity on the log-partition function. That is, we can relate the canonical parameter and the mean vector: \[ \mu = \nabla A(\theta), \theta = (\nabla A)^{-1}(\mu) \] Suppose we observe independent and identically distributed samples \(x_1, \dots, x_n \sim p_\theta\). How do we estimate \(\theta\)? The most natural estimation is the maximum (log-)likelihood estimation. We try to derive it for exponential family in the following. \[ \theta^\text{MLE} = \arg \max_\theta l(\theta) = \log \left[ \prod_{i=1}^n p_\theta(x_i) \right] \] MLE is a good fit for the exponential family because of the \(\exp\) term in the density function. \[ \begin{aligned} &\theta^\text{MLE} = \arg \max_\theta \sum_{i=1}^n \log p_\theta(x_i) \\ &= \arg \max_\theta \sum_{i=1}^n [\theta^T \phi(x_i) - A(\theta)] \\ &= \arg \max_\theta \frac{1}{n} \sum_{i=1}^n [\theta^T \phi(x_i) - A(\theta)] \\ &= \arg \max_\theta \theta^T \underbrace{\left[ \frac{1}{n} \sum_{i=1}^n \phi(x_i) \right]}_{\hat \mu} - A(\theta) \\ &= \arg \min_\theta A(\theta) - \theta^T \hat \mu \end{aligned} \] It turns out to be a standard unconstrained convex optimization problem. The optimal solution would be to zero out the gradient: \[ \begin{aligned} \nabla A(\theta^\text{MLE}) - \hat \mu &= 0 \\ \theta^\text{MLE} &= (\nabla A)^{-1} (\hat \mu) \end{aligned} \] As a result, the resulting \(\mu^\text{MLE} \triangleq \E_{\theta^\text{MLE}} [\phi(X)] = \nabla A(\theta^\text{MLE})\) will be exactly the empirical mean \(\hat \mu\). Suppose \(\phi(X)\) has the covariance matrix \(\Sigma\). By the central limit theorem, \[ \sqrt{n} (\hat u - \E[\phi(X)]) \stackrel{d}{\to} N(0, \Sigma) \] But what we are interested in is the asymptotic performance of the estimation of \(\theta\), which is \(\theta^\text{MLE}\). By the delta method, for \(\theta^\text{MLE}\) which is a function of \(\hat \mu\), we have \[ \sqrt{n} (\theta^\text{MLE} - \theta) \stackrel{d}{\to} N(0, \Sigma^{-1}) \]
Method of Moments
Definition: method of moments. Given a parameterized family of distributions \(\{ p_\theta: \theta \in \R^d \}\), the method of moments estimator finds the parameter \(\hat \theta\) that matches the population moments with the empirical moments of i.i.d. samples \(x_1, \dots, x_n\). Let \(\phi\) be the moment function. Hence, \(\hat \theta\) must satisfy \[ \E_{\hat \theta} [\phi(X)] = \frac{1}{n} \sum_{i=1}^n \phi(x_i) \]
Proposition: equivalence of method of moments and maximum likelihood estimation. Given an exponential family \(\{ p_\theta: \theta \in \R^d \}\) with feature function \(\phi\), the method of moments estimator with \(\phi\)-based moments results in the same estimator as maximum likelihood estimator \(\theta^\text{MLE}\).
Proof. Recall that we have shown \(\mu^\text{MLE} = \hat \mu\) in the previous note. Additionally, due to the invertibility of \(\nabla A(\hat \theta) = \E_{\hat \theta} [\phi(X)]\) (thanks to Yiyao :D), the solution to method of moments is unique. As a result, the two methods are equivalent.
Principle of Maximum Entropy
The above develops the method of moments for the exponential family, making it a model-based approach. What if the distribution is not coming from an exponential family? In such case, we follow the principle of maximum entropy.
Definition: principle of maximum entropy. Given a set of probability distributions \(M\) (e.g. using the method of moments), conduct the inference and base the decision on the distribution maximizing the entropy function: \[ \arg \max_{q \in M} H_q(X) \triangleq \sum_{x \in \mathcal{X}} q(x) \log \frac{1}{q(x)} \]
Recall in the method of moments we are restricting ourselves to \[ M_\phi \triangleq \{ q \in \mathcal{P}_\mathcal{X}: \E_q [\phi(X)] = \frac{1}{n} \sum_{i=1}^n \phi(x_i) \} \] We stick to the method of moments and the principle of maximum entropy. Therefore, the optimization problem becomes \[ \begin{aligned} \min_q \quad & \sum_{x \in \mathcal{X}} q_x \log q_x \\ \text{s.t.} \quad & \sum_{x \in \mathcal{X}} q_x \phi(x) = \underbrace{\frac{1}{n} \sum_{i=1}^n \phi(x_i)}_{\hat \mu} \\ & \sum_{x \in \mathcal{X}} q_x = 1 \\ & q_x \ge 0, \forall x \in \mathcal{X} \end{aligned} \] Note that we use \(q_x\) to indicate that \(q\) is a probability vector on a finite support. We proceed with ignoring the non-negativity constraint and show that relaxed solution is still the optimal to the original. Suppose \(\phi(x)\) is \(k\)-dimensional. Denote as \(\gamma \in \R^k, \eta \in \R\) the Lagrangian multipliers. The Lagrangian function is as follows: \[ \mathcal{L}(q, \gamma, \eta) = \left[ \sum_{x \in \mathcal{X}} q_x (\log (q_x) + \gamma^\top \phi(x) + \eta) \right] - \gamma^\top \hat \mu - \eta \] We apply the KKT conditions (verify that the regularity condition holds) to give \[ \begin{gathered} 1 + \log (q_x) + \gamma^\top \phi(x) + \eta = 0, \forall x \in \mathcal{X} \\ \Downarrow \\ q_x^* = \exp[-(1 + \gamma^\top \phi(x) + \eta)] \ge 0 \end{gathered} \] From above we know \(q_x^* \propto \exp(-\gamma^\top \phi(x))\) and as a result \[ q_x^* = \frac{\exp(-\gamma^\top \phi(x))}{\sum_{x' \in \mathcal{X}} \exp(-\gamma^\top \phi(x'))} \] Now we substitute the primal optima back to the Lagrangian function to give \[ \mathcal{L}(q^*, \gamma, \eta) = -\log \left( \sum_{x \in \mathcal{X}} \exp(-\gamma^\top \phi(x)) \right) - \gamma^\top \hat \mu \] Hence, we can formulate the dual problem as \[ \begin{aligned} \max_{\gamma \in \R^k} \quad & -\log \left( \sum_{x \in \mathcal{X}} \exp(-\gamma^\top \phi(x)) \right) - \gamma^\top \hat \mu \end{aligned} \] After rewriting \(-\gamma\) as \(\theta\), this exactly recovers the maximum likelihood estimation problem for an exponential family with the feature function \(\phi\).
Corollary: maximum entropy as the dual to maximum likelihood. In a set of distribution \(M_\phi \triangleq \{ q: \E_q [\phi(X)] = \mu \}\), the distribution \(q^*\) that maximizes the entropy will be from an exponential family with feature function \(\phi\). That is, for some \(\theta\) we will have \[ q^*(x) = \exp(\theta^\top \phi(x) - A(\theta)) \] In addition, if \(\mu\) is the empirical mean of sample, the maximum entropy problem over \(M_\phi\) is the dual optimization problem to the maximum likelihood estimation for the exponential family with feature function \(\phi\).
We conclude the magic powers of exponential family here. With appropriate conditions,
- exponential family + method of moments = maximum likelihood estimation
- principle of max entropy + method of moments = exponential family
Examples
What distribution in the set \(M_\phi \triangleq \{ q_x: \E_q[X] = \mu \}\) will maximize the entropy of an integer-valued random variable \(X \in \N\) with a fixed mean value?
The problem is \[ \begin{aligned} \min_q \quad & \sum_{x \in \mathcal{X}} q_x \log q_x \\ \text{s.t.} \quad & \sum_{x \in \mathcal{X}} q_x x = \mu \\ & \sum_{x \in \mathcal{X}} q_x = 1 \\ & q_x \ge 0, \forall x \in \mathcal{X} \end{aligned} \] We write down the Lagrangian function: \[ \begin{aligned} \mathcal{L}(\theta, \lambda) &= \sum_{x \in \mathcal{X}} q_x \log q_x + \lambda_1 (\sum_{x \in \mathcal{X}} q_x - 1) + \lambda_2 (\sum_{x \in \mathcal{X}} x q_x - \mu) \\ &= \sum_{x \in \mathcal{X}} [(\log q_x + \lambda_1 + \lambda_2 x) q_x] - \lambda_1 - \lambda_2 \mu \end{aligned} \]
Take its derivative w.r.t. \(p\) and make it zero to give \[ \begin{aligned} 0 &= 1 + \log q_x + \lambda_1 + \lambda_2 x \\ q_x &= \exp[-(1 + \lambda_1 + \lambda_2 x)] \end{aligned} \] Consider the normalization constraint to give \[ \begin{gathered} q_x = \frac{\exp(-\lambda_2 x)}{\sum_{x'=1}^\infty \exp(-\lambda_2 x')} \\ \exp(-1 - \lambda_1) = \sum_{x'=1}^\infty \exp(-\lambda_2 x') \end{gathered} \]
For \(\lambda_1\) to converge, \(\lambda_2\) has to be positive. As a result, \[ \begin{aligned} &\exp(-1 - \lambda_1) = \sum_{x'=1}^\infty \exp(-\lambda_2 x') \\ &= \exp(-\lambda_2) \frac{1}{1 - \exp(-\lambda_2)} \\ &= \frac{1}{\exp(\lambda_2) - 1} \\ \end{aligned} \] Consider the expectation constraint to give \[ \begin{aligned} &\E_p[X] = \frac{\sum_{x=1}^\infty x \exp(-\lambda_2 x)}{\sum_{x'=1}^\infty \exp(-\lambda_2 x')} \\ &= \frac{\exp(-\lambda_2) / [1-\exp(-\lambda_2)]^2}{\exp(-\lambda_2) / [1-\exp(-\lambda_2)]} \\ &= \frac{1}{1-\exp(-\lambda_2)} = \mu \end{aligned} \]
Hence, \[ \begin{aligned} & \lambda_2 = -\log(1-\frac{1}{\mu}) \\ & \lambda_1 = -\log(\mu-1) - 1 \\ & q_x = \frac{(1-\frac{1}{\mu})^x}{\mu-1} = \frac{1}{\mu} (1-\frac{1}{\mu})^{x-1} \end{aligned} \]
What distribution in the set \(M_\phi \triangleq \{ q: \E_q[X] = \mu, \Cov_q[X] = \Sigma \}\) will maximize the entropy of a random vector \(X \in \R^d\) with a fixed mean vector and covariance matrix??
This is the first attempt to extend the discussion to a continuous case. We know that \(q^*\) is in the form of \[ q^*(x) = \exp(\theta^\top \phi(x) - A(\theta)) \] Note that \[ \begin{aligned} \Cov[X] &= \E[(X-\mu)(X-\mu)^T] \\ \Cov[X] &= \E[X X^T] - \E[X] \E[X]^T \\ \E[X X^T] &= \Cov[X] + \E[X] \E[X]^T \\ \end{aligned} \] Therefore, we can choose our feature function as \[ \phi(x) = [x_1, \dots, x_d, x_1 x_1, \dots, x_1 x_d, x_2 x_2, \dots, x_2 x_d, \dots, x_d x_d] \]
In a simpler way, we know \(q^*(x) = \exp(\theta^T \phi(x) - A(\theta))\) where \(\theta \in \R^{d^2+d}\). This form of exponential family is exactly the Gaussian distribution: \[ q^*(x) = \frac{\exp[-\frac{1}{2}{(x-\mu)}^\top \Sigma^{-1} (x-\mu)]}{\sqrt{|2\pi \Sigma|}} \]
Consider the normalization constraint to give \[ \int \exp(A(\theta)) = 1 \]
Maximizing Conditional Entropy
Consider the prediction problem where we want to predict the label variable from the feature variable. Denote as \(P_{X,Y}\) the joint distribution of \((X,Y)\). The question is how to determine the prediction rule \(f: \mathcal{X} \mapsto \mathcal{Y}\)?
If \(P_{X,Y}\) is known…
This would depend on the loss function. If the loss function is the squared error, the MMSE estimator gives \[ f^*(x) = \E[Y|X=x] \]
If \(P_{X,Y}\) is unknown and we know that \(P_{X, Y}\) belongs to the family \(M_\phi \triangleq \{ Q_{X,Y}: \E_Q[\phi(X, Y)] = \mu \}\)…
In this case, we can further apply the principle of maximum conditional entropy. But why not maximize the joint entropy \(H(X, Y)\) as done in previous discussion? In fact, in this case, the two approaches are equivalent due to the following relationship: \[ H(X,Y) = H(Y|X) + H(X) \] Since \(H(X)\) is fixed/deterministic given the samples (because we can directly learn \(p_X\) at least in a non-parametric way from samples), we can resort to maximizing \(H(Y|X)\) to slim down the formula. Note that even though we have been given the samples from \(p_{X,Y}\), we can still adjust \(p_{Y|X}\) to maximize \(H(Y|X)\) (because can’t learn \(p_{Y|X=x}\) for every \(x \in \mathcal{X}\) from samples). From another perspective, in prediction task, what we are interested in is the \(p_{Y|X}\) instead of \(p_X\).
Definition: principle of maximum conditional entropy. Given the set of probability distributions \(M\) on \((X,Y)\), base the prediction of \(Y\) from \(X\) on the distribution maximizing the conditional entropy \(H(Y|X)\): \[ \arg \max_{q \in M} H_q(Y|X) \triangleq \sum_{x \in \mathcal{X}} q(x, y) \frac{1}{q(x|y)} \]
Theorem: maximum conditional entropy and logistic regression. The conditional distribution \(q_{Y|X}^*\) chosen from \(M_\phi \triangleq \{ Q_{X,Y}: \E_Q[Y \phi(X)] = \mu \}\) (i.e., “\(\phi(X,Y)\)” is chosen to be in the form of \(Y \phi(X)\)) that results in the maximum conditional entropy will follow a logistic regression model for some vector \(\theta\): \[ q_{Y|X}^*(y|x) = \frac{\exp(y \theta^\top \phi(x))}{\sum_{y' \in \mathcal{Y}} \exp(y' \theta^\top \phi(x))} \]