Noise Contrastive Estimation

Three Perspectives on NCE

Non-parametric estimation

The traditional log-likelihood function will be \(\ell = \sum_x \ln p_\theta(x)\). In NCE, we learn \[ p_\theta(1|x) = \sigma(G(x;\theta) - \gamma) = \frac{1}{1 + e^{-G(x;\theta) + \gamma}} \] And corresponding loss function becomes \[ \begin{aligned} \mathcal {L} &= -\E_{x \sim \tilde p(x)} \ln p_\theta(1|x) - \E_{x \sim q(x)} \ln p_\theta(0|x) \\ &= -\int \tilde p(x) \ln p_\theta(1|x) dx - \int q(x) \ln p_\theta(0|x) dx \\ &= - \int [\tilde p(x) + q(x)] [\frac{\tilde p(x)}{\tilde p(x) + q(x)} \ln p_\theta(1|x) + \frac{q(x)}{\tilde p(x) + q(x)} \ln p_\theta(0|x)]dx \end{aligned} \]

Let \(P(y|x) = \begin{cases}\frac{\tilde p(x)}{\tilde p(x) + q(x)}, y=1 \\\frac{q(x)}{\tilde p(x) + q(x)},y=0\end{cases}\), \[ \label{loss} \begin{aligned} \arg \min_{\theta, \gamma} \mathcal{L} &= \arg \min_{\theta, \gamma} -\int [\tilde p(x) + q(x)][\frac{\tilde p(x)}{\tilde p(x) + q(x)} \ln p_\theta(1|x) + \int \frac{q(x)}{\tilde p(x) + q(x)} \ln p_\theta(0|x)]dx \\ &= \arg \min_{\theta, \gamma} -\int [\tilde p(x) + q(x)][P(1|x) \ln p_\theta(1|x) + P(0|x)\ln p_\theta(0|x)]dx \\ &= \arg \min_{\theta, \gamma} \int [\tilde p(x) + q(x)][P(1|x) \ln \frac{1}{p_\theta(1|x)} + P(0|x)\ln \frac{1}{p_\theta(0|x)}]dx \\ &= \arg \min_{\theta, \gamma} \int [\tilde p(x) + q(x)] H[P(y|x)||p_\theta(y|x)]dx \end{aligned} \]

Since cross entropy \(H(p||q) \ge H(p)\) and the minimum is reached only when \(p = q\), the global minimum for equation \(\eqref{loss}\) is reached when \(p_\theta(y|x) = P(y|x)\). Therefore, \[ \begin{aligned} p_\theta(1|x) = \frac{1}{1 + e^{-G(x;\theta) + \gamma}} &= \frac{\tilde p(x)}{\tilde p(x) + q(x)} = P(1|x) \\ \frac{q(x)}{\tilde p(x)} &= e^{-G(x;\theta) + \gamma} \\ \tilde p(x) &= \frac{q(x) e^{G(x;\theta)}}{e^\gamma} \end{aligned} \] \(\theta\) and \(\gamma\) are learnt so that \(q(x) e^{G(x;\theta) - \gamma}\) fit the real distribution. It becomes more intuitive when \(q\) is a uniform distribution and \(q(x)\) is a constant \(U\), where \(e^\gamma\) will be the learnt normalizing factor.

https://kexue.fm/archives/5617/comment-page-1

Maximum likelihood estimation (the original paper’s view)

The model is defined as \(\ln p_\theta(x) = \ln p^0_\alpha(x) + c\). The MLE will maximize the objective function \[ \begin{aligned} J_T(\theta) &= \frac{1}{T_d} [\sum_{i=1}^{T_d \\} \ln h(x_i;\theta) + \sum_{i=1}^{T_n} \ln (1 - h(y_i;\theta)) \text{, ($x_i$'s are samples, $y_i$'s are noises, $T_n = \nu T_d$)} \\ &\stackrel{P}\to J(\theta) \triangleq \E_{x \sim \tilde p} \ln h(x;\theta) + \nu \E_{x \sim q} \ln (1 - h(x;\theta)) \\ \end{aligned} \] where \[ \begin{gather} r_\nu(u) = \frac{1}{1 + \nu e^{-u}} \\ G(x; \theta) = \ln p_\theta(x) - \ln q(x) \\ h(x;\theta) = r_\nu(G(x;\theta)) = \frac{1}{1 + \nu e^{-G(x;\theta)}} \\ \end{gather} \] Denote by \(\tilde J\) the objective function \(J\) seen as a function of \(f(.) = \ln p_\theta(.)\), \[ \tilde J(f) = \E_{x \sim \tilde p} \ln r_\nu(f(x) - \ln q(x)) + \nu \E_{x \sim q} \ln (1 - r_\nu[f(x) - q(x)]) \] For \(\epsilon > 0\) and \(\phi(x)\) a perturbation of \(f\), \[ \tilde J(f + \epsilon \phi) = \E_{x \sim \tilde p} \ln r_\nu(f(x) + \epsilon \phi(x) - \ln q(x)) + \nu \E_{x \sim q} \ln (1 - r_\nu[f(x) + \epsilon \phi - q(x)]) \]

By Taylor’s expansion,

\[ \begin{aligned} \ln r_\nu(u + \epsilon u_1 + \epsilon^2 u_2) &\approx \ln r_\nu (u) + r_{\frac{1}{\nu}}(-u)(\epsilon u_1 + \epsilon^2 u_2) - \frac{r_\nu (u)r_\frac{1}{\nu}(-u)}{2}(\epsilon u_1 + \epsilon^2 u_2)^2 + \Omicron \big((\epsilon u_1 + \epsilon^2 u_2)^3 \big) \\ &= \ln r_\nu (u) + \epsilon u_1r_{\frac{1}{\nu}}(-u) + \epsilon^2(u_2r_{\frac{1}{\nu}}(-u) - \frac{1}{2}u_1^2 r_{\frac{1}{\nu}}(-u)r_\nu (u)) + \Omicron \big(\epsilon^3 \big) \\ \end{aligned} \]

\[ \begin{aligned} \ln \big(1 - r_v(u + \epsilon u_1 + \epsilon^2 u_2) \big) &\approx \ln \big(1 - r_v(u) \big) - r_v(u)(\epsilon u_1 + \epsilon^2 u_2) - \frac{r_{\frac{1}{\nu}}(-u) r_\nu(u)}{2} (\epsilon u_1 + \epsilon^2 u_2)^2 + \Omicron \big((\epsilon u_1 + \epsilon^2 u_2)^3 \big) \\ &= \ln(1 - r_v(u)) - \epsilon u_1 r_v(u) - \epsilon^2 \big( u_2 r_v(u) + \frac{1}{2} u_1^2 r_{\frac{1}{\nu}}(-u) r_\nu(u) \big) + \Omicron(\epsilon^3) \end{aligned} \]

Substitute \(u\) with \(f(x)\), \(u_1\) with \(\phi(x)\), \(u_2\) with \(0\) to give \[ \begin{aligned} \tilde J(f + \epsilon \phi) \approx &\E_{x \sim \tilde p} \ln r_\nu \big(f(x) + \epsilon \phi(x) - \ln q(x) \big) + \nu \E_{x \sim q} \ln (1 - r_\nu[f(x) + \epsilon \phi - q(x)]) \\ = &\E_{x \sim \tilde p} \{\ln r_\nu \big(f(x) - \ln q(x) \big) + \epsilon \phi(x) r_{\frac{1}{\nu}} \big(\ln q(x) - f(x) \big) \} \\ &+\nu \E_{x \sim q} \{ \ln \big(1 - r_\nu[f(x) -\ln q(x)] \big) - \epsilon \phi(x) r_\nu \big( f(x) - \ln q(x) \big) \} + \Omicron(\epsilon^2) \\ = &\tilde J(f) + \epsilon \int \phi(x) \big(r_\frac{1}{\nu} [\ln q(x) - f(x)] \tilde p(x) - \nu r_\nu[f(x) - \ln q(x)] q(x) \big) + \Omicron(\epsilon^2) \end{aligned} \] The above equation attains the local maximum at \(f\) only if the term of order \(\epsilon\) is \(0\). In this case, \[ \begin{aligned} r_\frac{1}{\nu} [\ln q(x) - f(x)] \tilde p(x) &= \nu r_\nu[f(x) - \ln q(x)] q(x) \\ \frac{\nu \tilde p(x)}{\nu + e^{f(x) - \ln q(x)}} &= \frac{\nu q(x)}{1 + \nu e^{\ln q(x) - f(x)}} \\ \frac{\tilde p(x)q(x)}{\nu q(x) + p_\theta(x)} &= \frac{q(x) p_\theta(x)}{p_\theta(x) + \nu q(x)} \\ \end{aligned} \] If \(q(x)\) is nonzero, \[ \frac{\tilde p(x)}{\nu q(x) + p_\theta(x)} = \frac{p_\theta(x)}{p_\theta(x) + \nu q(x)} \iff p_\theta(x) = \tilde p(x)\\ \]

Gradient descent

https://leimao.github.io/article/Noise-Contrastive-Estimation/

Implementation

PyTorch-NCE

Other Reference

NCE与语言模型

Next