Logistic Regression

Two-class Logistic Regression

Logistic regression is a binary linear classifier. Suppose the feature space is \(\R^N\), then it processes the feature vector \(x\) with a linear function \(f(x) = w^Tx + b\), where \(w \in \R^N\) and \(b \in \R\). It takes a probabilistic approach and maps \(f(x)\) to a probability value between \(0\) and \(1\) by applying a sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\). That is \[ \begin{aligned} p(y = +1|x) &= \sigma(f(x)) \\ p(y = -1|x) &= 1 - \sigma(f(x)) = \sigma(-f(x)) \\ \end{aligned} \] Or equivalently, \(p(y|x) = \sigma(yf(x))\). Then, \(f_{LR}(x) = \arg \max\limits_{y \in \{-1, 1\}} \sigma(y(f(x)))\).

Logistic regression is a discriminative classifier because we are directly modelling \(p(y|x)\), with no intermediate step on \(p(x|y)\).

Given dataset \(\mathcal{D} = \{ (x^{(i)}, y^{(i)}): i=1, \dots, M \}\), logistic regression is learning by maximizing the conditional log-likelihood of \(\mathcal{D}\). \[ l(w, b) = \log L(w, b) = \log \prod_{i=1}^Mp(y^{(i)}|x^{(i)};w,b) = \sum_{i=1}^M\log p(y^{(i)}|x^{(i)};w,b) \]

$$ \[\begin{aligned} &(w^\star, b^\star) = \arg \max\limits_{w, b}l(w, b) \\ &= \arg \max\limits_{w, b}\sum_{i=1}^M\log p(y^{(i)}|x^{(i)};w,b) \\ &= \arg \max\limits_{w, b}\sum_{i=1}^M\log \sigma(y^{(i)}(w^Tx^{(i)}+b)) \\ &= \arg \min\limits_{w, b}-\sum_{i=1}^M\log \sigma(y^{(i)}(w^Tx^{(i)}+b)) \\ \end{aligned}\] \[ Let $J(w, b) = -\sum_{i=1}^M\log \sigma(y^{(i)}(w^Tx^{(i)}+b))$ be the target function. There is no closed-form solution to this optimization problem. Rather, it is to be solved by some iterative algorithm, e.g. gradient descent. For each iteration, parameters are updated by \] \[\begin{gather} w \leftarrow w - \eta\frac{\partial J}{\partial w} \\ b \leftarrow b - \eta\frac{\partial J}{\partial b} \end{gather}\] \[ Specifically, \] \[\begin{aligned} \frac{\partial J}{\partial w} &= -\sum_{i=1}^M\frac{\partial\log \sigma(y^{(i)}f(x^{(i)}))}{\partial w} \\ &= -\sum_{i=1}^M \frac{\partial\log \sigma(y^{(i)}f(x^{(i)}))}{\partial \sigma(y^{(i)}f(x^{(i)}))} \frac{\partial \sigma(y^{(i)}f(x^{(i)}))}{\partial (y^{(i)}f(x^{(i)}))} \frac{\partial (y^{(i)}f(x^{(i)}))}{\partial w} \\ &= -\sum_{i=1}^M \frac{1}{\sigma(y^{(i)}f(x^{(i)}))} \sigma(y^{(i)}f(x^{(i)}))(1 - \sigma(y^{(i)}f(x^{(i)}))) y^{(i)}x^{(i)} \\ &= -\sum_{i=1}^M (1 - \sigma(y^{(i)}f(x^{(i)}))) y^{(i)}x^{(i)} \\ \end{aligned}\]

$$

\[ \begin{aligned} \frac{\partial J}{\partial b} &= -\sum_{i=1}^M\frac{\partial\log \sigma(y^{(i)}f(x^{(i)}))}{\partial b} \\ &= -\sum_{i=1}^M \frac{\partial\log \sigma(y^{(i)}f(x^{(i)}))}{\partial \sigma(y^{(i)}f(x^{(i)}))} \frac{\partial \sigma(y^{(i)}f(x^{(i)}))}{\partial (y^{(i)}f(x^{(i)}))} \frac{\partial (y^{(i)}f(x^{(i)}))}{\partial b} \\ &= -\sum_{i=1}^M \frac{1}{\sigma(y^{(i)}f(x^{(i)}))} \sigma(y^{(i)}f(x^{(i)}))(1 - \sigma(y^{(i)}f(x^{(i)}))) y^{(i)} \\ &= -\sum_{i=1}^M (1 - \sigma(y^{(i)}f(x^{(i)}))) y^{(i)} \\ \end{aligned} \]

Note that the posterior obtained by binary Linear Discriminant Analysis also has the form of \[ p(y|x) = \frac{1}{1 + e^{-(w^Tx + b)}} \] This is not to say LDA and LR are the same in binary case. LDA is a generative model which learns \(p(x|y)\) and \(p(y)\), LR is a discriminative model which only learns \(p(y|x)\). One can discriminate with a generative model, but not reversely.

Logistic Loss Perspective

Logistic regression is equivalent to \[ \arg \min\limits_{w, b} \sum_{i=1}^M \log [1 + e^{-y^{(i)}(w^T x^{(i)}+b)}] \\ \] where \(\ell(z) = \log(1 + e^{-y z})\) is the logistic loss and \(z = w^T x + b = \frac{P(\hat y=1|x)}{P(\hat y=0|x)}\) is the logit. Square loss \(\ell(\hat y) = (y - \hat y)^2\) is not used because this will make the problem non-convex, forfeiting the facility of theoretical guarantee.

Some History

Historically, there has been efforts on adapting linear regression to the classification task where the output is a probability value between \(0\) and \(1\), instead of between \(-\infty\) and \(+\infty\). Many attempts have been given to mapping \((0, 1)\) to \((-\infty, +\infty)\) first and then apply linear regression on transformed values. An early work uses the quantile function of standard normal distribution. The model was named as “probabilistic unit” (probit). However, this model is too computationally-expensive at that time. Later on a work that uses the quantile function of logistic distribution followed on, naming its model as “logistic unit” (logit). Essentially, the logit function is the log of the odds: \[ \mathrm{logit} (p) = \ln \frac{p}{1-p} \] It can be verified that in logistic regression, \(\mathrm{logit}(p) = w^T x + b\). Therefore, the solution of logistic regression is in essence “linear-regressing” the logit term, explaining why it is called “logistic regression”.

Also note that in machine learning, un-normalized scores for different classes are usually called logits too. It makes some sense since these unbounded values are to be mapped to \((0, 1)\), which means they are “logitted” values.

Multi-class Logistic Regression

One way to train use Logistic Regression in multi-class classification in to for each class \(i = {1, \dots, C}\), assign a weight vector \(w^{(i)}\), the probability is define by the softmax function: \[ p(y = c|x) =\frac{\exp(w^{(c)}\cdot x + b^{(c)})}{\sum_{i=1}^C \exp(w^{(i)} \cdot x + b^{(i)})} \] Then \(f_\text{Multiclass LR}(x) = \arg \max\limits_{y \in \{1,\dots,C\}}p(y|x)\)

To prevent overfitting, a prior distribution is usually added to \(w\). Suppose each dimension of \(w\) is independently sampled from a Gaussian distribution \(\mathcal N(0, \frac{C}{2})\), then, \[ p(w) \propto \exp(-\frac{1}{C}w^Tw) \]

Previous
Next