Machine Learning

  • Bullet Points

    Bullet Points This post lists out various topics under the machine learning subject. Data data modalities numbers texts images videos audios data cleaning imbalanced data data normalization (one pitfall) standardization

  • Linear Discriminant Analysis

    Linear discriminant analysis is another approximation to the Bayes optimal classifier. Instead of assuming independence between each pair of input dimensions given certain label, LDA assumes a single common shared covariance matrix among the input dimensions, no matter the label is.

  • Logistic Regression

    Two-class Logistic Regression Logistic regression is a binary linear classifier. Suppose the feature space is \(\R^N\), then it processes the feature vector \(x\) with a linear function \(f(x) = w^Tx + b\), where \(w \in \R^N\) and \(b \in \R\).

  • Support Vector Machine

    Support vector machine is used in binary classification task. It aims to find a linear hyperplane that separates the data with different labels with the maximum margin. By symmetry, there should be at least one margin point on both side of the decision boundary.

  • Linear Regression

    Given a dataset \(\mathcal D = \{(x^{(i)}, y^{(i)}),i=1,\dots,M\}\) where \(x^{(i)} \in \R^N\) is the feature vector and \(y^{(i)} \in R\) is the output, learn a linear function \(f = w^Tx

  • Non-linear Regression

    The basic idea about non-linear regression is to perform the feature mapping \(\phi(x)\), followed by linear regression in the new feature space. Polynomial Regression When \(x\) is a scalar, the feature mapping in \(p\)-th order Polynomial Regression is \[ \phi(x) = \begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^p \end{bmatrix} \] The regression function and the parameters are then like those in linear regression: \[ f(x) = [w_0,w_1,w_2,\dots,w^p]\begin{bmatrix} 1 \\ x \\ x^2 \\ \vdots \\ x^p \end{bmatrix} = w^T\phi(x) \] \(\phi(x)\) captures all features in each degree from \(1\) up to \(p\).

  • Clustering

    Big Picture The clustering algorithms can be broadly split into two categories depending on whether the number of clusters is given or to be determined by user. Partitional ones pre-set the number of clusters; while hierarchical ones output a dendrogram that illustrates how clusters are built level by level.

  • Dimension Reduction

    Unsupervised Dimension Reduction Dimensionality reduction reduces computational cost; de-noises by projecting onto lower-dimensional space and back to original space; makes results easier to understand by reducing the collinearity. Compared to feature selection,

  • Principal Component Analysis

    Given a data matrix \(X = [x^{(1)}, \dots, x^{(M)}] \in \mathbb R^{N \times M}\), the goal of PCA is to identify the directions of maximum variance contained in the data

  • Eckart-Young-Mirsky Theorem

    Given an \(M \times N\) matrix \(X\) of rank \(R \le \min\{M,N\}\) and its SVD \(X = U\Sigma V^T\), where \(\Sigma = diag(\sigma_1, \sigma_2, ..., \sigma_R)\), among all the \(M \times N\) matrices of rank \(K \le R\), the best approximation to \(X\) is \(Y^\star = U\Sigma_K V^T\), where \(\Sigma_K = diag(\sigma_1, \sigma_2, .

  • Independent Component Analysis

    Assumption and Derivation Suppose observations are the linear combination of signals. Also suppose that the number of signal sources is equal to the number of linearly mixed channel. Given observations (along the time axis \(i\)) \(\mathcal{D} = \{x^{(i)}\}_{i=1}^M, x^{(i)} \in \R^{N \times 1}\), independent component analysis finds the \(N \times N\) mixing matrix \(A\) such that \(X = AS\).

  • RANSAC

    Outliers (noises) in the data can diverge the regression model to reduce prediction errors for them, instead of the majority real data points. RANdom SAmple Consensus is a methodology to

  • Fisher's Linear Discriminant

    Two-class Assume we have a set of \(N\)-dimensional samples \(D = \{x^{(i)}\}^M_{i=1}\) , \(M_1\) of which belong to Class \(1\), and \(M_2\) to Class \(2\) (\(M_1 + M_2 = M\)).

  • Bias-variance Decomposition

    The notation used is as follows: Symbol Notation \(\mathcal D\) the dataset \(x\) the sample \(y_\mathcal D\) the observation of \(x\) in \(\mathcal D\), affected by noise \(y\) the real value of \(x\) \(\bar y\) the mean of the real values \(f\) the model learned with \(\mathcal D\) \(f(x)\) the prediction of \(f\) with \(x\) \(\bar f(x)\) the expectation of prediction of \(f\) with \(x\) \(l(f(x), y_\mathcal D)\) the loss function, chosen to be squared error By assuming that the observation errors averages to \(0\), the expectation of the error will be \[ \begin{aligned} E_{x \sim \mathcal D}&[l(f(x), y_\mathcal D)] = E[(f(x) - y_\mathcal D)^2] = E\{[(f(x) - \bar f(x) + (\bar f(x) - y_\mathcal D)]^2\} \\ &= E[(f(x) - \bar f(x))^2] + E[(\bar f(x) - y_\mathcal D)^2] + 2E[(f(x) - \bar f(x))(\bar f(x) - y_\mathcal D)] \\ &= E[(f(x) - \bar f(x))^2] + E\{[(\bar f(x) - y) + (y - y_\mathcal D)]^2\} \\ &\quad + 2\underbrace{E[f(x) - \bar f(x)]}_0 E[\bar f(x) - y_\mathcal D] \\ &= E[(f(x) - \bar f(x))^2] + E[(\bar f(x) - y)^2] + E[(y - y_\mathcal D)^2] + 2E[(\bar f(x) - y)(y - y_\mathcal D)] \\ &= E[(f(x) - \bar f(x))^2] + E[(y - y_\mathcal D)^2] + E\{[(\bar f(x) - \bar y) + (\bar y - y)]^2\} \\ &\quad + 2E[\bar f(x) - y] \underbrace{E[y - y_\mathcal D]}_0 \\ &= E[(f(x) - \bar f(x))^2] + E[(y - y_\mathcal D)^2] + E\{[(\bar f(x) - \bar y) + (\bar y - y)]^2\} \\ &= E[(f(x) - \bar f(x))^2] + E[(y - y_\mathcal D)^2] + E[(\bar f(x) - \bar y)^2] + E[(\bar y - y)^2] + 2E[(\bar f(x) - \bar y)(\bar y - y)] \\ &= E[(f(x) - \bar f(x))^2] + E[(y - y_\mathcal D)^2] + E[(\bar f(x) - \bar y)^2] + E[(\bar y - y)^2] \\ &\quad + 2E[\bar f(x) - \bar y]\underbrace{E[\bar y - y]}_0 \\ &= \underbrace{E[(f(x) - \bar f(x))^2]}_{variance} + \underbrace{E[(\bar f(x) - \bar y)^2]}_{bias^2} + \underbrace{E[(y - y_\mathcal D)^2]}_{noise} + \underbrace{E[(\bar y - y)^2]}_{scatter} \\ \end{aligned} \] 5 ways to achieve right balance of Bias and Variance in ML model | by Niwratti Kasture | Analytics Vidhya | Medium

  • Receiver Operator Characteristic

    Receiver Operator Characteristic Receiver operator characteristic (ROC) curve connects the consecutive TPR-FPR 2-D points, which are obtained by ranking the testcases according to the probability of being positive from high to low;

  • 贝叶斯分类器

    贝叶斯分类器 贝叶斯分类器(Bayes classifier)基

  • 隐马尔可夫模型

    隐马尔可夫模型 隐马尔可夫模型(Hidden Markov Model)的对

  • Mean Average Precision

    The Precision The typical precision and recall definition in a binary classification is \[ \text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}, \text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} \] Precision determines, among all the samples that are identified as positive, how many are really positive.