RANSAC
Outliers (noises) in the data can diverge the regression model to reduce prediction errors for them, instead of the majority real data points. RANdom SAmple Consensus is a methodology to robustly fit the model in the presence of outliers.
RANSAC does the following:
- randomly sample a subset of data of an fairly enough amount for training;
- fit a model to the this subset;
- determine data points in the whole data set as inliers or outliers by comparing the residuals (prediction errors) to a threshold. The set of inliers is called a consensus set;
- repeat above for some iterations and retrain the final model with the largest consensus set (since inliers should be the majority).
Parameters of RANSAC include:
\(s\): number of points to fit the model;
\(t\): threshold of the residual;
\(e\): proportion the outliers;
\(\delta\): probability of success (at least one iteration is finished with no outlier);
\(T\): number of iterations to be determined.
Then,
- \(p\text{(training subset has no outliers)} = (1 - e)^s\)
- \(p\text{(training subset has at least one outlier)} = 1 - (1 - e)^s\)
- \(p\text{(all T subsets have outliers)} = (1 - (1 - e)^s)^T\)
We want \[ \begin{gather} p\text{(all T subsets have outliers)} = (1 - (1 - e)^s)^T < 1 - \delta \\ T > \log\frac{1 - \delta}{1 - (1 - e)^s} \end{gather} \] The threshold \(t\) is usually set as the median absolute deviation of \(y\).
External Material
随机抽样一致算法(Random sample consensus,RANSAC) - 桂。 - 博客园 (cnblogs.com)