regplot() 的 robust 选项到底有什么作用？

Question

与this question有关，我想知道seaborn的regplot()中的robust选项究竟做了什么。

描述如下：

If True, use statsmodels to estimate a robust regression. This will de-weight outliers. Note that this is substantially more computationally intensive than standard linear regression, so you may wish to decrease the number of bootstrap resamples (n_boot) or set ci to None.

这是否意味着它更类似于 Kendall 或 Spearman 相关性的工作方式，因为众所周知它们对异常值具有鲁棒性？还是彼此没有任何关系？也就是说，对一些数据计算Kendall，用regplot()画散点图的时候，用robust=True有意义吗？

Answer 1

"Classical" 对比稳健线性回归

通常，线性回归使用 Ordinary Least Squares, or OLS 来查找回归系数，目标是最小化残差的平方和（估计线与实际数据之间差异的平方根）。这对异常值非常敏感：

x = np.arange(0,10,0.2)
y = (x*0.25)+np.random.normal(0,.1,50)
y[[12,14,18,24]] -= 4

sns.regplot(x,y, robust = False)

注意线是如何被离群值拖下来的。在很多情况下，这是您希望看到的行为。

另一方面，稳健的回归方法通常使用不同的度量来找到除 OLS 之外的回归系数，例如最小化 least trim squares, which is essentially the sum of squares over a subset of your data (in this sense, it's similar to bootstrapping). Typically, this is done iteratively, weighing the result accordingly，这样给定的异常值最终不会对您的系数产生巨大影响.这就是 statsmodels.robust.robust_linear_model.RLM 所做的，当您在 seaborn 中使用 robust = True 时会调用它。结果，在与之前相同的数据上：

sns.regplot(x,y,robust = True)

请注意，这条线并没有被异常值拉低。很多时候，这并不是人们想要的行为，而是取决于你在做什么...

注意：这在计算上确实很昂贵（仅对于那 50 个数据点，运行在我的机器上花费了大约 5 秒）。

使用哪个相关系数？

如果您想继续报告您的 Kendall 相关系数，不要在可视化数据时使用 robust 参数.这会产生误导，因为 Kendall 的误差敏感性无法与您稳健的线性回归所代表的相比（为了说明这可以变化多少，在我上面的数据中，kendall 相关系数为 0.85，spearman 相关系数为 0.93 ). sns.regplot() 默认调用 robust=True statsmodels.robust.robust_linear_model.RLM, which uses the HuberT() criterion by default. Because of this, if you want to report something like correlation coefficient, my intuition is that you'll have to use some measure of the huber loss (you'll probably find more info about that here). Or, you can read this paper，这似乎对稳健的相关系数替代方案有一些见解。

regplot() 的 robust 选项到底有什么作用？

What exactly does regplot()'s robust option do?

python

matplotlib

scipy

correlation

seaborn

相关系数与回归系数

"Classical" 对比稳健线性回归

使用哪个相关系数？