如何隔离偏离均值 2 和 3 西格玛的数据，然后在 python 的图中标记它们？

Question

我正在读取一个数据集，该数据集在 matplotlib 中绘制时如下所示，然后使用线性回归得出最佳拟合曲线。数据样本如下所示：

# ID X Y px py pz M R
1.04826492772e-05 1.04828050287e-05 1.048233088e-05 0.000107002791008 0.000106552433081 0.000108704469007 387.02 4.81947797625e+13
1.87380963036e-05 1.87370588085e-05 1.87372620448e-05 0.000121616280029 0.000151924707761 0.00012371156585 428.77 6.54636174067e+13
3.95579877816e-05 3.95603773653e-05 3.95610756809e-05 0.000163470663023 0.000265203868883 0.000228031803626 470.74 8.66961875758e+13

我的代码如下所示：

# Regression Function
def regress(x, y):
    #Return a tuple of predicted y values and parameters for linear regression.
    p = sp.stats.linregress(x, y)
    b1, b0, r, p_val, stderr = p
    y_pred = sp.polyval([b1, b0], x)
    return y_pred, p

# plotting z
xz, yz = M, Y_z                              # data, non-transformed
y_pred, _ = regress(xz, np.log(yz))      # change here           # transformed input             

plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit')  # transformed output

但是我可以看到数据中有很多向上散点，最佳拟合曲线受这些影响。所以首先我想从我的平均数据中分离出 2 和 3 sigma 的数据点，并在它们周围用圆圈标记它们。然后采用最佳拟合曲线，仅考虑落在我的平均数据的 1 西格玛范围内的点

python有没有什么好的函数可以帮我做这个？

除此之外，我还可以将数据与我的实际数据集隔离开来，比如如果示例输入中的第三行表示 2 西格玛偏差，我是否可以将该行也作为输出以便稍后保存并进行更多调查？

非常感谢您的帮助。

Answer 1

下面是一些代码，它遍历给定数量的 windows 中的数据，计算所述 windows 中的统计数据，并将行为良好和行为不当的列表中的数据分开。希望这有帮助。

from scipy import stats
from scipy import polyval
import numpy as np
import matplotlib.pyplot as plt

num_data = 10000
fake_data_x = np.sort(12.8+np.random.random(num_data))
fake_data_y = np.exp(fake_data_x) + np.random.normal(0,scale=50000,size=num_data)

# Regression Function
def regress(x, y):
    #Return a tuple of predicted y values and parameters for linear regression.
    p = stats.linregress(x, y)
    b1, b0, r, p_val, stderr = p
    y_pred = polyval([b1, b0], x)
    return y_pred, p

# plotting z
xz, yz = fake_data_x, fake_data_y                            # data, non-transformed
y_pred, _ = regress(xz, np.log(yz))      # change here           # transformed input             

plt.figure()
plt.semilogy(xz, yz, marker='o',color ='b', markersize=4,linestyle='None', label="l.o.s within R500")
plt.semilogy(xz, np.exp(y_pred), "b", label = 'best fit')  # transformed output
plt.show()

num_bin_intervals = 10 # approx number of averaging windows
window_boundaries = np.linspace(min(fake_data_x),max(fake_data_x),int(len(fake_data_x)/num_bin_intervals)) # window boundaries
y_good = [] # list to collect the "well-behaved" y-axis data
x_good = [] # list to collect the "well-behaved" x-axis data
y_outlier = []
x_outlier = []

for i in range(len(window_boundaries)-1):

    # create a boolean mask to select the data within the averaging window
    window_indices = (fake_data_x<=window_boundaries[i+1]) & (fake_data_x>window_boundaries[i])
    # separate the pieces of data in the window
    fake_data_x_slice = fake_data_x[window_indices]
    fake_data_y_slice = fake_data_y[window_indices]

    # calculate the mean y_value in the window
    y_mean = np.mean(fake_data_y_slice)
    y_std = np.std(fake_data_y_slice)

    # choose and select the outliers
    y_outliers = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]
    x_outliers = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)>=2*y_std]

    # choose and select the good ones
    y_goodies = fake_data_y_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]
    x_goodies = fake_data_x_slice[np.abs(fake_data_y_slice-y_mean)<2*y_std]

    # extend the lists with all the good and the bad
    y_good.extend(list(y_goodies))
    y_outlier.extend(list(y_outliers))
    x_good.extend(list(x_goodies))
    x_outlier.extend(list(x_outliers))

plt.figure()
plt.semilogy(x_good,y_good,'o')
plt.semilogy(x_outlier,y_outlier,'r*')
plt.show()

如何隔离偏离均值 2 和 3 西格玛的数据，然后在 python 的图中标记它们？

how to isolate data that are 2 and 3 sigma deviated from mean and then mark them in a plot in python?

python

curve-fitting

scipy

standard-deviation