基于 python 中的错误统计数据进行分组和绘制

Question

我已经实施了回归模型并检索了结果。现在评估结果，我想创建绘图，其中 MAE 及其标准偏差在同一张图中表示。但是，我想将日期分组并评估统计数据。虽然，我可以使用 sklearn 指标来计算平均绝对误差，但它适用于整个数据范围。有人可以给出有关如何根据间隔对数据进行分组的想法吗？

数据太大，无法在这里分享。但是，随机数据和用于计算偏差的实现代码，我在下面附上。

import pandas as pd
import random
import matplotlib.pyplot as plt
yact = random.sample(range(1, 100), 50)
ypred=random.sample(range(1, 100), 50)
df = pd.DataFrame(yact,columns=['yact'])
df['ypred']=ypred
df['bias']=df['yact']-df['ypred']
#groups=[20,40,60,80,100]

我想根据 yact 创建 y pred 组（类似于上面给出的组）。我试图绘制的参考图出现在下图的第一象限中。

Answer 1

我们只能使用 pandas/matplotlib 但 seaborn 使这种绘图变得容易得多。首先，我们将数据分类为pd.cut based on the bins provided, then we plot them with seaborns pointplot。估计器 mean 是默认值，但我想指出您可以在此处将其他函数输入到图中。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#random data generation
rng = np.random.default_rng(123)
n=500
yact = rng.choice(range(1, 100), n)
ypred = rng.choice(range(1, 100), n)
df = pd.DataFrame({"yact": yact, "ypred": ypred})
df['bias']=df['yact']-df['ypred']

#binning of data
bins = [0, 30, 50, 80, 100]
labels = [f"({first}; {second}]" for first, second in zip(bins[:-1], bins[1:])]
df["cats"] = pd.cut(x=df['yact'], bins=bins, labels=labels, include_lowest=True)

#plotting with seaborn
sns.pointplot(x="cats", y="ypred", data=df, order=labels, estimator=np.mean, ci="sd", join=False)

plt.show()

（毫不奇怪地统一）示例输出：

基于 python 中的错误统计数据进行分组和绘制

Grouping based on and plotting error statistics in python

python

matplotlib

pandas