Pandas方差和标准差结果与人工计算不同

Question

我正在尝试使用 pandas 计算均值、方差和标准差。但是，手动计算与 pandas 输出不同。使用 pandas 有什么我遗漏的吗？附上xl截图供参考

import pandas as pd

dg_df = pd.DataFrame(
            data=[600,470,170,430,300],
            index=['a','b','c','d','e'])

print(dg_df.mean(axis=0)) # 394.0 matches with manual calculation
print(dg_df.var())        # 27130.0 not matching with manual calculation 21704
print(dg_df.std(axis=0))  # 164.71187 not matching with manual calculation 147.32

Answer 1

将DataFrame.var and also in DataFrame.std中的默认参数ddof=1（Delta Degrees of Freedom）更改为0，参数axis=0为默认参数，因此应省略：

print(dg_df.mean())
0    394.0
dtype: float64

print(dg_df.var(ddof=0))  
0    21704.0
dtype: float64

print(dg_df.std(ddof=0))
0    147.322775
dtype: float64

Answer 2

标准差的定义不止一种。您正在计算 Excel STDEV.P 的等价物，其描述为：“根据整个人口 ... 计算标准偏差”。如果您需要 Excel 中的样本标准偏差，请使用 STDEV.S.

pd.DataFrame.std 默认假设 1 自由度 ，也称为样本标准差。

numpy.std 默认假定 0 自由度 ，也称为人口标准偏差。

请参阅 Bessel's correction 以了解样本和总体之间的差异。

您还可以使用 Pandas std / var 方法指定 ddof=0：

dg_df.std(ddof=0)
dg_df.var(ddof=0)

Answer 3

你也可以使用dg_df.describe()，然后有下一个数据帧。也许更直观

count   5.00000
mean    394.00000
std 164.71187
min 170.00000
25% 300.00000
50% 430.00000
75% 470.00000
max 600.00000

并且您可以获得正确的数据，例如 dg_df.describe().loc['count']

Pandas方差和标准差结果与人工计算不同

Pandas variance and Standard deviation result differing with manual calculation

python

statistics

variance

standard-deviation

pandas