如果列中有缺失值，该列的平均值是多少？

Question

示例数据

ID	Name	Phone
1	x	+212
2	y	NaN
3	xy	NaN

df为数据集名称下面的代码给出了没有缺失值的列的名称。

no_nulls = set(df.columns[df.isnull().mean()==0])

isnull() 会将数据集转换成这样

ID	Name	Phone
False	False	False
False	False	True
False	False	True

谁能解释一下 mean 对非整数的作用？

我用过这个并且它有效但我很好奇平均值

no_nulls = set(df.columns[df.notnull().all()])

Answer 1

您的案例，.mean() 正在处理仅具有 True 和 False 值的布尔值数据帧。在这种情况下，.mean() 将 False 视为 0，将 True 视为 1。因此，如果您查看 df.isnull().mean() 的结果，您将看到：

df.isnull().mean()

ID       0.000000
Name     0.000000
Phone    0.666667
dtype: float64

此处，由于列 ID 和 Name 具有所有 False 值，因此 .mean() 会将所有值视为零并获得零的平均值。对于 Phone 列，您有一个 False 和 2 个 True，因此，平均值等同于取 0、1、1 的平均值，即 0.666667.

因此，当您检查 df.isnull().mean()==0 时，只有前 2 列是 True，因此，no_nulls 的结果是 {'ID', 'Name'} .

参考 official document of DataFrame.mean，您将从参数 numeric_only= 中得到一些提示，并注意其默认设置的默认行为：

Parameters

numeric_only bool, default None

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

如果列中有缺失值，该列的平均值是多少？

what will be the mean of a column if there are missing values in the column?

python

statistics

dataframe

pandas

data-science