如何正确过滤 Pandas 中的多列？

Question

我正在使用这个数据集：https://www.kaggle.com/uciml/pima-indians-diabetes-database。

我想根据行是否包含任何零来过滤数据框（Outcome 除外）。

当我尝试基于一列过滤数据框时，一切都很好：

但是，当我尝试过滤两列或更多列时，我会得到不同的行数，具体取决于我是否这样做：

或者这个：

我分别得到 429 行和 652 行。

所以我尝试使用 iloc:

进行过滤

但这只是用 NaN 填充列，但不会删除行。它还改变了 Outcome 列，我想保持不变。似乎这种 iloc 方法仅在一次过滤一列时有效。

有什么方法可以一次过滤 8 列而不是一次只过滤一列吗？

Answer 1

你可以这样做：

df[df.loc[:, 0:5] < 10].dropna(how='all', axis=1).dropna()

它的作用是首先创建一个掩码 selecting 前 5 列小于 10 的所有值。然后，它 selects 来自数据帧的所有值 select被那个面具打动了。

因为掩码没有 select 所有列，使用该掩码索引数据框将 return 没有被该掩码（从第 6 列开始）考虑的列作为纯列NaN 值。 .dropna(how='all', axis=1) 将删除所有为 NaN 的列。

最后，.dropna() 将删除所有包含任何 NaN 的行，留下所有值都符合条件（小于 10）的所有行。

Answer 2

您可以使用 apply 一次过滤所有列，如果值为 0，则签入每个列，如果有 return，则为真。

result = df.drop(["Outcome"], axis=1).apply(lambda x: x != 0 , axis=0).any(1)
df[result]

不使用应用的替代解决方案：

# determine for each value cell whether it it zero
matches = df.drop(["Outcome"], axis=1) == 0

# build rowsums. It counts the number of zero values.
# if there are no zero values in a row, the rowsum is 0
# find all rows with a rowsum of 0
relevant_rows = matches.sum(axis=1) == 0

# subset just those rows with rowsum == 0
df.loc[relevant_rows, :]

Answer 3

您第一次尝试做多列：

data[(data.Pregnancies & data.Glucose) != 0]

错了。

第二个：

data[(data.Pregnancies != 0) & (data.Glucose != 0)]

是对的。

这就是结果不同的原因。

如何正确过滤 Pandas 中的多列？

How to properly filter multiple columns in Pandas?

python

filtering

pandas