如何删除 Pandas 中包含少于 1% 行数的非零列？

Question

我有以下数据集：

    Col1    Col2    Col3    Col4    Col5    Col6    Col7    Col8    Col9    Col10   ... 

Col991  Col992  Col993  Col994  Col995  Col996  Col997  Col998  Col999  Col1000
rows                                                                                    
Row1    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row2    0   0   0   0   0   23  0   0   0   0   ... 0   0   0   0   7   0   0   0   0   0
Row3    97  0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row4    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row5    0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Row496  182 0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   116 0   0   0
Row497  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row498  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row499  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   0   0   0   0
Row500  0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   0   125 0   0   0

我正在尝试删除非零条目总数小于行数 1% 的列。

我可以按列计算非零条目的百分比

(df[df > 0.0].count()/df.shape[0])*100

我应该如何使用它来获得 df 那些列数仅在超过 1% 的行中具有非零值的列？此外，我应该如何更改代码以删除非零值少于列的 1% 的行？

Answer 1

使用mean计算零的百分比：

df[df.eq(0).mean() >= 0.01]

Answer 2

您可以使用 loc 为您的新 df 获取指定的列或行，如答案所示，基本上您可以这样做：

df.loc[rows, cols]  # accepts boolean lists/arrays

因此删除列的 df 可以通过以下方式实现：

col_condition = df[df > 0].count() / df.shape[0] >= .01
df_ = df[:, col_condition]

如果您需要在列和行之间切换，您只需使用

转置数据框

df.T

对于非零数小于列长度 1% 的行也是如此：

row_condition = df.T[df.T > 0].count() / df.shape[1] >= .01
df_ = df[row_condition]

还有更简洁的形式：

df_ = df.loc[:, df.gt(0).mean() >= .01]  # keep columns
df_ = df[df.T.gt(0).mean() >= .01]  # keep rows

如何删除 Pandas 中包含少于 1% 行数的非零列？

How do I remove columns in Pandas that contains non-zero in less than 1% of number of rows?

python

data-analysis

dataframe

pandas

data-filtering