如何获得 table 中出现频率最高的行

Question

如何获取DataFrame中出现频率最高的行？例如，如果我有以下 table:

   col_1  col_2 col_3
0      1      1     A
1      1      0     A
2      0      1     A
3      1      1     A
4      1      0     B
5      1      0     C

预期结果：

   col_1  col_2 col_3
0      1      1     A

编辑：我需要最频繁的行（作为一个单位）而不是可以使用 mode() 方法计算的最频繁的列值。

Answer 1

勾选groupby

df.groupby(df.columns.tolist()).size().sort_values().tail(1).reset_index().drop(0,1)
   col_1  col_2 col_3  
0      1      1     A

Answer 2

您可以使用 groupby 和大小来执行此操作：

df = df.groupby(df.columns.tolist(),as_index=False).size()
result = df.iloc[[df["size"].idxmax()]].drop(["size"], axis=1)
result.reset_index(drop=True) #this is just to reset the index

Answer 3

与 NumPy 的 np.unique -

In [92]: u,idx,c = np.unique(df.values.astype(str), axis=0, return_index=True, return_counts=True)

In [99]: df.iloc[[idx[c.argmax()]]]
Out[99]: 
   col_1  col_2 col_3
0      1      1     A

如果您正在寻找性能，请将字符串列转换为数字，然后使用 np.unique -

a = np.c_[df.col_1, df.col_2, pd.factorize(df.col_3)[0]]
u,idx,c = np.unique(a, axis=0, return_index=True, return_counts=True)

Answer 4

npi_indexed 库有助于对 'groupby' 类型的问题执行某些操作，脚本更少，性能与 numpy 相似。所以这是与@Divakar 基于 np.unique() 的解决方案非常相似的替代方法：

arr = df.values.astype(str)
idx = npi.multiplicity(arr)
output = df.iloc[[idx[c.argmax()]]]

Answer 5

在 Pandas 1.1.0 中。可以使用方法 value_counts() 来计算 DataFrame 中的唯一行：

df.value_counts()

输出：

col_1  col_2  col_3
1      1      A        2
       0      C        1
              B        1
              A        1
0      1      A        1

此方法可用于查找最频繁的行：

df.value_counts().head(1).index.to_frame(index=False)

输出：

   col_1  col_2 col_3
0      1      1     A

如何获得 table 中出现频率最高的行

How to get the most frequent row in table

python

numpy

mode

frequency

pandas