如何获取 pandas 数据框中特定列的众数索引

How to get the index of the mode value of a specific column in a pandas data frame

我有一个排序的数据框如下:

            x_test         test_label     x_train             train_label  \
37  [[6.3, 3.3, 4.7, 1.6]]        [1]  [[6.4, 3.2, 4.5, 1.5]]         [1]   
63  [[6.3, 3.3, 4.7, 1.6]]        [1]  [[6.0, 3.4, 4.5, 1.6]]         [1]   
67  [[6.3, 3.3, 4.7, 1.6]]        [1]  [[6.1, 3.0, 4.6, 1.4]]         [1]   
96  [[6.3, 3.3, 4.7, 1.6]]        [1]  [[6.1, 3.0, 4.9, 1.8]]         [2]   
51  [[6.3, 3.3, 4.7, 1.6]]        [1]  [[5.9, 3.2, 4.8, 1.8]]         [1]   

    dist  
37  0.26  
63  0.37  
67  0.42  
96  0.46  
51  0.47  

我想在 'train_label' 列(任何一个)找到模式值并获取它的索引。接下来,我想根据该索引找到 'test_label' 处的值。我该怎么做?

我试过使用 df.mode() 但没有成功。

您首先需要展平数据,例如:

>>> df["train_label"]=df["train_label"].apply(lambda x: x[0])
>>> df
    dist  test_label  train_label                  x_test                 x_train
37  0.26           1            1  [[6.3, 3.3, 4.7, 1.6]]  [[6.4, 3.2, 4.5, 1.5]]
63  0.37           1            1  [[6.3, 3.3, 4.7, 1.6]]  [[6.0, 3.4, 4.5, 1.6]]
67  0.42           1            1  [[6.3, 3.3, 4.7, 1.6]]  [[6.1, 3.0, 4.6, 1.4]]
96  0.46           1            2  [[6.3, 3.3, 4.7, 1.6]]  [[6.1, 3.0, 4.9, 1.8]]
51  0.47           1            1  [[6.3, 3.3, 4.7, 1.6]]  [[5.9, 3.2, 4.8, 1.8]]

然后 运行 df.mode():

>>> df.mode(numeric_only=True)
   dist  test_label  train_label
0  0.26         1.0          1.0
1  0.37         NaN          NaN
2  0.42         NaN          NaN
3  0.46         NaN          NaN
4  0.47         NaN          NaN

首先,找到train列中mode值的索引:

 df.loc[:, 'train_label'] = df['train_label'].apply(lambda x: x[0])
 df.loc[:, 'test_label'] = df['test_label'].apply(lambda x: x[0])

 tr_mode_idx = df['train_label'].mode().index.values

然后根据那个索引求test_label的值:

 df.loc[tr_mode_index, 'test_label']
df.test_label[df.train_label.isin(df.train_label.mode())]

结果:

37    [1]
63    [1]
67    [1]
51    [1]

我认为以上任何一个答案都不是最好的解决方法。我建议您使用布尔索引来查找与模式值对应的列的子集。这样做时,您还将获得它们的索引。然后,您只需将这些索引值输入任何其他列即可在这些索引处找到它们的值。

这样一来,就可以简化成一行代码:

df['test_label'].loc[df['train_label'][df['train_label'] == df['train_label'].mode()[0]].index]

所以我创建了一个数据框并选择了列

df=pd.DataFrame({"A":[14,4,5,4,1], 
                 "B":[5,2,54,3,2], 
                 "C":[20,20,7,3,8], 
                 "train_label":[14,3,6,2,6]}) 
X=df['train_label'].mode()
"""
        A   B   C  train_label
0  14   5  20           14
1   4   2  20            3
2   5  54   7            6
3   4   3   3            2
4   1   2   8            6

"""
for i in X:
   print(df['train_label'].loc[df['train_label']==i].index)

Output

Int64Index([2, 4], dtype='int64')