当用作同一数据帧的索引时，数据帧切片中的一系列 np.argmax 返回的索引指向错误值

Question

我有一个根据收集的采样数据创建的数据框。然后我操作数据框以删除重复项、排序和删除饱和值：

df = pd.read_csv(path+ newfilename, header=0, usecols=[0,1,2,3,5,7,10],
                names=['ch1_real', 'ch1_imag', 'ch2_real', 'ch2_imag', 'ch1_log_mag', 'ch1_phase',
                      'ch2_log_mag', 'ch2_phase', 'pr_sample_real', 'pr_sample_imag', 'distance'])    
tmp=df.drop_duplicates(subset='distance', keep='first').copy()
tmp.sort_values("distance", inplace=True)
dfUnique=tmp[tmp.distance <65000].copy()

我还添加了两个计算值（在的帮助下） dfUnique['ch1_log_mag']=20np.log10((dfUnique.ch1_real +1jdfUnique.ch1_imag).abs()) dfUnique['ch2_log_mag']=20np.log10((dfUnique.ch2_real +1jdfUnique.ch2_imag).abs())

当我试图找到最大幅度的索引时，问题出现了。事实证明（出乎我的意料），数据帧保留了原始数据索引。因此，在排序和删除行之后，给定行的索引不是它在新排序数据框中的索引，而是它在原始数据框中的行索引：

         ch1_real  ch1_imag  ch2_real  ...  distance  ch1_log_mag  ch2_log_mag
79   0.011960 -0.003418  0.005127  ...       0.0   -38.104414   -33.896518
78  -0.009766 -0.005371 -0.015870  ...       1.0   -39.058001   -34.533870
343  0.002197  0.010990  0.003662  ...       2.0   -39.009865   -37.278737
80  -0.002686  0.010740  0.011960  ...       3.0   -39.116435   -34.902513
341 -0.007080  0.009033  0.016600  ...       4.0   -38.803434   -35.582833
81  -0.004883 -0.008545 -0.016850  ...      12.0   -40.138523   -35.410047
83  -0.009277  0.004883 -0.000977  ...      14.0   -39.589769   -34.848170
84   0.006592 -0.010250 -0.009521  ...      27.0   -38.282239   -33.891250
85   0.004395  0.010010  0.017580  ...      41.0   -39.225735   -34.890353
86  -0.007812 -0.005127 -0.015380  ...      53.0   -40.589187   -35.625615

当我再使用：

np.argmax(dfUnique.ch1_log_mag)

找到最大幅度的索引，这个 returns new 有序数据帧系列中的索引。但是，当我使用它来索引数据帧以提取该行中的其他值时，我从该行索引处的原始数据帧中获取元素。

我将数据框导出到 excel 以便更容易地观察发生了什么。第 1 列是数据帧索引。请注意，这与电子表格中的行号不同。

returns161 上面的 np.argmax 命令。如果我查看新排序的数据帧，索引 161 是下面突出显示的这一行（数据从电子表格的第二行开始，索引从python 中的 0）：并且是正确的。但是，根据原始数据帧的顺序，这是在索引 238 处。然后当我尝试访问 ch1_log_max[161],

dfUnique.ch1_log_mag[161]

我得到 -30.9759，而不是 -11.453。它使用 161 作为原始数据帧的索引获取值：

这很可怕——两个函数使用两个不同的参考系（至少对于新手 python 用户而言）。我该如何避免这种情况？（如何）我是否重新索引数据框？或者我应该使用等效的 pandas 方法来查找数据帧中一系列中的最大值（假设问题是由于 pandas 和 numpy 对数据的操作方式造成的）？问题是我创建数据框副本的方式吗？

Answer 1

如果对数据框进行排序，它会保留索引。

import pandas as pd
a = pd.DataFrame(np.random.randn(24).reshape(6,4), columns=list('abcd'))
a.sort_values(by='d', inplace=True)
print(a)
>>>
          a         b         c         d
2 -0.553612  1.407712 -0.454262 -1.822359
0 -1.046893  0.656053  1.036462 -0.994408
5 -0.772923 -0.554434 -0.254187 -0.948573
4 -1.660773  0.291029  1.785757 -0.457495
3  0.128831  1.399746  0.083545 -0.101106
1 -0.250536 -0.045355  0.072153  1.871799

为了重置索引，您可以使用.reset_index(drop=True):

b = a.sort_values(by='d').reset_index(drop=True)
print(b)
>>>
          a         b         c         d
0 -0.553612  1.407712 -0.454262 -1.822359
1 -1.046893  0.656053  1.036462 -0.994408
2 -0.772923 -0.554434 -0.254187 -0.948573
3 -1.660773  0.291029  1.785757 -0.457495
4  0.128831  1.399746  0.083545 -0.101106
5 -0.250536 -0.045355  0.072153  1.871799

要找到最大值的原始索引，可以使用 .idxmax() 然后使用 .loc[]:

ix_max = a.d.idxmax()
# or ix_max = np.argmax(a.d)
print(f"ix_max = {ix_max}")
a.loc[ix_max]
>>>
ix_max = 1
a   -0.250536
b   -0.045355
c    0.072153
d    1.871799
Name: 1, dtype: float64

或者如果你有新的索引顺序，你可以使用.iloc:

iix = np.argmax(a.d.values)
print(f"iix = {iix}")
print(a.iloc[iix])
>>>
iix = 5
a   -0.250536
b   -0.045355
c    0.072153
d    1.871799
Name: 1, dtype: float64

你可以看看https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

当用作同一数据帧的索引时，数据帧切片中的一系列 np.argmax 返回的索引指向错误值

Index returned by np.argmax of a series within a dataframe slice points to wrong value when used as index into same dataframe

python

indexing

numpy

max

dataframe