对应于列 header 的引用 DataFrame 值

Question

我正在尝试根据指定列名引用的值将一列附加到我的 DataFrame。

我有以下数据框：

     Area      1         2         3         4      Select     
-----------------------------------------------------------
0      22     54        33        46        23           4       
1      45     36        54        32        14           1        
2      67     34        29        11        14           3       
3      54     35        19        22        45           2        
4      21     27        39        43        22           3

“Select”下的值引用“Select”显示的列号下的值。例如，对于第0行，“Select”显示4，它指的是第0行中“4”列下的值，即23。然后对于第1行，“Select”显示1 ，指的是第1行“1”列下的值，即36。

我想在我的 DataFrame 中添加一个新列，其中包含“Select”所引用的值。

所以我需要使用我的 DataFrame 并创建以下 DataFrame：

     Area      1         2         3         4      Select      Value
----------------------------------------------------------------------
0      22     54        33        46        23           4         23
1      45     36        54        32        14           1         36 
2      67     34        29        11        14           3         11
3      54     35        19        22        45           2         19
4      21     27        39        43        22           3         43

我不确定如何从“Select”列引用的编号列下提取值，因为列 headers 只是标题而不是要索引的实际值。如何在 python 中完成？

Answer 1

尝试 .apply 和 axis=1。在 lambda 中，您可以使用 Select 列中的值来引用值：

df["Value"] = df.apply(lambda x: x[x["Select"]], axis=1)
print(df)

打印：

   Area   1   2   3   4  Select  Value
0    22  54  33  46  23       4     23
1    45  36  54  32  14       1     36
2    67  34  29  11  14       3     11
3    54  35  19  22  45       2     19
4    21  27  39  43  22       3     43

Answer 2

我们可以使用 Looking up values by index/column labels as the replacement for the deprecated DataFrame.lookup 上的文档推荐的 numpy 索引。

与factorize Select and reindex:

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[np.arange(len(df)), idx]
)

注意1：如果因式分解列中有一个值不对应列header，则结果值为NaN（表示缺失数据).
注意2：两个索引器都需要是基于0的范围索引（与numpy索引兼容）。 np.arange(len(df)) 根据 DataFrame 的长度创建范围索引，因此适用于所有情况。

但是，如果 DataFrame 已经有一个兼容的索引（如本例）df.index 可以直接使用。

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

df:

   Area   1   2   3   4  Select  value
0    22  54  33  46  23       4     23
1    45  36  54  32  14       1     36
2    67  34  29  11  14       3     11
3    54  35  19  22  45       2     19
4    21  27  39  43  22       3     43

另一种选择是Index.get_indexer:

df['value'] = df.to_numpy()[
    df.index.get_indexer(df.index),
    df.columns.get_indexer(df['Select'])
]

注意：同上条件一样，如果df.index已经是一个连续的0-Based索引（兼容numpy索引）我们可以直接使用df.index，而不用Index.get_indexer:

df['value'] = df.to_numpy()[
    df.index,
    df.columns.get_indexer(df['Select'])
]

df:

   Area   1   2   3   4  Select  value
0    22  54  33  46  23       4     23
1    45  36  54  32  14       1     36
2    67  34  29  11  14       3     11
3    54  35  19  22  45       2     19
4    21  27  39  43  22       3     43

get_indexer 的警告：如果 Select 中的值不对应于 header 列，则 return value is -1 which will return DataFrame 中最后一列的值（因为 python 支持相对于末尾的负索引）。这不如 NaN 安全，因为它将 return 来自 Select 列的数字值，可能很难立即判断数据是否无效。

示例程序：

import pandas as pd

df = pd.DataFrame({
    'Select': ['B', 'A', 'C', 'D'],
    'A': [47, 2, 51, 95],
    'B': [56, 88, 10, 56],
    'C': [70, 73, 59, 56]
})

df['value'] = df.to_numpy()[
    df.index,
    df.columns.get_indexer(df['Select'])
]

print(df)

注意最后一行的 Select 列是 D 但它从 C 中提取值，这是 DataFrame 中的最后一列 (-1)。查找 failed/is 不正确并不是立即显而易见的。

  Select   A   B   C value
0      B  47  56  70    56
1      A   2  88  73     2
2      C  51  10  59    59
3      D  95  56  56    56  # <- Value from C

与factorize相比：

idx, cols = pd.factorize(df['Select'])
df['value'] = (
    df.reindex(cols, axis=1).to_numpy()[df.index, idx]
)

注意最后一行的Select列是D，对应的值为NaN，在pandas中用来表示缺失数据。

  Select   A   B   C  value
0      B  47  56  70   56.0
1      A   2  88  73    2.0
2      C  51  10  59   59.0
3      D  95  56  56    NaN  # <- Missing Data

设置和导入：

import numpy as np  # (Only needed is using np.arange)
import pandas as pd

df = pd.DataFrame({
    'Area': [22, 45, 67, 54, 21],
    1: [54, 36, 34, 35, 27],
    2: [33, 54, 29, 19, 39],
    3: [46, 32, 11, 22, 43],
    4: [23, 14, 14, 45, 22],
    'Select': [4, 1, 3, 2, 3]
})

对应于列 header 的引用 DataFrame 值

Reference DataFrame value corresponding to column header

python

indexing

row

dataframe

pandas