根据其他 2 个值索引 DataFrame 的更好解决方案

Better solution to index a DataFrame according to the values of 2 others

我想根据其他 2 个值(val1_df 用于列,val2_df 用于行)对 DataFrame (aaxx_df) 进行索引。

>>> val1_df = pd.DataFrame({"A": ["aa", "bb"], "B": ["bb", "aa"]})
    A   B
0  aa  bb
1  bb  aa
>>> val2_df = pd.DataFrame({"A": ["xx", "yy"], "B": ["zz", "yy"]})
    A   B
0  xx  zz
1  yy  yy
>>> aaxx_df = pd.DataFrame(
...     {"aa": [10, 20, 30], "bb": [-10, -20, -30]}, 
...     index=["xx", "yy", "zz"]
... )
    aa  bb
xx  10 -10
yy  20 -20
zz  30 -30
>>> comb(val1_df, val2_df, aaxx_df)
      A     B
0  10.0 -30.0
1 -20.0  20.0

我在下面提出了一个适用于我的问题的解决方案,但我想,一定有一些更清晰的解决方案,可能是通过 SQL(在我看来,这与关系数据库问题非常相似) .

def comb(df1, df2, link_df):

    comb_df = df1.copy()
    
    for val in link_df:
        tmp_df = df2.copy()
        tmp_df = tmp_df.applymap(lambda x: link_df.loc[x, val])
        comb_df = comb_df.replace({val: np.nan}).combine_first(tmp_df)

    return comb_df

如果您想使用 val2_df(行)和 val1_df (cols), 使用 lookup:

vals = aaxx_df.lookup(val2_df.values.flatten(), val1_df.values.flatten())
    
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

输出

    A   B
0  10 -30
1 -20  20

自 2020 年 12 月 26 日起,pandas 1.2 已弃用 lookup 方法。这是作为函数的最小版本:

def lookup(df, row_labels, col_labels) -> np.ndarray:
    """
    Label-based "fancy indexing" function for DataFrame.
    Given equal-length arrays of row and column labels, return an
    array of the values corresponding to each (row, col) pair.

    Parameters
    ----------
    row_labels : sequence
        The row labels to use for lookup.
    col_labels : sequence
        The column labels to use for lookup.

    Returns
    -------
    numpy.ndarray
        The found values.
    """
    row_idx = df.index.get_indexer(row_labels)
    col_idx = df.columns.get_indexer(col_labels)
    flat_index = row_idx * len(df.columns) + col_idx
    return df.values.flat[flat_index]

代码变为:

vals = lookup(aaxx_df, val2_df.values.flatten(), val1_df.values.flatten())
    
res = pd.DataFrame(data=vals.reshape(val1_df.shape), columns=val1_df.columns)
print(res)

让我们一行完成

out = (
    pd.concat([val1_df, val2_df])
    .groupby(level=0)
    .agg(tuple)
    .applymap(lambda x: aaxx_df.loc[x[1], x[0]])
)
    A   B
0  10 -30
1 -20  20

更新

s = aaxx_df.stack().reindex(
    pd.MultiIndex.from_arrays([val2_df.stack(), val1_df.stack()])
)
val1_df[:] = s.values.reshape(val1_df.shape)
    A   B
0  10 -30
1 -20  20

此代码不够简洁,但创建字典并直接替换字典值可能比使用 lookup 更快。此外,您可以使用行 + 列组合的唯一字符串而不是元组来简化代码:

a = val2_df.values + '|' + val1_df.values.tolist()
rows = np.repeat(aaxx_df.index, len(aaxx_df.columns)).tolist()
cols = np.tile(aaxx_df.columns, len(aaxx_df)).tolist()
vals = aaxx_df.values.ravel().tolist()  # .tolist() makes code fasterfor dictionary conversion from numpy array
b = dict(zip(f'{rows}|{cols}', vals)) #this line of code is the bottleneck on this small dataframe, but it may have a high fixed cost and lower variable cost than lookup.
pd.DataFrame(a, columns=val1_df.columns).replace(b)

    A   B
0  10 -30
1 -20  20

有关创建“b”的代码,请在此处查看 Jezrael 的类似答案: