匹配 3 个数据框的 5 列并创建一个计算列

Question

我有 3 个数据框，如下所示：

df1:

a  b  c  d  e  f  2020    2021
a1 b1 c1 d1 e1 f1 334.385 340.210
a1 b1 c1 d1 e1 f1 335.385 341.210
a2 b2 c2 d2 e2 f2 344.385 350.210
a4 b2 c4 d4 e4 f4 354.385 360.210

df2:

a  g  h  i  j  k  2020    2021
a1 b1 c1 d1 e1 f1 434.385 440.210
a5 b6 c6 d6 e6 f6 444.385 450.210
a5 b6 c6 d6 e6 f6 445.385 451.210
a4 b2 c4 d4 e4 f4 454.385 460.210
a4 b2 c4 d4 e4 f4 455.385 461.210

df3:

a  l  m  n  o  p  2020    2021
a1 b1 c1 d1 e1 f1 534.385 540.210
a7 b7 c7 d7 e7 f7 544.385 550.210
a4 b2 c4 d4 e4 f4 554.385 560.210

预期输出：

a  l  m  n  o  p  2020    2021     new_2021
a1 b1 c1 d1 e1 f1 534.385 540.210  540.210*(340.210/440.210)
a7 b7 c7 d7 e7 f7 544.385 550.210  numpy.nan
a4 b2 c4 d4 e4 f4 554.385 560.210  560.210*((460.210+461.210)/560.210)

解释：
我想匹配 3 个数据帧的所有前 5 个字符串列，并创建一个新列，对年份列进行少量计算。 df3 是我的参考数据框，我想根据 df1 和 df2 的变化率调整 df3 的年份列中的值。
例如：对于所有 5 列都匹配的行，那么我想做 df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021']).
如果在前 5 列中有多个具有相同值的行。我想计算年列的总和，如预期输出的第 3 行所示；
并且如预期输出的第二行所示，如果在 df1 和 df2 中的一个或两个中找不到 df3 的所有 5 列的匹配项，我希望该行保留为空。

我如何有效地做到这一点？我有非常大的数据框。

Answer 1

您可以聚合 sum 因为前 5 列中可能存在重复值，然后根据 df5 中的列名称设置索引名称，以便在所有 DataFrames 中使用相同的索引名称，因此可能存在除法和多重：

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df3 = df3.groupby(df3.columns[:5].tolist()).sum()

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                   2020    2021    new_2021
a  l  m  n  o                              
a1 b1 c1 d1 e1  534.385  540.21  836.214303
a4 b2 c4 d4 e4  554.385  560.21  219.002457
a7 b7 c7 d7 e7  544.385  550.21         NaN

编辑：在 df3 中重复 MultiIndex 是否可行，但需要更多步骤：

print (df3)
    a   l   m   n   o   p     2020    2021
0  a1  b1  c1  d1  e1  f1  534.385  540.21
1  a7  b7  c7  d7  e7  f7  544.385  550.21
2  a4  b2  c4  d4  e4  f4  554.385  560.21
3  a1  b1  c1  d1  e1  f1  534.385  200.00
4  a7  b7  c7  d7  e7  f7  544.385  800.00
5  a4  b2  c4  d4  e4  f4  554.385  500.00

df1 = df1.groupby(df1.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())
df2 = df2.groupby(df2.columns[:5].tolist()).sum().rename_axis(df3.columns[:5].tolist())

#convert first 5 columns to index and sorting
df3 = df3.set_index(df3.columns[:5].tolist()).sort_index()

#create unique MultiIndex from df3 and change index in df1, df2
mux = pd.MultiIndex.from_frame(df3.index.to_frame().drop_duplicates())
df1 = df1.reindex(mux)
df2 = df2.reindex(mux)
print (df2)
                   2020    2021
a  l  m  n  o                  
a1 b1 c1 d1 e1  434.385  440.21
a4 b2 c4 d4 e4  909.770  921.42
a7 b7 c7 d7 e7      NaN     NaN

df3['new_2021'] = df3['2021'] * (df1['2021'] / df2['2021'])
print (df3)
                 p     2020    2021    new_2021
a  l  m  n  o                                  
a1 b1 c1 d1 e1  f1  534.385  540.21  836.214303
            e1  f1  534.385  200.00  309.588605
a4 b2 c4 d4 e4  f4  554.385  560.21  219.002457
            e4  f4  554.385  500.00  195.464609
a7 b7 c7 d7 e7  f7  544.385  550.21         NaN
            e7  f7  544.385  800.00         NaN

匹配 3 个数据框的 5 列并创建一个计算列

match 5 columns of 3 dataframes and create a calculated column

python

data-analysis

dataframe

python-3.x

pandas