lambda 在满足条件时创建新列而不是行

Question

有人友善地回答了我之前的问题，该问题一直有效，直到将其扩展到我们的实际数据集（大约有 500 万行）。我自己的 for 循环处理该数据非常慢，因此希望提供的 lambda 解决方案能够工作。但是，当使用以下数据集时：

date       customerID   saved   purchased   savedProduct    purchasedProduct
2021-01-01  456789        1        0          11223344           0
2021-01-01  456789        1        0          55667788           0
2021-01-03  456789        0        1           0              11223344
2021-01-03  456789        0        1           0              28373827
2021-02-05  456710        1        0          55667789           0
2021-02-05  456710        1        0          55667790           0
2021-02-09  456710        1        0          556677288          0
2021-02-05  2727228       1        0          55667789           0
2021-02-05  2727228       0        1          0               11223344
2021-02-05  2727228       0        1          0               28373827
2021-02-09  2727228       0        1          0               55667789
2021-02-09  2727228       0        1          0               28373827

使用以下代码创建：

d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-10'], 
     'customerID': ['456789', '456789', '456789', '456710', '456710', '456710', '2727228', '2727228', '2727228', '2727210', '2727210'],
    'saved':[1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0], 
     'purchased': [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1], 
     'savedProduct': [11223344, 55667788, 0, 55667789, 55667790, 556677288, 55667789, 0, 0, 3828292, 0], 
     'purchasedProduct': [[0], [0], [11223344, 28373827], [0], [0], [0], [0], [11223344, 28373827], [55667789, 28373827], [0], [3828292]]}
df2 = pd.DataFrame(data=d).explode('purchasedProduct').reset_index(drop=True)

当运行ning 提供的解决方案是这样的：

df2.groupby('customerID').apply(
  lambda df: df.apply(
    lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))

我得到的结果 table 每次找到匹配时都会创建一个新列，如下所示：

2            10 13
2021-01-03  NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         2021-02-09  NaN

我已尝试解决问题，但我对 lambda 的了解非常基础。我的（编辑过的）代码确实可以满足我的需要，但是一旦数据集达到 100k+ 行，Databricks 上的运行需要一个多小时，而我需要它运行上的数据如前所述，大约有 500 万行。有什么方法可以让 lambda 执行，我会这样做：

df2['purchasedDates'] = df2.groupby('customerID').apply(
      lambda df: df.apply(
        lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))

会return（预期输出）：

purchasedDates
2021-01-03
NaN
NaN
NaN
NaN
2021-02-09
etc

感谢您的帮助，希望这是合适的，我会编辑上一个问题，但看到你不打算

Answer 1

如果我没理解错的话，您可以通过以下方式获取所需的专栏：

merge DataFrame 自身获取保存产品的购买日期。
仅保留必需的列并重命名以匹配原始 DataFrame 列。

merged = df2[df2["savedProduct"].ne(0)].merge(df2, 
                                              left_on=["customerID", "savedProduct"], 
                                              right_on=["customerID", "purchasedProduct"], 
                                              how="left")
output = merged.rename(columns={"customerID": "customerID_x", "date_y": "purchasedDate_x"}).filter(like="_x")
output = output.rename(columns={col:col.rstrip("_x") for col in output.columns})

>>> output
         date customerID  saved  ...  savedProduct purchasedProduct purchasedDate
0  2021-01-01     456789      1  ...      11223344                0    2021-01-03
1  2021-01-01     456789      1  ...      55667788                0           NaN
2  2021-02-05     456710      1  ...      55667789                0           NaN
3  2021-02-05     456710      1  ...      55667790                0           NaN
4  2021-02-09     456710      1  ...     556677288                0           NaN
5  2021-02-05    2727228      1  ...      55667789                0    2021-02-09
6  2021-02-05    2727210      1  ...       3828292                0    2021-02-10

[7 rows x 7 columns]

lambda 在满足条件时创建新列而不是行

lambda is creating new columns when criteria is met rather than rows

python

lambda

pandas