lambda 在满足条件时创建新列而不是行

lambda is creating new columns when criteria is met rather than rows

有人友善地回答了我之前的问题,该问题一直有效,直到将其扩展到我们的实际数据集(大约有 500 万行)。我自己的 for 循环处理该数据非常慢,因此希望 提供的 lambda 解决方案能够工作。但是,当使用以下数据集时:

date       customerID   saved   purchased   savedProduct    purchasedProduct
2021-01-01  456789        1        0          11223344           0
2021-01-01  456789        1        0          55667788           0
2021-01-03  456789        0        1           0              11223344
2021-01-03  456789        0        1           0              28373827
2021-02-05  456710        1        0          55667789           0
2021-02-05  456710        1        0          55667790           0
2021-02-09  456710        1        0          556677288          0
2021-02-05  2727228       1        0          55667789           0
2021-02-05  2727228       0        1          0               11223344
2021-02-05  2727228       0        1          0               28373827
2021-02-09  2727228       0        1          0               55667789
2021-02-09  2727228       0        1          0               28373827 

使用以下代码创建:

d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-10'], 
     'customerID': ['456789', '456789', '456789', '456710', '456710', '456710', '2727228', '2727228', '2727228', '2727210', '2727210'],
    'saved':[1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0], 
     'purchased': [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1], 
     'savedProduct': [11223344, 55667788, 0, 55667789, 55667790, 556677288, 55667789, 0, 0, 3828292, 0], 
     'purchasedProduct': [[0], [0], [11223344, 28373827], [0], [0], [0], [0], [11223344, 28373827], [55667789, 28373827], [0], [3828292]]}
df2 = pd.DataFrame(data=d).explode('purchasedProduct').reset_index(drop=True)

当 运行ning 提供的解决方案是这样的:

df2.groupby('customerID').apply(
  lambda df: df.apply(
    lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))

我得到的结果 table 每次找到匹配时都会创建一个新列,如下所示:

2            10 13
2021-01-03  NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         NaN NaN
NaN         2021-02-09  NaN

我已尝试解决问题,但我对 lambda 的了解非常基础。我的(编辑过的)代码确实可以满足我的需要,但是一旦数据集达到 100k+ 行,Databricks 上的 运行 需要一个多小时,而我需要它 运行 上的数据如前所述,大约有 500 万行。有什么方法可以让 lambda 执行,我会这样做:

df2['purchasedDates'] = df2.groupby('customerID').apply(
      lambda df: df.apply(
        lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))

会return(预期输出):

purchasedDates
2021-01-03
NaN
NaN
NaN
NaN
2021-02-09
etc

感谢您的帮助,希望这是合适的,我会编辑上一个问题,但看到你不打算

如果我没理解错的话,您可以通过以下方式获取所需的专栏:

  1. merge DataFrame 自身获取保存产品的购买日期。
  2. 仅保留必需的列并重命名以匹配原始 DataFrame 列。
merged = df2[df2["savedProduct"].ne(0)].merge(df2, 
                                              left_on=["customerID", "savedProduct"], 
                                              right_on=["customerID", "purchasedProduct"], 
                                              how="left")
output = merged.rename(columns={"customerID": "customerID_x", "date_y": "purchasedDate_x"}).filter(like="_x")
output = output.rename(columns={col:col.rstrip("_x") for col in output.columns})

>>> output
         date customerID  saved  ...  savedProduct purchasedProduct purchasedDate
0  2021-01-01     456789      1  ...      11223344                0    2021-01-03
1  2021-01-01     456789      1  ...      55667788                0           NaN
2  2021-02-05     456710      1  ...      55667789                0           NaN
3  2021-02-05     456710      1  ...      55667790                0           NaN
4  2021-02-09     456710      1  ...     556677288                0           NaN
5  2021-02-05    2727228      1  ...      55667789                0    2021-02-09
6  2021-02-05    2727210      1  ...       3828292                0    2021-02-10

[7 rows x 7 columns]