lambda 在满足条件时创建新列而不是行
lambda is creating new columns when criteria is met rather than rows
有人友善地回答了我之前的问题,该问题一直有效,直到将其扩展到我们的实际数据集(大约有 500 万行)。我自己的 for 循环处理该数据非常慢,因此希望 提供的 lambda 解决方案能够工作。但是,当使用以下数据集时:
date customerID saved purchased savedProduct purchasedProduct
2021-01-01 456789 1 0 11223344 0
2021-01-01 456789 1 0 55667788 0
2021-01-03 456789 0 1 0 11223344
2021-01-03 456789 0 1 0 28373827
2021-02-05 456710 1 0 55667789 0
2021-02-05 456710 1 0 55667790 0
2021-02-09 456710 1 0 556677288 0
2021-02-05 2727228 1 0 55667789 0
2021-02-05 2727228 0 1 0 11223344
2021-02-05 2727228 0 1 0 28373827
2021-02-09 2727228 0 1 0 55667789
2021-02-09 2727228 0 1 0 28373827
使用以下代码创建:
d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-10'],
'customerID': ['456789', '456789', '456789', '456710', '456710', '456710', '2727228', '2727228', '2727228', '2727210', '2727210'],
'saved':[1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0],
'purchased': [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1],
'savedProduct': [11223344, 55667788, 0, 55667789, 55667790, 556677288, 55667789, 0, 0, 3828292, 0],
'purchasedProduct': [[0], [0], [11223344, 28373827], [0], [0], [0], [0], [11223344, 28373827], [55667789, 28373827], [0], [3828292]]}
df2 = pd.DataFrame(data=d).explode('purchasedProduct').reset_index(drop=True)
当 运行ning 提供的解决方案是这样的:
df2.groupby('customerID').apply(
lambda df: df.apply(
lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))
我得到的结果 table 每次找到匹配时都会创建一个新列,如下所示:
2 10 13
2021-01-03 NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN 2021-02-09 NaN
我已尝试解决问题,但我对 lambda 的了解非常基础。我的(编辑过的)代码确实可以满足我的需要,但是一旦数据集达到 100k+ 行,Databricks 上的 运行 需要一个多小时,而我需要它 运行 上的数据如前所述,大约有 500 万行。有什么方法可以让 lambda 执行,我会这样做:
df2['purchasedDates'] = df2.groupby('customerID').apply(
lambda df: df.apply(
lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))
会return(预期输出):
purchasedDates
2021-01-03
NaN
NaN
NaN
NaN
2021-02-09
etc
感谢您的帮助,希望这是合适的,我会编辑上一个问题,但看到你不打算
如果我没理解错的话,您可以通过以下方式获取所需的专栏:
merge
DataFrame 自身获取保存产品的购买日期。
- 仅保留必需的列并重命名以匹配原始 DataFrame 列。
merged = df2[df2["savedProduct"].ne(0)].merge(df2,
left_on=["customerID", "savedProduct"],
right_on=["customerID", "purchasedProduct"],
how="left")
output = merged.rename(columns={"customerID": "customerID_x", "date_y": "purchasedDate_x"}).filter(like="_x")
output = output.rename(columns={col:col.rstrip("_x") for col in output.columns})
>>> output
date customerID saved ... savedProduct purchasedProduct purchasedDate
0 2021-01-01 456789 1 ... 11223344 0 2021-01-03
1 2021-01-01 456789 1 ... 55667788 0 NaN
2 2021-02-05 456710 1 ... 55667789 0 NaN
3 2021-02-05 456710 1 ... 55667790 0 NaN
4 2021-02-09 456710 1 ... 556677288 0 NaN
5 2021-02-05 2727228 1 ... 55667789 0 2021-02-09
6 2021-02-05 2727210 1 ... 3828292 0 2021-02-10
[7 rows x 7 columns]
有人友善地回答了我之前的问题,该问题一直有效,直到将其扩展到我们的实际数据集(大约有 500 万行)。我自己的 for 循环处理该数据非常慢,因此希望
date customerID saved purchased savedProduct purchasedProduct
2021-01-01 456789 1 0 11223344 0
2021-01-01 456789 1 0 55667788 0
2021-01-03 456789 0 1 0 11223344
2021-01-03 456789 0 1 0 28373827
2021-02-05 456710 1 0 55667789 0
2021-02-05 456710 1 0 55667790 0
2021-02-09 456710 1 0 556677288 0
2021-02-05 2727228 1 0 55667789 0
2021-02-05 2727228 0 1 0 11223344
2021-02-05 2727228 0 1 0 28373827
2021-02-09 2727228 0 1 0 55667789
2021-02-09 2727228 0 1 0 28373827
使用以下代码创建:
d = {'date': ['2021-01-01', '2021-01-01', '2021-01-03', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-05', '2021-02-09', '2021-02-05', '2021-02-10'],
'customerID': ['456789', '456789', '456789', '456710', '456710', '456710', '2727228', '2727228', '2727228', '2727210', '2727210'],
'saved':[1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0],
'purchased': [0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1],
'savedProduct': [11223344, 55667788, 0, 55667789, 55667790, 556677288, 55667789, 0, 0, 3828292, 0],
'purchasedProduct': [[0], [0], [11223344, 28373827], [0], [0], [0], [0], [11223344, 28373827], [55667789, 28373827], [0], [3828292]]}
df2 = pd.DataFrame(data=d).explode('purchasedProduct').reset_index(drop=True)
当 运行ning 提供的解决方案是这样的:
df2.groupby('customerID').apply(
lambda df: df.apply(
lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))
我得到的结果 table 每次找到匹配时都会创建一个新列,如下所示:
2 10 13
2021-01-03 NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN NaN NaN
NaN 2021-02-09 NaN
我已尝试解决问题,但我对 lambda 的了解非常基础。我的(编辑过的)代码确实可以满足我的需要,但是一旦数据集达到 100k+ 行,Databricks 上的 运行 需要一个多小时,而我需要它 运行 上的数据如前所述,大约有 500 万行。有什么方法可以让 lambda 执行,我会这样做:
df2['purchasedDates'] = df2.groupby('customerID').apply(
lambda df: df.apply(
lambda x: np.nan if x.savedProduct == 0 else df.loc[df.purchasedProduct == x.savedProduct, 'date'], axis = 1))
会return(预期输出):
purchasedDates
2021-01-03
NaN
NaN
NaN
NaN
2021-02-09
etc
感谢您的帮助,希望这是合适的,我会编辑上一个问题,但看到你不打算
如果我没理解错的话,您可以通过以下方式获取所需的专栏:
merge
DataFrame 自身获取保存产品的购买日期。- 仅保留必需的列并重命名以匹配原始 DataFrame 列。
merged = df2[df2["savedProduct"].ne(0)].merge(df2,
left_on=["customerID", "savedProduct"],
right_on=["customerID", "purchasedProduct"],
how="left")
output = merged.rename(columns={"customerID": "customerID_x", "date_y": "purchasedDate_x"}).filter(like="_x")
output = output.rename(columns={col:col.rstrip("_x") for col in output.columns})
>>> output
date customerID saved ... savedProduct purchasedProduct purchasedDate
0 2021-01-01 456789 1 ... 11223344 0 2021-01-03
1 2021-01-01 456789 1 ... 55667788 0 NaN
2 2021-02-05 456710 1 ... 55667789 0 NaN
3 2021-02-05 456710 1 ... 55667790 0 NaN
4 2021-02-09 456710 1 ... 556677288 0 NaN
5 2021-02-05 2727228 1 ... 55667789 0 2021-02-09
6 2021-02-05 2727210 1 ... 3828292 0 2021-02-10
[7 rows x 7 columns]