Pandas 中的 str 值随时间和客户维度发生变化
A str value changed over time & Customer dimension(s) in Pandas
我有一些不同日期的客户数据,我想看看他们是否随时间选择了另一种产品。理想情况下,我想将发生更改的两列复制到新列中。
所以,如果我有一个 table 喜欢
period, Customer , product
2020-01, Cust1, 12 TS
2020-02, Cust1, 12 TS
2020-03, Cust1, 14 SLM
2020-01, Cust2, 12 SLM
2020-02, Cust2, 12 TS
2020-03, Cust2, 14 SLM
所以 cust1 随着时间的推移从 TS 到 SLM,而 Cust2 从 SLM 到 TS 然后相反。
最后一列应如下所示:
period, Customer , product , change
2020-01, Cust1, 12 TS , NAN
2020-02, Cust1, 12 TS , NAN
2020-03, Cust1, 14 SLM, from TS to SLM
2020-01, Cust2, 12 SLM, NAN
2020-02, Cust2, 12 TS, from SLM to TS
2020-03, Cust2, 14 SLM, from TS to SLM
我查看了许多可用的解决方案,例如 ,但我无法按照我想要的方式进行。
我们可以先group
数据帧Customer
,然后shift
检查是否有变化。之后我们就可以比较确定变化了。
df['prev_product'] = df.groupby(['Customer'])['product'].shift().bfill()
df['change'] = df[['product', 'prev_product']].apply(lambda x: None if(x[0] == x[1]) else f'from {x[1]} to {x[0]}', axis=1)
period Customer n product prev_product change
0 2020-01 Cust1 12 TS TS None
1 2020-02 Cust1 12 TS TS None
2 2020-03 Cust1 14 SLM TS from TS to SLM
3 2020-01 Cust2 12 SLM SLM None
4 2020-02 Cust2 12 TS SLM from SLM to TS
5 2020-03 Cust2 14 SLM TS from TS to SLM
注意:df.drop('prev_product',axis=1)
不需要。
我们可以通过多种方式做到这一点,我建议使用 shift
和 groupby
来查找最大记录,然后 .loc
来适当地过滤您的查询集。
设置。
from io import StringIO
import pandas as pd
d = """period, Customer, quantity , product
2020-01, Cust1, 12, TS
2020-02, Cust1, 12, TS
2020-03, Cust1, 14, SLM
2020-01, Cust2, 12, SLM
2020-02, Cust2, 12, TS
2020-03, Cust2, 14, SLM"""
df = pd.read_csv(StringIO(d),sep=',',parse_dates=['period'])
# as you have spaces in your csv above.
#df.columns = df.columns.str.strip()
#create a record end date.
df['period_end_date'] = df.groupby('Customer')['period'].shift(-1)
#find the previous product.
df.loc[df['period_end_date'].isna(),
'previous_product'] = df.groupby('Customer')['product'].shift(1)
此处的当前记录将是 preiod_end_date
为空的位置。
print(df)
period Customer quantity product period_end_date previous_product
0 2020-01-01 Cust1 12 TS 2020-02-01 NaN
1 2020-02-01 Cust1 12 TS 2020-03-01 NaN
2 2020-03-01 Cust1 14 SLM NaT TS
3 2020-01-01 Cust2 12 SLM 2020-02-01 NaN
4 2020-02-01 Cust2 12 TS 2020-03-01 NaN
5 2020-03-01 Cust2 14 SLM NaT TS
如果您需要按照上面概述的预定义格式使用它。
df.loc[df['period_end_date'].isna(),
'previous_product'] = ("FROM "
+ df.groupby('Customer')['product'].shift(1)
+ " TO "
+ df['product'] )
period Customer quantity product period_end_date previous_product
0 2020-01-01 Cust1 12 TS 2020-02-01 NaN
1 2020-02-01 Cust1 12 TS 2020-03-01 NaN
2 2020-03-01 Cust1 14 SLM NaT FROM TS TO SLM
3 2020-01-01 Cust2 12 SLM 2020-02-01 NaN
4 2020-02-01 Cust2 12 TS 2020-03-01 NaN
5 2020-03-01 Cust2 14 SLM NaT FROM TS TO SLM
我有一些不同日期的客户数据,我想看看他们是否随时间选择了另一种产品。理想情况下,我想将发生更改的两列复制到新列中。
所以,如果我有一个 table 喜欢
period, Customer , product
2020-01, Cust1, 12 TS
2020-02, Cust1, 12 TS
2020-03, Cust1, 14 SLM
2020-01, Cust2, 12 SLM
2020-02, Cust2, 12 TS
2020-03, Cust2, 14 SLM
所以 cust1 随着时间的推移从 TS 到 SLM,而 Cust2 从 SLM 到 TS 然后相反。 最后一列应如下所示:
period, Customer , product , change
2020-01, Cust1, 12 TS , NAN
2020-02, Cust1, 12 TS , NAN
2020-03, Cust1, 14 SLM, from TS to SLM
2020-01, Cust2, 12 SLM, NAN
2020-02, Cust2, 12 TS, from SLM to TS
2020-03, Cust2, 14 SLM, from TS to SLM
我查看了许多可用的解决方案,例如
我们可以先group
数据帧Customer
,然后shift
检查是否有变化。之后我们就可以比较确定变化了。
df['prev_product'] = df.groupby(['Customer'])['product'].shift().bfill()
df['change'] = df[['product', 'prev_product']].apply(lambda x: None if(x[0] == x[1]) else f'from {x[1]} to {x[0]}', axis=1)
period Customer n product prev_product change
0 2020-01 Cust1 12 TS TS None
1 2020-02 Cust1 12 TS TS None
2 2020-03 Cust1 14 SLM TS from TS to SLM
3 2020-01 Cust2 12 SLM SLM None
4 2020-02 Cust2 12 TS SLM from SLM to TS
5 2020-03 Cust2 14 SLM TS from TS to SLM
注意:df.drop('prev_product',axis=1)
不需要。
我们可以通过多种方式做到这一点,我建议使用 shift
和 groupby
来查找最大记录,然后 .loc
来适当地过滤您的查询集。
设置。
from io import StringIO
import pandas as pd
d = """period, Customer, quantity , product
2020-01, Cust1, 12, TS
2020-02, Cust1, 12, TS
2020-03, Cust1, 14, SLM
2020-01, Cust2, 12, SLM
2020-02, Cust2, 12, TS
2020-03, Cust2, 14, SLM"""
df = pd.read_csv(StringIO(d),sep=',',parse_dates=['period'])
# as you have spaces in your csv above.
#df.columns = df.columns.str.strip()
#create a record end date.
df['period_end_date'] = df.groupby('Customer')['period'].shift(-1)
#find the previous product.
df.loc[df['period_end_date'].isna(),
'previous_product'] = df.groupby('Customer')['product'].shift(1)
此处的当前记录将是 preiod_end_date
为空的位置。
print(df)
period Customer quantity product period_end_date previous_product
0 2020-01-01 Cust1 12 TS 2020-02-01 NaN
1 2020-02-01 Cust1 12 TS 2020-03-01 NaN
2 2020-03-01 Cust1 14 SLM NaT TS
3 2020-01-01 Cust2 12 SLM 2020-02-01 NaN
4 2020-02-01 Cust2 12 TS 2020-03-01 NaN
5 2020-03-01 Cust2 14 SLM NaT TS
如果您需要按照上面概述的预定义格式使用它。
df.loc[df['period_end_date'].isna(),
'previous_product'] = ("FROM "
+ df.groupby('Customer')['product'].shift(1)
+ " TO "
+ df['product'] )
period Customer quantity product period_end_date previous_product
0 2020-01-01 Cust1 12 TS 2020-02-01 NaN
1 2020-02-01 Cust1 12 TS 2020-03-01 NaN
2 2020-03-01 Cust1 14 SLM NaT FROM TS TO SLM
3 2020-01-01 Cust2 12 SLM 2020-02-01 NaN
4 2020-02-01 Cust2 12 TS 2020-03-01 NaN
5 2020-03-01 Cust2 14 SLM NaT FROM TS TO SLM