Pandas groupby() 条件来自另一个 DataFrame
Pandas groupby() with conditions from another DataFrame
我正在尝试使用 trial2 中的信息创建一个新列 trial1['Return']。
我需要在 trial1 的给定时间范围内获取特定 ID 的 returns 的产品。
我试过将 groupby() 与 lambda 一起使用,也只是条件优化。但是两者都导致了错误。唯一可行的方法是 for 循环。但我想知道这样做是否更有效。
import pandas as pd
trial1 = pd.DataFrame([[1,'2016-09-01','2016-09-05'],[1,'2016-09-03','2016-09-06'],[2,'2016-09-01','2016-09-05']] , columns=('Id','startDate','EndDate'))
trial1
trial2 = pd.DataFrame([[1,'2016-09-01',1.1],[1,'2016-09-02',1],[1,'2016-09-03',1],[1,'2016-09-04',1],[1,'2016-09-05',1],[1,'2016-09-06',1],[2,'2016-09-01',1.2],[2,'2016-09-02',1],[2,'2016-09-03',1],[2,'2016-09-04',1],[2,'2016-09-05',1]] , columns=('Id','Date','Return'))
trial2
trial1['EndDate'] = pd.to_datetime(trial1['EndDate'])
trial1['startDate'] = pd.to_datetime(trial1['startDate'])
trial2['Date'] = pd.to_datetime(trial2['Date'])
##This throws a Timestamp error
trial2_g = trial2.groupby('Id')
trial2_g.apply(lambda x: x[x['Date'].isin(pd.date_range(trial1['startDate'], trial1['EndDate']))]['Return'].prod())
##This throws a ValueError (can only compare identical-labeled series object)
trial2['Id'] = trial2['Id'].reset_index(drop=True)
trial1['Id'] = trial1['Id'].reset_index(drop=True)
trial1['Return'] = trial2[((trial2['Id']==trial1['Id']))
&(trial2['Date'].isin(pd.date_range(trial1['startDate'],trial1['EndDate'])))].prod()
##THIS WORKS AND THAT'S HOW I WANT IT TO LOOK LIKE
trial1['Return'] = 0
for nn in range(len(trial1)):
trial1['Return'].loc[nn] = trial2.Return[(trial2.Id == trial1.Id[nn])
&(trial2.Date >= trial1.startDate[nn])
&(trial2.Date <= trial1.EndDate[nn])].prod()
trial1
我会先在 trial2
上设置索引
t2 = trial2.set_index(['Id', 'Date'])
然后在 trial1
上使用 apply
trial1['Return'] = trial1.apply(
lambda x: t2.xs(x.Id)[x.startDate:x.EndDate].prod(), 1)
trial1
我正在尝试使用 trial2 中的信息创建一个新列 trial1['Return']。 我需要在 trial1 的给定时间范围内获取特定 ID 的 returns 的产品。
我试过将 groupby() 与 lambda 一起使用,也只是条件优化。但是两者都导致了错误。唯一可行的方法是 for 循环。但我想知道这样做是否更有效。
import pandas as pd
trial1 = pd.DataFrame([[1,'2016-09-01','2016-09-05'],[1,'2016-09-03','2016-09-06'],[2,'2016-09-01','2016-09-05']] , columns=('Id','startDate','EndDate'))
trial1
trial2 = pd.DataFrame([[1,'2016-09-01',1.1],[1,'2016-09-02',1],[1,'2016-09-03',1],[1,'2016-09-04',1],[1,'2016-09-05',1],[1,'2016-09-06',1],[2,'2016-09-01',1.2],[2,'2016-09-02',1],[2,'2016-09-03',1],[2,'2016-09-04',1],[2,'2016-09-05',1]] , columns=('Id','Date','Return'))
trial2
trial1['EndDate'] = pd.to_datetime(trial1['EndDate'])
trial1['startDate'] = pd.to_datetime(trial1['startDate'])
trial2['Date'] = pd.to_datetime(trial2['Date'])
##This throws a Timestamp error
trial2_g = trial2.groupby('Id')
trial2_g.apply(lambda x: x[x['Date'].isin(pd.date_range(trial1['startDate'], trial1['EndDate']))]['Return'].prod())
##This throws a ValueError (can only compare identical-labeled series object)
trial2['Id'] = trial2['Id'].reset_index(drop=True)
trial1['Id'] = trial1['Id'].reset_index(drop=True)
trial1['Return'] = trial2[((trial2['Id']==trial1['Id']))
&(trial2['Date'].isin(pd.date_range(trial1['startDate'],trial1['EndDate'])))].prod()
##THIS WORKS AND THAT'S HOW I WANT IT TO LOOK LIKE
trial1['Return'] = 0
for nn in range(len(trial1)):
trial1['Return'].loc[nn] = trial2.Return[(trial2.Id == trial1.Id[nn])
&(trial2.Date >= trial1.startDate[nn])
&(trial2.Date <= trial1.EndDate[nn])].prod()
trial1
我会先在 trial2
t2 = trial2.set_index(['Id', 'Date'])
然后在 trial1
apply
trial1['Return'] = trial1.apply(
lambda x: t2.xs(x.Id)[x.startDate:x.EndDate].prod(), 1)
trial1