Pandas - 加速 df.apply() - 计算时差
Pandas - Speeding up df.apply() - Calculating time difference
我有使用 df.apply()
计算两个日期之间营业时间的工作代码。然而,鉴于我的 df 是 ~40k 行,它非常慢,有没有办法通过向量化来加速它?
原代码:
import datetime
import pytz
import businesstimedelta
import holidays as pyholidays
workday = businesstimedelta.WorkDayRule(
start_time=datetime.time(9),
end_time=datetime.time(17),
working_days=[0, 1, 2, 3, 4])
vic_holidays = pyholidays.AU(prov='VIC')
holidays = businesstimedelta.HolidayRule(vic_holidays)
businesshrs = businesstimedelta.Rules([workday, holidays])
def BusHrs(start, end):
return businesshrs.difference(start,end).hours+float(businesshrs.difference(start,end).seconds)/float(3600)
df['Diff Hrs'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1)
给出:
Index Created Date Updated Date Diff Hrs Current Date
10086 2016-11-04 16:00:00 2016-11-11 11:38:00 35.633333 2018-05-29 10:09:11.291391
10087 2016-11-04 16:03:00 2016-11-29 12:54:00 132.850000 2018-05-29 10:09:11.291391
10088 2016-11-04 16:05:00 2016-11-16 08:05:00 56.916667 2018-05-29 10:09:11.291391
10089 2016-11-04 16:17:00 2016-11-08 11:37:00 11.333333 2018-05-29 10:09:11.291391
10090 2016-11-04 16:20:00 2016-11-16 09:58:00 57.633333 2018-05-29 10:09:11.291391
10091 2016-11-04 16:32:00 2016-11-08 11:10:00 10.633333 2018-05-29 10:09:11.291391
我可以看到它正在运行,看起来可能需要 5 个多小时才能完成。
奇怪的是,我有一种预感,两个日期越接近,计算速度就越快。
前任。 df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1)
比
快得多
df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Created Date'], row['Updated Date']), axis=1)
像这样进行优化比我习惯的做法高出一步,因此非常感谢您的帮助。
如果你想加速你的代码,你可以先重新定义你的函数:
def BusHrs(start, end):
diff_hours = businesshrs.difference(start,end)
return diff_hours.hours+float(diff_hours.seconds)/float(3600)
然后,我认为您可以通过计算两个连续更新日期之间的小时数然后对这些部分计算求和直到当前日期来做得更快。您需要两个临时列,一个更改了更新日期,另一个具有部分营业时间
# sort from more recent date
df = df.sort_values('Updated Date',ascending=False)
#create a column with shift of 1 and set the Nan to be now
df['Shift Date'] = df['Updated Date'].shift(1).fillna(pd.datetime.now())
#calculate partial business hours between two successives update date
df['BsnHrs Partial'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Shift Date']), axis=1)
# with this order, you can use cumsum() to add the necessary value of partial business hours
df['Time Since Last Update'] = df['BsnHrs Partial'].cumsum()
#drop column not usefull anymore and sort_index to return original order
df = df.drop(['Shift Date','BsnHrs Partial'],1).sort_index()
我有使用 df.apply()
计算两个日期之间营业时间的工作代码。然而,鉴于我的 df 是 ~40k 行,它非常慢,有没有办法通过向量化来加速它?
原代码:
import datetime
import pytz
import businesstimedelta
import holidays as pyholidays
workday = businesstimedelta.WorkDayRule(
start_time=datetime.time(9),
end_time=datetime.time(17),
working_days=[0, 1, 2, 3, 4])
vic_holidays = pyholidays.AU(prov='VIC')
holidays = businesstimedelta.HolidayRule(vic_holidays)
businesshrs = businesstimedelta.Rules([workday, holidays])
def BusHrs(start, end):
return businesshrs.difference(start,end).hours+float(businesshrs.difference(start,end).seconds)/float(3600)
df['Diff Hrs'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1)
给出:
Index Created Date Updated Date Diff Hrs Current Date
10086 2016-11-04 16:00:00 2016-11-11 11:38:00 35.633333 2018-05-29 10:09:11.291391
10087 2016-11-04 16:03:00 2016-11-29 12:54:00 132.850000 2018-05-29 10:09:11.291391
10088 2016-11-04 16:05:00 2016-11-16 08:05:00 56.916667 2018-05-29 10:09:11.291391
10089 2016-11-04 16:17:00 2016-11-08 11:37:00 11.333333 2018-05-29 10:09:11.291391
10090 2016-11-04 16:20:00 2016-11-16 09:58:00 57.633333 2018-05-29 10:09:11.291391
10091 2016-11-04 16:32:00 2016-11-08 11:10:00 10.633333 2018-05-29 10:09:11.291391
我可以看到它正在运行,看起来可能需要 5 个多小时才能完成。
奇怪的是,我有一种预感,两个日期越接近,计算速度就越快。
前任。 df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Current Date']), axis=1)
比
df['Time Since Last Update'] = df.apply(lambda row: BusHrs(row['Created Date'], row['Updated Date']), axis=1)
像这样进行优化比我习惯的做法高出一步,因此非常感谢您的帮助。
如果你想加速你的代码,你可以先重新定义你的函数:
def BusHrs(start, end):
diff_hours = businesshrs.difference(start,end)
return diff_hours.hours+float(diff_hours.seconds)/float(3600)
然后,我认为您可以通过计算两个连续更新日期之间的小时数然后对这些部分计算求和直到当前日期来做得更快。您需要两个临时列,一个更改了更新日期,另一个具有部分营业时间
# sort from more recent date
df = df.sort_values('Updated Date',ascending=False)
#create a column with shift of 1 and set the Nan to be now
df['Shift Date'] = df['Updated Date'].shift(1).fillna(pd.datetime.now())
#calculate partial business hours between two successives update date
df['BsnHrs Partial'] = df.apply(lambda row: BusHrs(row['Updated Date'], row['Shift Date']), axis=1)
# with this order, you can use cumsum() to add the necessary value of partial business hours
df['Time Since Last Update'] = df['BsnHrs Partial'].cumsum()
#drop column not usefull anymore and sort_index to return original order
df = df.drop(['Shift Date','BsnHrs Partial'],1).sort_index()