向量化日期时间 pandas 比较

Vectorizing datetime pandas comparisons

我最近读了一篇很棒的文章 (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4),它表明向量化比 itertuples 快得多,希望得到实践。我当前的代码,在 200 万行中,需要大约 16 个小时才能完成保存在 pandas DataFrame 对象“数据”中的以下示例:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
 #   Column    Dtype         
---  ------    -----         
 0   Date      object        
 1   Time      object        
 2   Open      float64       
 3   Close     float64       
 4   datetime  datetime64[ns]
print(data)
               Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500 2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600 2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400 2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500 2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500 2012-02-10 09:54:00

工作代码删除 9:30AM 之前和 3:59PM 之后的时间:

keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
    if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
        pass
    elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
        pass
    else:
        keep.append(row[0])
    print(row[0], "/", end)
data = data.loc[keep, :] 

矢量化对我来说是新手,我也尝试过一些操作,但我觉得既然是一个系列,比较或设置值都是一个问题,因为它不是一个单独的数字。从阅读来看,我似乎需要做一个函数,这样我就可以做到: data['keep_it'] = my_fun(data['datetime'])

失败尝试:

data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep  # a vectorized approach
    if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
        return 1
     # AttributeError: 'Series' object has no attribute 'year'
    if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
        return 1
    # AttributeError: 'Series' object has no attribute 'year'

有什么想法吗? 谢谢!

这是矢量化的。

import datetime as dt

df = pd.read_csv(io.StringIO("""    Date   Time      Open     Close            datetime
0        02/10/2012  07:26  191.9500  191.9500  2012-02-10 07:26:00
1        02/10/2012  07:56  191.6600  191.6600  2012-02-10 07:56:00
2        02/10/2012  08:00  191.9400  191.9400  2012-02-10 08:00:00
3        02/10/2012  09:30  191.7500  191.7500  2012-02-10 09:30:00
4        02/10/2012  09:54  191.8500  191.8500  2012-02-10 09:54:00"""), sep="\s\s+", engine="python")

df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]

感谢@MrFuppes,我设计了这种基本上是瞬时的粗略方法:

testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)] 

改进的余地包括删除单行测试,并可能正确使用 DateTimeIndex .time 属性