向量化日期时间 pandas 比较
Vectorizing datetime pandas comparisons
我最近读了一篇很棒的文章 (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4),它表明向量化比 itertuples 快得多,希望得到实践。我当前的代码,在 200 万行中,需要大约 16 个小时才能完成保存在 pandas DataFrame 对象“数据”中的以下示例:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 Date object
1 Time object
2 Open float64
3 Close float64
4 datetime datetime64[ns]
print(data)
Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00
工作代码删除 9:30AM 之前和 3:59PM 之后的时间:
keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
pass
elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
pass
else:
keep.append(row[0])
print(row[0], "/", end)
data = data.loc[keep, :]
矢量化对我来说是新手,我也尝试过一些操作,但我觉得既然是一个系列,比较或设置值都是一个问题,因为它不是一个单独的数字。从阅读来看,我似乎需要做一个函数,这样我就可以做到: data['keep_it'] = my_fun(data['datetime'])
失败尝试:
data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep # a vectorized approach
if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
return 1
# AttributeError: 'Series' object has no attribute 'year'
if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
return 1
# AttributeError: 'Series' object has no attribute 'year'
有什么想法吗?
谢谢!
这是矢量化的。
import datetime as dt
df = pd.read_csv(io.StringIO(""" Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00"""), sep="\s\s+", engine="python")
df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]
感谢@MrFuppes,我设计了这种基本上是瞬时的粗略方法:
testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)]
改进的余地包括删除单行测试,并可能正确使用 DateTimeIndex .time 属性
我最近读了一篇很棒的文章 (https://towardsdatascience.com/apply-function-to-pandas-dataframe-rows-76df74165ee4),它表明向量化比 itertuples 快得多,希望得到实践。我当前的代码,在 200 万行中,需要大约 16 个小时才能完成保存在 pandas DataFrame 对象“数据”中的以下示例:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2274589 entries, 0 to 2274588
Data columns (total 5 columns):
# Column Dtype
--- ------ -----
0 Date object
1 Time object
2 Open float64
3 Close float64
4 datetime datetime64[ns]
print(data)
Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00
工作代码删除 9:30AM 之前和 3:59PM 之后的时间:
keep=[]
end = data.shape[0]
for row in data.itertuples(index=True):
if (row.datetime < datetime(year = row.datetime.year, month = row.datetime.month, day = row.datetime.day, hour = 9, minute = 30, second = 0)):
pass
elif (row.datetime.hour > 16): # closes at 15:59 (keep it!) in this database's notation
pass
else:
keep.append(row[0])
print(row[0], "/", end)
data = data.loc[keep, :]
矢量化对我来说是新手,我也尝试过一些操作,但我觉得既然是一个系列,比较或设置值都是一个问题,因为它不是一个单独的数字。从阅读来看,我似乎需要做一个函数,这样我就可以做到: data['keep_it'] = my_fun(data['datetime'])
失败尝试:
data['keep_it'] = my_fun(data['datetime'])
def my_fun(row): # returns 1 if desired to keep # a vectorized approach
if (row < datetime.date(year = row.year, month = row.month, day = row.day, hour = 9, minute = 30, second = 0)):
return 1
# AttributeError: 'Series' object has no attribute 'year'
if (row < pd.to_datetime(str(row['datetime'].year) +'/' + str(row['datetime'].month) +'/' + str(row['datetime'].day) + 'T9:30:00')):
return 1
# AttributeError: 'Series' object has no attribute 'year'
有什么想法吗? 谢谢!
这是矢量化的。
import datetime as dt
df = pd.read_csv(io.StringIO(""" Date Time Open Close datetime
0 02/10/2012 07:26 191.9500 191.9500 2012-02-10 07:26:00
1 02/10/2012 07:56 191.6600 191.6600 2012-02-10 07:56:00
2 02/10/2012 08:00 191.9400 191.9400 2012-02-10 08:00:00
3 02/10/2012 09:30 191.7500 191.7500 2012-02-10 09:30:00
4 02/10/2012 09:54 191.8500 191.8500 2012-02-10 09:54:00"""), sep="\s\s+", engine="python")
df["datetime"] = pd.to_datetime(df["datetime"])
df.loc[df["datetime"].dt.time.between(dt.time(9,30),dt.time(15,59))]
感谢@MrFuppes,我设计了这种基本上是瞬时的粗略方法:
testing = pd.DatetimeIndex(data['datetime'])
data = data[(testing.hour<16) & (testing.hour*60+testing.minute >= 9*60+30)]
改进的余地包括删除单行测试,并可能正确使用 DateTimeIndex .time 属性