如何连接列值在一定范围内的两个数据框?
How to join two dataframes for which column values are within a certain range?
给定两个数据框 df_1
和 df_2
,如何连接它们,使日期时间列 df_1
在数据框 start
和 end
之间 df_2
:
print df_1
timestamp A B
0 2016-05-14 10:54:33 0.020228 0.026572
1 2016-05-14 10:54:34 0.057780 0.175499
2 2016-05-14 10:54:35 0.098808 0.620986
3 2016-05-14 10:54:36 0.158789 1.014819
4 2016-05-14 10:54:39 0.038129 2.384590
print df_2
start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
获取对应的event
其中df1.timestamp
在df_2.start
和df2.end
之间
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
一个简单的解决方案是从 start and end
设置 closed = both
创建 interval index
然后使用 get_loc
来获取事件,即(希望所有日期时间都在时间戳 dtype 中)
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
输出:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
首先使用 IntervalIndex 基于感兴趣的区间创建参考索引,然后使用 get_indexer 对包含感兴趣的离散事件的数据帧进行切片。
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0 E1
1 E2
1 E2
1 E2
2 E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
参考:
选项 1
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
选项 2
df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
在此方法中,我们假设使用了 TimeStamp 对象。
df2 start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
get_event
的解释
对于 df1
中的每个时间戳,说 t0 = 2016-05-14 10:54:33
、
(t0 >= df2.start) & (t0 <= df2.end)
将包含 1 个 true。 (参见示例 1)。然后,与 np.arange(event_num)
进行点积,得到 t0
所属事件的索引。
示例:
示例 1
t0 >= df2.start t0 <= df2.end After & np.arange(3)
0 True True -> T 0 event_idx
1 False True -> F 1 -> 0
2 False True -> F 2
以t2 = 2016-05-14 10:54:35
为例
t2 >= df2.start t2 <= df2.end After & np.arange(3)
0 True False -> F 0 event_idx
1 True True -> T 1 -> 1
2 False True -> F 2
我们最终使用 transform
将每个时间戳转换为一个事件。
您可以使用模块pandasql
import pandasql as ps
sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1
inner join df_2
on d1.timestamp between df_2.start and df2.end
'''
newdf = ps.sqldf(sqlcode,locals())
在解决方案, that suggests that Polymorphism does not work. I have to agree with firelynx (after extensive testing). However, combining that idea of Polymorphism with 中,可以运行!
唯一的问题是,最终,在引擎盖下,numpy 广播实际上做了某种交叉连接,我们过滤了所有相等的元素,给出了 O(n1*n2)
内存和 O(n1*n2)
性能下降。可能有人可以在一般意义上提高效率。
我在这里 post 的原因是 firelynx 的解决方案的问题作为这个问题的副本被关闭,我倾向于不同意。因为当你有多个点属于多个区间时,这个问题和其中的答案并没有给出解决方案,而只是一个属于多个区间的点。我在下面提出的解决方案,确实 处理了这些 n-m 关系。
基本上,为多态性创建以下两个 类 PointInTime
和 Timespan
。
from datetime import datetime
class PointInTime(object):
doPrint = True
def __init__(self, year, month, day):
self.dt = datetime(year, month, day)
def __eq__(self, other):
if isinstance(other, self.__class__):
r = (self.dt == other.dt)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance(other, Timespan):
r = (other.start_date < self.dt < other.end_date)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (Timespan in PointInTime) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (PointInTime)')
return NotImplemented
def __repr__(self):
return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)
class Timespan(object):
doPrint = True
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date
def __eq__(self, other):
if isinstance(other, self.__class__):
r = ((self.start_date == other.start_date) and (self.end_date == other.end_date))
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance (other, PointInTime):
r = self.start_date < other.dt < self.end_date
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (PointInTime in Timespan) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (Timespan)')
return NotImplemented
def __repr__(self):
return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day, self.end_date.year, self.end_date.month, self.end_date.day)
顺便说一句,如果您不想使用 ==,而是使用其他运算符(例如 !=、<、>、<=、>=),您可以为它们创建相应的函数 (__ne__
, __lt__
、__gt__
、__le__
、__ge__
).
结合广播使用它的方法如下。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"pit":[(x) for x in [PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3), PointInTime(2015,4,4)]], 'vals1':[1,2,3,4]})
df2 = pd.DataFrame({"ts":[(x) for x in [Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1)), Timespan(datetime(2015,2,1), datetime(2015,2,5))]], 'vals2' : ['a', 'b', 'c']})
a = df1['pit'].values
b = df2['ts'].values
i, j = np.where((a[:,None] == b))
res = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
)
print(df1)
print(df2)
print(res)
这给出了预期的输出。
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
pit vals1
0 2015-1-1 1
1 2015-2-2 2
2 2015-3-3 3
3 2015-4-4 4
ts vals2
0 2015-2-1 -> 2015-2-5 a
1 2015-2-1 -> 2015-4-1 b
2 2015-2-1 -> 2015-2-5 c
pit vals1 ts vals2
0 2015-2-2 2 2015-2-1 -> 2015-2-5 a
1 2015-2-2 2 2015-2-1 -> 2015-4-1 b
2 2015-2-2 2 2015-2-1 -> 2015-2-5 c
3 2015-3-3 3 2015-2-1 -> 2015-4-1 b
与基本 Python 类型相比,类 的开销可能会有额外的性能损失,但我没有研究过。
以上是我们如何创建“内部”连接。创建“(外)左”、“(外)右”和“(全)外”连接应该很简单。
您可以通过将 df_1
的索引设置为时间戳字段来使 pandas
索引对齐为您工作
import pandas as pd
df_1 = pd.DataFrame(
columns=["timestamp", "A", "B"],
data=[
(pd.Timestamp("2016-05-14 10:54:33"), 0.020228, 0.026572),
(pd.Timestamp("2016-05-14 10:54:34"), 0.057780, 0.175499),
(pd.Timestamp("2016-05-14 10:54:35"), 0.098808, 0.620986),
(pd.Timestamp("2016-05-14 10:54:36"), 0.158789, 1.014819),
(pd.Timestamp("2016-05-14 10:54:39"), 0.038129, 2.384590),
],
)
df_2 = pd.DataFrame(
columns=["start", "end", "event"],
data=[
(
pd.Timestamp("2016-05-14 10:54:31"),
pd.Timestamp("2016-05-14 10:54:33"),
"E1",
),
(
pd.Timestamp("2016-05-14 10:54:34"),
pd.Timestamp("2016-05-14 10:54:37"),
"E2",
),
(
pd.Timestamp("2016-05-14 10:54:38"),
pd.Timestamp("2016-05-14 10:54:42"),
"E3",
),
],
)
df_2.index = pd.IntervalIndex.from_arrays(df_2["start"], df_2["end"], closed="both")
只需将 df_1["event"]
设置为 df_2["event"]
df_1["event"] = df_2["event"]
瞧瞧
df_1["event"]
timestamp
2016-05-14 10:54:33 E1
2016-05-14 10:54:34 E2
2016-05-14 10:54:35 E2
2016-05-14 10:54:36 E2
2016-05-14 10:54:39 E3
Name: event, dtype: object
如果 df_2
中的时间跨度不重叠,您可以使用 numpy 广播将时间戳与所有时间跨度进行比较,并确定它位于哪个时间跨度之间。然后使用 argmax
找出要分配的 'Event'
(因为最多只能有 1 个 non-overlapping 时间跨度)。
where
条件用于 NaN
任何可能超出所有时间跨度的内容(因为 argmax
无法正确处理)
import numpy as np
m = ((df_1['timestamp'].to_numpy() >= df_2['start'].to_numpy()[:, None])
& (df_1['timestamp'].to_numpy() <= df_2['end'].to_numpy()[:, None]))
df_1['Event'] = df_2['event'].take(np.argmax(m, axis=0)).where(m.sum(axis=0) > 0)
print(df_1)
timestamp A B Event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
一个选项是 conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df_1
.conditional_join(
df_2,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('timestamp', 'start', '>='),
('timestamp', 'end', '<='),
how = 'inner',
sort_by_appearance = False)
.drop(columns=['start', 'end'])
)
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
您可以使用 how
参数决定联接类型 => left
、right
或 inner
。
给定两个数据框 df_1
和 df_2
,如何连接它们,使日期时间列 df_1
在数据框 start
和 end
之间 df_2
:
print df_1
timestamp A B
0 2016-05-14 10:54:33 0.020228 0.026572
1 2016-05-14 10:54:34 0.057780 0.175499
2 2016-05-14 10:54:35 0.098808 0.620986
3 2016-05-14 10:54:36 0.158789 1.014819
4 2016-05-14 10:54:39 0.038129 2.384590
print df_2
start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
获取对应的event
其中df1.timestamp
在df_2.start
和df2.end
之间
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
一个简单的解决方案是从 start and end
设置 closed = both
创建 interval index
然后使用 get_loc
来获取事件,即(希望所有日期时间都在时间戳 dtype 中)
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
输出:
timestamp A B event 0 2016-05-14 10:54:33 0.020228 0.026572 E1 1 2016-05-14 10:54:34 0.057780 0.175499 E2 2 2016-05-14 10:54:35 0.098808 0.620986 E2 3 2016-05-14 10:54:36 0.158789 1.014819 E2 4 2016-05-14 10:54:39 0.038129 2.384590 E3
首先使用 IntervalIndex 基于感兴趣的区间创建参考索引,然后使用 get_indexer 对包含感兴趣的离散事件的数据帧进行切片。
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
event = df_2.iloc[idx.get_indexer(df_1.timestamp), 'event']
event
0 E1
1 E2
1 E2
1 E2
2 E3
Name: event, dtype: object
df_1['event'] = event.to_numpy()
df_1
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
参考:
选项 1
idx = pd.IntervalIndex.from_arrays(df_2['start'], df_2['end'], closed='both')
df_2.index=idx
df_1['event']=df_2.loc[df_1.timestamp,'event'].values
选项 2
df_2['timestamp']=df_2['end']
pd.merge_asof(df_1,df_2[['timestamp','event']],on='timestamp',direction ='forward',allow_exact_matches =True)
Out[405]:
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
在此方法中,我们假设使用了 TimeStamp 对象。
df2 start end event
0 2016-05-14 10:54:31 2016-05-14 10:54:33 E1
1 2016-05-14 10:54:34 2016-05-14 10:54:37 E2
2 2016-05-14 10:54:38 2016-05-14 10:54:42 E3
event_num = len(df2.event)
def get_event(t):
event_idx = ((t >= df2.start) & (t <= df2.end)).dot(np.arange(event_num))
return df2.event[event_idx]
df1["event"] = df1.timestamp.transform(get_event)
get_event
对于 df1
中的每个时间戳,说 t0 = 2016-05-14 10:54:33
、
(t0 >= df2.start) & (t0 <= df2.end)
将包含 1 个 true。 (参见示例 1)。然后,与 np.arange(event_num)
进行点积,得到 t0
所属事件的索引。
示例:
示例 1
t0 >= df2.start t0 <= df2.end After & np.arange(3)
0 True True -> T 0 event_idx
1 False True -> F 1 -> 0
2 False True -> F 2
以t2 = 2016-05-14 10:54:35
为例
t2 >= df2.start t2 <= df2.end After & np.arange(3)
0 True False -> F 0 event_idx
1 True True -> T 1 -> 1
2 False True -> F 2
我们最终使用 transform
将每个时间戳转换为一个事件。
您可以使用模块pandasql
import pandasql as ps
sqlcode = '''
select df_1.timestamp
,df_1.A
,df_1.B
,df_2.event
from df_1
inner join df_2
on d1.timestamp between df_2.start and df2.end
'''
newdf = ps.sqldf(sqlcode,locals())
在解决方案
唯一的问题是,最终,在引擎盖下,numpy 广播实际上做了某种交叉连接,我们过滤了所有相等的元素,给出了 O(n1*n2)
内存和 O(n1*n2)
性能下降。可能有人可以在一般意义上提高效率。
我在这里 post 的原因是 firelynx 的解决方案的问题作为这个问题的副本被关闭,我倾向于不同意。因为当你有多个点属于多个区间时,这个问题和其中的答案并没有给出解决方案,而只是一个属于多个区间的点。我在下面提出的解决方案,确实 处理了这些 n-m 关系。
基本上,为多态性创建以下两个 类 PointInTime
和 Timespan
。
from datetime import datetime
class PointInTime(object):
doPrint = True
def __init__(self, year, month, day):
self.dt = datetime(year, month, day)
def __eq__(self, other):
if isinstance(other, self.__class__):
r = (self.dt == other.dt)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance(other, Timespan):
r = (other.start_date < self.dt < other.end_date)
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (Timespan in PointInTime) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (PointInTime)')
return NotImplemented
def __repr__(self):
return "{}-{}-{}".format(self.dt.year, self.dt.month, self.dt.day)
class Timespan(object):
doPrint = True
def __init__(self, start_date, end_date):
self.start_date = start_date
self.end_date = end_date
def __eq__(self, other):
if isinstance(other, self.__class__):
r = ((self.start_date == other.start_date) and (self.end_date == other.end_date))
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (equals) gives {r}')
return (r)
elif isinstance (other, PointInTime):
r = self.start_date < other.dt < self.end_date
if self.doPrint:
print(f'{self.__class__}: comparing {self} to {other} (PointInTime in Timespan) gives {r}')
return (r)
else:
if self.doPrint:
print(f'Not implemented... (Timespan)')
return NotImplemented
def __repr__(self):
return "{}-{}-{} -> {}-{}-{}".format(self.start_date.year, self.start_date.month, self.start_date.day, self.end_date.year, self.end_date.month, self.end_date.day)
顺便说一句,如果您不想使用 ==,而是使用其他运算符(例如 !=、<、>、<=、>=),您可以为它们创建相应的函数 (__ne__
, __lt__
、__gt__
、__le__
、__ge__
).
结合广播使用它的方法如下。
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"pit":[(x) for x in [PointInTime(2015,1,1), PointInTime(2015,2,2), PointInTime(2015,3,3), PointInTime(2015,4,4)]], 'vals1':[1,2,3,4]})
df2 = pd.DataFrame({"ts":[(x) for x in [Timespan(datetime(2015,2,1), datetime(2015,2,5)), Timespan(datetime(2015,2,1), datetime(2015,4,1)), Timespan(datetime(2015,2,1), datetime(2015,2,5))]], 'vals2' : ['a', 'b', 'c']})
a = df1['pit'].values
b = df2['ts'].values
i, j = np.where((a[:,None] == b))
res = pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
)
print(df1)
print(df2)
print(res)
这给出了预期的输出。
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-1-1 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-2-2 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives True
<class '__main__.PointInTime'>: comparing 2015-3-3 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-4-1 (Timespan in PointInTime) gives False
<class '__main__.PointInTime'>: comparing 2015-4-4 to 2015-2-1 -> 2015-2-5 (Timespan in PointInTime) gives False
pit vals1
0 2015-1-1 1
1 2015-2-2 2
2 2015-3-3 3
3 2015-4-4 4
ts vals2
0 2015-2-1 -> 2015-2-5 a
1 2015-2-1 -> 2015-4-1 b
2 2015-2-1 -> 2015-2-5 c
pit vals1 ts vals2
0 2015-2-2 2 2015-2-1 -> 2015-2-5 a
1 2015-2-2 2 2015-2-1 -> 2015-4-1 b
2 2015-2-2 2 2015-2-1 -> 2015-2-5 c
3 2015-3-3 3 2015-2-1 -> 2015-4-1 b
与基本 Python 类型相比,类 的开销可能会有额外的性能损失,但我没有研究过。
以上是我们如何创建“内部”连接。创建“(外)左”、“(外)右”和“(全)外”连接应该很简单。
您可以通过将 df_1
的索引设置为时间戳字段来使 pandas
索引对齐为您工作
import pandas as pd
df_1 = pd.DataFrame(
columns=["timestamp", "A", "B"],
data=[
(pd.Timestamp("2016-05-14 10:54:33"), 0.020228, 0.026572),
(pd.Timestamp("2016-05-14 10:54:34"), 0.057780, 0.175499),
(pd.Timestamp("2016-05-14 10:54:35"), 0.098808, 0.620986),
(pd.Timestamp("2016-05-14 10:54:36"), 0.158789, 1.014819),
(pd.Timestamp("2016-05-14 10:54:39"), 0.038129, 2.384590),
],
)
df_2 = pd.DataFrame(
columns=["start", "end", "event"],
data=[
(
pd.Timestamp("2016-05-14 10:54:31"),
pd.Timestamp("2016-05-14 10:54:33"),
"E1",
),
(
pd.Timestamp("2016-05-14 10:54:34"),
pd.Timestamp("2016-05-14 10:54:37"),
"E2",
),
(
pd.Timestamp("2016-05-14 10:54:38"),
pd.Timestamp("2016-05-14 10:54:42"),
"E3",
),
],
)
df_2.index = pd.IntervalIndex.from_arrays(df_2["start"], df_2["end"], closed="both")
只需将 df_1["event"]
设置为 df_2["event"]
df_1["event"] = df_2["event"]
瞧瞧
df_1["event"]
timestamp
2016-05-14 10:54:33 E1
2016-05-14 10:54:34 E2
2016-05-14 10:54:35 E2
2016-05-14 10:54:36 E2
2016-05-14 10:54:39 E3
Name: event, dtype: object
如果 df_2
中的时间跨度不重叠,您可以使用 numpy 广播将时间戳与所有时间跨度进行比较,并确定它位于哪个时间跨度之间。然后使用 argmax
找出要分配的 'Event'
(因为最多只能有 1 个 non-overlapping 时间跨度)。
where
条件用于 NaN
任何可能超出所有时间跨度的内容(因为 argmax
无法正确处理)
import numpy as np
m = ((df_1['timestamp'].to_numpy() >= df_2['start'].to_numpy()[:, None])
& (df_1['timestamp'].to_numpy() <= df_2['end'].to_numpy()[:, None]))
df_1['Event'] = df_2['event'].take(np.argmax(m, axis=0)).where(m.sum(axis=0) > 0)
print(df_1)
timestamp A B Event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
一个选项是 conditional_join from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor
(df_1
.conditional_join(
df_2,
# variable arguments
# tuple is of the form:
# col_from_left_df, col_from_right_df, comparator
('timestamp', 'start', '>='),
('timestamp', 'end', '<='),
how = 'inner',
sort_by_appearance = False)
.drop(columns=['start', 'end'])
)
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
您可以使用 how
参数决定联接类型 => left
、right
或 inner
。