如何 select 只有时间序列中的常量值
How to select only the constant values in a timeseries
我有一个速度时间序列,想检测在超过特定时间的情况下保持不变的所有部分。假设使用以下数据,我想检测何时没有移动超过 2 分钟,并将这些部分放入另一个数据帧(以及所有其他列)
2020-02-27 15:43:00 0.000000
2020-02-27 15:43:30 0.000000
2020-02-27 15:44:00 0.000000
2020-02-27 15:44:30 0.000000
2020-02-27 15:45:00 0.000000
2020-02-27 15:45:30 0.000000
2020-02-27 15:46:00 0.000000
2020-02-27 15:46:30 0.000000
2020-02-27 15:47:00 0.000000
2020-02-27 15:47:30 0.000000
2020-02-27 15:48:00 0.000000
2020-02-27 15:48:30 0.000000
2020-02-27 15:49:00 0.000000
2020-02-27 15:49:30 0.000000
2020-02-27 15:50:00 0.000000
2020-02-27 15:50:30 0.000000
2020-02-27 15:51:00 0.000000
2020-02-27 15:51:30 0.000000
2020-02-27 15:52:00 1.004333
2020-02-27 15:52:30 2.002667
2020-02-27 15:53:00 5.001000
2020-02-27 15:53:30 6.002667
2020-02-27 15:54:00 8.001000
2020-02-27 15:54:30 4.000667
2020-02-27 15:55:00 3.000000
2020-02-27 15:55:30 0.000000
2020-02-27 15:56:00 0.000000
2020-02-27 15:56:30 0.000000
2020-02-27 15:57:00 0.000000
2020-02-27 15:57:30 0.000000
2020-02-27 15:58:00 0.000000
那么结果将是 df_constant,数据从 2020-02-27 15:43:00
到 2020-02-27 15:51:30
& 2020-02-27 15:55:30
到 2020-02-27 15:58:00
import pandas as pd
from datetime import datetime
d1 = datetime.strptime("2020-02-27 15:43:00","%Y-%m-%d %H:%M:%S")
d2 = datetime.strptime('2020-02-27 15:58:00', "%Y-%m-%d %H:%M:%S")
df = pd.date_range(d1,d2, periods=30)
df = pd.DataFrame(df)
df['val'] = [0]*10 + list(range(10)) + [10]*10
df.columns = ['date','val']
def get_cont_lists(series, n):
'''
Given a list returns list of lists of indices where the values are constant
for >= n consecutive values
'''
lol = []
current_list = []
prev_value = None
for idx,elem in enumerate(series):
if elem == prev_value:
current_list.append(idx)
if elem != prev_value:
lol.append(current_list)
current_list = [idx]
prev_value = elem
lol.append(current_list)
lol = [lst for lst in lol if len(lst)>=n]
return lol
cont_lst = get_cont_lists(lst,4)
cont_lst = [i for j in cont_lst for i in j]
required_df = df.iloc[cont_lst]
print(required_df)
- 这是一个完全矢量化的解决方案,因此与使用循环或应用的解决方案相比,它会更快。
datetime
列应转换为 datetime dtype
,然后在该列上排序,但该列不用于确定连续出现。
- 此解决方案使用了另外两个 Stack Overflow 答案的一部分:
- GroupBy Pandas Count Consecutive Zero's
- 问题是,数据不能按
val
分组,因为在示例中,连续数字的组不是唯一的(例如,两组都是 0.0)
.ne
、.shift
和 .cumsum
用于创建系列,其中每个连续值序列都是唯一值。
- 对于一系列唯一的连续值,groupby 可用于为 select 行创建布尔掩码,在这种情况下,连续值的计数大于 4。
df['val'].groupby(g).transform('count') > 4
创建一个布尔掩码,用于 df[['datetime', 'val']]
中的 select 行
- 由于请求是在
2 minute
周期内没有移动,因此计数应该是>=4
,因为时间步长是30 seconds
,连续出现5次是2分钟
import pandas as pd
# sample dataframe is the same as the data in the op
data = {'datetime': ['2020-02-27 15:43:00', '2020-02-27 15:43:30', '2020-02-27 15:44:00', '2020-02-27 15:44:30', '2020-02-27 15:45:00', '2020-02-27 15:45:30', '2020-02-27 15:46:00', '2020-02-27 15:46:30', '2020-02-27 15:47:00', '2020-02-27 15:47:30', '2020-02-27 15:48:00', '2020-02-27 15:48:30', '2020-02-27 15:49:00', '2020-02-27 15:49:30', '2020-02-27 15:50:00', '2020-02-27 15:50:30', '2020-02-27 15:51:00', '2020-02-27 15:51:30', '2020-02-27 15:52:00', '2020-02-27 15:52:30', '2020-02-27 15:53:00', '2020-02-27 15:53:30', '2020-02-27 15:54:00', '2020-02-27 15:54:30', '2020-02-27 15:55:00', '2020-02-27 15:55:30', '2020-02-27 15:56:00', '2020-02-27 15:56:30', '2020-02-27 15:57:00', '2020-02-27 15:57:30', '2020-02-27 15:58:00'], 'val': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.004333, 2.002667, 5.001, 6.002667, 8.001, 4.000667, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}
df = pd.DataFrame(data)
# display(df.head())
datetime val
0 2020-02-27 15:43:00 0.0
1 2020-02-27 15:43:30 0.0
2 2020-02-27 15:44:00 0.0
3 2020-02-27 15:44:30 0.0
4 2020-02-27 15:45:00 0.0
# create a Series with the same index as df, where the consecutive values are unique
g = df.val.ne(df.val.shift()).cumsum()
# use g with groupby to count the consecutive values and then create a Boolean using > 4 (will represent 2 minutes, when the time interval is 30 seconds).
consecutive_data = df[['datetime', 'val']][df['val'].groupby(g).transform('count') > 4]
display(consecutive_data)
datetime val
0 2020-02-27 15:43:00 0.0
1 2020-02-27 15:43:30 0.0
2 2020-02-27 15:44:00 0.0
3 2020-02-27 15:44:30 0.0
4 2020-02-27 15:45:00 0.0
5 2020-02-27 15:45:30 0.0
6 2020-02-27 15:46:00 0.0
7 2020-02-27 15:46:30 0.0
8 2020-02-27 15:47:00 0.0
9 2020-02-27 15:47:30 0.0
10 2020-02-27 15:48:00 0.0
11 2020-02-27 15:48:30 0.0
12 2020-02-27 15:49:00 0.0
13 2020-02-27 15:49:30 0.0
14 2020-02-27 15:50:00 0.0
15 2020-02-27 15:50:30 0.0
16 2020-02-27 15:51:00 0.0
17 2020-02-27 15:51:30 0.0
25 2020-02-27 15:55:30 0.0
26 2020-02-27 15:56:00 0.0
27 2020-02-27 15:56:30 0.0
28 2020-02-27 15:57:00 0.0
29 2020-02-27 15:57:30 0.0
30 2020-02-27 15:58:00 0.0
我有一个速度时间序列,想检测在超过特定时间的情况下保持不变的所有部分。假设使用以下数据,我想检测何时没有移动超过 2 分钟,并将这些部分放入另一个数据帧(以及所有其他列)
2020-02-27 15:43:00 0.000000
2020-02-27 15:43:30 0.000000
2020-02-27 15:44:00 0.000000
2020-02-27 15:44:30 0.000000
2020-02-27 15:45:00 0.000000
2020-02-27 15:45:30 0.000000
2020-02-27 15:46:00 0.000000
2020-02-27 15:46:30 0.000000
2020-02-27 15:47:00 0.000000
2020-02-27 15:47:30 0.000000
2020-02-27 15:48:00 0.000000
2020-02-27 15:48:30 0.000000
2020-02-27 15:49:00 0.000000
2020-02-27 15:49:30 0.000000
2020-02-27 15:50:00 0.000000
2020-02-27 15:50:30 0.000000
2020-02-27 15:51:00 0.000000
2020-02-27 15:51:30 0.000000
2020-02-27 15:52:00 1.004333
2020-02-27 15:52:30 2.002667
2020-02-27 15:53:00 5.001000
2020-02-27 15:53:30 6.002667
2020-02-27 15:54:00 8.001000
2020-02-27 15:54:30 4.000667
2020-02-27 15:55:00 3.000000
2020-02-27 15:55:30 0.000000
2020-02-27 15:56:00 0.000000
2020-02-27 15:56:30 0.000000
2020-02-27 15:57:00 0.000000
2020-02-27 15:57:30 0.000000
2020-02-27 15:58:00 0.000000
那么结果将是 df_constant,数据从 2020-02-27 15:43:00
到 2020-02-27 15:51:30
& 2020-02-27 15:55:30
到 2020-02-27 15:58:00
import pandas as pd
from datetime import datetime
d1 = datetime.strptime("2020-02-27 15:43:00","%Y-%m-%d %H:%M:%S")
d2 = datetime.strptime('2020-02-27 15:58:00', "%Y-%m-%d %H:%M:%S")
df = pd.date_range(d1,d2, periods=30)
df = pd.DataFrame(df)
df['val'] = [0]*10 + list(range(10)) + [10]*10
df.columns = ['date','val']
def get_cont_lists(series, n):
'''
Given a list returns list of lists of indices where the values are constant
for >= n consecutive values
'''
lol = []
current_list = []
prev_value = None
for idx,elem in enumerate(series):
if elem == prev_value:
current_list.append(idx)
if elem != prev_value:
lol.append(current_list)
current_list = [idx]
prev_value = elem
lol.append(current_list)
lol = [lst for lst in lol if len(lst)>=n]
return lol
cont_lst = get_cont_lists(lst,4)
cont_lst = [i for j in cont_lst for i in j]
required_df = df.iloc[cont_lst]
print(required_df)
- 这是一个完全矢量化的解决方案,因此与使用循环或应用的解决方案相比,它会更快。
datetime
列应转换为datetime dtype
,然后在该列上排序,但该列不用于确定连续出现。- 此解决方案使用了另外两个 Stack Overflow 答案的一部分:
- GroupBy Pandas Count Consecutive Zero's
- 问题是,数据不能按
val
分组,因为在示例中,连续数字的组不是唯一的(例如,两组都是 0.0).ne
、.shift
和.cumsum
用于创建系列,其中每个连续值序列都是唯一值。- 对于一系列唯一的连续值,groupby 可用于为 select 行创建布尔掩码,在这种情况下,连续值的计数大于 4。
df['val'].groupby(g).transform('count') > 4
创建一个布尔掩码,用于df[['datetime', 'val']]
中的 select 行
- 由于请求是在
2 minute
周期内没有移动,因此计数应该是>=4
,因为时间步长是30 seconds
,连续出现5次是2分钟
import pandas as pd
# sample dataframe is the same as the data in the op
data = {'datetime': ['2020-02-27 15:43:00', '2020-02-27 15:43:30', '2020-02-27 15:44:00', '2020-02-27 15:44:30', '2020-02-27 15:45:00', '2020-02-27 15:45:30', '2020-02-27 15:46:00', '2020-02-27 15:46:30', '2020-02-27 15:47:00', '2020-02-27 15:47:30', '2020-02-27 15:48:00', '2020-02-27 15:48:30', '2020-02-27 15:49:00', '2020-02-27 15:49:30', '2020-02-27 15:50:00', '2020-02-27 15:50:30', '2020-02-27 15:51:00', '2020-02-27 15:51:30', '2020-02-27 15:52:00', '2020-02-27 15:52:30', '2020-02-27 15:53:00', '2020-02-27 15:53:30', '2020-02-27 15:54:00', '2020-02-27 15:54:30', '2020-02-27 15:55:00', '2020-02-27 15:55:30', '2020-02-27 15:56:00', '2020-02-27 15:56:30', '2020-02-27 15:57:00', '2020-02-27 15:57:30', '2020-02-27 15:58:00'], 'val': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.004333, 2.002667, 5.001, 6.002667, 8.001, 4.000667, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}
df = pd.DataFrame(data)
# display(df.head())
datetime val
0 2020-02-27 15:43:00 0.0
1 2020-02-27 15:43:30 0.0
2 2020-02-27 15:44:00 0.0
3 2020-02-27 15:44:30 0.0
4 2020-02-27 15:45:00 0.0
# create a Series with the same index as df, where the consecutive values are unique
g = df.val.ne(df.val.shift()).cumsum()
# use g with groupby to count the consecutive values and then create a Boolean using > 4 (will represent 2 minutes, when the time interval is 30 seconds).
consecutive_data = df[['datetime', 'val']][df['val'].groupby(g).transform('count') > 4]
display(consecutive_data)
datetime val
0 2020-02-27 15:43:00 0.0
1 2020-02-27 15:43:30 0.0
2 2020-02-27 15:44:00 0.0
3 2020-02-27 15:44:30 0.0
4 2020-02-27 15:45:00 0.0
5 2020-02-27 15:45:30 0.0
6 2020-02-27 15:46:00 0.0
7 2020-02-27 15:46:30 0.0
8 2020-02-27 15:47:00 0.0
9 2020-02-27 15:47:30 0.0
10 2020-02-27 15:48:00 0.0
11 2020-02-27 15:48:30 0.0
12 2020-02-27 15:49:00 0.0
13 2020-02-27 15:49:30 0.0
14 2020-02-27 15:50:00 0.0
15 2020-02-27 15:50:30 0.0
16 2020-02-27 15:51:00 0.0
17 2020-02-27 15:51:30 0.0
25 2020-02-27 15:55:30 0.0
26 2020-02-27 15:56:00 0.0
27 2020-02-27 15:56:30 0.0
28 2020-02-27 15:57:00 0.0
29 2020-02-27 15:57:30 0.0
30 2020-02-27 15:58:00 0.0