如何 select 只有时间序列中的常量值

Question

我有一个速度时间序列，想检测在超过特定时间的情况下保持不变的所有部分。假设使用以下数据，我想检测何时没有移动超过 2 分钟，并将这些部分放入另一个数据帧（以及所有其他列）

2020-02-27 15:43:00    0.000000
2020-02-27 15:43:30    0.000000
2020-02-27 15:44:00    0.000000
2020-02-27 15:44:30    0.000000
2020-02-27 15:45:00    0.000000
2020-02-27 15:45:30    0.000000
2020-02-27 15:46:00    0.000000
2020-02-27 15:46:30    0.000000
2020-02-27 15:47:00    0.000000
2020-02-27 15:47:30    0.000000
2020-02-27 15:48:00    0.000000
2020-02-27 15:48:30    0.000000
2020-02-27 15:49:00    0.000000
2020-02-27 15:49:30    0.000000
2020-02-27 15:50:00    0.000000
2020-02-27 15:50:30    0.000000
2020-02-27 15:51:00    0.000000
2020-02-27 15:51:30    0.000000
2020-02-27 15:52:00    1.004333
2020-02-27 15:52:30    2.002667
2020-02-27 15:53:00    5.001000
2020-02-27 15:53:30    6.002667
2020-02-27 15:54:00    8.001000
2020-02-27 15:54:30    4.000667
2020-02-27 15:55:00    3.000000
2020-02-27 15:55:30    0.000000
2020-02-27 15:56:00    0.000000
2020-02-27 15:56:30    0.000000
2020-02-27 15:57:00    0.000000
2020-02-27 15:57:30    0.000000
2020-02-27 15:58:00    0.000000

那么结果将是 df_constant，数据从 2020-02-27 15:43:00 到 2020-02-27 15:51:30 & 2020-02-27 15:55:30 到 2020-02-27 15:58:00

Answer 1

import pandas as pd
from datetime import datetime


d1 = datetime.strptime("2020-02-27 15:43:00","%Y-%m-%d %H:%M:%S")
d2 = datetime.strptime('2020-02-27 15:58:00', "%Y-%m-%d %H:%M:%S")

df = pd.date_range(d1,d2, periods=30)
df = pd.DataFrame(df)
df['val'] = [0]*10 + list(range(10)) + [10]*10
df.columns = ['date','val']


def get_cont_lists(series, n):
    '''
    
    Given a list returns list of lists of indices where the values are constant
    for >= n consecutive values
    
    
    '''
    
    
    lol = []
    
    current_list = []
    prev_value = None
    
    
    for idx,elem in enumerate(series):

        if elem == prev_value:
            current_list.append(idx)        
        
        if elem != prev_value:
            lol.append(current_list)
            current_list = [idx]
            prev_value = elem
        

    
    lol.append(current_list)
    
    lol = [lst for lst in lol if len(lst)>=n]
    
    return lol


cont_lst = get_cont_lists(lst,4)
cont_lst = [i for j in cont_lst for i in j]

required_df = df.iloc[cont_lst]

print(required_df)

Answer 2

这是一个完全矢量化的解决方案，因此与使用循环或应用的解决方案相比，它会更快。
datetime 列应转换为 datetime dtype，然后在该列上排序，但该列不用于确定连续出现。
此解决方案使用了另外两个 Stack Overflow 答案的一部分：
1. GroupBy Pandas Count Consecutive Zero's
问题是，数据不能按 val 分组，因为在示例中，连续数字的组不是唯一的（例如，两组都是 0.0）
- .ne、.shift 和 .cumsum 用于创建系列，其中每个连续值序列都是唯一值。
- 对于一系列唯一的连续值，groupby 可用于为 select 行创建布尔掩码，在这种情况下，连续值的计数大于 4。
  - df['val'].groupby(g).transform('count') > 4 创建一个布尔掩码，用于 df[['datetime', 'val']]
  - 由于请求是在2 minute周期内没有移动，因此计数应该是>=4，因为时间步长是30 seconds，连续出现5次是2分钟

import pandas as pd

# sample dataframe is the same as the data in the op
data = {'datetime': ['2020-02-27 15:43:00', '2020-02-27 15:43:30', '2020-02-27 15:44:00', '2020-02-27 15:44:30', '2020-02-27 15:45:00', '2020-02-27 15:45:30', '2020-02-27 15:46:00', '2020-02-27 15:46:30', '2020-02-27 15:47:00', '2020-02-27 15:47:30', '2020-02-27 15:48:00', '2020-02-27 15:48:30', '2020-02-27 15:49:00', '2020-02-27 15:49:30', '2020-02-27 15:50:00', '2020-02-27 15:50:30', '2020-02-27 15:51:00', '2020-02-27 15:51:30', '2020-02-27 15:52:00', '2020-02-27 15:52:30', '2020-02-27 15:53:00', '2020-02-27 15:53:30', '2020-02-27 15:54:00', '2020-02-27 15:54:30', '2020-02-27 15:55:00', '2020-02-27 15:55:30', '2020-02-27 15:56:00', '2020-02-27 15:56:30', '2020-02-27 15:57:00', '2020-02-27 15:57:30', '2020-02-27 15:58:00'], 'val': [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.004333, 2.002667, 5.001, 6.002667, 8.001, 4.000667, 3.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]}
df = pd.DataFrame(data)

# display(df.head())
              datetime  val
0  2020-02-27 15:43:00  0.0
1  2020-02-27 15:43:30  0.0
2  2020-02-27 15:44:00  0.0
3  2020-02-27 15:44:30  0.0
4  2020-02-27 15:45:00  0.0

# create a Series with the same index as df, where the consecutive values are unique
g = df.val.ne(df.val.shift()).cumsum()

# use g with groupby to count the consecutive values and then create a Boolean using > 4 (will represent 2 minutes, when the time interval is 30 seconds).
consecutive_data = df[['datetime', 'val']][df['val'].groupby(g).transform('count') > 4]

`display(consecutive_data)`

               datetime  val
0   2020-02-27 15:43:00  0.0
1   2020-02-27 15:43:30  0.0
2   2020-02-27 15:44:00  0.0
3   2020-02-27 15:44:30  0.0
4   2020-02-27 15:45:00  0.0
5   2020-02-27 15:45:30  0.0
6   2020-02-27 15:46:00  0.0
7   2020-02-27 15:46:30  0.0
8   2020-02-27 15:47:00  0.0
9   2020-02-27 15:47:30  0.0
10  2020-02-27 15:48:00  0.0
11  2020-02-27 15:48:30  0.0
12  2020-02-27 15:49:00  0.0
13  2020-02-27 15:49:30  0.0
14  2020-02-27 15:50:00  0.0
15  2020-02-27 15:50:30  0.0
16  2020-02-27 15:51:00  0.0
17  2020-02-27 15:51:30  0.0
25  2020-02-27 15:55:30  0.0
26  2020-02-27 15:56:00  0.0
27  2020-02-27 15:56:30  0.0
28  2020-02-27 15:57:00  0.0
29  2020-02-27 15:57:30  0.0
30  2020-02-27 15:58:00  0.0

如何 select 只有时间序列中的常量值

How to select only the constant values in a timeseries

python

time-series

dataframe

pandas

data-science

`display(consecutive_data)`