计算时间序列中值超过阈值的次数
count the times that values exceed over a threshold in time series
我有一个多项式特征多次超过上下阈值的时间序列数据,
我想统计超过上限和下限的次数
例如我的上限是 35°C,我的下限是 -45°C。
如何编写一个函数来计算数据超过阈值上限和下限的次数以及数据在范围内的时间?
有没有pythonic的方法来解决这个问题?
我认为布尔掩码需要 between
,将其反转 ~
和 True
s 的 sum
:
print ((~df['data'].between(-45, 35)).sum())
样本:
df = pd.DataFrame({'data':[-47,10,0,30,50]})
print (df)
data
0 -47
1 10
2 0
3 30
4 50
print ((~df['data'].between(-45, 35)).sum())
2
详情:
print (df['data'].between(-45, 35))
0 False
1 True
2 True
3 True
4 False
Name: data, dtype: bool
print (~df['data'].between(-45, 35))
0 True
1 False
2 False
3 False
4 True
Name: data, dtype: bool
如果您的数据可以包含 "runs" 个高于、低于或介于阈值之间的连续值,并且您想计算运行次数而不是单个数据点,您可以标记数据,折叠连续标签,过滤并计数:
In [64]: df = pd.DataFrame({'Temp': [50, 47.7, 45, 0, 0, -1, -1, -2, -10, -30,
...: -45, -45, -46, -20, -1, 2, 2, 10, 10, 20,
...: 35.5, 35, 36, 20, 0, -10, -45.1, -50]})
创建标签:
In [65]: df['Category'] = 0
In [66]: df.loc[df['Temp'] <= -45, 'Category'] = -1
In [67]: df.loc[df['Temp'] >= 35, 'Category'] = 1
In [68]: df
Out[68]:
Temp Category
0 50.0 1
1 47.7 1
2 45.0 1
3 0.0 0
...
9 -30.0 0
10 -45.0 -1
11 -45.0 -1
12 -46.0 -1
13 -20.0 0
...
19 20.0 0
20 35.5 1
21 35.0 1
22 36.0 1
23 20.0 0
24 0.0 0
25 -10.0 0
26 -45.1 -1
27 -50.0 -1
然后使用Series.shift()
to compare and collapse consecutive values:
In [69]: df[df['Category'].shift() != df['Category']]
Out[69]:
Temp Category
0 50.0 1
3 0.0 0
10 -45.0 -1
13 -20.0 0
20 35.5 1
23 20.0 0
26 -45.1 -1
从那里可以简单地根据类别进行过滤和计数:
In [70]: collapsed = df[df['Category'].shift() != df['Category']]
In [71]: (collapsed['Category'] != 0).sum()
Out[71]: 4
In [72]: (collapsed['Category'] == 0).sum()
Out[72]: 3
Series.value_counts()
也可能有用:
In [73]: collapsed['Category'].value_counts()
Out[73]:
0 3
-1 2
1 2
Name: Category, dtype: int64
How do I write a function which ... the time when the data was in range?
如果您有时间序列数据,很容易再次移动折叠的数据以计算运行持续时间(此处使用整数索引进行演示):
In [74]: fake_time_series = collapsed.reset_index()
In [75]: fake_time_series
Out[75]:
index Temp Category
0 0 50.0 1
1 3 0.0 0
2 10 -45.0 -1
3 13 -20.0 0
4 20 35.5 1
5 23 20.0 0
6 26 -45.1 -1
In [76]: fake_time_series.shift(-1)['index'] - fake_time_series['index']
Out[76]:
0 3.0
1 7.0
2 3.0
3 7.0
4 3.0
5 3.0
6 NaN
Name: index, dtype: float64
我有一个多项式特征多次超过上下阈值的时间序列数据,
我想统计超过上限和下限的次数
例如我的上限是 35°C,我的下限是 -45°C。
如何编写一个函数来计算数据超过阈值上限和下限的次数以及数据在范围内的时间?
有没有pythonic的方法来解决这个问题?
我认为布尔掩码需要 between
,将其反转 ~
和 True
s 的 sum
:
print ((~df['data'].between(-45, 35)).sum())
样本:
df = pd.DataFrame({'data':[-47,10,0,30,50]})
print (df)
data
0 -47
1 10
2 0
3 30
4 50
print ((~df['data'].between(-45, 35)).sum())
2
详情:
print (df['data'].between(-45, 35))
0 False
1 True
2 True
3 True
4 False
Name: data, dtype: bool
print (~df['data'].between(-45, 35))
0 True
1 False
2 False
3 False
4 True
Name: data, dtype: bool
如果您的数据可以包含 "runs" 个高于、低于或介于阈值之间的连续值,并且您想计算运行次数而不是单个数据点,您可以标记数据,折叠连续标签,过滤并计数:
In [64]: df = pd.DataFrame({'Temp': [50, 47.7, 45, 0, 0, -1, -1, -2, -10, -30,
...: -45, -45, -46, -20, -1, 2, 2, 10, 10, 20,
...: 35.5, 35, 36, 20, 0, -10, -45.1, -50]})
创建标签:
In [65]: df['Category'] = 0
In [66]: df.loc[df['Temp'] <= -45, 'Category'] = -1
In [67]: df.loc[df['Temp'] >= 35, 'Category'] = 1
In [68]: df
Out[68]:
Temp Category
0 50.0 1
1 47.7 1
2 45.0 1
3 0.0 0
...
9 -30.0 0
10 -45.0 -1
11 -45.0 -1
12 -46.0 -1
13 -20.0 0
...
19 20.0 0
20 35.5 1
21 35.0 1
22 36.0 1
23 20.0 0
24 0.0 0
25 -10.0 0
26 -45.1 -1
27 -50.0 -1
然后使用Series.shift()
to compare and collapse consecutive values:
In [69]: df[df['Category'].shift() != df['Category']]
Out[69]:
Temp Category
0 50.0 1
3 0.0 0
10 -45.0 -1
13 -20.0 0
20 35.5 1
23 20.0 0
26 -45.1 -1
从那里可以简单地根据类别进行过滤和计数:
In [70]: collapsed = df[df['Category'].shift() != df['Category']]
In [71]: (collapsed['Category'] != 0).sum()
Out[71]: 4
In [72]: (collapsed['Category'] == 0).sum()
Out[72]: 3
Series.value_counts()
也可能有用:
In [73]: collapsed['Category'].value_counts()
Out[73]:
0 3
-1 2
1 2
Name: Category, dtype: int64
How do I write a function which ... the time when the data was in range?
如果您有时间序列数据,很容易再次移动折叠的数据以计算运行持续时间(此处使用整数索引进行演示):
In [74]: fake_time_series = collapsed.reset_index()
In [75]: fake_time_series
Out[75]:
index Temp Category
0 0 50.0 1
1 3 0.0 0
2 10 -45.0 -1
3 13 -20.0 0
4 20 35.5 1
5 23 20.0 0
6 26 -45.1 -1
In [76]: fake_time_series.shift(-1)['index'] - fake_time_series['index']
Out[76]:
0 3.0
1 7.0
2 3.0
3 7.0
4 3.0
5 3.0
6 NaN
Name: index, dtype: float64