Python、pandas：截断累积序列中尖峰的过滤器

Question

我有这样一系列累积值：

1821, 2015-01-26 22:14:42+02:00, 24574.7 
1822, 2015-01-26 22:15:05+02:00, 24574.7 
1823, 2015-01-26 22:15:28+02:00, 24574.8 
1824, 2015-01-26 22:15:49+02:00, 24574.9 
1825, 2015-01-26 22:16:11+02:00, 24574.9 
1826, 2015-01-26 22:16:34+02:00, 24576.0 
1828, 2015-01-26 22:17:19+02:00, 24575.1 
1829, 2015-01-26 22:17:41+02:00, 24575.2 
1830, 2015-01-26 22:18:03+02:00, 24575.3 
1831, 2015-01-26 22:18:25+02:00, 24575.3

问题是有时我得到的值对于一个累积的系列来说是不正常的，值应该只会增加。就像第 1826 行一样（值为 24576，下一个更小）。有没有办法从 Pandas 系列对象中删除这些值？ IE。当一个值大于前一个值和下一个值时？

Answer 1

您可以使用np.diff()来计算相邻差异。任何差异为负的地方你都知道你需要删除前面的行。

Answer 2

这可以通过使用 Pandas' boolean indexing 的单行解决方案来完成。单行代码还使用了一些其他技巧：Pandas' map 和 diff 方法以及 lambda 函数。 map 用于将 lambda 函数应用于所有行。需要 lambda 函数来创建自定义小于比较，将 NaN 值评估为 True。

下面的例子说明了。

免责声明：只有当我们可以假设每一行总是大于或等于前两个位置的行时，这才有效。换句话说：行[i] >= 行[i-2]

import pandas as pd
df = pd.DataFrame({'A':['a','b','c','d','e', 'f', 'g'], 'B': [1,2,2,4,3,5,6]})

# We're going to use Pandas' diff method, telling it to take the difference 1 row back.
print df['B'].diff(1)

# Createa  boolean index. We use map and a lambda function to handle the tricky case of the first row evaluating to 
print df['B'].diff(1).map(lambda x: not(x<0))

# Here is the one line solution!
# Redefine df to only contain the rows that behave themselves.
df = df[df['B'].diff(1).map(lambda x: not(x<0))]

print df

Answer 3

这里有一个内置方法 diff:

In [30]:

pd.concat([df.head(1), df[df['cumulative value'].diff()>=0]])
Out[30]:
               timestamp  cumulative value
0                                         
1821 2015-01-26 20:14:42           24574.7
1822 2015-01-26 20:15:05           24574.7
1823 2015-01-26 20:15:28           24574.8
1824 2015-01-26 20:15:49           24574.9
1825 2015-01-26 20:16:11           24574.9
1826 2015-01-26 20:16:34           24576.0
1829 2015-01-26 20:17:41           24575.2
1830 2015-01-26 20:18:03           24575.3
1831 2015-01-26 20:18:25           24575.3

编辑正如所指出的那样，在这里调用 diff 会丢失第一行，所以我使用了一个丑陋的技巧，我将第一行与 diff 的结果连接起来，这样我就不会丢失第一行

Python、pandas：截断累积序列中尖峰的过滤器

Python, pandas: Cut off filter for spikes in a cumulative series

python

pandas