如何 select 使用传统 python 或使用 pandas/numpy/sciy 在列表中按顺序第一次出现重复项
How to select the first occurrence of a repeated item in sequence in a list using traditional python or using pandas/numpy/sciy
假设有一个列表 'series',它在多个索引值处有一些重复元素。有没有办法找到一个数字的重复序列的第一次出现。
series = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]
Return 应该类似于 [5,11,17,21] 是 [16,16] , [16,16,16] , [ 17,17] 和 [18,18]
您可以使用 shift
In [3815]: s = pd.Series(series)
In [3816]: cond = (s == s.shift(-1))
In [3817]: cond.index[cond]
Out[3817]: Int64Index([5, 11, 12, 17, 21], dtype='int64')
或者,diff
In [3828]: cond = s.diff(-1).eq(0)
In [3829]: cond.index[cond]
Out[3829]: Int64Index([5, 11, 12, 17, 21], dtype='int64')
对于列表输出使用 tolist
In [3833]: cond.index[cond].tolist()
Out[3833]: [5, 11, 12, 17, 21]
详情
In [3823]: s.head(10)
Out[3823]:
0 2
1 3
2 7
3 10
4 11
5 16
6 16
7 9
8 11
9 12
dtype: int64
In [3824]: cond.head(10)
Out[3824]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
dtype: bool
np.diff
& np.flatnonzero
此答案使用 np.diff
并测试该差异何时为零。在那些时候,我们知道我们有重复。我们使用 np.flatnonzero
给我们这些差异为零的位置。但是,我们只想要连续差异的第一个位置。所以我们再次使用 np.diff
来过滤掉重复序列中的第一个。这次我们将结果用作布尔掩码。
d = np.flatnonzero(np.diff(series) == 0)
w = np.append(True, np.diff(d) > 1)
d[w]
array([ 5, 11, 17, 21])
np.flatnonzero
我认为这是一个更好的答案。我们构建一个布尔数组,评估值何时等于下一个但不等于前一个。我们利用 np.flatnonzero
来告诉我们 True
值的位置。
我也觉得答案的对称性很有吸引力。
s = np.array(series)
np.flatnonzero(
np.append(s[:-1] == s[1:], True) &
np.append(True, s[1:] != s[:-1])
)
array([ 5, 11, 17, 21])
首先通过 shift
和 cumsum
创建独特的组,然后获取第一个副本的掩码并通过 boolean indexing
:
过滤
s = pd.Series([2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18])
s1 = s.shift(1).ne(s).cumsum()
m = ~s1.duplicated() & s1.duplicated(keep=False)
s2 = m.index[m].tolist()
print (s2)
[5, 11, 17, 21]
print (s1)
0 1
1 2
2 3
3 4
4 5
5 6
6 6
7 7
8 8
9 9
10 10
11 11
12 11
13 11
14 12
15 13
16 14
17 15
18 15
19 16
20 17
21 18
22 18
dtype: int32
print (m)
dtype: int32
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 True
12 False
13 False
14 False
15 False
16 False
17 True
18 False
19 False
20 False
21 True
22 False
dtype: bool
您可以很简单地模仿 Python 的 itertools.groupby
,并将相邻的副本组合在一起。
>>> import pandas
>>> s = pandas.Series([2, 3, 7, 10, 11, 16, 16, 9, 11, 12, 14, 16, 16, 16, 5, 7, 9, 17, 17, 4, 8, 18, 18])
>>> for _, group in s.groupby((s != s.shift()).cumsum()):
... if len(group) > 1:
... print(group.index[0])
5
11
17
21
或作为列表:
>>> [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]
[5, 11, 17, 21]
由于我们似乎是在速度上竞争,而且如果不绕过 pandas
/numpy
/scipy
要求,任何人都不可能击败 Divakar / piRsquared,这是我的 numba
解决方案:
from numba import jit
import numpy as np
@jit
def rpt_idx(s):
out = []
j = True
for i in range(len(s)):
if s[i] == s[i+1]:
if j:
out.append(i)
j = False
else:
j = True
return out
rpt_idx(series)
Out: array([ 5, 11, 17, 21])
对于这样一个微不足道的案例,退出 jit
可能完全矫枉过正,但它确实提供了很大的加速
%timeit rpt_idx(series)
The slowest run took 10.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop
%timeit divakar(series)
The slowest run took 7.73 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.5 µs per loop
series_ = np.tile(series,10000).tolist()
%timeit divakar(series_)
100 loops, best of 3: 20.1 ms per loop
%timeit rpt_idx(series_)
100 loops, best of 3: 5.84 ms per loop
这是一个使用数组切片的性能,类似于 但没有任何 appending/concatenation -
a = np.array(series)
out = np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1
样本运行-
In [28]: a = np.array(series)
In [29]: np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1
Out[29]: array([ 5, 11, 17, 21])
运行时测试(适用于工作解决方案)
接近 -
def piRSquared1(series):
d = np.flatnonzero(np.diff(series) == 0)
w = np.append(True, np.diff(d) > 1)
return d[w].tolist()
def piRSquared2(series):
s = np.array(series)
return np.flatnonzero(
np.append(s[:-1] == s[1:], True) &
np.append(True, s[1:] != s[:-1])
).tolist()
def Zach(series):
s = pd.Series(series)
i = [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]
return i
def jezrael(series):
s = pd.Series(series)
s1 = s.shift(1).ne(s).cumsum()
m = ~s1.duplicated() & s1.duplicated(keep=False)
s2 = m.index[m].tolist()
return s2
def divakar(series):
a = np.array(series)
x = a[1:-1]
return (np.flatnonzero((a[2:] == x) & (x != a[:-2]))+1).tolist()
对于设置,我们只是多次平铺样本输入。
计时 -
案例 #1:大集合
In [34]: series0 = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]
In [35]: series = np.tile(series0,10000).tolist()
In [36]: %timeit piRSquared1(series)
...: %timeit piRSquared2(series)
...: %timeit Zach(series)
...: %timeit jezrael(series)
...: %timeit divakar(series)
...:
100 loops, best of 3: 8.06 ms per loop
100 loops, best of 3: 7.79 ms per loop
1 loop, best of 3: 3.88 s per loop
10 loops, best of 3: 24.3 ms per loop
100 loops, best of 3: 7.97 ms per loop
案例 #2:更大的集合(在前 2 个解决方案上)
In [40]: series = np.tile(series0,1000000).tolist()
In [41]: %timeit piRSquared2(series)
1 loop, best of 3: 823 ms per loop
In [42]: %timeit divakar(series)
1 loop, best of 3: 823 ms per loop
现在,这两种解决方案的不同之处仅在于后一种解决方案避免了追加。让我们在更小的数据集上仔细看看它们和 运行 -
In [43]: series = np.tile(series0,100).tolist()
In [44]: %timeit piRSquared2(series)
10000 loops, best of 3: 89.4 µs per loop
In [45]: %timeit divakar(series)
10000 loops, best of 3: 82.8 µs per loop
因此,它表明在处理较小的数据集时,concatenation/append 避免在后一种解决方案中有很大帮助,但在处理更大的数据集时,它们变得具有可比性。
在更大的数据集上进行边际改进是可能的,其中一个连接。因此,最后一步可以重写为:
np.flatnonzero(np.concatenate(([False],(a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))))
假设有一个列表 'series',它在多个索引值处有一些重复元素。有没有办法找到一个数字的重复序列的第一次出现。
series = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]
Return 应该类似于 [5,11,17,21] 是 [16,16] , [16,16,16] , [ 17,17] 和 [18,18]
您可以使用 shift
In [3815]: s = pd.Series(series)
In [3816]: cond = (s == s.shift(-1))
In [3817]: cond.index[cond]
Out[3817]: Int64Index([5, 11, 12, 17, 21], dtype='int64')
或者,diff
In [3828]: cond = s.diff(-1).eq(0)
In [3829]: cond.index[cond]
Out[3829]: Int64Index([5, 11, 12, 17, 21], dtype='int64')
对于列表输出使用 tolist
In [3833]: cond.index[cond].tolist()
Out[3833]: [5, 11, 12, 17, 21]
详情
In [3823]: s.head(10)
Out[3823]:
0 2
1 3
2 7
3 10
4 11
5 16
6 16
7 9
8 11
9 12
dtype: int64
In [3824]: cond.head(10)
Out[3824]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
dtype: bool
np.diff
& np.flatnonzero
此答案使用 np.diff
并测试该差异何时为零。在那些时候,我们知道我们有重复。我们使用 np.flatnonzero
给我们这些差异为零的位置。但是,我们只想要连续差异的第一个位置。所以我们再次使用 np.diff
来过滤掉重复序列中的第一个。这次我们将结果用作布尔掩码。
d = np.flatnonzero(np.diff(series) == 0)
w = np.append(True, np.diff(d) > 1)
d[w]
array([ 5, 11, 17, 21])
np.flatnonzero
我认为这是一个更好的答案。我们构建一个布尔数组,评估值何时等于下一个但不等于前一个。我们利用 np.flatnonzero
来告诉我们 True
值的位置。
我也觉得答案的对称性很有吸引力。
s = np.array(series)
np.flatnonzero(
np.append(s[:-1] == s[1:], True) &
np.append(True, s[1:] != s[:-1])
)
array([ 5, 11, 17, 21])
首先通过 shift
和 cumsum
创建独特的组,然后获取第一个副本的掩码并通过 boolean indexing
:
s = pd.Series([2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18])
s1 = s.shift(1).ne(s).cumsum()
m = ~s1.duplicated() & s1.duplicated(keep=False)
s2 = m.index[m].tolist()
print (s2)
[5, 11, 17, 21]
print (s1)
0 1
1 2
2 3
3 4
4 5
5 6
6 6
7 7
8 8
9 9
10 10
11 11
12 11
13 11
14 12
15 13
16 14
17 15
18 15
19 16
20 17
21 18
22 18
dtype: int32
print (m)
dtype: int32
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 False
9 False
10 False
11 True
12 False
13 False
14 False
15 False
16 False
17 True
18 False
19 False
20 False
21 True
22 False
dtype: bool
您可以很简单地模仿 Python 的 itertools.groupby
,并将相邻的副本组合在一起。
>>> import pandas
>>> s = pandas.Series([2, 3, 7, 10, 11, 16, 16, 9, 11, 12, 14, 16, 16, 16, 5, 7, 9, 17, 17, 4, 8, 18, 18])
>>> for _, group in s.groupby((s != s.shift()).cumsum()):
... if len(group) > 1:
... print(group.index[0])
5
11
17
21
或作为列表:
>>> [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]
[5, 11, 17, 21]
由于我们似乎是在速度上竞争,而且如果不绕过 pandas
/numpy
/scipy
要求,任何人都不可能击败 Divakar / piRsquared,这是我的 numba
解决方案:
from numba import jit
import numpy as np
@jit
def rpt_idx(s):
out = []
j = True
for i in range(len(s)):
if s[i] == s[i+1]:
if j:
out.append(i)
j = False
else:
j = True
return out
rpt_idx(series)
Out: array([ 5, 11, 17, 21])
对于这样一个微不足道的案例,退出 jit
可能完全矫枉过正,但它确实提供了很大的加速
%timeit rpt_idx(series)
The slowest run took 10.50 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 1.99 µs per loop
%timeit divakar(series)
The slowest run took 7.73 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.5 µs per loop
series_ = np.tile(series,10000).tolist()
%timeit divakar(series_)
100 loops, best of 3: 20.1 ms per loop
%timeit rpt_idx(series_)
100 loops, best of 3: 5.84 ms per loop
这是一个使用数组切片的性能,类似于
a = np.array(series)
out = np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1
样本运行-
In [28]: a = np.array(series)
In [29]: np.flatnonzero((a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))+1
Out[29]: array([ 5, 11, 17, 21])
运行时测试(适用于工作解决方案)
接近 -
def piRSquared1(series):
d = np.flatnonzero(np.diff(series) == 0)
w = np.append(True, np.diff(d) > 1)
return d[w].tolist()
def piRSquared2(series):
s = np.array(series)
return np.flatnonzero(
np.append(s[:-1] == s[1:], True) &
np.append(True, s[1:] != s[:-1])
).tolist()
def Zach(series):
s = pd.Series(series)
i = [g.index[0] for _, g in s.groupby((s != s.shift()).cumsum()) if len(g) > 1]
return i
def jezrael(series):
s = pd.Series(series)
s1 = s.shift(1).ne(s).cumsum()
m = ~s1.duplicated() & s1.duplicated(keep=False)
s2 = m.index[m].tolist()
return s2
def divakar(series):
a = np.array(series)
x = a[1:-1]
return (np.flatnonzero((a[2:] == x) & (x != a[:-2]))+1).tolist()
对于设置,我们只是多次平铺样本输入。
计时 -
案例 #1:大集合
In [34]: series0 = [2,3,7,10,11,16,16,9,11,12,14,16,16,16,5,7,9,17,17,4,8,18,18]
In [35]: series = np.tile(series0,10000).tolist()
In [36]: %timeit piRSquared1(series)
...: %timeit piRSquared2(series)
...: %timeit Zach(series)
...: %timeit jezrael(series)
...: %timeit divakar(series)
...:
100 loops, best of 3: 8.06 ms per loop
100 loops, best of 3: 7.79 ms per loop
1 loop, best of 3: 3.88 s per loop
10 loops, best of 3: 24.3 ms per loop
100 loops, best of 3: 7.97 ms per loop
案例 #2:更大的集合(在前 2 个解决方案上)
In [40]: series = np.tile(series0,1000000).tolist()
In [41]: %timeit piRSquared2(series)
1 loop, best of 3: 823 ms per loop
In [42]: %timeit divakar(series)
1 loop, best of 3: 823 ms per loop
现在,这两种解决方案的不同之处仅在于后一种解决方案避免了追加。让我们在更小的数据集上仔细看看它们和 运行 -
In [43]: series = np.tile(series0,100).tolist()
In [44]: %timeit piRSquared2(series)
10000 loops, best of 3: 89.4 µs per loop
In [45]: %timeit divakar(series)
10000 loops, best of 3: 82.8 µs per loop
因此,它表明在处理较小的数据集时,concatenation/append 避免在后一种解决方案中有很大帮助,但在处理更大的数据集时,它们变得具有可比性。
在更大的数据集上进行边际改进是可能的,其中一个连接。因此,最后一步可以重写为:
np.flatnonzero(np.concatenate(([False],(a[2:] == a[1:-1]) & (a[1:-1] != a[:-2]))))