如何改进 Python 中列表的模式匹配

Question

我有一个很大的列表，其中可能包含数千到数百万个条目。我设置了一个有限大小的 window 来滑过列表。我需要计算 windows 中的匹配元素，并通过一次向前滑动 window 1 位置来重复该过程。这是一个列表的简单示例

L = [1 2 1 3 4 5 1 2 1 2 2 2 3 ]

假设window的长度为3，它将捕获

[1 2 1] 其中包含一对匹配元素 (1 & 1)
将windows向前移动1个位置=> [2 1 3]，没有匹配的元素
将windows向前移动1个位置=> [1 3 4]，没有匹配的元素
将windows向前移动1个位置=> [3 4 5]，没有匹配的元素
将windows向前移动1个位置=> [4 5 1]，没有匹配的元素
将windows向前移动1个位置=> [5 1 2]，没有匹配的元素
将windows向前移动1个位置=> [1 2 1], 1个匹配元素(1 & 1)
将windows向前移动1个位置=> [2 1 2], 1个匹配元素(2 & 2)
将windows向前移动1个位置=> [1 2 2], 1个匹配元素(2 & 2)
将windows向前移动1个位置=> [2 2 2], 3个匹配元素([2 2 -], [2 - 2], [- 2 2])
将windows向前移动1个位置=> [2 2 3], 1个匹配元素(2 & 2)

所以总共有 1 + 1 + 1 + 1 + 3 + 1 = 8 对配对。我发现了使用 itertools 查找 window 中所有元素的组合并开发代码来查找所有匹配对

的想法

import itertools
L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3
totalMatch = 0
for n in range(len(L)-winlen+1):
    window = [L[n+i] for i in range(winlen)]
    A = list(itertools.combinations(window, 2))
    match = [a==b for a, b in A]
    totalMatch += sum(match)

它适用于一个简短的列表，但对于列表和 windows 变大的列表，这段代码太慢了。我已经使用 C++ 多年并决定切换到 python，如果有任何提高代码效率的提示，我将不胜感激。

Answer 1

更有效地跟踪 window 中的数据？这是 O(|L|) 而不是你的 O(|L|*winlen^2)。它在 ctr 中保留 window 的元素计数，在 match 中保留 window 的匹配项。例如，当一个新值进入 window 并且在 window 中已经有两个该值的实例时，您会得到两个新的匹配项。类似地，对于落在 window 之外的值，它与它匹配的次数与它在 window.

中的其他实例一样多。

from collections import Counter

L = [1,2,1,3,4,5,1,2,1,2,2,2,3]
winlen = 3

totalMatch = match = 0
ctr = Counter()
for i, x in enumerate(L):
    
    # Remove old element falling out of window
    if i >= winlen:
        ctr[L[i-winlen]] -= 1
        match -= ctr[L[i-winlen]]

    # Add new element to window
    match += ctr[x]
    ctr[x] += 1

    # Update the total (for complete windows)
    if i >= winlen - 1:
        totalMatch += match

print(totalMatch)

L 和 winlen 的基准结果乘以 20：

 38.75 ms  original
  0.18 ms  Manuel

 38.73 ms  original
  0.19 ms  Manuel

 38.87 ms  original
  0.18 ms  Manuel

基准代码（还包括所有长度为0到9的数字1到3列表的测试代码）：

from timeit import repeat
import itertools
from itertools import product
from collections import Counter

def original(L, winlen):
    totalMatch = 0
    for n in range(len(L)-winlen+1):
        window = [L[n+i] for i in range(winlen)]
        A = list(itertools.combinations(window, 2))
        match = [a==b for a, b in A]
        totalMatch += sum(match)
    return totalMatch

def Manuel(L, winlen):
    totalMatch = match = 0
    ctr = Counter()
    for i, x in enumerate(L):
        if i >= winlen:
            ctr[L[i-winlen]] -= 1
            match -= ctr[L[i-winlen]]
        match += ctr[x]
        ctr[x] += 1
        if i >= winlen - 1:
            totalMatch += match
    return totalMatch

def test():
    for n in range(10):
        print('testing', n)
        for T in product([1, 2, 3], repeat=n):
            L = list(T)
            winlen = 3
            expect = original(L, winlen)
            result = Manuel(L, winlen)
            assert result == expect, (L, expect, result)

def bench():
    L = [1,2,1,3,4,5,1,2,1,2,2,2,3] * 20
    winlen = 3 * 20
    for _ in range(3):
        for func in original, Manuel:
            t = min(repeat(lambda: func(L, winlen), number=1))
            print('%6.2f ms ' % (t * 1e3), func.__name__)
        print()

test()
bench()

Answer 2

你可以在for循环中使用np.bincount，确定每个number/bin的组合数，然后将其加总。

import numpy as np

L = [1, 2, 1, 3, 4, 5, 1, 2, 1, 2, 2, 2, 3]
winlen = 3

L = np.array(L) # convert to array to speed up indexing

total = 0
for i in range(len(L) - winlen + 1):
    bc = np.bincount(L[i:i+winlen]) # bincount on the window
    bc = bc[bc>1] # get rid of all single and empty values
    bc = bc * (bc-1) // 2 # gauss addition, number of combinations of each number
    total += np.sum(bc) # sum up combinations, and add to total

print(total)
# 8

如何改进 Python 中列表的模式匹配

How to improve the pattern matching on a list in Python

python

list

itertools