有没有更快的方法来查找列表列表中两个元素的共现

Is there a faster way to find co-occurrence of two elements in list of lists

我有一个这样的列表。

a = [
    ['a1', 'b2', 'c3'],
    ['c3', 'd4', 'a1'],
    ['b2', 'a1', 'e5'],
    ['d4', 'a1', 'b2'],
    ['c3', 'b2', 'a1']
    ]

我会得到 x(例如:'a1')。我必须找到 a1 与所有其他元素的共现并对它进行排序并检索前 n 个(例如:前 2 个) 我的回答应该是

[
 {'product_id': 'b2', 'count': 4}, 
 {'product_id': 'c3', 'count': 3}, 
]

我当前的代码如下所示:

def compute (x):
    set_a = list(set(list(itertools.chain(*a))))
    count_dict = []
    for i in range(0, len(set_a)):
        count = 0
        for j in range(0, len(a)):
            if x==set_a[i]:
                continue
            if x and set_a[i] in a[j]:
                count+=1
        if count>0:
            count_dict.append({'product_id': set_a[i], 'count': count})
    count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
    return count_dict

而且它适用于较小的输入。然而,我的实际输入有 70000 个唯一项目而不是 5(a 到 e)和 130 万行而不是 5。因此 mxn 变得非常详尽。有更快的方法吗?

“更快”是一个非常笼统的术语。您需要更短的总处理时间,还是更短的请求响应时间?这是仅针对一个请求,还是您想要一个处理重复输入的系统?

如果你需要的是对重复输入最快的响应时间,那么把这整个列表列表转成一个图,每个元素作为一个节点,边的权重就是两个元素之间出现的次数。您对数据进行单次传递以构建图形。对于每个节点,按权重对边缘列表进行排序。从那里开始,每个请求都是一个简单的查找:return节点顶边的权重,这是一个散列(线性函数)和两个直接访问操作(基地址+偏移量)。


OP 回复后更新

然后,“最快响应”密封了算法。你想要的是一个简单的字典,由每个节点键入。每个节点的值是相关元素及其计数的排序列表。

图形包(例如,networkx)将为您提供一个很好的入口,但可能不会以快速形式保留节点的边缘,也不会按权重排序。相反,预处理您的数据库。对于每一行,您都有一个相关元素列表。让我们看看对数据集中某些行的处理;调用元素 a5, b2, z1 和字典 d。假设 a5, b2 已经在你的字典中。

using `intertools`, Iterate through the six pairs.
(a5, b2):
    d[a5][b2] += 1
(a5, z1):
    d[a5][z1]  = 1  (creates a new entry under a5)
(b2, a5):
    d[b2][a5] += 1
(b2, z1):
    d[b2][z1]  = 1  (creates a new entry under b2)
(z1, a5):
    d[z1] = {}      (creates a new z1 entry in d)
    d[z1][a5]  = 1  (creates a new entry under z1)
(z1, b2):
    d[z1][b2]  = 1  (creates a new entry under z1)

您需要使用 defaultdict 来避免检测和初始化新条目的麻烦。

处理完所有这些后,您现在想要根据子级别值对每个子字典进行排序。这为您留下了每个元素的有序序列。当需要访问topn个连通元素时,直接去dict提取:

top = d[elem][:n]

你能从那里完成编码吗?

正如@prune 所提到的,没有提到您想要更短的处理时间还是更短的响应时间。 所以我将解释解决这个问题的两种方法

  1. 优化代码方法(处理时间更短)
from heapq import nlargest
from operator import itemgetter

#say we have K THREADS
def compute (x,    top_n=2):
    # first you find the unique items and save them somewhere easily accessible
    set_a = list(set(list(itertools.chain(*a))))


    #first find that in which of your ROWS the x exists
    selected_rows=[]
    for i,row in enumerate(a): #this whole loop can be parallelized
        if x in row: 
            selected_rows.append(i) #append index of the row in selected_rows array
    
    
    # time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row       # can be evaluated independently M items can be evaluated independently 
    # THE M rows can be run in parallel, if we have K threads
    # it is going to take us (M/K)*N time complexity to run it.


    count_dict=[]
    
    
    # now the same thing you did earlier but now in second loop we are looking for less rows
    for val in set_a:
        if val==x:
            continue
        count=0
        for ri in selected_rows: # this whole part can be parallelized as well
            if val in a[ri]:
                count+=1
        count_dict.append({'product_id':val, 'count': count})

    # if our selected rows size on worst case is M itself
    # and our unique values are U, the complexity
    # will be (U/K)*(M/K)*N


    

    res = nlargest(top_n, count_dict, key = itemgetter('count'))
    return res

让我们在这里计算时间复杂度 如果我们有 K 个线程那么

O((M/K)*N)+O((U/K)*(M/K)*N))

哪里

M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
  1. Prune 建议的图形方法
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
    big_dictionary[x]=[]
    for y in set_a:
        count=0

        if x==y:
            continue
        for arr in a:
            if (x in arr) and (y in arr):
                count+=1
        big_dictionary[x].append((y,count))

for x in big_dictionary:
    big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)

让我们在这里计算一下这个的时间复杂度 一次复杂度为:

O(U*U*M*N)

哪里

M---> Total rows
N---> Total Columns
U---> Unique Values

但是一旦这个big_dictionary计算一次, 只需 1 步即可获得您的 topN 值 例如,如果我们想获得 a1

的前 3 个值
result=big_dictionary['a1'][:3]

我遵循了上面@Prune 建议的 defaultdict 方法。这是最终代码:

from collections import defaultdict

def recommender(input_item, b_list, n):
    count =[]
    top_items = []
    for x in b.keys():
        lst_2 = b[x]
        common_transactions = len(set(b_list) & set(lst_2))
        count.append(common_transactions)
    
    top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
    top_values_counts = [count[i] for i in top_ids]
    
    key_list = list(b.keys())
    for i,v in enumerate(top_ids):
        item_id = key_list[v]
        top_items.append({item_id:  top_values_counts[i]})
    print(top_items)
    return top_items

a = [
        ['a1', 'b2', 'c3'],
        ['c3', 'd4', 'a1'],
        ['b2', 'a1', 'e5'],
        ['d4', 'a1', 'b2'],
        ['c3', 'b2', 'a1']
    ]

b = defaultdict(list) 
for i, s in enumerate(a):
    for key in s : 
        b[key].append(i)
        
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)

这是 'a1' 前 3 名的输出:

[{'b2': 4}, {'c3': 3}, {'d4': 2}]

谢谢!!!