有没有更快的方法来查找列表列表中两个元素的共现
Is there a faster way to find co-occurrence of two elements in list of lists
我有一个这样的列表。
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
我会得到 x(例如:'a1')。我必须找到 a1 与所有其他元素的共现并对它进行排序并检索前 n 个(例如:前 2 个)
我的回答应该是
[
{'product_id': 'b2', 'count': 4},
{'product_id': 'c3', 'count': 3},
]
我当前的代码如下所示:
def compute (x):
set_a = list(set(list(itertools.chain(*a))))
count_dict = []
for i in range(0, len(set_a)):
count = 0
for j in range(0, len(a)):
if x==set_a[i]:
continue
if x and set_a[i] in a[j]:
count+=1
if count>0:
count_dict.append({'product_id': set_a[i], 'count': count})
count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
return count_dict
而且它适用于较小的输入。然而,我的实际输入有 70000 个唯一项目而不是 5(a 到 e)和 130 万行而不是 5。因此 mxn 变得非常详尽。有更快的方法吗?
“更快”是一个非常笼统的术语。您需要更短的总处理时间,还是更短的请求响应时间?这是仅针对一个请求,还是您想要一个处理重复输入的系统?
如果你需要的是对重复输入最快的响应时间,那么把这整个列表列表转成一个图,每个元素作为一个节点,边的权重就是两个元素之间出现的次数。您对数据进行单次传递以构建图形。对于每个节点,按权重对边缘列表进行排序。从那里开始,每个请求都是一个简单的查找:return节点顶边的权重,这是一个散列(线性函数)和两个直接访问操作(基地址+偏移量)。
OP 回复后更新
然后,“最快响应”密封了算法。你想要的是一个简单的字典,由每个节点键入。每个节点的值是相关元素及其计数的排序列表。
图形包(例如,networkx
)将为您提供一个很好的入口,但可能不会以快速形式保留节点的边缘,也不会按权重排序。相反,预处理您的数据库。对于每一行,您都有一个相关元素列表。让我们看看对数据集中某些行的处理;调用元素 a5, b2, z1
和字典 d
。假设 a5, b2
已经在你的字典中。
using `intertools`, Iterate through the six pairs.
(a5, b2):
d[a5][b2] += 1
(a5, z1):
d[a5][z1] = 1 (creates a new entry under a5)
(b2, a5):
d[b2][a5] += 1
(b2, z1):
d[b2][z1] = 1 (creates a new entry under b2)
(z1, a5):
d[z1] = {} (creates a new z1 entry in d)
d[z1][a5] = 1 (creates a new entry under z1)
(z1, b2):
d[z1][b2] = 1 (creates a new entry under z1)
您需要使用 defaultdict
来避免检测和初始化新条目的麻烦。
处理完所有这些后,您现在想要根据子级别值对每个子字典进行排序。这为您留下了每个元素的有序序列。当需要访问topn
个连通元素时,直接去dict提取:
top = d[elem][:n]
你能从那里完成编码吗?
正如@prune 所提到的,没有提到您想要更短的处理时间还是更短的响应时间。
所以我将解释解决这个问题的两种方法
- 优化代码方法(处理时间更短)
from heapq import nlargest
from operator import itemgetter
#say we have K THREADS
def compute (x, top_n=2):
# first you find the unique items and save them somewhere easily accessible
set_a = list(set(list(itertools.chain(*a))))
#first find that in which of your ROWS the x exists
selected_rows=[]
for i,row in enumerate(a): #this whole loop can be parallelized
if x in row:
selected_rows.append(i) #append index of the row in selected_rows array
# time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row # can be evaluated independently M items can be evaluated independently
# THE M rows can be run in parallel, if we have K threads
# it is going to take us (M/K)*N time complexity to run it.
count_dict=[]
# now the same thing you did earlier but now in second loop we are looking for less rows
for val in set_a:
if val==x:
continue
count=0
for ri in selected_rows: # this whole part can be parallelized as well
if val in a[ri]:
count+=1
count_dict.append({'product_id':val, 'count': count})
# if our selected rows size on worst case is M itself
# and our unique values are U, the complexity
# will be (U/K)*(M/K)*N
res = nlargest(top_n, count_dict, key = itemgetter('count'))
return res
让我们在这里计算时间复杂度
如果我们有 K 个线程那么
O((M/K)*N)+O((U/K)*(M/K)*N))
哪里
M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
- Prune 建议的图形方法
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
big_dictionary[x]=[]
for y in set_a:
count=0
if x==y:
continue
for arr in a:
if (x in arr) and (y in arr):
count+=1
big_dictionary[x].append((y,count))
for x in big_dictionary:
big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)
让我们在这里计算一下这个的时间复杂度
一次复杂度为:
O(U*U*M*N)
哪里
M---> Total rows
N---> Total Columns
U---> Unique Values
但是一旦这个big_dictionary计算一次,
只需 1 步即可获得您的 topN 值
例如,如果我们想获得 a1
的前 3 个值
result=big_dictionary['a1'][:3]
我遵循了上面@Prune 建议的 defaultdict
方法。这是最终代码:
from collections import defaultdict
def recommender(input_item, b_list, n):
count =[]
top_items = []
for x in b.keys():
lst_2 = b[x]
common_transactions = len(set(b_list) & set(lst_2))
count.append(common_transactions)
top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
top_values_counts = [count[i] for i in top_ids]
key_list = list(b.keys())
for i,v in enumerate(top_ids):
item_id = key_list[v]
top_items.append({item_id: top_values_counts[i]})
print(top_items)
return top_items
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
b = defaultdict(list)
for i, s in enumerate(a):
for key in s :
b[key].append(i)
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)
这是 'a1' 前 3 名的输出:
[{'b2': 4}, {'c3': 3}, {'d4': 2}]
谢谢!!!
我有一个这样的列表。
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
我会得到 x(例如:'a1')。我必须找到 a1 与所有其他元素的共现并对它进行排序并检索前 n 个(例如:前 2 个) 我的回答应该是
[
{'product_id': 'b2', 'count': 4},
{'product_id': 'c3', 'count': 3},
]
我当前的代码如下所示:
def compute (x):
set_a = list(set(list(itertools.chain(*a))))
count_dict = []
for i in range(0, len(set_a)):
count = 0
for j in range(0, len(a)):
if x==set_a[i]:
continue
if x and set_a[i] in a[j]:
count+=1
if count>0:
count_dict.append({'product_id': set_a[i], 'count': count})
count_dict = sorted(count_dict, key=lambda k: k['count'], reverse=True) [:2]
return count_dict
而且它适用于较小的输入。然而,我的实际输入有 70000 个唯一项目而不是 5(a 到 e)和 130 万行而不是 5。因此 mxn 变得非常详尽。有更快的方法吗?
“更快”是一个非常笼统的术语。您需要更短的总处理时间,还是更短的请求响应时间?这是仅针对一个请求,还是您想要一个处理重复输入的系统?
如果你需要的是对重复输入最快的响应时间,那么把这整个列表列表转成一个图,每个元素作为一个节点,边的权重就是两个元素之间出现的次数。您对数据进行单次传递以构建图形。对于每个节点,按权重对边缘列表进行排序。从那里开始,每个请求都是一个简单的查找:return节点顶边的权重,这是一个散列(线性函数)和两个直接访问操作(基地址+偏移量)。
OP 回复后更新
然后,“最快响应”密封了算法。你想要的是一个简单的字典,由每个节点键入。每个节点的值是相关元素及其计数的排序列表。
图形包(例如,networkx
)将为您提供一个很好的入口,但可能不会以快速形式保留节点的边缘,也不会按权重排序。相反,预处理您的数据库。对于每一行,您都有一个相关元素列表。让我们看看对数据集中某些行的处理;调用元素 a5, b2, z1
和字典 d
。假设 a5, b2
已经在你的字典中。
using `intertools`, Iterate through the six pairs.
(a5, b2):
d[a5][b2] += 1
(a5, z1):
d[a5][z1] = 1 (creates a new entry under a5)
(b2, a5):
d[b2][a5] += 1
(b2, z1):
d[b2][z1] = 1 (creates a new entry under b2)
(z1, a5):
d[z1] = {} (creates a new z1 entry in d)
d[z1][a5] = 1 (creates a new entry under z1)
(z1, b2):
d[z1][b2] = 1 (creates a new entry under z1)
您需要使用 defaultdict
来避免检测和初始化新条目的麻烦。
处理完所有这些后,您现在想要根据子级别值对每个子字典进行排序。这为您留下了每个元素的有序序列。当需要访问topn
个连通元素时,直接去dict提取:
top = d[elem][:n]
你能从那里完成编码吗?
正如@prune 所提到的,没有提到您想要更短的处理时间还是更短的响应时间。 所以我将解释解决这个问题的两种方法
- 优化代码方法(处理时间更短)
from heapq import nlargest
from operator import itemgetter
#say we have K THREADS
def compute (x, top_n=2):
# first you find the unique items and save them somewhere easily accessible
set_a = list(set(list(itertools.chain(*a))))
#first find that in which of your ROWS the x exists
selected_rows=[]
for i,row in enumerate(a): #this whole loop can be parallelized
if x in row:
selected_rows.append(i) #append index of the row in selected_rows array
# time complexity till now is still O(M*N) but this part can be run in parallel as well, as each row # can be evaluated independently M items can be evaluated independently
# THE M rows can be run in parallel, if we have K threads
# it is going to take us (M/K)*N time complexity to run it.
count_dict=[]
# now the same thing you did earlier but now in second loop we are looking for less rows
for val in set_a:
if val==x:
continue
count=0
for ri in selected_rows: # this whole part can be parallelized as well
if val in a[ri]:
count+=1
count_dict.append({'product_id':val, 'count': count})
# if our selected rows size on worst case is M itself
# and our unique values are U, the complexity
# will be (U/K)*(M/K)*N
res = nlargest(top_n, count_dict, key = itemgetter('count'))
return res
让我们在这里计算时间复杂度 如果我们有 K 个线程那么
O((M/K)*N)+O((U/K)*(M/K)*N))
哪里
M---> Total rows
N---> Total Columns
U---> Unique Values
K---> number of threads
- Prune 建议的图形方法
# other approach
#adding to Prune approach
big_dictionary={}
set_a = list(set(list(itertools.chain(*a))))
for x in set_a:
big_dictionary[x]=[]
for y in set_a:
count=0
if x==y:
continue
for arr in a:
if (x in arr) and (y in arr):
count+=1
big_dictionary[x].append((y,count))
for x in big_dictionary:
big_dictionary[x]=sorted(big_dictionary[x], key=lambda v:v[1], reverse=True)
让我们在这里计算一下这个的时间复杂度 一次复杂度为:
O(U*U*M*N)
哪里
M---> Total rows
N---> Total Columns
U---> Unique Values
但是一旦这个big_dictionary计算一次, 只需 1 步即可获得您的 topN 值 例如,如果我们想获得 a1
的前 3 个值result=big_dictionary['a1'][:3]
我遵循了上面@Prune 建议的 defaultdict
方法。这是最终代码:
from collections import defaultdict
def recommender(input_item, b_list, n):
count =[]
top_items = []
for x in b.keys():
lst_2 = b[x]
common_transactions = len(set(b_list) & set(lst_2))
count.append(common_transactions)
top_ids = list((np.argsort(count)[:-n-2:-1])[1::])
top_values_counts = [count[i] for i in top_ids]
key_list = list(b.keys())
for i,v in enumerate(top_ids):
item_id = key_list[v]
top_items.append({item_id: top_values_counts[i]})
print(top_items)
return top_items
a = [
['a1', 'b2', 'c3'],
['c3', 'd4', 'a1'],
['b2', 'a1', 'e5'],
['d4', 'a1', 'b2'],
['c3', 'b2', 'a1']
]
b = defaultdict(list)
for i, s in enumerate(a):
for key in s :
b[key].append(i)
input_item = str(input("Enter the item_id: "))
n = int(input("How many values to be retrieved? (eg: top 5, top 2, etc.): "))
top_items = recommender(input_item, b[input_item], n)
这是 'a1' 前 3 名的输出:
[{'b2': 4}, {'c3': 3}, {'d4': 2}]
谢谢!!!