在 for 循环中并行化 for 循环很热吗？ Python

Question

我正在尝试并行化这个等式：

def cosfunction(a,b):
     sumxx, sumxy, sumyy = 0, 0, 0
     for i in range(len(a)):
          x = a[i]
          y = b[i]
          sumxx += x*x
          sumyy += y*y
          sumxy += x*y
     return sumxy/math.sqrt(sumxx*sumyy)
     

def get_cosinesimilarity(vectrain, vectest):
     '''Calculates the cosine similarity for train and test'''
     x = vectrain 
     y = vectest 
     simlist = []
     for i in range(len(y)):
          sim = []
          listoftopten = [(0,0,0)] * 10
          for j in range(len(x)):
               cos = cosfunction(x[j],y[i])
               c = []
               for a in range(len(listoftopten)):
                    c.append(listoftopten[a][0])
               if cos > min(c):
                    listoftopten.remove(listoftopten[c.index(min(c))])
                    listoftopten.append((cos, x[j], y[i]))
          simlist.append(listoftopten)
     return simlist

我必须列出哪个是训练数据的 vectrain 和测试数据的 vectest。它们都包含格式如下的数据 [[0.012545, 0.58612, 0.7892],[0.4566, 0.4868, 0.789]] 所以基本上是向量。在我的 get_cosinesimilarity 函数中，我想计算每个测试向量与每个训练向量的余弦相似度。然后为每个包含元组 (cos, i, j) 的测试向量返回一个包含 10 个元组的列表，其中 cos 是余弦相似度，i 是训练集的向量，j 是测试集的向量。这是我附加到 listoftopten 的内容。然后将每个测试向量的包含 10 个元组的列表附加到 simlist 列表，该列表将包含所有测试向量的前十个元组的所有列表。非常重要的是我的输出是我描述的 simlist 的格式。

但是，由于我的 vectest 和 vectrain 列表非常长（最多 200.000 个向量），如果我不将其并行化，则该过程需要很长时间才能完成。我以前从未在 python 中使用过多处理。有人可以告诉我如何并行化吗？

谢谢！

Answer 1

这里昂贵的操作似乎是计算余弦相似度之后的代码。你可能想使用堆数据结构来获得前十名。

这里尝试通过并行化余弦相似度计算来提高性能（同时确保低 space 复杂度）。参考：https://docs.python.org/3/library/multiprocessing.html

def cosfunction(*args):
    a = args[0]
    b = args[1]
    cos = 0
    # ... cos function implementation
    return cos, a, b

def insert_and_trim(heap, new_elements):
    # Please try this yourself.
    # iterate through each element in the new_elements list and insert into the heap
    # trim the heap to ensure heap doesn't bloat up in size
    # One method of doing the above is to create a "MAX HEAP". Insert the new_elements into the heap. get_max from the heap, until the heap contains, say 10 elements.
    pass

def get_top_ten(heap):
    # Please try this yourself.
    # Since it is a max heap, when you consecutively do get max from the heap, you get a descending order of the elements in the heap.
    pass 


def get_cosinesimilarity(vectrain, vectest):
     '''Calculates the cosine similarity for train and test'''
     x = vectrain 
     y = vectest 
     simlist = []
     for i in range(len(y)):
          sim = []
          heap = None # Create a heap by yourself; https://docs.python.org/3/library/heapq.html
          listoftopten = [(0,0,0)] * 10
          BATCH_SIZE = 10 # set to any value of your choice. Capped by ulimit
          args_list = []
          for j in range(len(x)):
               if len(args_list) < BATCH_SIZE:
                   args_list.append((x[j],y[i]))
               else:
                   pool = Pool(BATCH_SIZE)
                   cos_list = pool.map(cosfunction, args_list)
                   insert_and_trim(heap, cos_list)
               
               # In case of say 14 elements, process the 4 elements in the end that broke out of the loop prematurely
               if len(args_list) > 0:
                   pool = Pool(BATCH_SIZE)
                   cos_list = pool.map(cosfunction, args_list)
                   insert_and_trim(heap, cos_list)
                   # insert_and_trim(listoftopten, cos_list)
           sim.append(get_top_ten(heap))
           # sim.append(get_top_ten(listoftopten))
     return sim

如果您不想使用堆，那么您可以使用如下的原始实现：

def insert_and_trim(listoftopten, new_elements):
    # slightly modified code wrt question, following the cosine similarity computation
    c = []
    for cos, x, y in new_elements:
        for a in range(len(listoftopten)):
            c.append(listoftopten[a][0])
        if cos > min(c):
            listoftopten.remove(listoftopten[c.index(min(c))])
            listoftopten.append((cos, x, y))

def get_top_ten(listoftopten):
    return listoftopten

在 for 循环中并行化 for 循环很热吗？ Python

Hot to parallelize for loof in for loop? Python

python

parallel-processing

multiprocessing