结合使用多处理和请求时，是否有更好的方法来避免内存泄漏？

Question

我目前正在使用请求模块和多处理对单个目标进行网络抓取。

我正在使用 Pool 和多处理异步。

每个进程发送一系列连续的请求，在每个请求中我随机切换 header(User-agent) 和代理。

过了一会儿，我注意到电脑变慢了，所有脚本上的所有请求都失败了。

经过一番挖掘，我意识到问题不在于代理，而在于请求的内存泄漏。

我已经阅读了有关多处理内存泄漏的其他文章。

我的问题是，有没有更好的方法来避免这种情况而不是使用：if __ name __ == '__ main __': ？

（也许每次 tot 迭代或类似的事情都转储内存？）

下面是我的代码：

a = [[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)],[('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)]]

def hydra_gecko(file_name, initial_letter, final_letter, process_number):
    # url and proxy details here
    response = requests.get(url, headers=header_switcher(), proxies={'http': proxy, 'https': proxy}, timeout=(1, 3))
    # parse html and gather data


for multi_arguments in a:
if __name__ == '__main__':
    with Pool(5) as p:
        print(p.starmap_async(hydra_gecko, multi_arguments))
        p.close()
        p.join()

有更好的方法吗？是否有代码可以在每次迭代时转储内存，或者比上面的代码更好的类似代码？谢谢

Answer 1

您正在为每个 multi_arguments 创建一个新池。那是一种资源浪费。如果总的工作进程数多于 CPU 的核心数，工作人员将争夺 CPU 资源甚至内存，从而减慢整个进程。

池的整个目的处理的项目比工作函数处理的项目多。

试试这个（使用单个池）：

a = [
  ('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3),
  ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2),
  ('ae.txt', 'ae', 'aeo', 4), ('ab.txt', 'ab', 'abo', 1),
  ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5),
  ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo',4)
  ('ab.txt', 'ab', 'abo', 1), ('ac.txt', 'ac', 'aco', 3),
  ('acz.txt', 'acz', 'ac o', 5), ('ad.txt', 'ad', 'ado', 2),
  ('ae.txt', 'ae', 'aeo', 4), ('ab.txt', 'ab', 'abo', 1),
  ('ac.txt', 'ac', 'aco', 3), ('acz.txt', 'acz', 'ac o', 5),
  ('ad.txt', 'ad', 'ado', 2), ('ae.txt', 'ae', 'aeo', 4)
]

def hydra_gecko(item):
    file_name, initial_letter, final_letter, process_number = item
    # url and proxy details here
    response = requests.get(
      url, headers=header_switcher(),
      proxies={'http': proxy, 'https': proxy},
      timeout=(1, 3)
    )
    # parse html and gather data, return result.
    return response.status_code

if __name__ == '__main__':
# Do **not** choose a number of workers. The default usually works fine.
# If you are worried about memory leaks, set maxtasksperchild
# to refresh the worker process after a certain number of tasks.
with Pool(maxtasksperchild=4) as p:
    for result in p.imap_unordered(hydra_gecko, a):
        print(result)

结合使用多处理和请求时，是否有更好的方法来避免内存泄漏？

Is there a better way to avoid memory leaks when using multiprocessing and request combined?

python

memory-leaks

request

web-scraping

python-multiprocessing