如何使用多线程快速下载 Python 中的 1000 多个 .txt 文件

Question

我有一个 .txt 文件，其中包含 1000 多个 .txt 文件的 URL 列表，我需要下载这些文件然后按字词编制索引。索引很快，但下载是一个巨大的瓶颈。我尝试使用 urllib2 和 urllib.request，但使用这些库中的任何一个下载单个文本文件每个文件需要 .25-.5 秒（平均文件大约有 600 words/3000 个文本字符）

我意识到此时我需要利用多线程（作为一个概念），但我不知道在 Python 中我将如何着手这样做。此时我正在一次下载一个，它看起来像这样：

        with open ('urls.txt', 'r') as f:                   # urls.txt is the .txt file of urls, one on each line
            for url in f:
                response = urllib.request.urlopen(url)
                data = response.read()
                text = data.decode()
                # .. and then index the text

这个项目提示允许我select任何语言。我选择 Python 因为我认为它会更快。我收到的示例输出列出了大约 1.5 秒的总索引时间，因此我认为这大约是他们希望申请人达到的基准。在我的机器（有 4 个内核）上 Python 中如此多的可下载文件甚至有可能实现如此快速的运行时间吗？

编辑（包括更多关于索引的信息）：

我下载的每个 .txt 文件都包含对大学的评论。最后，我想为每条评论中的所有术语建立索引，这样当你按术语搜索时，你会得到一个列表，其中包含评论中包含该术语的所有大学，以及该术语在给定的评论中使用了多少次大学。我嵌套了字典，另一个键是搜索词，外部值是字典。内词典的词条是一个学院的名字，内词典的值是这个词在该学院的评论中出现的次数。

Answer 1

我不知道你在索引文本时到底在做什么，是否应该在线程之间交换数据，或者你是否正在写入单个 file/variable（应该使用锁），但这应该有效：

import threading
import urllib.request

with open('urls.txt', 'r') as f:
    urls = f.read().splitlines()
    
def download_index(url, lock):
    response = urllib.request.urlopen(url)
    data = response.read()
    text = data.decode()
    #indexing
    with lock:
        #access shared resources

n = 2 #number of parallel connections
chunks = [urls[i * n:(i + 1) * n] for i in range((len(urls) + n - 1) // n )]

lock = threading.Lock()
                
for chunk in chunks:
    threads = []
    for url in chunk:
        thread = threading.Thread(target=download_index, args=(url, lock,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()

请注意，您应该考虑一次应该有多少个连接，因为您很可能会因为同时发生 1000 多个请求而被阻止。我不知道理想的数字，但尝试使用 n 数字看看有什么用。或者使用代理。

编辑：添加了一把锁

如何使用多线程快速下载 Python 中的 1000 多个 .txt 文件

How to use multithreading to download 1000+ .txt files in Python quickly

python

multithreading

download

python-multithreading