如何使用多处理循环遍历 URL 的大列表？

Question

问题：检查超过 1000 个 url 的列表并获取 url return 代码 (status_code)。

我的脚本可以运行但是很慢。

我认为必须有更好的 pythonic（更漂亮）方法来执行此操作，我可以在其中生成 10 或 20 个线程来检查 urls 并收集响应。（即：

200 -> www.yahoo.com
404 -> www.badurl.com
...

输入file:Url10.txt

www.example.com
www.yahoo.com
www.testsite.com

.....

import requests

with open("url10.txt") as f:
    urls = f.read().splitlines()

print(urls)
for url in urls:
    url =  'http://'+url   #Add http:// to each url (there has to be a better way to do this)
    try:
        resp = requests.get(url, timeout=1)
        print(len(resp.content), '->', resp.status_code, '->', resp.url)
    except Exception as e:
        print("Error", url)

挑战： 通过多处理提高速度。

多处理

但它不起作用。我收到以下错误：（注意：我不确定我是否正确执行了此操作）

AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>

--

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()
 
def checkurlconnection(url):
    
    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)
        
if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

Answer 1

在checkurlconnection函数中，参数必须是urls而不是url。否则，在for循环中，urls会指向全局变量，这不是你想要的。

import requests
from multiprocessing import Pool

with open("url10.txt") as f:
    urls = f.read().splitlines()

def checkurlconnection(urls):
    for url in urls:
        url =  'http://'+url
        try:
            resp = requests.get(url, timeout=1)
            print(len(resp.content), '->', resp.status_code, '->', resp.url)
        except Exception as e:
            print("Error", url)

if __name__ == "__main__":
    p = Pool(processes=4)
    result = p.map(checkurlconnection, urls)

Answer 2

在这种情况下，您的任务是 I/O 绑定而不是处理器绑定 - 网站回复所需的时间比 CPU 循环一次脚本（不包括TCP 请求）。这意味着您不会从并行执行此任务中获得任何加速（这是 multiprocessing 所做的）。你想要的是多线程。实现这一点的方法是使用很少记录的，可能命名不佳的 multiprocessing.dummy:

import requests
from multiprocessing.dummy import Pool as ThreadPool 

urls = ['https://www.python.org',
        'https://www.python.org/about/']

def get_status(url):
    r = requests.get(url)
    return r.status_code

if __name__ == "__main__":
    pool = ThreadPool(4)  # Make the Pool of workers
    results = pool.map(get_status, urls) #Open the urls in their own threads
    pool.close() #close the pool and wait for the work to finish 
    pool.join()

See here Python.

中的多处理与多线程示例

如何使用多处理循环遍历 URL 的大列表？

How to use multiprocessing to loop through a big list of URL?

python

multithreading

multiprocessing

python-multiprocessing

输入file:Url10.txt

多处理