如何使用多处理循环遍历 URL 的大列表?
How to use multiprocessing to loop through a big list of URL?
问题:检查超过 1000 个 url 的列表并获取 url return 代码 (status_code)。
我的脚本可以运行但是很慢。
我认为必须有更好的 pythonic(更漂亮)方法来执行此操作,我可以在其中生成 10 或 20 个线程来检查 urls 并收集响应。
(即:
200 -> www.yahoo.com
404 -> www.badurl.com
...
输入file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
.....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
挑战:
通过多处理提高速度。
多处理
但它不起作用。
我收到以下错误:(注意:我不确定我是否正确执行了此操作)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
在checkurlconnection
函数中,参数必须是urls
而不是url
。
否则,在for循环中,urls
会指向全局变量,这不是你想要的。
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(urls):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
在这种情况下,您的任务是 I/O 绑定而不是处理器绑定 - 网站回复所需的时间比 CPU 循环一次脚本(不包括TCP 请求)。这意味着您不会从并行执行此任务中获得任何加速(这是 multiprocessing
所做的)。你想要的是多线程。实现这一点的方法是使用很少记录的,可能命名不佳的 multiprocessing.dummy
:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here Python.
中的多处理与多线程示例
问题:检查超过 1000 个 url 的列表并获取 url return 代码 (status_code)。
我的脚本可以运行但是很慢。
我认为必须有更好的 pythonic(更漂亮)方法来执行此操作,我可以在其中生成 10 或 20 个线程来检查 urls 并收集响应。 (即:
200 -> www.yahoo.com
404 -> www.badurl.com
...
输入file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
.....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
挑战: 通过多处理提高速度。
多处理
但它不起作用。 我收到以下错误:(注意:我不确定我是否正确执行了此操作)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
在checkurlconnection
函数中,参数必须是urls
而不是url
。
否则,在for循环中,urls
会指向全局变量,这不是你想要的。
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(urls):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
在这种情况下,您的任务是 I/O 绑定而不是处理器绑定 - 网站回复所需的时间比 CPU 循环一次脚本(不包括TCP 请求)。这意味着您不会从并行执行此任务中获得任何加速(这是 multiprocessing
所做的)。你想要的是多线程。实现这一点的方法是使用很少记录的,可能命名不佳的 multiprocessing.dummy
:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here Python.
中的多处理与多线程示例