无法找出在脚本中使用代理轮换来加速执行过程的正确方法
Can't figure out the right way to use rotation of proxies within a script to speed up the execution process
我创建了一个脚本,使用 python 在其中实施代理轮换以从某些链接获取正确的响应。此函数 get_proxy_list()
从源生成代理。但是,为了简洁起见,我在该函数中硬编码了 5 个代理。
现在,您可以看到还有两个函数 validate_proxies()
和 fetch_response()
。此函数 validate_proxies()
从 get_proxy_list()
.
生成的粗代理列表中过滤掉工作代理
最后,这个函数 fetch_response()
使用那些工作代理从我的 url 列表中获取正确的响应。
我不知道这个功能 validate_proxies()
是否有任何用处,因为我可以直接在 fetch_response()
中使用那些粗糙的代理。此外,大多数免费代理都是短命的,所以当我试图过滤掉那些粗糙的代理时,工作代理已经死了。但是,脚本运行非常缓慢,即使它找到并使用工作代理也是如此。
我试过:
import random
import requests
from bs4 import BeautifulSoup
validation_link = 'https://icanhazip.com/'
target_links = [
'https://whosebug.com/questions/tagged/web-scraping',
'https://whosebug.com/questions/tagged/vba',
'https://whosebug.com/questions/tagged/java'
]
working_proxies = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_proxy_list():
proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
return proxy_list
def validate_proxies(proxies,link):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
assert res.status_code==200
working_proxies.append(proxy_url)
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
except Exception as e:
print("error raised as:",str(e))
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
return working_proxies
def fetch_response(proxies,url):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
assert resp.status_code==200
return resp
except Exception as e:
print("error thrown as:",str(e))
if not proxies: return
proxy_url = proxies.pop(random.randrange(len(proxies)))
if __name__ == '__main__':
proxies = get_proxy_list()
working_proxy_list = validate_proxies(proxies,validation_link)
print("working proxy list:",working_proxy_list)
for target_link in target_links:
print(fetch_response(working_proxy_list,target_link))
Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?
我对您的代码做了一些更改,希望对您有所帮助:
- 由于您提到代理是短暂的,代码现在获取新代理并检查它们是否适用于每个请求。
- 现在使用
concurrent.futures.ThreadPoolExecutor
检查代理是否并行完成。这意味着您不会为每个代理检查超时等待最多 5 秒,而是最多等待 所有 5 秒让它们超时。
- 不是随机选择代理,而是使用第一个被发现有效的代理。
- 已添加类型提示。
import itertools as it
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from typing import Dict
from bs4 import BeautifulSoup
import requests
Proxy = Dict[str, str]
executor = ThreadPoolExecutor()
validation_link = 'https://icanhazip.com/'
target_links = [
'https://whosebug.com/questions/tagged/web-scraping',
'https://whosebug.com/questions/tagged/vba',
'https://whosebug.com/questions/tagged/java'
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_proxy_list():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"html.parser")
proxies = [':'.join([item.select_one('td').text,item.select_one('td:nth-of-type(2)').text]) for item in soup.select('table.table tr') if ('yes' in item.text and 'elite proxy' in item.text)]
return [{'https': f'http://{x}'} for x in proxies]
def validate_proxy(proxy: Proxy) -> Proxy:
res = requests.get(validation_link, proxies=proxy, headers=headers, timeout=5)
assert 200 == res.status_code
return proxy
def get_working_proxy() -> Proxy:
futures = [executor.submit(validate_proxy, x) for x in get_proxy_list()]
for i in it.count():
future = futures[i % len(futures)]
try:
working_proxy = future.result(timeout=0.01)
for f in futures:
f.cancel()
return working_proxy
except TimeoutError:
continue
except Exception:
futures.remove(future)
if not len(futures):
raise Exception('No working proxies found') from None
def fetch_response(url: str) -> requests.Response:
res = requests.get(url, proxies=get_working_proxy(), headers=headers, timeout=7)
assert res.status_code == 200
return res
用法:
>>> get_working_proxy()
{'https': 'http://119.81.189.194:80'}
>>> get_working_proxy()
{'https': 'http://198.50.163.192:3129'}
>>> get_working_proxy()
{'https': 'http://191.241.145.22:6666'}
>>> get_working_proxy()
{'https': 'http://169.57.1.84:8123'}
>>> get_working_proxy()
{'https': 'http://182.253.171.31:8080'}
在每种情况下,返回具有最低延迟的代理之一。
如果你想让代码更有效率,并且你几乎可以肯定一个工作的代理将在短时间内(例如 30 秒)仍然工作,那么你可以通过将代理到 TTL 缓存中,并在必要时重新填充它,而不是每次调用 fetch_response
时都找到一个工作代理。有关如何在 Python 中实现 TTL 缓存,请参阅 。
我创建了一个脚本,使用 python 在其中实施代理轮换以从某些链接获取正确的响应。此函数 get_proxy_list()
从源生成代理。但是,为了简洁起见,我在该函数中硬编码了 5 个代理。
现在,您可以看到还有两个函数 validate_proxies()
和 fetch_response()
。此函数 validate_proxies()
从 get_proxy_list()
.
最后,这个函数 fetch_response()
使用那些工作代理从我的 url 列表中获取正确的响应。
我不知道这个功能 validate_proxies()
是否有任何用处,因为我可以直接在 fetch_response()
中使用那些粗糙的代理。此外,大多数免费代理都是短命的,所以当我试图过滤掉那些粗糙的代理时,工作代理已经死了。但是,脚本运行非常缓慢,即使它找到并使用工作代理也是如此。
我试过:
import random
import requests
from bs4 import BeautifulSoup
validation_link = 'https://icanhazip.com/'
target_links = [
'https://whosebug.com/questions/tagged/web-scraping',
'https://whosebug.com/questions/tagged/vba',
'https://whosebug.com/questions/tagged/java'
]
working_proxies = []
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_proxy_list():
proxy_list = ['198.24.171.26:8001','187.130.139.197:8080','159.197.128.8:3128','119.28.56.116:808','85.15.152.39:3128']
return proxy_list
def validate_proxies(proxies,link):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
res = requests.get(link,proxies=proxy,headers=headers,timeout=5)
assert res.status_code==200
working_proxies.append(proxy_url)
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
except Exception as e:
print("error raised as:",str(e))
if not proxies: break
proxy_url = proxies.pop(random.randrange(len(proxies)))
return working_proxies
def fetch_response(proxies,url):
proxy_url = proxies.pop(random.randrange(len(proxies)))
while True:
proxy = {'https': f'http://{proxy_url}'}
try:
resp = requests.get(url, proxies=proxy, headers=headers, timeout=7)
assert resp.status_code==200
return resp
except Exception as e:
print("error thrown as:",str(e))
if not proxies: return
proxy_url = proxies.pop(random.randrange(len(proxies)))
if __name__ == '__main__':
proxies = get_proxy_list()
working_proxy_list = validate_proxies(proxies,validation_link)
print("working proxy list:",working_proxy_list)
for target_link in target_links:
print(fetch_response(working_proxy_list,target_link))
Question: what is the right way to use rotation of proxies within a script in order to make the execution faster?
我对您的代码做了一些更改,希望对您有所帮助:
- 由于您提到代理是短暂的,代码现在获取新代理并检查它们是否适用于每个请求。
- 现在使用
concurrent.futures.ThreadPoolExecutor
检查代理是否并行完成。这意味着您不会为每个代理检查超时等待最多 5 秒,而是最多等待 所有 5 秒让它们超时。 - 不是随机选择代理,而是使用第一个被发现有效的代理。
- 已添加类型提示。
import itertools as it
from concurrent.futures import ThreadPoolExecutor, TimeoutError
from typing import Dict
from bs4 import BeautifulSoup
import requests
Proxy = Dict[str, str]
executor = ThreadPoolExecutor()
validation_link = 'https://icanhazip.com/'
target_links = [
'https://whosebug.com/questions/tagged/web-scraping',
'https://whosebug.com/questions/tagged/vba',
'https://whosebug.com/questions/tagged/java'
]
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def get_proxy_list():
response = requests.get('https://www.sslproxies.org/')
soup = BeautifulSoup(response.text,"html.parser")
proxies = [':'.join([item.select_one('td').text,item.select_one('td:nth-of-type(2)').text]) for item in soup.select('table.table tr') if ('yes' in item.text and 'elite proxy' in item.text)]
return [{'https': f'http://{x}'} for x in proxies]
def validate_proxy(proxy: Proxy) -> Proxy:
res = requests.get(validation_link, proxies=proxy, headers=headers, timeout=5)
assert 200 == res.status_code
return proxy
def get_working_proxy() -> Proxy:
futures = [executor.submit(validate_proxy, x) for x in get_proxy_list()]
for i in it.count():
future = futures[i % len(futures)]
try:
working_proxy = future.result(timeout=0.01)
for f in futures:
f.cancel()
return working_proxy
except TimeoutError:
continue
except Exception:
futures.remove(future)
if not len(futures):
raise Exception('No working proxies found') from None
def fetch_response(url: str) -> requests.Response:
res = requests.get(url, proxies=get_working_proxy(), headers=headers, timeout=7)
assert res.status_code == 200
return res
用法:
>>> get_working_proxy()
{'https': 'http://119.81.189.194:80'}
>>> get_working_proxy()
{'https': 'http://198.50.163.192:3129'}
>>> get_working_proxy()
{'https': 'http://191.241.145.22:6666'}
>>> get_working_proxy()
{'https': 'http://169.57.1.84:8123'}
>>> get_working_proxy()
{'https': 'http://182.253.171.31:8080'}
在每种情况下,返回具有最低延迟的代理之一。
如果你想让代码更有效率,并且你几乎可以肯定一个工作的代理将在短时间内(例如 30 秒)仍然工作,那么你可以通过将代理到 TTL 缓存中,并在必要时重新填充它,而不是每次调用 fetch_response
时都找到一个工作代理。有关如何在 Python 中实现 TTL 缓存,请参阅