旋转代理(STORM、SMART)不在每个 scrapy 请求中提供唯一的 IP
Rotating Proxy (STORM, SMART) not giving unique IP in each scrapy request
如何确保在每个 scrapy 请求中都获得新的 ip?我尝试同时使用 stormproxies 和 smartproxies,但它提供的 ip 对于会话来说是相同的。
但是,每个 运行 上的 ip 都是新的。但对于单个会话,ip 是相同的。
我的代码如下:
import json
import uuid
import scrapy
from scrapy.crawler import CrawlerProcess
class IpTest(scrapy.Spider):
name = 'IP_test'
previous_ip = ''
count = 1
ip_url = 'https://ifconfig.me/all.json'
def start_requests(self,):
yield scrapy.Request(
self.ip_url,
dont_filter=True,
meta={
'cookiejar': uuid.uuid4().hex,
'proxy': MY_ROTATING_PROXY # either stormproxy or smartproxy
}
)
def parse(self, response):
ip_address = json.loads(response.text)['ip_addr']
self.logger.info(f"IP: {ip_address}")
if self.count < 10:
self.count += 1
yield from self.start_requests()
settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 1,
}
process = CrawlerProcess(settings)
process.crawl(IpTest)
process.start()
输出日志:
2020-12-27 21:15:52 [scrapy.core.engine] INFO: Spider opened
2020-12-27 21:15:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-27 21:15:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-27 21:15:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: None)
2020-12-27 21:15:55 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:56 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:57 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:59 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:00 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:01 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:03 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:04 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:06 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:07 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] INFO: Closing spider (finished)
我在这里做错了什么?
我什至尝试禁用 cookies (COOKIES_ENABLED = False
),从 request.meta 中删除 cookiejar。但是运气不好。
很难,但我找到了答案。对于 Storm,您需要通过 headers 和 'Connection':'close'。在这种情况下,您将为每个请求获得新的代理。例如:
HEADERS = {'Connection': 'close'}
yield Request(url=url, callback=self.parse, body=body, headers=HEADERS)
在这种情况下,Storm 将关闭连接并为每个请求提供新 IP
如何确保在每个 scrapy 请求中都获得新的 ip?我尝试同时使用 stormproxies 和 smartproxies,但它提供的 ip 对于会话来说是相同的。
但是,每个 运行 上的 ip 都是新的。但对于单个会话,ip 是相同的。
我的代码如下:
import json
import uuid
import scrapy
from scrapy.crawler import CrawlerProcess
class IpTest(scrapy.Spider):
name = 'IP_test'
previous_ip = ''
count = 1
ip_url = 'https://ifconfig.me/all.json'
def start_requests(self,):
yield scrapy.Request(
self.ip_url,
dont_filter=True,
meta={
'cookiejar': uuid.uuid4().hex,
'proxy': MY_ROTATING_PROXY # either stormproxy or smartproxy
}
)
def parse(self, response):
ip_address = json.loads(response.text)['ip_addr']
self.logger.info(f"IP: {ip_address}")
if self.count < 10:
self.count += 1
yield from self.start_requests()
settings = {
'DOWNLOAD_DELAY': 1,
'CONCURRENT_REQUESTS': 1,
}
process = CrawlerProcess(settings)
process.crawl(IpTest)
process.start()
输出日志:
2020-12-27 21:15:52 [scrapy.core.engine] INFO: Spider opened
2020-12-27 21:15:52 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-27 21:15:52 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-27 21:15:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: None)
2020-12-27 21:15:55 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:56 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:57 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:15:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:15:59 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:00 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:00 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:01 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:03 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:03 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:04 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:06 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ifconfig.me/all.json> (referer: https://ifconfig.me/all.json)
2020-12-27 21:16:07 [IP_test] INFO: IP: 190.239.69.94
2020-12-27 21:16:07 [scrapy.core.engine] INFO: Closing spider (finished)
我在这里做错了什么?
我什至尝试禁用 cookies (COOKIES_ENABLED = False
),从 request.meta 中删除 cookiejar。但是运气不好。
很难,但我找到了答案。对于 Storm,您需要通过 headers 和 'Connection':'close'。在这种情况下,您将为每个请求获得新的代理。例如:
HEADERS = {'Connection': 'close'}
yield Request(url=url, callback=self.parse, body=body, headers=HEADERS)
在这种情况下,Storm 将关闭连接并为每个请求提供新 IP