Cookies/数据处理重定向导致错误抓取网站
Cookies / data handling redirect causes wrong scraping website
我有一个非常简单的自定义蜘蛛的问题,但我无法弄清楚。
当试图在 yahoo finance 上抓取页面时,Scrapy 被重定向到 consent.yahoo 页面。
蜘蛛看起来像这样:
import scrapy
class CompanyDetailsSpider(scrapy.Spider):
name = 'company_details'
allowed_domains = ['finance.yahoo.com']
start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']
def parse(self, response):
company_names_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
company_price_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]/span/text()').extract()
count = len(company_names_list)
for i in range(0, count):
print(company_names_list[i], company_price_list[i])
此代码取自 scrapy 课程,它确实有效。问题是当我尝试 运行 时。它告诉我:
2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (307) to <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology> from <GET https://finance.yahoo.com/screener/predefined/ms_technology>
2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> from <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology>
2022-02-01 15:29:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> (referer: None)
当我简单地 运行 scrapy shell 页面查看响应时,它显示:that is redirected to a (cookies?) request page.
我无法在任何地方找到解决此问题的方法,因为我找不到任何人报告相同的问题。但是,其他与 cookie 相关的问题说应该启用 cookie,我也这样做了。 robot.txt 变为 false。我的设置如下所示:
BOT_NAME = 'SimpleSpider'
SPIDER_MODULES = ['SimpleSpider.spiders']
NEWSPIDER_MODULE = 'SimpleSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'SimpleSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'SimpleSpider.middlewares.SimplespiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'SimpleSpider.middlewares.SimplespiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'SimpleSpider.pipelines.SimplespiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
我希望任何人都可以帮助解决这个问题!
问题是您需要将 cookies
包含在 start_requests
中,然后是如何为值编制索引的问题。与 print
相比,使用 scrapy yield
数据更好。您的 xpath
中也不需要 span
的价格。
这是一个可行的解决方案:
import scrapy
cookies = {
'B': '7t389hlgv4sqv&b=3&s=gb',
'GUCS': 'AU8-5cgT',
'EuConsent': 'CPTv0BMPTv0BMAOACBENB-CoAP_AAH_AACiQIJNe_X__bX9n-_59__t0eY1f9_r3v-QzjhfNt-8F2L_W_L0H_2E7NB36pq4KuR4ku3bBIQFtHMnUTUmxaolVrzHsak2MpyNKJ7LkmnsZe2dYGHtPn9lD-YKZ7_7___f73z___9_-39z3_9f___d9_-__-vjfV_993________9nd____BBIAkw1LyALsSxwJNo0qhRAjCsJCoBQAUUAwtEVgAwOCnZWAT6ghYAITUBGBECDEFGDAIAAAIAkIiAkALBAIgCIBAACAFCAhAARMAgsALAwCAAUA0LEAKAAQJCDI4KjlMCAiRaKCWysQSgr2NMIAyywAoFEZFQgIlCCBYGQkLBzHAEgJYAYaADAAEEEhEAGAAIIJCoAMAAQQSA',
'A1': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
'A3': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
'A1S': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY&j=GDPR',
'GUC': 'AQABBgFh-tdiyUIdFwSP',
'cmp': 'v=22&t=1643742832&j=1',
}
class CompanyDetailsSpider(scrapy.Spider):
name = 'company_details'
allowed_domains = ['finance.yahoo.com']
start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
cookies=cookies,
callback = self.parse
)
def parse(self, response):
company_names_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
company_price_list = response.xpath(
'.//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]//text()').extract()
yield {
'company_names_list':company_names_list,
'company_price_list':company_price_list
}
输出:
{'company_names_list': ['Apple Inc.', 'Microsoft Corporation', 'Taiwan Semiconductor Manufacturing Company Limited', 'NVIDIA Corporation', 'ASML Holding N.V.', 'Adobe Inc.', 'Broadcom Inc.', 'Cisco Systems, Inc.', 'salesforce.com, inc.', 'Accenture plc', 'Oracle Corporation', 'Intel Corporation', 'QUALCOMM Incorporated', 'Texas Instruments Incorporated', 'Intuit Inc.', 'SAP SE', 'Sony Group Corporation', 'Advanced Micro Devices, Inc.', 'Applied Materials, Inc.', 'Shopify Inc.', 'International Business Machines Corporation', 'ServiceNow, Inc.', 'Infosys Limited', 'Micron Technology, Inc.', 'Snowflake Inc.'], 'company_price_list': ['172.95', '305.85', '121.92', '241.58', '675.08', '532.16', '585.24', '55.24', '230.07', '350.62', '81.11', '48.63', '175.59', '179.27', '557.11', '127.09', '111.82', '114.18', '137.63', '958.72', '134.34', '582.21', '23.36', '80.83', '282.22']}
我有一个非常简单的自定义蜘蛛的问题,但我无法弄清楚。 当试图在 yahoo finance 上抓取页面时,Scrapy 被重定向到 consent.yahoo 页面。
蜘蛛看起来像这样:
import scrapy
class CompanyDetailsSpider(scrapy.Spider):
name = 'company_details'
allowed_domains = ['finance.yahoo.com']
start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']
def parse(self, response):
company_names_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
company_price_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]/span/text()').extract()
count = len(company_names_list)
for i in range(0, count):
print(company_names_list[i], company_price_list[i])
此代码取自 scrapy 课程,它确实有效。问题是当我尝试 运行 时。它告诉我:
2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (307) to <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology> from <GET https://finance.yahoo.com/screener/predefined/ms_technology>
2022-02-01 15:29:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> from <GET https://guce.yahoo.com/consent?brandType=nonEu&gcrumb=TEYoGM4&done=https%3A%2F%2Ffinance.yahoo.com%2Fscreener%2Fpredefined%2Fms_technology>
2022-02-01 15:29:08 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://consent.yahoo.com/v2/collectConsent?sessionId=3_cc-session_4eb5a247-c8c1-47f7-b860-1b593d8ad1ef> (referer: None)
当我简单地 运行 scrapy shell 页面查看响应时,它显示:that is redirected to a (cookies?) request page.
我无法在任何地方找到解决此问题的方法,因为我找不到任何人报告相同的问题。但是,其他与 cookie 相关的问题说应该启用 cookie,我也这样做了。 robot.txt 变为 false。我的设置如下所示:
BOT_NAME = 'SimpleSpider'
SPIDER_MODULES = ['SimpleSpider.spiders']
NEWSPIDER_MODULE = 'SimpleSpider.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'SimpleSpider (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = True
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'SimpleSpider.middlewares.SimplespiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'SimpleSpider.middlewares.SimplespiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'SimpleSpider.pipelines.SimplespiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
我希望任何人都可以帮助解决这个问题!
问题是您需要将 cookies
包含在 start_requests
中,然后是如何为值编制索引的问题。与 print
相比,使用 scrapy yield
数据更好。您的 xpath
中也不需要 span
的价格。
这是一个可行的解决方案:
import scrapy
cookies = {
'B': '7t389hlgv4sqv&b=3&s=gb',
'GUCS': 'AU8-5cgT',
'EuConsent': 'CPTv0BMPTv0BMAOACBENB-CoAP_AAH_AACiQIJNe_X__bX9n-_59__t0eY1f9_r3v-QzjhfNt-8F2L_W_L0H_2E7NB36pq4KuR4ku3bBIQFtHMnUTUmxaolVrzHsak2MpyNKJ7LkmnsZe2dYGHtPn9lD-YKZ7_7___f73z___9_-39z3_9f___d9_-__-vjfV_993________9nd____BBIAkw1LyALsSxwJNo0qhRAjCsJCoBQAUUAwtEVgAwOCnZWAT6ghYAITUBGBECDEFGDAIAAAIAkIiAkALBAIgCIBAACAFCAhAARMAgsALAwCAAUA0LEAKAAQJCDI4KjlMCAiRaKCWysQSgr2NMIAyywAoFEZFQgIlCCBYGQkLBzHAEgJYAYaADAAEEEhEAGAAIIJCoAMAAQQSA',
'A1': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
'A3': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY',
'A1S': 'd=AQABBF9z8mECELBiwNCF9soE8MMAyI0JjX4FEgABBgHX-mHJYvbPb2UB9iMAAAcIX3PyYY0JjX4&S=AQAAAjnkhOf_LxrMMNCN1-BYfEY&j=GDPR',
'GUC': 'AQABBgFh-tdiyUIdFwSP',
'cmp': 'v=22&t=1643742832&j=1',
}
class CompanyDetailsSpider(scrapy.Spider):
name = 'company_details'
allowed_domains = ['finance.yahoo.com']
start_urls = ['https://finance.yahoo.com/screener/predefined/ms_technology']
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
cookies=cookies,
callback = self.parse
)
def parse(self, response):
company_names_list = response.xpath(
'//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[2]/text()').extract()
company_price_list = response.xpath(
'.//*[@id="scr-res-table"]/div[1]/table/tbody/tr/td[3]//text()').extract()
yield {
'company_names_list':company_names_list,
'company_price_list':company_price_list
}
输出:
{'company_names_list': ['Apple Inc.', 'Microsoft Corporation', 'Taiwan Semiconductor Manufacturing Company Limited', 'NVIDIA Corporation', 'ASML Holding N.V.', 'Adobe Inc.', 'Broadcom Inc.', 'Cisco Systems, Inc.', 'salesforce.com, inc.', 'Accenture plc', 'Oracle Corporation', 'Intel Corporation', 'QUALCOMM Incorporated', 'Texas Instruments Incorporated', 'Intuit Inc.', 'SAP SE', 'Sony Group Corporation', 'Advanced Micro Devices, Inc.', 'Applied Materials, Inc.', 'Shopify Inc.', 'International Business Machines Corporation', 'ServiceNow, Inc.', 'Infosys Limited', 'Micron Technology, Inc.', 'Snowflake Inc.'], 'company_price_list': ['172.95', '305.85', '121.92', '241.58', '675.08', '532.16', '585.24', '55.24', '230.07', '350.62', '81.11', '48.63', '175.59', '179.27', '557.11', '127.09', '111.82', '114.18', '137.63', '958.72', '134.34', '582.21', '23.36', '80.83', '282.22']}