SCRAPY FORM REQUEST return 没有任何数据
SCRAPY FORM REQUEST doesn't return any data
我正在向网站提出表单请求。请求成功,但没有返回任何数据。
日志:
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
我的代码:
# -*- coding: utf-8 -*-
import scrapy
codes = open('codes.txt').read().split('\n')
class MainSpider(scrapy.Spider):
name = 'main'
form_url = 'https://safer.fmcsa.dot.gov/query.asp'
start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']
def parse(self, response):
for code in codes:
data = {
'searchtype': 'ANY',
'query_type': 'queryCarrierSnapshot',
'query_param': 'USDOT',
'query_string': code,
}
yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)
def parse_form(self, response):
cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
for each in cargo:
each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()
yield {
"X Values": each_x if each_x else "N/A",
}
以下是我用于 POST REQUEST 的一些示例代码。
2146709
273286
120670
2036998
690147
我相信您只需要从此处的 XPath 中删除 tbody
:
cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
这样使用:
cargo = response.xpath('//table[@summary="Cargo Carried"]/tr[2]')
# I also removed the () inside the path because you don't need it, but that didn't cause the problem.
这是因为 Scrapy 将从页面解析原始代码,而您的浏览器可能会呈现 tbody
以防它不在源代码中。更多信息 here.
我正在向网站提出表单请求。请求成功,但没有返回任何数据。
日志:
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
我的代码:
# -*- coding: utf-8 -*-
import scrapy
codes = open('codes.txt').read().split('\n')
class MainSpider(scrapy.Spider):
name = 'main'
form_url = 'https://safer.fmcsa.dot.gov/query.asp'
start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']
def parse(self, response):
for code in codes:
data = {
'searchtype': 'ANY',
'query_type': 'queryCarrierSnapshot',
'query_param': 'USDOT',
'query_string': code,
}
yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)
def parse_form(self, response):
cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
for each in cargo:
each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()
yield {
"X Values": each_x if each_x else "N/A",
}
以下是我用于 POST REQUEST 的一些示例代码。
2146709
273286
120670
2036998
690147
我相信您只需要从此处的 XPath 中删除 tbody
:
cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
这样使用:
cargo = response.xpath('//table[@summary="Cargo Carried"]/tr[2]')
# I also removed the () inside the path because you don't need it, but that didn't cause the problem.
这是因为 Scrapy 将从页面解析原始代码,而您的浏览器可能会呈现 tbody
以防它不在源代码中。更多信息 here.