SCRAPY FORM REQUEST return 没有任何数据

SCRAPY FORM REQUEST doesn't return any data

我正在向网站提出表单请求。请求成功,但没有返回任何数据。

日志:

2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:57 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://safer.fmcsa.dot.gov/query.asp> (referer: https://safer.fmcsa.dot.gov/)
2020-09-05 22:37:59 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-05 22:37:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats:

我的代码:

# -*- coding: utf-8 -*-
import scrapy

codes = open('codes.txt').read().split('\n')

class MainSpider(scrapy.Spider):
    name = 'main'
    form_url = 'https://safer.fmcsa.dot.gov/query.asp'
    start_urls = ['https://safer.fmcsa.dot.gov/CompanySnapshot.aspx']

    def parse(self, response):

        for code in codes:
        
            data = {
                'searchtype': 'ANY',
                'query_type': 'queryCarrierSnapshot',
                'query_param': 'USDOT',
                'query_string': code,
            }

            yield scrapy.FormRequest(url=self.form_url, formdata=data, callback=self.parse_form)

    def parse_form(self, response):
        cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')
        for each in cargo:
            each_x = each.xpath('.//td[contains(text(), "X")]/following-sibling::td/font/text()').get()

            yield {
                "X Values": each_x if each_x else "N/A",
            }

以下是我用于 POST REQUEST 的一些示例代码。

2146709

273286

120670

2036998

690147

我相信您只需要从此处的 XPath 中删除 tbody

    cargo = response.xpath('(//table[@summary="Cargo Carried"]/tbody/tr)[2]')

这样使用:

    cargo = response.xpath('//table[@summary="Cargo Carried"]/tr[2]') 
    # I also removed the () inside the path because you don't need it, but that didn't cause the problem.

这是因为 Scrapy 将从页面解析原始代码,而您的浏览器可能会呈现 tbody 以防它不在源代码中。更多信息 here.