使用 Scrapy 从 table 中提取数据,包括文件

Extract data from a table including files using Scrapy

我正在尝试从 table 中提取数据,该数据在此处显示有效出价列表:https://purchasing.alabama.gov/active-statewide-contracts/。我是一个 Scrapy 新手,对于为什么我没有输出有点困惑。此外,如何下载在 table 中找到的文件?到目前为止我有以下代码:

import scrapy

class AlabamaSpider(scrapy.Spider):

name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']
start_urls = ['https://purchasing.alabama.gov/active-statewide-contracts/']

def start_requests(self):
    urls = ['https://purchasing.alabama.gov/active-statewide-contracts/']

    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)
   
def parse(self, response):
    for row in response.xpath('//*[@class="table table-bordered table-responsive-sm dataTable no-footer"]//tbody//tr'):

        yield {
                'Description': row.xpath('td[@class="col-sm-5 sorting_asc"]//text()').extract_first(),
                'T-NBR': row.xpath('td[@class="col-sm-1 sorting"]/a/text()').extract_first(),
                'Begin Date': row.xpath('td[@class="col-sm-1 sorting"]//text()').extract_first(),
                'End Date': row.xpath('td[@class="col-sm-1 sorting"]//text()').extract_first(),
                'Buyer Name': row.xpath('td[@class="col-sm-3 sorting"]/a/text()').extract_first(),
                'Vendor Websites': row.xpath('td[@class="col-sm-1 sorting"]/a/text()').extract_first(),
}

我们将不胜感激为此提供的任何帮助!

谢谢!

由于您是 Scrapy 的新手,我的建议是:

  • 您可以使用 start_urls 属性 或 start_requests() 方法。但是,避免在同一代码中同时使用两者。您可以从 here 阅读更多相关信息。

  • 无需遍历 url,因为您只发出一次请求。

  • 您的代码没有生成输出,因为您的 XPath 不正确。

代码

import scrapy

class AlabamaSpider(scrapy.Spider):

    name = 'alabama'
    allowed_domains = ['purchasing.alabama.gov']

    def start_requests(self):
        url = 'https://purchasing.alabama.gov/active-statewide-contracts/'

        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):

            yield {
                'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
                'T-NBR': row.xpath('td[@class="col-sm-1"]/a//text()').extract_first(),
                'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
                'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
                'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
                'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
            }