使用 Scrapy 从 table 中提取数据，包括文件

Question

我正在尝试从 table 中提取数据，该数据在此处显示有效出价列表：https://purchasing.alabama.gov/active-statewide-contracts/。我是一个 Scrapy 新手，对于为什么我没有输出有点困惑。此外，如何下载在 table 中找到的文件？到目前为止我有以下代码：

import scrapy

class AlabamaSpider(scrapy.Spider):

name = 'alabama'
allowed_domains = ['purchasing.alabama.gov']
start_urls = ['https://purchasing.alabama.gov/active-statewide-contracts/']

def start_requests(self):
    urls = ['https://purchasing.alabama.gov/active-statewide-contracts/']

    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse)
   
def parse(self, response):
    for row in response.xpath('//*[@class="table table-bordered table-responsive-sm dataTable no-footer"]//tbody//tr'):

        yield {
                'Description': row.xpath('td[@class="col-sm-5 sorting_asc"]//text()').extract_first(),
                'T-NBR': row.xpath('td[@class="col-sm-1 sorting"]/a/text()').extract_first(),
                'Begin Date': row.xpath('td[@class="col-sm-1 sorting"]//text()').extract_first(),
                'End Date': row.xpath('td[@class="col-sm-1 sorting"]//text()').extract_first(),
                'Buyer Name': row.xpath('td[@class="col-sm-3 sorting"]/a/text()').extract_first(),
                'Vendor Websites': row.xpath('td[@class="col-sm-1 sorting"]/a/text()').extract_first(),
}

我们将不胜感激为此提供的任何帮助！

谢谢！

Answer 1

由于您是 Scrapy 的新手，我的建议是：

您可以使用 start_urls 属性或 start_requests() 方法。但是，避免在同一代码中同时使用两者。您可以从 here 阅读更多相关信息。
无需遍历 url，因为您只发出一次请求。
您的代码没有生成输出，因为您的 XPath 不正确。

代码

import scrapy

class AlabamaSpider(scrapy.Spider):

    name = 'alabama'
    allowed_domains = ['purchasing.alabama.gov']

    def start_requests(self):
        url = 'https://purchasing.alabama.gov/active-statewide-contracts/'

        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for row in response.xpath('//*[@class="table table-bordered table-responsive-sm"]//tbody//tr'):

            yield {
                'Description': row.xpath('normalize-space(./td[@class="col-sm-5"])').extract_first(),
                'T-NBR': row.xpath('td[@class="col-sm-1"]/a//text()').extract_first(),
                'Begin Date': row.xpath('normalize-space(./td[@class="col-sm-1"][2])').extract_first(),
                'End Date': row.xpath('normalize-space(./td[@class="col-sm-1"][3])').extract_first(),
                'Buyer Name': row.xpath('td[@class="col-sm-3"]/a//text()').extract_first(),
                'Vendor Websites': row.xpath('td[@class="col-sm-1"]/label/text()').extract_first(),
            }

使用 Scrapy 从 table 中提取数据，包括文件

Extract data from a table including files using Scrapy

python

xpath

scrapy

web-scraping