使用 Scrapy 遍历 footballdb 上的 Boxscore 链接

Question

我需要用 scrapy 遍历所有的 boxscore 链接，然后从每个 boxscores 中提取传球、冲球和接收表来创建一个数据集。主要问题是我的代码 returns 当我运行它时什么也没有。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class Nfl20Spider(CrawlSpider):
    name = 'nfl20'
    allowed_domains = ['www.footballdb.com']
    start_urls = ['http://www.footballdb.com/games']
#fixed to iterate through all box scores
    rules = (
        Rule(LinkExtractor(restrict_xpaths='.//table/tbody/tr[1]/td[7]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        item = {}
        #table of stats.
#need to fix so that it only prints out the text and not the html elements.
        item['table'] = response.xpath('//table/tbody').extract_first()
        print(item['table'])
        yield item

能够让它迭代并保存到文件中，但我无法限制它只是 boxscores，它正在打印出 html 标签。需要帮助清理它，以便它只提取文本并且只转到 boxscore 链接。感谢您的帮助。

Answer 1

print(item['table']) 你有什么输出？

Answer 2

我建议使用 scrapy shell websitetoscrap.com 以便更容易确定在 HTML 结构中搜索信息的位置。在 http://www.footballdb.com/games 上，只需在 //a 之前添加 //td 即可仅获取与 boxscore 相关的链接。

https://www.footballdb.com/games/boxscore.html?gid=... HTML 结构不是很好。几乎没有标识不同统计位置的id。

首先，如果你想确定不同的对手，例如 Cleveland Browns at New York Jets 对 this 匹配尝试查找它是否在 HTML 结构中具有 id。在这个网站上，没有 id 和他的 parents 标签。所以尝试确定最独特的路径，对于这个我们能做的最好的是：

response.xpath('//center//h1/text()').get()

由于只有一个返回结果，我可以直接使用get()。

现在，如果我们想得到比赛的日期（例如December 27, 2020），在分析完HTML结构之后，我们可以这样进行：

response.xpath('//center//div/text()').getall()[2]

本例返回了几个结果，所以你必须先使用getall()，然后定位你要查找的信息的位置。

我们可以为这个地方做同样的事情（例如 MetLife Stadium, East Rutherford, NJ）：

response.xpath('//center//div/text()').getall()[3]

对于统计，将有必要使用相同的技术：尝试确定一条尽可能唯一的路径，如果返回多个结果，则找到我们感兴趣的那个。

对于table标签你不能直接将它转换成文本，这不是那么简单，你必须遍历每个table header th然后遍历每一行 tr 然后是列 td.

我希望您现在对这个过程有了更全面的了解。

这是您的代码，上面列出了一些更正和补充内容：

class Nfl20Spider(CrawlSpider):
    name = 'nfl20'
    allowed_domains = ['www.footballdb.com']
    start_urls = ['http://www.footballdb.com/games']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//td//a'), callback='parse_item'),
    )

    def parse_item(self, response):
        item = {}
        item['stats'] = {}
        item['stats']['visitor'] = {}
        item['stats']['home'] = {}

        item['match'] = response.xpath('//center//h1/text()').get()
        item['date'] = response.xpath('//center//div/text()').getall()[2]
        item['location'] = response.xpath('//center//div/text()').getall()[3]
        item['stats']['visitor']['name'] = response.xpath('//div[@class="boxdiv_visitor"]//span/text()').get()
        item['stats']['home']['name'] = response.xpath('//div[@class="boxdiv_home"]//span/text()').get()
        print(item)
        yield item

使用 Scrapy 遍历 footballdb 上的 Boxscore 链接

Using Scrapy to iterate through Boxscore links on footballdb

python

scrapy