如何使用 scrapy 在 table 中切换页面？

Question

我正在尝试从该站点获取 table：https://www.burgrieden.de/index.php?id=77
我设法获得了第一个站点，但无法访问其他 4 个页面。我能找到的唯一解释和示例是直接 link 到其他站点或简单的 URL 操作。
我试图检查按钮并查看按下它时网络记录器中会发生什么。但是没有任何效果。
如何使用 scrapy 进入下一页？

这是我目前的情况：

from abc import ABC
import scrapy
from scrapy.crawler import CrawlerProcess
import re
from datetime import datetime


class TrashSpider(scrapy.Spider, ABC):
    name = "Trasher"
    start_urls = ['https://www.burgrieden.de/index.php?id=77']

    def parse(self, response, **kwargs):
        for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
            d = row.xpath('td//text()')[0].extract()
            match = re.search(r'\d{2}.\d{2}.\d{4}', d)
            date = datetime.strptime(match.group(), '%d.%m.%Y').date()
            entry = {
                'date': date,
                'type': row.xpath('td//text()')[2].extract()
            }


process = CrawlerProcess()
process.crawl(TrashSpider)
process.start()

Inspector Img

提前感谢您的帮助。

Answer 1

对于所有有同样问题的人。我想通了。
这是一个触发 POST 请求的按钮。要使用 scrapy 触发此请求，您必须定义 request-headers 和 request-form.
两者都可以使用您的浏览器网络分析器找到：

在左侧您可以看到用于请求站点的方法。标记的条目显示 POST.
现在我们需要获取可以在右下角字段中找到的 headers 并将它们放入 spider_class.
中的字典中确保 忽略 content-length。将其留空或根本不将其添加到您的字典中，因为 scrapy 将发送自己的 content-length 并且当发送多个时，一方将阻止您的请求。代码 400 而不是 200。

至少在firefox的request标签下可以找到表单数据：

这里我们需要整行原样，并放入一个变量中，以便在实际请求中使用它。
它应该是这样的：

# request-header for POST request
headers = {
    'Host': 'www.burgrieden.de',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Content-Type': 'application/x-www-form-urlencoded',
    # 'Content-Length': '219',
    'Origin': 'https://www.burgrieden.de',
    'Connection': 'keep-alive',
    'Referer': 'https://www.burgrieden.de/index.php?id=77',
    'Cookie': 'style=normal.css',
    'Upgrade-Insecure-Requests': '1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache'
}

# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'

为确保它正常工作，您必须像这样禁用 cookie：
只需将其放入您的自定义蜘蛛 class

# custom scraper settings
custom_settings = {
    # pass cookies along with headers
    'COOKIES_ENABLED': False
}

对于实际请求，您必须使用 start_requests() 方法。

# crawler's entry point
def start_requests(self):
    # make HTTP POST request
    # page 1
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p1,
        callback=self.parse
    )

现在您可以使用普通的 parse() 方法解析响应。
如果您遇到任何问题，请尝试将 headers 中的主机留空或将其删除。

这是完整的 class 代码：

class TrashSpider(scrapy.Spider, ABC):
name = "Trasher"
start_url = "https://www.burgrieden.de/index.php?id=77"

# request-header for POST request
headers = {
    'Host': 'www.burgrieden.de',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
    'Accept-Encoding': 'gzip, deflate, br',
    'Content-Type': 'application/x-www-form-urlencoded',
    # 'Content-Length': '219',
    'Origin': 'https://www.burgrieden.de',
    'Connection': 'keep-alive',
    'Referer': 'https://www.burgrieden.de/index.php?id=77',
    'Cookie': 'style=normal.css',
    'Upgrade-Insecure-Requests': '1',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache'
}

# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p2 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=0&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p3 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=10&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p4 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=20&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p5 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=30&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'


# custom scraper settings
custom_settings = {
    # pass cookies along with headers
    'COOKIES_ENABLED': False
}

entrys_crawled = []

# crawler's entry point
def start_requests(self):
    # make HTTP POST request to burgrieden
    # page 1
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p1,
        callback=self.parse
    )

    # page 2
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p2,
        callback=self.parse
    )

    # page 3
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p3,
        callback=self.parse
    )

    # page 4
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p4,
        callback=self.parse
    )

    #page 5
    yield scrapy.Request(
        url=self.start_url,
        method='POST',
        headers=self.headers,
        body=self.form_data_p5,
        callback=self.parse
    )

# parse date and description from table "contenttable",
# extract date from shitty formatted text and store in dictionary entry as datetime
def parse(self, response, **kwargs):
    for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
        d = row.xpath('td//text()')[0].extract()
        match = re.search(r'\d{2}.\d{2}.\d{4}', d)
        entry = {
            'date': datetime.strptime(match.group(), '%d.%m.%Y').date(),
            'type': row.xpath('td//text()')[2].extract()
        }
        self.entrys_crawled.append(entry)

可能有更好的方法来处理多个 post-request，但它对我有用。如果有人想改进它并发送给我，请随时这样做。

致所有对 entrys_crawled 感到疑惑的人。我用scrapys close()方法处理成.ics文件。

如何使用 scrapy 在 table 中切换页面？

How to switch page in a table with scrapy?

python

web-crawler

scrapy