如何使用 scrapy 在 table 中切换页面?
How to switch page in a table with scrapy?
我正在尝试从该站点获取 table:https://www.burgrieden.de/index.php?id=77
我设法获得了第一个站点,但无法访问其他 4 个页面。
我能找到的唯一解释和示例是直接 link 到其他站点或简单的 URL 操作。
我试图检查按钮并查看按下它时网络记录器中会发生什么。
但是没有任何效果。
如何使用 scrapy 进入下一页?
这是我目前的情况:
from abc import ABC
import scrapy
from scrapy.crawler import CrawlerProcess
import re
from datetime import datetime
class TrashSpider(scrapy.Spider, ABC):
name = "Trasher"
start_urls = ['https://www.burgrieden.de/index.php?id=77']
def parse(self, response, **kwargs):
for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
d = row.xpath('td//text()')[0].extract()
match = re.search(r'\d{2}.\d{2}.\d{4}', d)
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
entry = {
'date': date,
'type': row.xpath('td//text()')[2].extract()
}
process = CrawlerProcess()
process.crawl(TrashSpider)
process.start()
Inspector Img
提前感谢您的帮助。
对于所有有同样问题的人。我想通了。
这是一个触发 POST 请求的按钮。要使用 scrapy 触发此请求,您必须定义 request-headers 和 request-form.
两者都可以使用您的浏览器网络分析器找到:
在左侧您可以看到用于请求站点的方法。标记的条目显示 POST.
现在我们需要获取可以在右下角字段中找到的 headers 并将它们放入 spider_class.
中的字典中
确保 忽略 content-length。将其留空或根本不将其添加到您的字典中,因为 scrapy 将发送自己的 content-length 并且当发送多个时,一方将阻止您的请求。代码 400 而不是 200。
至少在firefox的request标签下可以找到表单数据:
这里我们需要整行原样,并放入一个变量中,以便在实际请求中使用它。
它应该是这样的:
# request-header for POST request
headers = {
'Host': 'www.burgrieden.de',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Content-Length': '219',
'Origin': 'https://www.burgrieden.de',
'Connection': 'keep-alive',
'Referer': 'https://www.burgrieden.de/index.php?id=77',
'Cookie': 'style=normal.css',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
为确保它正常工作,您必须像这样禁用 cookie:
只需将其放入您的自定义蜘蛛 class
# custom scraper settings
custom_settings = {
# pass cookies along with headers
'COOKIES_ENABLED': False
}
对于实际请求,您必须使用 start_requests() 方法。
# crawler's entry point
def start_requests(self):
# make HTTP POST request
# page 1
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p1,
callback=self.parse
)
现在您可以使用普通的 parse() 方法解析响应。
如果您遇到任何问题,请尝试将 headers 中的主机留空或将其删除。
这是完整的 class 代码:
class TrashSpider(scrapy.Spider, ABC):
name = "Trasher"
start_url = "https://www.burgrieden.de/index.php?id=77"
# request-header for POST request
headers = {
'Host': 'www.burgrieden.de',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Content-Length': '219',
'Origin': 'https://www.burgrieden.de',
'Connection': 'keep-alive',
'Referer': 'https://www.burgrieden.de/index.php?id=77',
'Cookie': 'style=normal.css',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p2 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=0&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p3 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=10&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p4 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=20&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p5 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=30&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
# custom scraper settings
custom_settings = {
# pass cookies along with headers
'COOKIES_ENABLED': False
}
entrys_crawled = []
# crawler's entry point
def start_requests(self):
# make HTTP POST request to burgrieden
# page 1
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p1,
callback=self.parse
)
# page 2
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p2,
callback=self.parse
)
# page 3
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p3,
callback=self.parse
)
# page 4
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p4,
callback=self.parse
)
#page 5
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p5,
callback=self.parse
)
# parse date and description from table "contenttable",
# extract date from shitty formatted text and store in dictionary entry as datetime
def parse(self, response, **kwargs):
for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
d = row.xpath('td//text()')[0].extract()
match = re.search(r'\d{2}.\d{2}.\d{4}', d)
entry = {
'date': datetime.strptime(match.group(), '%d.%m.%Y').date(),
'type': row.xpath('td//text()')[2].extract()
}
self.entrys_crawled.append(entry)
可能有更好的方法来处理多个 post-request,但它对我有用。如果有人想改进它并发送给我,请随时这样做。
致所有对 entrys_crawled 感到疑惑的人。我用scrapys close()方法处理成.ics文件。
我正在尝试从该站点获取 table:https://www.burgrieden.de/index.php?id=77
我设法获得了第一个站点,但无法访问其他 4 个页面。
我能找到的唯一解释和示例是直接 link 到其他站点或简单的 URL 操作。
我试图检查按钮并查看按下它时网络记录器中会发生什么。
但是没有任何效果。
如何使用 scrapy 进入下一页?
这是我目前的情况:
from abc import ABC
import scrapy
from scrapy.crawler import CrawlerProcess
import re
from datetime import datetime
class TrashSpider(scrapy.Spider, ABC):
name = "Trasher"
start_urls = ['https://www.burgrieden.de/index.php?id=77']
def parse(self, response, **kwargs):
for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
d = row.xpath('td//text()')[0].extract()
match = re.search(r'\d{2}.\d{2}.\d{4}', d)
date = datetime.strptime(match.group(), '%d.%m.%Y').date()
entry = {
'date': date,
'type': row.xpath('td//text()')[2].extract()
}
process = CrawlerProcess()
process.crawl(TrashSpider)
process.start()
Inspector Img
提前感谢您的帮助。
对于所有有同样问题的人。我想通了。
这是一个触发 POST 请求的按钮。要使用 scrapy 触发此请求,您必须定义 request-headers 和 request-form.
两者都可以使用您的浏览器网络分析器找到:
在左侧您可以看到用于请求站点的方法。标记的条目显示 POST.
现在我们需要获取可以在右下角字段中找到的 headers 并将它们放入 spider_class.
中的字典中
确保 忽略 content-length。将其留空或根本不将其添加到您的字典中,因为 scrapy 将发送自己的 content-length 并且当发送多个时,一方将阻止您的请求。代码 400 而不是 200。
至少在firefox的request标签下可以找到表单数据:
这里我们需要整行原样,并放入一个变量中,以便在实际请求中使用它。
它应该是这样的:
# request-header for POST request
headers = {
'Host': 'www.burgrieden.de',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Content-Length': '219',
'Origin': 'https://www.burgrieden.de',
'Connection': 'keep-alive',
'Referer': 'https://www.burgrieden.de/index.php?id=77',
'Cookie': 'style=normal.css',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
为确保它正常工作,您必须像这样禁用 cookie:
只需将其放入您的自定义蜘蛛 class
# custom scraper settings
custom_settings = {
# pass cookies along with headers
'COOKIES_ENABLED': False
}
对于实际请求,您必须使用 start_requests() 方法。
# crawler's entry point
def start_requests(self):
# make HTTP POST request
# page 1
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p1,
callback=self.parse
)
现在您可以使用普通的 parse() 方法解析响应。
如果您遇到任何问题,请尝试将 headers 中的主机留空或将其删除。
这是完整的 class 代码:
class TrashSpider(scrapy.Spider, ABC):
name = "Trasher"
start_url = "https://www.burgrieden.de/index.php?id=77"
# request-header for POST request
headers = {
'Host': 'www.burgrieden.de',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'de,en-US;q=0.7,en;q=0.3',
'Accept-Encoding': 'gzip, deflate, br',
'Content-Type': 'application/x-www-form-urlencoded',
# 'Content-Length': '219',
'Origin': 'https://www.burgrieden.de',
'Connection': 'keep-alive',
'Referer': 'https://www.burgrieden.de/index.php?id=77',
'Cookie': 'style=normal.css',
'Upgrade-Insecure-Requests': '1',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache'
}
# POST request form data for every table page
form_data_p1 = 'publish%5BbtnStart%5D=+%7C%3C+&publish%5Bstart%5D=40&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p2 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=0&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p3 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=10&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p4 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=20&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
form_data_p5 = 'publish%5BbtnNext%5D=+%3E%3E+&publish%5Bstart%5D=30&publish%5BdayFrom%5D=20&publish%5BmonthFrom%5D=03&publish%5ByearFrom%5D=2021&publish%5BdayTo%5D=&publish%5BmonthTo%5D=&publish%5ByearTo%5D=&publish%5Bfulltext%5D=&id=77'
# custom scraper settings
custom_settings = {
# pass cookies along with headers
'COOKIES_ENABLED': False
}
entrys_crawled = []
# crawler's entry point
def start_requests(self):
# make HTTP POST request to burgrieden
# page 1
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p1,
callback=self.parse
)
# page 2
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p2,
callback=self.parse
)
# page 3
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p3,
callback=self.parse
)
# page 4
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p4,
callback=self.parse
)
#page 5
yield scrapy.Request(
url=self.start_url,
method='POST',
headers=self.headers,
body=self.form_data_p5,
callback=self.parse
)
# parse date and description from table "contenttable",
# extract date from shitty formatted text and store in dictionary entry as datetime
def parse(self, response, **kwargs):
for row in response.xpath('//*[@class="contenttable"]//tr')[1:]:
d = row.xpath('td//text()')[0].extract()
match = re.search(r'\d{2}.\d{2}.\d{4}', d)
entry = {
'date': datetime.strptime(match.group(), '%d.%m.%Y').date(),
'type': row.xpath('td//text()')[2].extract()
}
self.entrys_crawled.append(entry)
可能有更好的方法来处理多个 post-request,但它对我有用。如果有人想改进它并发送给我,请随时这样做。
致所有对 entrys_crawled 感到疑惑的人。我用scrapys close()方法处理成.ics文件。