为包含 javascript 元素的页面编写网络抓取工具？

Question

所以，我需要写一个python webscraper 来从这个网站收集数据：https://online.portalberni.ca/WebApps/PIP/Pages/Search.aspx?templateName=permit%20reporting

如您所见，似乎无法在日期字段中手动输入文本。这是我在为这样的页面编写脚本时通常会做的事情。该脚本将运行每天在无头 ubuntu 服务器上运行。我需要能够 select 脚本运行s 之前 7 天的日期范围，同样，通常通过输入文本很容易，但我不认为是这里的一个选项。知道如何使用这样的 javascript 元素吗？

Answer 1

这让我进入下一页（那里有另一种形式可以做类似的事情）：

from requests import Session
from bs4 import BeautifulSoup as Bs

s = Session() # Keeps things stored in for future use

# If you look at the HTML, this is the "action" of the form (in this case happens to be the same as the form itself, not always true)
form_url = "https://online.portalberni.ca/WebApps/PIP/Pages/Search.aspx?templateName=permit%20reporting"

# Gets the HTML of the form
r = s.get(form_url)
html = Bs(r.text, "lxml")
form = html.find("form")

# Finds hidden inputs in the form that are necessary for a successful POST
hidden = form.find_all("input", {"type": "hidden"})
data = {i["name"]: i["value"] for i in hidden}

"""
There is javascript code that changes the form data before submission (onsubmit in the
form). I found this by using developer tools in chrome to see what the POST data actually
was, not by analyzing the javascript
"""
data["ctl00$FeaturedContent$ToolkitScriptManager1"] = "ctl00$FeaturedContent$updpnl_search|ctl00$FeaturedContent$btn_ViewReport"
data["__EVENTTARGET"] = ""
data["__EVENTARGUMENT"] = ""
data["__ASYNCPOST"] = "true"
data["ctl00$FeaturedContent$btn_ViewReport"] = "Search"

# Change to your date range
data["ctl00$FeaturedContent$txt_FromDate"] = "01/01/2021"
data["ctl00$FeaturedContent$txt_ToDate"] = "01/10/2021"

# Submits the form
headers = {
    "Content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "Referer": "https://online.portalberni.ca/WebApps/PIP/Pages/Search.aspx?templateName=permit%20reporting",
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36",
}
s.post(form_url, data=data, headers=headers)

# The page with the results you're looking for
results_url = "https://online.portalberni.ca/WebApps/PIP/Pages/PropBasedReportSelection.aspx?templateName=permit%20reporting"
r = s.get(results_url)

或许可以跳过这个表格，只做第二页的表格，但我没试过。这至少应该让你走上正轨。

为包含 javascript 元素的页面编写网络抓取工具？

Writing a webscraper for a page with javascript elements?

python

web-crawler