使用 R 提取 html 文本 - 无法访问某些节点
Extracting html text using R - can't access some nodes
我有大量可在线获取的取水许可证,我想从中提取一些数据。例如
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
我根本不知道 html,但在 google 和一位朋友的帮助下一直在努力工作。我可以使用 xpath 或 css 选择器毫无问题地访问某些节点,例如访问标题:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
或使用 css 选择器:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
到目前为止,还不错,但我真正想要的信息隐藏得更深一些,我似乎无法获取。例如,客户端名称字段("Killermont Station Limited",在本例中)具有此 xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
css 选择器变得相当复杂,但我得到了相同的结果。 html_nodes() 的帮助文件说:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
但它没有给我任何关于在 xpath 中使用替代前缀的线索(如果我知道 html 可能会很明显)。
朋友指出部分文档在javascript(ajax)中,这也可能是部分问题。就是说,我试图到达上面的位显示在 html 中,但它在一个名为 'div.ajax-block'.
的节点内
css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
有人可以帮忙吗?谢谢!
令人非常不安的是,大多数(如果不是全部)SO R 贡献者在涉及到抓取时都默认 "use a heavyweight third-party dependency" 简洁 "answers" 。 99% 的时间你不需要 Selenium。锻炼你的小灰细胞就可以了
首先,页面异步加载内容的大线索是出现的等待微调器。第二个在您的代码段中,其中 div
实际上包含选择器名称的一部分,其中包含 ajax
。 XHR 请求正在发挥作用的迹象。
如果您在浏览器中打开开发人员工具并重新加载页面,然后转到网络,然后是 XHR 选项卡,您将看到:
页面上的大部分 "real" 数据都是动态加载的。我们可以编写 httr
模仿浏览器调用的调用。
不过…
我们首先需要对主页进行一次 GET
调用以启动一些 cookie,这些 cookie 将为我们保留,然后找到一个用于防止网站滥用的每次生成的会话令牌。它是使用 JavaScript 定义的,所以我们将使用 V8
包来评估它。我们 可以 只使用正则表达式来查找字符串。随心所欲
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
这是 searchlist
(确实是空的):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
这是 "Consent Overview" 部分:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
这里是 "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
这是 "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
这是“工作流程:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
这里是 "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
你 仍然需要解析 HTML 但现在你可以只用普通的 R 包来完成这一切。
我有大量可在线获取的取水许可证,我想从中提取一些数据。例如
url <- "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1"
我根本不知道 html,但在 google 和一位朋友的帮助下一直在努力工作。我可以使用 xpath 或 css 选择器毫无问题地访问某些节点,例如访问标题:
library(rvest)
url %>%
read_html() %>%
html_nodes(xpath = '//*[@id="main"]/div/h1') %>%
html_text()
[1] "Details for CRC000002.1"
或使用 css 选择器:
url %>%
read_html() %>%
html_nodes(css = "#main") %>%
html_nodes(css = "div") %>%
html_nodes(css = "h1") %>%
html_text()
[1] "Details for CRC000002.1"
到目前为止,还不错,但我真正想要的信息隐藏得更深一些,我似乎无法获取。例如,客户端名称字段("Killermont Station Limited",在本例中)具有此 xpath:
clientxpath <- '//*[@id="main"]/div/div[1]/div/table/tbody/tr[1]/td[2]'
url %>%
read_html() %>%
html_nodes(xpath = clientxpath) %>%
html_text()
character(0)
css 选择器变得相当复杂,但我得到了相同的结果。 html_nodes() 的帮助文件说:
# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root noot
# regardless of where you currently are in the doc
但它没有给我任何关于在 xpath 中使用替代前缀的线索(如果我知道 html 可能会很明显)。
朋友指出部分文档在javascript(ajax)中,这也可能是部分问题。就是说,我试图到达上面的位显示在 html 中,但它在一个名为 'div.ajax-block'.
的节点内css selectors: #main > div > div.ajax-block > div > table > tbody > tr:nth-child(1) > td:nth-child(4)
有人可以帮忙吗?谢谢!
令人非常不安的是,大多数(如果不是全部)SO R 贡献者在涉及到抓取时都默认 "use a heavyweight third-party dependency" 简洁 "answers" 。 99% 的时间你不需要 Selenium。锻炼你的小灰细胞就可以了
首先,页面异步加载内容的大线索是出现的等待微调器。第二个在您的代码段中,其中 div
实际上包含选择器名称的一部分,其中包含 ajax
。 XHR 请求正在发挥作用的迹象。
如果您在浏览器中打开开发人员工具并重新加载页面,然后转到网络,然后是 XHR 选项卡,您将看到:
页面上的大部分 "real" 数据都是动态加载的。我们可以编写 httr
模仿浏览器调用的调用。
不过…
我们首先需要对主页进行一次 GET
调用以启动一些 cookie,这些 cookie 将为我们保留,然后找到一个用于防止网站滥用的每次生成的会话令牌。它是使用 JavaScript 定义的,所以我们将使用 V8
包来评估它。我们 可以 只使用正则表达式来查找字符串。随心所欲
library(httr)
library(rvest)
library(dplyr)
library(V8)
ctx <- v8() # we need this to eval some javascript
# Prime Cookies -----------------------------------------------------------
res <- httr::GET("https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1")
httr::cookies(res)
## domain flag path secure expiration name
## 1 .ecan.govt.nz TRUE / FALSE 2019-11-24 11:46:13 visid_incap_927063
## 2 .ecan.govt.nz TRUE / FALSE <NA> incap_ses_148_927063
## value
## 1 +p8XAM6uReGmEnVIdnaxoxWL+VsAAAAAQUIPAAAAAABjdOjQDbXt7PG3tpBpELha
## 2 nXJSYz8zbCRj8tGhzNANAhaL+VsAAAAA7JyOH7Gu4qeIb6KKk/iSYQ==
pg <- httr::content(res)
html_node(pg, xpath=".//script[contains(., '_monsido')]") %>%
html_text() %>%
ctx$eval()
## [1] "2"
monsido_token <- ctx$get("_monsido")[1,2]
这是 searchlist
(确实是空的):
httr::VERB(
verb = "POST", url = "https://www.ecan.govt.nz/data/document-library/searchlist",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
`X-Requested-With` = "XMLHttpRequest",
TE = "Trailers"
), httr::set_cookies(
monsido = monsido_token
),
body = list(
name = "CRC000002.1",
pageSize = "999999"
),
encode = "form"
) -> res
httr::content(res)
## NULL ## <<=== this is OK as there is no response
这是 "Consent Overview" 部分:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentoverview/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
html_table() %>%
glimpse()
## List of 1
## $ :'data.frame': 5 obs. of 4 variables:
## ..$ X1: chr [1:5] "RMA Authorisation Number" "Consent Location" "To" "Commencement Date" ...
## ..$ X2: chr [1:5] "CRC000002.1" "Manuka Creek, KILLERMONT STATION" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X3: chr [1:5] "Client Name" "State" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
## ..$ X4: chr [1:5] "Killermont Station Limited" "Issued - Active" "To take water from Manuka Creek at or about map reference NZMS 260 H39:5588-2366 for irrigation of up to 40.8 hectares." "29 Apr 2010" ...
这里是 "Consent Conditions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentconditions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="consentDetails">
## <ul class="unstyled-list">
## <li>
##
##
## <strong class="pull-left">1</strong> <div class="pad-left1">The rate at which wa
这是 "Consent Related":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentrelated/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body>
## <p>There are no related documents.</p>
##
##
##
##
##
## <div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead><tr>
## <th>Relationship</th>
## <th>Recor
这是“工作流程:
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentworkflow/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res)
## {xml_document}
## <html>
## [1] <body><p>No workflow</p></body>
这里是 "Consent Flow Restrictions":
httr::GET(
url = "https://www.ecan.govt.nz/data/consent-search/consentflowrestrictions/CRC000002.1",
httr::add_headers(
Referer = "https://www.ecan.govt.nz/data/consent-search/consentdetails/CRC000002.1",
Authority = "www.ecan.govt.nz",
`X-Requested-With` = "XMLHttpRequest"
),
httr::set_cookies(
monsido = monsido_token
)
) -> res
httr::content(res) %>%
as.character() %>%
substring(1, 300) %>%
cat()
## <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
## <html><body><div class="summary-table-wrapper">
## <table class="summary-table left">
## <thead>
## <th colspan="2">Low Flow Site</th>
## <th>Todays Flow <span class="lower">(m3/s)</span>
## </th>
你 仍然需要解析 HTML 但现在你可以只用普通的 R 包来完成这一切。