为什么我不能使用 rvest 阅读可点击的网络抓取链接？

Question

我正在尝试网络抓取这个 website。

点击每个标题后，我需要的内容就会出现。例如，如果我这样做（我正在使用 SelectorGadget），我可以获得我想要的内容：


library("rvest")

url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"

sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))

但是，我需要获取网站中每个可点击 link 的每个文本。所以我通常这样做：


url_boe = "https://www.bankofengland.co.uk/news/speeches"


html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")

不过我得到一个空 object。我尝试了不同的代码变体，但结果相同。

如何读取那些 link，然后将第一部分中的代码应用于所有 link？

谁能帮帮我？

谢谢！

Answer 1

正如@KonradRudolph 之前指出的那样，链接是动态插入到网页中的。因此，我使用 RSelenium 和 rvest 生成了一个代码来解决这个问题：

library(rvest)
library(RSelenium)

# URL
url = "https://www.bankofengland.co.uk/news/speeches"

# Base URL
base_url = "https://www.bankofengland.co.uk"

# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")

# Assign the client to an object
rem_dr <- rD[["client"]]

# Navigate to the URL
rem_dr$navigate(url)

# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])

# Extract links and concatenate them with the base_url
links <- page %>%
  html_nodes(".release-speech") %>%
  html_attr('href') %>%
  paste0(base_url, .)

# Get links names
links_names <- page %>%
  html_nodes('#SearchResults .exclude-navigation') %>%
  html_text()

# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]

# Create a data.frame with the results
df <- data.frame(links_names, links)

# Close the client and the server
rem_dr$close()
rD$server$stop()

结果 data.frame 如下所示：

> head(df)
                                                                                         links_names
1                           Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2                       Tackling climate for real: progress and next steps - speech by Andrew Bailey
3                     Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5                              Responsible openness in the Insurance Sector - speech by Anna Sweeney
6                           Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
                                                                                                                       links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2           https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3             https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4   https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5         https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6     https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit

为什么我不能使用 rvest 阅读可点击的网络抓取链接？

Why can't I read clickable links for webscraping with rvest?

r

web-scraping

rvest