为什么我不能使用 rvest 阅读可点击的网络抓取链接?
Why can't I read clickable links for webscraping with rvest?
我正在尝试网络抓取这个 website。
点击每个标题后,我需要的内容就会出现。例如,如果我这样做(我正在使用 SelectorGadget),我可以获得我想要的内容:
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
但是,我需要获取网站中每个可点击 link 的每个文本。所以我通常这样做:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
不过我得到一个空 object。我尝试了不同的代码变体,但结果相同。
如何读取那些 link,然后将第一部分中的代码应用于所有 link?
谁能帮帮我?
谢谢!
正如@KonradRudolph 之前指出的那样,链接是动态插入到网页中的。因此,我使用 RSelenium
和 rvest
生成了一个代码来解决这个问题:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
结果 data.frame 如下所示:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit
我正在尝试网络抓取这个 website。
点击每个标题后,我需要的内容就会出现。例如,如果我这样做(我正在使用 SelectorGadget),我可以获得我想要的内容:
library("rvest")
url_boe ="https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021"
sample_text = html_text(html_nodes(read_html(url_boe), "#output .page-section"))
但是,我需要获取网站中每个可点击 link 的每个文本。所以我通常这样做:
url_boe = "https://www.bankofengland.co.uk/news/speeches"
html_attr(html_nodes(read_html(url_boe), "#SearchResults .exclude-navigation"), name = "href")
不过我得到一个空 object。我尝试了不同的代码变体,但结果相同。
如何读取那些 link,然后将第一部分中的代码应用于所有 link?
谁能帮帮我?
谢谢!
正如@KonradRudolph 之前指出的那样,链接是动态插入到网页中的。因此,我使用 RSelenium
和 rvest
生成了一个代码来解决这个问题:
library(rvest)
library(RSelenium)
# URL
url = "https://www.bankofengland.co.uk/news/speeches"
# Base URL
base_url = "https://www.bankofengland.co.uk"
# Instantiate a Selenium server
rD <- rsDriver(browser=c("chrome"), chromever="91.0.4472.19")
# Assign the client to an object
rem_dr <- rD[["client"]]
# Navigate to the URL
rem_dr$navigate(url)
# Get page HTML
page <- read_html(rem_dr$getPageSource()[[1]])
# Extract links and concatenate them with the base_url
links <- page %>%
html_nodes(".release-speech") %>%
html_attr('href') %>%
paste0(base_url, .)
# Get links names
links_names <- page %>%
html_nodes('#SearchResults .exclude-navigation') %>%
html_text()
# Keep only even results to deduplicate
links_names <- links_names[c(FALSE, TRUE)]
# Create a data.frame with the results
df <- data.frame(links_names, links)
# Close the client and the server
rem_dr$close()
rD$server$stop()
结果 data.frame 如下所示:
> head(df)
links_names
1 Stablecoins: What’s old is new again - speech by Christina Segal-Knowles
2 Tackling climate for real: progress and next steps - speech by Andrew Bailey
3 Tackling climate for real: the role of central banks - speech by Andrew Bailey
4 What are government bond yields telling us about the economic outlook? - speech by Gertjan Vlieghe
5 Responsible openness in the Insurance Sector - speech by Anna Sweeney
6 Cyber Risk: 2015 to 2027 and the Penrose steps - speech by Lyndon Nelson
links
1 https://www.bankofengland.co.uk/speech/2021/june/christina-segal-knowles-speech-at-the-westminster-eforum-poicy-conference
2 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-bis-bank-of-france-imf-ngfs-green-swan-conference
3 https://www.bankofengland.co.uk/speech/2021/june/andrew-bailey-reuters-events-global-responsible-business-2021
4 https://www.bankofengland.co.uk/speech/2021/may/gertjan-vlieghe-speech-hosted-by-the-department-of-economics-and-the-ipr
5 https://www.bankofengland.co.uk/speech/2021/may/anna-sweeney-association-of-british-insurers-prudential-regulation
6 https://www.bankofengland.co.uk/speech/2021/may/lyndon-nelson-the-8th-operational-resilience-and-cyber-security-summit