为什么 R 中的 html_nodes() 没有给我这个网页所需的输出？

Question

我想在此网页上提取每个剧集的所有链接，但我似乎在使用 html_nodes() 时遇到了困难，而我以前从未遇到过这样的困难。我正在尝试使用“。”迭代代码。这样页面的所有属性都是通过 CSS 获得的。此代码旨在提供所有属性的输出，但我得到的是 {xml_nodeset (0)}。我知道在我拥有所有属性后该怎么做才能从中专门获取链接，但这一步证明是该网站的绊脚石。

这是我在 R 中开始的代码：

episode_list_page_1 <- "https://jrelibrary.com/episode-list/"

episode_list_page_1 %>%
  read_html() %>%
  html_node("body") %>%
  html_nodes(".type-text svelte-fugjkr first-mobile first-desktop") %>%
  html_attrs()

Answer 1

此 rvest down 在这里不起作用，因为此页面使用 javascript 将另一个网页插入此页面的 iframe 中，以显示信息。

如果您搜索内嵌脚本，您会找到对此页面的引用：“https://datawrapper.dwcdn.net/eoqPA/66/”，这会将您重定向到“https://datawrapper.dwcdn.net/eoqPA/67/"。第二页包含您要查找的嵌入 JSON 并通过 javascript.

生成的数据

节目的 link 是可提取的，link 到 Google 文档是完整索引。

搜索此页面会找到 link 到 Google 文档：

library(rvest)
library(dplyr)
library(stringr)

page2 <-read_html("https://datawrapper.dwcdn.net/eoqPA/67/")

#find all of the links on the page:
str_extract_all(html_text(page2), 'https:.*?\"') 

#isolate the Google docs
print(str_extract_all(html_text(page2), 'https://docs.*?\"') )
#[[1]]
#[1] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/edit?usp=sharing"                                                
#[2] "https://docs.google.com/spreadsheets/d/12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8/export?format=csv&id=12iTobpwHViCIANFSX3Pc_dGMdfod-0w3I5P5QJL45L8"

为什么 R 中的 html_nodes() 没有给我这个网页所需的输出？

Why is html_nodes() in R not giving me the desired output for this webpage?

screen-scraping

r

web-scraping

rvest