4chan：找不到带有 xml_find_all 和 rvest 的节点

Question

我正在尝试收集 4chan 的无限智慧，但我在使用 rvest w/ xml 时遇到了问题。我也习惯在Python中使用BS4，如果这很明显，请原谅我。

这里我试图捕捉一个线程的标题：

soup <- read_html('https://boards.4chan.org/pol/catalog')

soup %>% html_nodes('body') %>% 
  xml_find_all(".//id[contains(@class, 'teaser')]") %>% 
  html_text()

See attached, 我想我已经将代码指向正确的方向，但我在输出中得到了 'character(0)'。

感谢任何帮助。

最佳

Answer 1

页面似乎是动态加载的，因此您需要 RSelenium 而不是 rvest。

例如，它似乎适用于此代码：

rD <- RSelenium::rsDriver(browser="firefox")
remDr <- rD[["client"]]
remDr$navigate("https://boards.4chan.org/pol/catalog")

# scroll down a bit and wait some seconds so as to ensure the loading of the page
remDr$executeScript(paste("scroll(0,",i*10000,");")) 
Sys.sleep(5)

# fetch the html code
soup <- remDr$getPageSource()
soup <- xml2::read_html(soup[[1]])

# obtain the titles
thread_titles <- soup %>% html_nodes("#threads div.teaser > b") %>% 
  html_text()

# exit
remDr$close()
gc()
rD$server$stop()
system("taskkill /im java.exe /f", intern=FALSE, ignore.stdout=FALSE)

矢量 thread_titles 然后包含 127 个标题。

我希望我使用 CSS 选择器 (#threads div.teaser > b) 而不是 xpath 没问题吗？

4chan：找不到带有 xml_find_all 和 rvest 的节点

4chan: Can't find node with xml_find_all and rvest

r

web-scraping

rvest