使用 rvest 抓取数据时遇到问题

Question

拜托，我正在尝试从 google 新闻网站上抓取数据。我想使用 rvest 和 dplyr 软件包以及 google chrome 上的选择器小工具提取网站上热门话题的关键字。这是我的代码：

library(rvest)
library(dplyr)
google.news<-read_html("https://news.google.com/topstories?hl=en-NG&gl=NG&ceid=NG:en")
google.news %>%
+html_nodes(".boy4he") %>%
+html_text()

但是在运行代码之后，我收到以下错误消息：

google.news<-read_html("https://news.google.com/topstories?hl=en-NG&gl=NG&ceid=NG:en")
> google.news %>%
+ +html_nodes(".boy4he") %>%
+ +html_text()
Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"

请问有什么问题吗？我将不胜感激任何人的意见或建议，谢谢。

Answer 1

这个有效：

library(rvest)
library(dplyr)
google.news<-read_html("https://news.google.com/topstories?hl=en-NG&gl=NG&ceid=NG:en")

google.news %>%
  html_nodes(css = ".boy4he") %>%
  html_attr("aria-label")

[1] "Godwin Obaseki"            "Abdullahi Umar Ganduje"    "Sanusi Lamido Sanusi"      "Zamfara"                  
 [5] "All Progressives Congress" "Dangote Group"             "Kano"                      "Senate of Nigeria"        
 [9] "Aliko Dangote"             "Muhammadu Buhari"

html 属性中的值为 "hidden" "aria-label":

<a class="boy4he" href="./topics/CAAqJQgKIh9DQkFTRVFvTEwyMHZNREV5YlRKa2RHd1NBbVZ1S0FBUAE?hl=en-NG&amp;gl=NG&amp;ceid=NG%3Aen" aria-label="Abdullahi Umar Ganduje"></a>

使用 rvest 抓取数据时遇到问题

Trouble scraping data using rvest

r

dplyr

rvest