如何使用 RVEST 从 class 中抓取标题？

Question

我想从墨西哥零售网页上抓取所有智能手机的名称。

我不明白为什么我的代码不起作用，因为我已经为类似的网页做了这个，显然 RVEST 不是 "reading" "class" html 代码.

使用 Google 选择器小工具，我发现智能手机名称位于名为“.name”的 class 中，所以我尝试了这个：

url <- 'https://www.chedraui.com.mx/Departamentos/Tecnolog%C3%ADa/Telefon%C3%ADa/Celular/c/MC230202?siteName=Sitio+de+Chedraui&isAlcoholRestricted=false'
web <- read_html(url)

web %>%
  html_nodes('.name') %>%
  html_text()

但结果是：''''

预期结果是包含所有智能手机名称的向量。

Answer 1

检查响应，您将在不同的 class

下看到信息

library(rvest)
page <- read_html("https://www.chedraui.com.mx/Departamentos/Tecnolog%C3%ADa/Telefon%C3%ADa/Celular/c/MC230202?siteName=Sitio+de+Chedraui&isAlcoholRestricted=false")
titles <- page %>% 
  html_nodes('.product__list--thumb') %>%
  html_attr(., "title")

Answer 2

要查找您的 HTML 文本属于哪个 class 而无需与网页结构交互，即不在网页上使用 'Inspect'，您可以使用 CSS 选择器来搜索 HTML 文本，然后使用 xml2::xml_attrs().

访问它所属的 class

这是一个使用 'Huawei' 作为文本的示例，它出现在您的其中一个标题中。

"https://www.chedraui.com.mx/Departamentos/Tecnolog%C3%ADa/Telefon%C3%ADa/Celular/c/MC230202?siteName=Sitio+de+Chedraui&isAlcoholRestricted=false" %>% 
  read_html() %>% 
  html_nodes(":contains('Huawei')") %>% # search for string. Note the separate types of quotations
  xml2::xml_attrs() %>% # show all the attributes the string belongs to
  purrr::map("class") %>% # pull just 'class' attrs from the list
  unlist %>% unique

您还可以使用通配符来搜索字符串，例如将 html_nodes(":contains('Huawei')") 替换为 html_nodes("*:contains('Huaw')")

如何使用 RVEST 从 class 中抓取标题？

How can I scrape the title from a class with RVEST?

screen-scraping

r

web-scraping

rvest