数据锚文本 - Web 抓取 rvest 问题

Question

我正在尝试从此页面抓取：https://www.scielo.br/j/rcf/a/M6Ck7FmWQvm8nTCWkLBXLhp/?lang=pt

我需要从这个页面中抓取更多相似的页面，但模式不一样。我可以通过这个 xpath - //*[@id="articleText"]/div[1] 抓取文本，但实际上我想从 div- class="articleSection"; 抓取文本。数据锚名称“文本”。

div 数字改变了链接，但模式数据锚名称“文本”，没有。

我加入这张图片是为了提供一些背景信息：

R代码：

library(dplyr)
library(rvest)

article <- "https://www.scielo.br/j/rcf/a/h9fbHLPbwgRVymxmtxNhKJR/?lang=pt&format=html" # link

aticle_text <- article %>%
  rvest::read_html() %>% 
  rvest::html_node(xpath='//*[@id="articleText"]/div[1]') %>% # here I would like to scrape from data-anchor name "Text", inside the div Article Section
  rvest::html_text()

Answer 1

您可以使用 attribute=value css 选择器来匹配属性

]library(magrittr)
library(rvest)

article <- "https://www.scielo.br/j/rcf/a/h9fbHLPbwgRVymxmtxNhKJR/?lang=pt&format=html" # link

article_text <- article %>%
  rvest::read_html() %>% 
  rvest::html_node('[data-anchor=Text]') %>% 
  rvest::html_text2()

Answer 2

我认为，这个 XPath 可以解决您的问题

//*[contains(@class,'articleSection') and @data-anchor='Text']

数据锚文本 - Web 抓取 rvest 问题

data-anchor text - Web-scraping rvest question

html

r

web-scraping

rvest