在 rvest 包的 `html_nodes` 函数中使用正则表达式

Question

我正在尝试创建一个使用 rvest 包中的 html_nodes 函数的函数。我的函数接收任何 Medium（blogging/publishing 平台）博客主页的 URL。它将生成指向该特定 Medium 博客上每个 posts/articles 的链接，并将其保存在列表中。

但是，每个 Medium 博客的设计都不同。因此，SelectorGadget 生成的 css 也会有所不同。有什么方法可以使用正则表达式，特别是竖线 ("|") 符号来捕获不同的 OR，这样我的函数就可以智能地捕获任何给定 Medium 博客上每个 posts/articles 的链接。

我的函数如下：

get_url_suffix <- function(url) {
  url_suffix <- read_html(url) %>%
    html_nodes(".u-borderLighter|.gc .bv") %>%
    html_attr("href") %>%
    as.data.frame()
  
  return(url_suffix)
}

.u-borderLighter 和 .gc .bv 是我在 Medium 博客中遇到的两个例子，我打算抓取它们的链接（单独使用时抓取成功）。

谢谢！

Answer 1

在这种情况下，您应该可以像这样使用 CSS 选择器：

html_elements(".u-borderLighter, .gc .bv")

（请注意，html_nodes() 已弃用并由 html_elements() 取代。）

在 rvest 包的 `html_nodes` 函数中使用正则表达式

Using regular expressions in the `html_nodes` function of the rvest package

regex

r

web-scraping

rvest