R:从 Sage 期刊的 table 内容中抓取作者姓名

R: scraping authors name from table of contents of Sage journal

我正在尝试抓取 Sage 出版的期刊的 table 内容。抓取标题和 URL 很简单。虽然抓取作者姓名很棘手,可能是因为他们打开了一个包含大量信息(从属关系、orcid 等)的 pop-up,而 SelectorGadget 似乎无法解读所有这些信息。 经过多次尝试,代码如下:

author_1 <- read_html("https://journals.sagepub.com/toc/ossa/40/1") %>%
  html_nodes('.all') %>%
  html_text(trim = TRUE)
author_1

给出:

[1] "Andrew D. Brown Andrew D. BrownUniversity of Bath, UK View ORCID profileSee all articles by this author\nSearch Google Scholar\n for this author"
[2] "Peter Fleming Peter FlemingSee all articles by this author\nSearch Google Scholar\n for this author"
[3] "Mike Reed Mike ReedCardiff Business School, UKSee all articles by this author\nSearch Google Scholar\n for this author, Gibson Burrell Gibson BurrellUniversities of Leicester and Manchester, UK View ORCID profileSee all articles by this author\nSearch Google Scholar\n for this author"

等等

清理这个的正则表达式超出了我的有限技能(特别是因为一些文章,如第 3 篇,有多个作者)。 任何帮助将不胜感激。

您可以识别父节点,然后将它们映射到 return 将作者放在一起的列表:

library(rvest)
library(purrr)

page <- read_html("https://journals.sagepub.com/toc/ossa/40/1")

page %>%
  html_nodes('div.tocAuthors') %>%
  map(~ html_nodes(.x, 'div.header a.entryAuthor') %>%
      html_text(trim = TRUE))

[[1]]
[1] "Andrew D. Brown"

[[2]]
[1] "Peter Fleming"

[[3]]
[1] "Mike Reed"      "Gibson Burrell"

...

或者对于每篇文章的单个作者字符串:

page %>%
  html_nodes('div.tocAuthors') %>%
  map_chr(~ html_nodes(.x, 'div.header a.entryAuthor') %>%
         html_text(trim = TRUE) %>% toString)

 [1] "Andrew D. Brown"                              "Peter Fleming"                               
 [3] "Mike Reed, Gibson Burrell"                    "Joep Cornelissen"                            
 [5] "Silviya Svejenova"                            "Mélodie Cartel, Eva Boxenbaum, Franck Aggeri"
 [7] "Deborah N. Brewis"                            "Renate Ortlieb, Barbara Sieben"              
 [9] "Lynne Andersson, Dirk Lindebaum, Mar Pérezts" "Lidia Greco"                                 
[11] "Michael Rowlinson"                            "Andrew Crane"                                
[13] "Jean Jenkins"