使用 purrr 迭代 xml 节点集的列表列而不展平结果

Iterating over a list column of xml nodesets with purrr without flattening the results

编辑 2:已更新以处理 dput 输出中的问题。

我不知道为什么 dput 输出不起作用,所以这里是共享数据的迂回方式。

可以从此处下载数据的简单 zip 文件:link to zip file

下面的代码应该代表我试图分享的数据。请注意,您需要替换下载的 zip 文件的路径名,并且 parse_file 函数将创建一个临时目录:

## Libraries
library(tidyverse)
library(rvest)
library(XML)

## list files
list_paths <- list.files(path = "path_to_downloaded_zip_folder", pattern = ".xml", full.names = TRUE)
files <- list_paths %>% basename %>% path_ext_remove()

## set up temporary junk folder
junk <- tempdir("junk")

## functions

parse_file <- function(file) {
  file <- gsub("<c", "<w", file)
  file <- gsub("</c", "</w", file)
  xml <- xmlParse(file)
  saveXML(xml,paste0(junk,"/parse.xml"))
  xml <- read_xml(paste0(junk,"/parse.xml"))
  content <- xml %>% xml_nodes("div[level='1']")
  ### remove junk xml file
  unlink(paste0(junk,"/parse.xml"))
  ## in case no dim[level='1'] 
  if (length(content) == 0) {
    content <- xml %>%
      xml_nodes("div")
  }
  return(content)
}

parse_text <- function(content){
  ### get the text
  text<-content %>% xml_nodes("s") %>% html_text()
  return(text)
}


## create dat df
dat <- tibble(file = files, path = list_paths) %>%
  mutate(datasets =  map(path, ~read_file(.)),
content = map(datasets, ~parse_file(.))) %>% select(file , content)

这留下了来自 xml 数据的列表列的数据框,内容列包含 xml 节点集对象的列表。

我想将内容列中的节点集解析为文本,这可以通过上面的 parse_text 函数来完成。

我可以使用 purrr::map 遍历内容列以获取每个文件的文本...

dat %>% mutate(text = map(content, ~ parse_text(.) %>% set_names(file)))

...并保留文件名,但这最终会展平内容列中的节点集,并将文本合并在一起。

我真正想做的是有一个文本列,每行中的元素数代表内容列中的节点集数。也就是说,文件 news_A84 的文本列应该有一个元素(最好命名为 news_A84_01),文件 的文本列行应该有一个元素(最好命名为 news_A84_01) news_A9P 应该有两个(名为 news_A9P_01 和 news_A9P_02)。

我已经得到了一些使用 for 循环的代码,但我想知道是否可以使用 purrr 以这种方式拆分节点集对象?

所需的 dat$text 输出应如下所示:

structure(list(text = list(news_A84 = list(news_A84_001 = c("Letters.", 
"By PETER STANFORD is right in maintaining (Weekend Guardian, November 4) that Graham Greene is not alone in calling himself a ‘Catholic agnostic’.", 
"An earlier thinker who declined to see any mutual contradiction between similar terms was Leslie Weatherhead, psychologist and cleric, who published his Creed of a Christian Agnostic.", 
"Weatherhead was honest enough to realise that there are many aspects of religion about which one has to remain uncertain, that is, agnostic.", 
"These ‘difficult’ areas will vary from one individual to another.", 
"As a self-confessed Christian agnostic he himself, however, was sure of a number of the major tenets of Christianity.", 
"His ‘creed’ claimed (in part): ‘I believe that God exists…", 
"I believe there is mind behind the universe…", "Such a mind must be love rather than hate…", 
"I believe in the divinity of Christ…", "I believe that sin is a grisly fact in the world…", 
"I believe that God's forgiveness is one of the most blessed and therapeutic experiences and that it is offered to all who seek it…", 
"I believe that our relationship with God is the most important thing in the world…", 
"I believe that each individual is precious to God.’", "Within this creed appear supportive rational arguments but also agnostic admissions, such as‘I can understand little about that mind’ and ‘I do not know what ‘divine’means’.", 
"I imagine that many honest people would sympathise with Weatherhead and happily echo his final paragraph: ‘All this gives me as much as I need, and seems to me the essential credo of Christianity.", 
"About the rest I am content to be agnostic.’", "Michael J.Smith.", 
"Southampton.", "FROM HIS Fifth Dimension article, Peter Stanford sounds even further burnt out than Graham Greene who continues capable of grim religious fun when opportunities arise.", 
"Indeed, Greene's latest revelations about his faith-non-faith appear quite knowingly hilarious.", 
"There's such sad unenthusiasm, on the other hand, about the way Stanford confuses personal doubt with the existence of mystery, as if one engenders the other, and, in doing so, must be assuaged by performance of some farcical rites.", 
"How boring!", "How pointless; precisely!", "Prayer, whether dramatised in a church or not, usually goes through futile-seeming passages.", 
"But even when extreme and prolonged, they have a strange tickle of meaning in them that becomes more of a mystery as faith grows.", 
"Olive Powell.", "Manchester.", "More letters Page 27")), news_A9P = list(
    news_A9P_001 = c("Snooker: Reynolds scales Davis peak.", 
    "By Clive Everton", "DEAN REYNOLDS, whitewashed 10-0 by Steve Davis in the final of the Rothmans Grand Prix in October, beat him 9-7 to reach the semi-finals of the Everest World Matchplay Championship at the Brentwood Centre late on Saturday.", 
    "The Grimsby left-hander's semi-final opponent on Wednesday will be Jimmy White, who won the last seven frames in a row to turn a 5-2 deficit into a 9-5 win over Doug Mountjoy.", 
    "Reynolds, 15th in the world rankings at the start of the season, has improved provisionally to eighth.", 
    "He had a gilt-edged opportunity to beat Stephen Hendry, so far the man of the season with four first prizes, in the Stormseal UK Open at Preston, but missed a simple pink when leading 22-0 in the deciding frame and did not have another shot.", 
    "Davis came to Brentwood a 16-12 loser to Hendry in the UK final and is going through a patch where he is making more unforced errors than usual.", 
    "‘There are certain times that are better than others,’ said Davis.", 
    "‘I'm a very good player.", "I can't be perfect.’", "Davis swept to 3-0 but missed the easiest of blues on the brink of 4-0.", 
    "Reynolds won that frame, the next on the pink, two more on the black, and the last of the afternoon to lead 5-3 at the interval.", 
    "He made it 6-3 with a 90 break, and after 52 minutes of tactical battling potted the pink for 7-3 and ran away with the next to go five up with six to play.", 
    "After this extraordinary eight-frame losing streak, Davis won four in a row to close to only 7-8, but two elementary mistakes in the following frame, failing to reach the yellow when rolling up behind it for a snooker, and failing to pot a red at close range along the top cushion, prefaced a run of 48 with which Reynolds secured his most notable scalp.", 
    "PAGE"), news_A9P_001 = c("Judo: Stevens justifies selection.", 
    "By Edward Ferrie", "THE British National Championships at Crystal Palace at the weekend once again saw Wolverhampton dominate the proceedings, with their fighters Elvis Gordon, at heavyweight, Densign White, at middleweight, Fitzroy Davies, at light-middleweight, and Owen Pinnock the bantamweight all taking gold medals.", 
    "There was, however, no Wolverhampton presence in the category which generated the greatest interest this weekend — the light-heavyweight.", 
    "Following the Olympic bronze medalist Denis Stewart's decision to retire from competition after a poor performance in the world championships in Belgrade, two months ago, the No.1 spot was up for grabs.", 
    "Stewart's bitter rival, the veteran Nicholas Kokataylo, the 33-year-old from Denton, in Manchester, was favourite for the gold medal, but a strong challenge was expected from newcomer to the weight, Ray Stevens, 26, of the London Budokwai.", 
    "The reigning Commonwealth middleweight champion Stevens was forced to move up to the heavier weight following a knee injury and a prolonged viral infection.", 
    "Stevens's superior speed and technique combined with superb fighting spirit carried him through to the final.", 
    "Kokataylo and Stevens was an all-action affair with the Manchester fighter scoring with a leg throw in the opening seconds which almost finished the bout.", 
    "A half point was awarded but Stevens fought his way back into the contest scoring with a spectacular sacrifice throw and almost arm locking the much taller and heavier Kokataylo.", 
    "Stevens, despite losing the bout, clearly did enough to justify his pre-event selection for the Commonwealth Games.", 
    "PAGE")))), row.names = c(NA, -2L), class = c("tbl_df", "tbl", 
"data.frame"))

非常感谢!

如果有帮助,这是我的会话信息:

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
 [1] magrittr_1.5     zip_2.0.3        openxlsx_4.1.0.1
 [4] seas_0.5-2       MASS_7.3-51.4    tictoc_1.0      
 [7] rlang_0.4.10     fs_1.3.1         rvest_0.3.4     
[10] xml2_1.2.2       XML_3.98-1.20    forcats_0.4.0   
[13] stringr_1.4.0    dplyr_1.0.5      purrr_0.3.2     
[16] readr_1.3.1      tidyr_1.1.3      tibble_2.1.3    
[19] ggplot2_3.2.1    tidyverse_1.2.1  here_1.0.1 

parse_text 函数中使用 map 以便单独获取列表。

library(tidyverse)
library(rvest)
library(XML)

parse_text <- function(content){
  text<- map(content, ~.x %>% xml_nodes("s") %>% html_text())
  return(text)
}

然后你就可以在map2中调用这些函数并指定名称了

tibble(file = files, path = list_paths) %>%
  mutate(datasets =  map2(path, file, function(x, y) x %>% read_file() %>% 
                                    parse_file() %>%  parse_text() %>%
                                    setNames(paste(y, seq_along(.), sep = '_'))))

#  file     path                datasets        
#  <chr>    <chr>               <list>          
#1 news_A84 ./dat//news_A84.xml <named list [1]>
#2 news_A9P ./dat//news_A9P.xml <named list [2]>