获取节点值以及 parent 属性

Get node values along with parent attribute

我有一个 XML-file 形状如下:


<dataDscr>
<var ID="V335" name="question1" files="F1" dcml="0" intrvl="discrete">
      <location width="1"/>
      <labl>
        question 1 label
      </labl>
      <qstn>
        <qstnLit>
          question 1 literal question
        </qstnLit>
        <ivuInstr>
          question 1 interviewer instructions
        </ivuInstr>
      </qstn>
  </var>

  <var ID="V335" name="question2" files="F1" dcml="0" intrvl="discrete">
      <location width="1"/>
      <labl>
        question 2 label
      </labl>
      <qstn>
        <preQTxt>
          question 2 pre question text
        </preQTxt>
        <qstnLit>
          question 2 literal question
        </qstnLit>
        <ivuInstr>
          question 2 interviewer instructions
        </ivuInstr>
      </qstn>
  </var>

    <var ID="V335" name="question3" files="F1" dcml="0" intrvl="discrete">
      <location width="1"/>
      <labl>
        question 3 label
      </labl>
      <qstn>
        <preQTxt>
          question 3 pre question text
        </preQTxt>
        <qstnLit>
          question 3 literal question
        </qstnLit>
      </qstn>
  </var>

</dataDscr> 

我想收集所有 <qstn> children 的值,以及 parent 标签 <var> 中的 name 属性(即 "question1").请注意 <qstn> 有不同数量的 children。比如有question1两个children,即<qstnLit><ivuInstr>question2 拥有 children <qstn> 所能拥有的一切。

我希望最终结果如下所示:


# name      | preQTxt | qstnLit | ivuInstr
# ------------------------------------------
# question1 |...      |...      |...
# question2 |...      |...      |...
# question3 |...      |...      |...

谢谢!

这应该适用于您的情况:

library(tidyverse)
library(xml2)

doc <- read_xml( "data.xml" )

# get all var elements
vars <- xml_find_all( doc, "//var" )

# extract from each "var" element the children of the "qstn" elements,
# then take the tag names and the enclosed text and put each in a column
df_long <- do.call( rbind, lapply(vars,
                             function(x) {
                               lbl <- xml_attr( x, "name" )
                               tags <- xml_find_all( x, "qstn/*" )
                               data.frame( name = lbl, 
                                           col = xml_name(tags), 
                                           txt = trimws(xml_text(tags)) )
                             }) ) 
# spread the data frame to wide format
df <- df_long %>% pivot_wider( name, names_from = col, values_from = txt )

输出:

# A tibble: 3 x 4
  name      qstnLit                     ivuInstr                            preQTxt                     
  <chr>     <chr>                       <chr>                               <chr>                       
1 question1 question 1 literal question question 1 interviewer instructions NA                          
2 question2 question 2 literal question question 2 interviewer instructions question 2 pre question text
3 question3 question 3 literal question NA                                  question 3 pre question text

此处,pivot_wider 处理不同数量的列,将 NA 放在 var 元素不存在的元素处。