R 数据帧转换：将字符观察拆分为多行，重新排列字符串

Question

我有一个数据框，其中一列填充了结构如下的字符串：姓氏，名字 XX，姓氏，名字 XX，等等。因此，名字组合在末尾用“XX”分开。

我正在寻找

将姓氏、名字的每个组合放在单独的行中；
将每个名字组合转换为名字姓氏。

这看起来如下：

example <- data.frame(id = c(1,2,3), 
                      names = c("Russell-Moyle, Lloyd XX, Lucas, Caroline XX, Hobhouse, Wera XX", "Benn, Hilary XX, Sobel, Alex XX, West, Catherine XX, Doughty, Stephen XX", "Oswald, Kirsten XX, Thompson, Owen XX, Dorans, Allan XX")
                      )

example

#current output:
#1  1           Russell-Moyle, Lloyd XX, Lucas, Caroline XX, Hobhouse, Wera XX
#2  2 Benn, Hilary XX, Sobel, Alex XX, West, Catherine XX, Doughty, Stephen XX
#3  3                  Oswald, Kirsten XX, Thompson, Owen XX, Dorans, Allan XX

#ideal output:
   id   names
   1    Lloyd Russel-Moyle   
   1    Caroline Lucas  
   1    Were Hobhouse
   2    Hilary Benn 
   2    Alex Sobel   
   2    Catherine West  
   2    Stephan Doughty
   3    Kirsten Oswald 
   3    Owen Thompson   
   3    Allan Dorans

有人能帮帮我吗？谢谢！！

Answer 1

您可以使用 tidyr 包中的一些函数来完成此操作。

library(tidyr)
library(dplyr)

example %>% 
  separate_rows(names, sep = "( *)XX(,*)( *)") %>% # create one row per name
  separate(names, into = c("last", "first"), sep = ", ") %>%   # separate names into first and last
  unite(names, first, last, sep = " ")

# A tibble: 10 x 2
      id names              
   <dbl> <chr>              
 1     1 Lloyd Russell-Moyle
 2     1 Caroline Lucas     
 3     1 Wera Hobhouse      
 4     2 Hilary Benn        
 5     2 Alex Sobel         
 6     2 Catherine West     
 7     2 Stephen Doughty    
 8     3 Kirsten Oswald     
 9     3 Owen Thompson      
10     3 Allan Dorans

这是 separate_rows() 的 sep = 参数中正则表达式的分解：

( *)  # match a sequence starting with 0 or more spaces
XX    # followed by XX
(,*)  # followed by 0 or more commas
( *)  # followed by 0 or more spaces

R 数据帧转换：将字符观察拆分为多行，重新排列字符串

R dataframe transformation: split character observations into multiple rows, rearrange strings

text

r

stringr

tidyverse