将多行转换为多列(长到宽格式不起作用(?))

Convert multiple rows to multiple columns (long to wide format does not work (?))

我有一个包含多篇期刊文章的数据集。不同的文章都有不同的识别码(WoS_No)。不同的文章在不同的行。

这些文章的作者数量不同。如果一篇论文有不止一位作者,识别码会重复多行,每个作者一行。

df 中还有其他信息,其中一些与论文相关(并且对于具有相同 WoS_No 代码的所有行都是相同的。但是,一些仅与作者相关(比如他们的教员) 然后在行上打印出来。

请看下面的例子:

# Original df
df <- data.frame("WoS_No" = matrix(c("WOS:000352315900021", "WOS:000352315900021", "WOS:000352315900021", "WOS:000352315900021", "WOS:000362644700013", "WOS:000362644700013", "WOS:000382460200025", "WOS:000381736200014", "WOS:000371540200019"), 9, 1))
df$Author <- c("CHENEVIX, Georg", "CHENEVIX, Georg", "DOLCE, Ric", "DOLCE, Ric", "CLOUST, A", "STEVEN, A", "WANG, Zhi", "COIN, L", "BARL, Kare")
df$Faculty <- c("Medicine", NA, "HASS", NA, "HABS", "Medicine", "Medicine", "IMB", NA)
df$CNCI <- c(10.51, 10.51, 10.51, 10.51, 37.47, 37.47,  0.84,  8.05, 29.41)
sapply(data2, class)

我真的很想安排 df,这样每篇文章只有 1 行(即每行 WoS_No)。

我希望将作者姓名分成不同的列(请参阅下面的 'Author1'、'Author2' 列)。我尝试从长格式转换为宽格式,但它没有用,可能是因为大多数文章的作者不同 - 所以它给每个名字一个新的列(我不能有大约 20,000 个名字)

如果这太繁琐了,我会很高兴将所有作者姓名折叠成 'Authors' 列中的一个字符串,所有姓名都用分号分隔(这意味着我可以在以后需要时将它们拆分) .请参阅下面的 'Faculties' 列。

# New df options

dfnew <- data.frame("WoS_No" = matrix(c("WOS:000352315900021", "WOS:000362644700013", "WOS:000382460200025", "WOS:000381736200014", "WOS:000371540200019"), 5, 1))
dfnew$Author1 <- c("CHENEVIX, Georg", "CLOUST, A", "WANG, Zhi", "COIN, L", "BARL, Kare")
dfnew$Author2 <- c("DOLCE, Ric", "STEVEN, A", "", "", "")
dfnew$Faculties <- c("Medicine; NA; HASS; NA", "HABS; Medicine", "Medicine", "IMB", "NA")
dfnew$CNCI <- c(10.51, 37.47,  0.84,  8.05, 29.41)

我尝试遍历每个 WoS_No 并一个一个地折叠,但是因为我有 68,000 个 WoS_No,所以未能在合理的时间内完成。

我真的很困惑,非常感谢任何人能给我的帮助。

您可以先使用 distinctgroup_by WoS_No 仅保留唯一行来创建唯一标识符列并获取宽格式数据。

library(dplyr)

df %>%
  distinct(WoS_No, Author, .keep_all = TRUE) %>%
  group_by(WoS_No) %>%
  mutate(row = row_number()) %>%
  tidyr::pivot_wider(names_from = row, values_from = c(Author, Faculty))

#  WoS_No               CNCI Author_1        Author_2   Faculty_1 Faculty_2
#  <chr>               <dbl> <chr>           <chr>      <chr>     <chr>    
#1 WOS:000352315900021 10.5  CHENEVIX, Georg DOLCE, Ric Medicine  HASS     
#2 WOS:000362644700013 37.5  CLOUST, A       STEVEN, A  HABS      Medicine 
#3 WOS:000382460200025  0.84 WANG, Zhi       NA         Medicine  NA       
#4 WOS:000381736200014  8.05 COIN, L         NA         IMB       NA       
#5 WOS:000371540200019 29.4  BARL, Kare      NA         NA        NA 

请注意,我还将 Faculty 转换为不同的列。如果您想将它们保留在一列中,如您预期的输出所示,您可以在代码中进行最少的更改。

这是一个解决方案,其中 AuthorsFaculties 之间用分号分隔,就像在预期输出中一样。

library(dplyr)

df %>% 
  group_by(WoS_No) %>% 
  mutate(
    Authors = paste(unique(Author), collapse = "; "),
    Faculties = paste(Faculty, collapse = "; ")
    ) %>% 
  select(WoS_No, Authors, Faculties, CNCI) %>% 
  distinct()

# A tibble: 5 x 4
# Groups:   WoS_No [5]
#   WoS_No              Authors                     Faculties               CNCI
#   <chr>               <chr>                       <chr>                  <dbl> 
# 1 WOS:000352315900021 CHENEVIX, Georg; DOLCE, Ric Medicine; NA; HASS; NA 10.5 
# 2 WOS:000362644700013 CLOUST, A; STEVEN, A        HABS; Medicine         37.5  
# 3 WOS:000382460200025 WANG, Zhi                   Medicine                0.84
# 4 WOS:000381736200014 COIN, L                     IMB                     8.05
# 5 WOS:000371540200019 BARL, Kare                  NA                     29.4 

这是一个data.table-方法

library( data.table )
#make df a data.table
setDT(df)
#first, paste the (unique) authors together by WoS_No
ans <- df[, .( authors = paste0( unique(Author), collapse = ";" ),
               Faculty = paste0( Faculty, collapse = ";" ),
               CNCI = unique(CNCI) ), by = WoS_No][]
#                 WoS_No                    authors             Faculty  CNCI
# 1: WOS:000352315900021 CHENEVIX, Georg;DOLCE, Ric Medicine;NA;HASS;NA 10.51
# 2: WOS:000362644700013        CLOUST, A;STEVEN, A       HABS;Medicine 37.47
# 3: WOS:000382460200025                  WANG, Zhi            Medicine  0.84
# 4: WOS:000381736200014                    COIN, L                 IMB  8.05
# 5: WOS:000371540200019                 BARL, Kare                  NA 29.41

#split the author column
ans[, paste0( "Author", 1:length(tstrsplit( ans$authors, ";" ) )) := tstrsplit( authors, ";")]
#                 WoS_No                    authors             Faculty  CNCI         Author1    Author2
# 1: WOS:000352315900021 CHENEVIX, Georg;DOLCE, Ric Medicine;NA;HASS;NA 10.51 CHENEVIX, Georg DOLCE, Ric
# 2: WOS:000362644700013        CLOUST, A;STEVEN, A       HABS;Medicine 37.47       CLOUST, A  STEVEN, A
# 3: WOS:000382460200025                  WANG, Zhi            Medicine  0.84       WANG, Zhi       <NA>
# 4: WOS:000381736200014                    COIN, L                 IMB  8.05         COIN, L       <NA>
# 5: WOS:000371540200019                 BARL, Kare                  NA 29.41      BARL, Kare       <NA>

使用漂亮的 reshape():

的简单基础 R 解决方案
data <- df[!duplicated(df[, c("WoS_No", "Author")]),]
data$grp.id <- ave(data$WoS_No, data$WoS_No, FUN = seq_along)

reshaped_data  <- reshape(data, idvar= "WoS_No", timevar= "grp.id",
                          v.names=c("Author", "Faculty"), direction="wide")

               WoS_No  CNCI        Author.1 Faculty.1   Author.2 Faculty.2
1 WOS:000352315900021 10.51 CHENEVIX, Georg  Medicine DOLCE, Ric      HASS
5 WOS:000362644700013 37.47       CLOUST, A      HABS  STEVEN, A  Medicine
7 WOS:000382460200025  0.84       WANG, Zhi  Medicine       <NA>      <NA>
8 WOS:000381736200014  8.05         COIN, L       IMB       <NA>      <NA>
9 WOS:000371540200019 29.41      BARL, Kare      <NA>       <NA>      <NA>

idvar 标识我们要传播的组。

timevar 标识组内的观察结果。我们需要为此创建 grp.id

v.names 命名我们要传播的列。

direction 告诉我们转换为宽格式。