R函数按id合并行并创建单独的列

Question

我有一个从 API 获得的文章列表，我的数据框如下所示：

PMID        Year     Title                  Journal         Author 
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles
33326729    2020     Avelumab Maintenance   PLoS biology    B., Huang
33326729    2020     Avelumab Maintenance   PLoS biology    A., Di Pietro

我需要合并到这个：

PMID        Year     Title                  Journal         Author-1         Author-2     Author-3
33326729    2020     Avelumab Maintenance   PLoS biology    T., Powles       B., Huang    A., Di Pietro

所以基本上，我需要将文章的作者合并成一行。我想按如下方式按 id 排序：

test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]

Outputs:
33326729    2020,2020,2020     Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance   PLoS biology,PLoS biology,PLoS biology    T., Powles,B., Huang,A., Di Pietro

但是，这会生成带有逗号而不是单独列的数据。有谁知道不同的功能或如何调整 setDT 功能以获得我想要的结果？提前致谢

编辑：根据要求输出 dput(head(PubMed_df)) :

structure(list(pmid = c("33326729", "33326729", "33326729", "33320856", 
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021", 
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles", 
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas", 
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs", 
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom", 
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom", 
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine", 
"The New England journal of medicine", "PLoS biology", "PLoS biology", 
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.", 
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018", 
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030", 
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2", 
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))

编辑 2：非常详细和具体的要求：

我需要将上面显示的头部数据转换成每行都有的形式：管理信息中心 |出版日期 |作者 1 |隶属关系 |地址 |城市 |州（如果是美国）|国家 |作者 2 |作者 2 的隶属关系 |地址 |城市 |州（如果是美国）|国家 |依此类推 co-Author |期刊 |标题 |摘要* | MH 项

我将不得不分解地址，但那是我稍后要关注的事情。现在我的目标是只获取添加到正确文章中的每个作者的所有信息，而不是同一篇文章有 3 行。

编辑 2 - 用于从@r2evans 获得答案以在我的案例中工作： 如果您将 dcast 用作 data.table::dcast!

，则提供的答案有效

Answer 1

这主要是 Rui 评论中的一个骗局，但它有助于添加一个帮助栏来获取它（我将在此处使用 row）。自从你开始使用 data.table，我会坚持使用它。

已编辑 以使用更新后的数据。（我假设 pmid 唯一定义了组。）

library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]

并且在 Über-wide 格式中：

dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
       pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1                               address_2                               address_3
     <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>                                  <char>                                  <char>
1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns   Thompson     Acosta  Martin A M      Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ...                      Pfizer, Groton, CT                    Pfizer, Milan, Italy

请注意，当您的论文作者数量少于数据集中作者的最大数量时，它们将有空列/NA。例如，如果我删除第 5-6 行并执行相同操作，

PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
#        pmid   year  month    day                             journal                                   title abstract                          doi                                keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3                               address_1          address_2            address_3
#      <char> <char> <char> <char>                              <char>                                  <char>   <char>                       <char>                                  <char>     <char>     <char>     <char>      <char>      <char>      <char>                                  <char>             <char>               <char>
# 1: 33320856   2021     01     07                        PLoS biology A sensitive and affordable multiplex...          10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ...     Reijns       <NA>       <NA>  Martin A M        <NA>        <NA> MRC Human Genetics Unit, MRC Institu...               <NA>                 <NA>
# 2: 33326729   2020     12     21 The New England journal of medicine Avelumab Maintenance for Urothelial ...     <NA>         10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ...     Powles      Huang  di Pietro      Thomas          Bo  Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy

R函数按id合并行并创建单独的列

R funtion to merge rows by id and create separate columns

r

lapply