R函数按id合并行并创建单独的列
R funtion to merge rows by id and create separate columns
我有一个从 API 获得的文章列表,我的数据框如下所示:
PMID Year Title Journal Author
33326729 2020 Avelumab Maintenance PLoS biology T., Powles
33326729 2020 Avelumab Maintenance PLoS biology B., Huang
33326729 2020 Avelumab Maintenance PLoS biology A., Di Pietro
我需要合并到这个:
PMID Year Title Journal Author-1 Author-2 Author-3
33326729 2020 Avelumab Maintenance PLoS biology T., Powles B., Huang A., Di Pietro
所以基本上,我需要将文章的作者合并成一行。我想按如下方式按 id 排序:
test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]
Outputs:
33326729 2020,2020,2020 Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance PLoS biology,PLoS biology,PLoS biology T., Powles,B., Huang,A., Di Pietro
但是,这会生成带有逗号而不是单独列的数据。有谁知道不同的功能或如何调整 setDT 功能以获得我想要的结果?提前致谢
编辑:
根据要求输出 dput(head(PubMed_df))
:
structure(list(pmid = c("33326729", "33326729", "33326729", "33320856",
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021",
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles",
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas",
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs",
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom",
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom",
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine",
"The New England journal of medicine", "PLoS biology", "PLoS biology",
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018",
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030",
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))
编辑 2:
非常详细和具体的要求:
我需要将上面显示的头部数据转换成每行都有的形式:
管理信息中心 |出版日期 |作者 1 |隶属关系 |地址 |城市 |州(如果是美国)|国家 |作者 2 |作者 2 的隶属关系 |地址 |城市 |州(如果是美国)|国家 |依此类推 co-Author |期刊 |标题 |摘要* | MH 项
我将不得不分解地址,但那是我稍后要关注的事情。现在我的目标是只获取添加到正确文章中的每个作者的所有信息,而不是同一篇文章有 3 行。
编辑 2 - 用于从@r2evans 获得答案以在我的案例中工作:
如果您将 dcast 用作 data.table::dcast!
,则提供的答案有效
这主要是 Rui 评论中的一个骗局,但它有助于添加一个帮助栏来获取它(我将在此处使用 row
)。自从你开始使用 data.table
,我会坚持使用它。
已编辑 以使用更新后的数据。 (我假设 pmid
唯一定义了组。)
library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]
并且在 Über-wide 格式中:
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
<char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns Thompson Acosta Martin A M Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy
请注意,当您的论文作者数量少于数据集中作者的最大数量时,它们将有空列/NA
。例如,如果我删除第 5-6 行并执行相同操作,
PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
# pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns <NA> <NA> Martin A M <NA> <NA> MRC Human Genetics Unit, MRC Institu... <NA> <NA>
# 2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy
我有一个从 API 获得的文章列表,我的数据框如下所示:
PMID Year Title Journal Author
33326729 2020 Avelumab Maintenance PLoS biology T., Powles
33326729 2020 Avelumab Maintenance PLoS biology B., Huang
33326729 2020 Avelumab Maintenance PLoS biology A., Di Pietro
我需要合并到这个:
PMID Year Title Journal Author-1 Author-2 Author-3
33326729 2020 Avelumab Maintenance PLoS biology T., Powles B., Huang A., Di Pietro
所以基本上,我需要将文章的作者合并成一行。我想按如下方式按 id 排序:
test <- setDT(PubMed_df)[, lapply(.SD, function(x) toString(na.omit(x))), by = "pmid"]
Outputs:
33326729 2020,2020,2020 Avelumab Maintenance,Avelumab Maintenance,Avelumab Maintenance PLoS biology,PLoS biology,PLoS biology T., Powles,B., Huang,A., Di Pietro
但是,这会生成带有逗号而不是单独列的数据。有谁知道不同的功能或如何调整 setDT 功能以获得我想要的结果?提前致谢
编辑:
根据要求输出 dput(head(PubMed_df))
:
structure(list(pmid = c("33326729", "33326729", "33326729", "33320856",
"33320856", "33320856"), year = c("2020", "2020", "2020", "2021",
"2021", "2021"), month = c("12", "12", "12", "01", "01", "01"
), day = c("21", "21", "21", "07", "07", "07"), lastname = c("Powles",
"Huang", "di Pietro", "Reijns", "Thompson", "Acosta"), firstname = c("Thomas",
"Bo", "Alessandra", "Martin A M", "Louise", "Juan Carlos"), address = c("St. Bartholomew's Hospital, London, United Kingdom thomas.powles1@nhs",
"Pfizer, Groton, CT", "Pfizer, Milan, Italy", "MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom",
"The South East of Scotland Clinical Genetic Service, Western General Hospital, NHS Lothian, Edinburgh, United Kingdom",
"Cancer Research UK Edinburgh Centre, MRC Institute of Genetics and Molecular Medicine, The University of Edinburgh, Edinburgh, United Kingdom"
), journal = c("The New England journal of medicine", "The New England journal of medicine",
"The New England journal of medicine", "PLoS biology", "PLoS biology",
"PLoS biology"), title = c("Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"Avelumab Maintenance for Urothelial Carcinoma. Reply.", "Avelumab Maintenance for Urothelial Carcinoma. Reply.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection.",
"A sensitive and affordable multiplex RT-qPCR assay for SARS-CoV-2 detection."
), abstract = c(NA, NA, NA, "", "", ""), doi = c("10.1056/NEJMc2032018",
"10.1056/NEJMc2032018", "10.1056/NEJMc2032018", "10.1371/journal.pbio.3001030",
"10.1371/journal.pbio.3001030", "10.1371/journal.pbio.3001030"
), keywords = c("Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"Antibodies, Monoclonal; Antibodies, Monoclonal, Humanized; Carcinoma, Transitional Cell; Humans; Urologic Neoplasms",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2",
"COVID-19; COVID-19 Testing; Humans; Multiplex Polymerase Chain Reaction; RNA, Viral; Reverse Transcriptase Polymerase Chain Reaction; SARS-CoV-2"
)), row.names = c(NA, 6L
), class = c("data.table", "data.frame"))
编辑 2: 非常详细和具体的要求:
我需要将上面显示的头部数据转换成每行都有的形式: 管理信息中心 |出版日期 |作者 1 |隶属关系 |地址 |城市 |州(如果是美国)|国家 |作者 2 |作者 2 的隶属关系 |地址 |城市 |州(如果是美国)|国家 |依此类推 co-Author |期刊 |标题 |摘要* | MH 项
我将不得不分解地址,但那是我稍后要关注的事情。现在我的目标是只获取添加到正确文章中的每个作者的所有信息,而不是同一篇文章有 3 行。
编辑 2 - 用于从@r2evans 获得答案以在我的案例中工作: 如果您将 dcast 用作 data.table::dcast!
,则提供的答案有效这主要是 Rui 评论中的一个骗局,但它有助于添加一个帮助栏来获取它(我将在此处使用 row
)。自从你开始使用 data.table
,我会坚持使用它。
已编辑 以使用更新后的数据。 (我假设 pmid
唯一定义了组。)
library(data.table)
setDT(PubMed_df)
PubMed_df[, row := seq_len(.N), by = .(pmid)]
并且在 Über-wide 格式中:
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
<char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns Thompson Acosta Martin A M Louise Juan Carlos MRC Human Genetics Unit, MRC Institu... The South East of Scotland Clinical ... Cancer Research UK Edinburgh Centre,...
2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy
请注意,当您的论文作者数量少于数据集中作者的最大数量时,它们将有空列/NA
。例如,如果我删除第 5-6 行并执行相同操作,
PubMed_df <- PubMed_df[1:4,]
dcast(PubMed_df, pmid + year + month + day + journal + title + abstract + doi + keywords ~ row, value.var = c("lastname", "firstname", "address"))
# pmid year month day journal title abstract doi keywords lastname_1 lastname_2 lastname_3 firstname_1 firstname_2 firstname_3 address_1 address_2 address_3
# <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char> <char>
# 1: 33320856 2021 01 07 PLoS biology A sensitive and affordable multiplex... 10.1371/journal.pbio.3001030 COVID-19; COVID-19 Testing; Humans; ... Reijns <NA> <NA> Martin A M <NA> <NA> MRC Human Genetics Unit, MRC Institu... <NA> <NA>
# 2: 33326729 2020 12 21 The New England journal of medicine Avelumab Maintenance for Urothelial ... <NA> 10.1056/NEJMc2032018 Antibodies, Monoclonal; Antibodies, ... Powles Huang di Pietro Thomas Bo Alessandra St. Bartholomew's Hospital, London, ... Pfizer, Groton, CT Pfizer, Milan, Italy