用于比较具有相同 ID 的特定列的行之间的函数
Function to compare rows among themselves of a specific column with the same ID
我有一个很大的实验室数据库,一些ID有多个结果,我还创建了另一个关键变量,首字母+年龄+性别变量用于与医院病历的其他匹配目的。但是我注意到有时不同的首字母缩写有相同的医院 ID。我想写一个函数来检测这种不一致。
所以数据库的例子:
df=data.frame(ID=c("5606","5606","5728","5824","5824","5824","5824"),
key2=c("TN35M","TN35M","JJ26M","CD47F","CD47F","DG44M","DG44M"),
date_sample=c("12/03/2012","12/03/2012","19/04/2012","21/05/2012","21/05/2012","19/10/2012","19/10/2012"), service=c("ORTHO","ORTHO","BLOC","VISC","VISC","BLOC","BLOC"), germe=c("Acinetobacter sp","Burkholderia pseudomallei","Stenotrophomonas maltophilia","Staphylococcus haemolyticus"," Enterobacter cloacae","Escherichia coli","Pseudomonas aeruginosa"))
ID key2 date_sample service germe
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus
5824 CD47F 21/05/2012 VISC Enterobacter cloacae
5824 DG44M 19/10/2012 BLOC Escherichia coli
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa
每个 ID 都应该有一个唯一的 key2 变量。我如何比较相同 "ID" 变量的 "key2" 变量的行,并有一个输出变量来检测我所有不连贯的行,以确保每个 ID 给一个唯一的病人而不是由超过 1 名患者共享?
喜欢:
ID key2 date_sample service germe incoherence
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp N
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei N
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia N
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus Y
5824 CD47F 21/05/2012 VISC Enterobacter cloacae Y
5824 DG44M 19/10/2012 BLOC Escherichia coli Y
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa Y
您可以统计每个组的唯一值。如果大于 1,则 Y
(或在本例中为 TRUE
),即
!with(df, ave(key2, ID, FUN = function(i) length(unique(i)))) == 1
#[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
注意:确保你的变量是字符,而不是因子
使用dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(incoherence = c("N", "Y")[(n_distinct(key2) > 1) +1])
# ID key2 incoherence
# <fct> <fct> <chr>
#1 5606 TN35M N
#2 5606 TN35M N
#3 5728 JJ26M N
#4 5824 CD47F Y
#5 5824 CD47F Y
#6 5824 DG44M Y
#7 5824 DG44M Y
和data.table
library(data.table)
setDT(df)[, incoherence := c("N", "Y")[(uniqueN(key2) > 1) +1], by = ID]
我有一个很大的实验室数据库,一些ID有多个结果,我还创建了另一个关键变量,首字母+年龄+性别变量用于与医院病历的其他匹配目的。但是我注意到有时不同的首字母缩写有相同的医院 ID。我想写一个函数来检测这种不一致。
所以数据库的例子:
df=data.frame(ID=c("5606","5606","5728","5824","5824","5824","5824"),
key2=c("TN35M","TN35M","JJ26M","CD47F","CD47F","DG44M","DG44M"),
date_sample=c("12/03/2012","12/03/2012","19/04/2012","21/05/2012","21/05/2012","19/10/2012","19/10/2012"), service=c("ORTHO","ORTHO","BLOC","VISC","VISC","BLOC","BLOC"), germe=c("Acinetobacter sp","Burkholderia pseudomallei","Stenotrophomonas maltophilia","Staphylococcus haemolyticus"," Enterobacter cloacae","Escherichia coli","Pseudomonas aeruginosa"))
ID key2 date_sample service germe
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus
5824 CD47F 21/05/2012 VISC Enterobacter cloacae
5824 DG44M 19/10/2012 BLOC Escherichia coli
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa
每个 ID 都应该有一个唯一的 key2 变量。我如何比较相同 "ID" 变量的 "key2" 变量的行,并有一个输出变量来检测我所有不连贯的行,以确保每个 ID 给一个唯一的病人而不是由超过 1 名患者共享?
喜欢:
ID key2 date_sample service germe incoherence
5606 TN35M 12/03/2012 ORTHO Acinetobacter sp N
5606 TN35M 12/03/2012 ORTHO Burkholderia pseudomallei N
5728 JJ26M 19/04/2012 BLOC Stenotrophomonas maltophilia N
5824 CD47F 21/05/2012 VISC Staphylococcus haemolyticus Y
5824 CD47F 21/05/2012 VISC Enterobacter cloacae Y
5824 DG44M 19/10/2012 BLOC Escherichia coli Y
5824 DG44M 19/10/2012 BLOC Pseudomonas aeruginosa Y
您可以统计每个组的唯一值。如果大于 1,则 Y
(或在本例中为 TRUE
),即
!with(df, ave(key2, ID, FUN = function(i) length(unique(i)))) == 1
#[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
注意:确保你的变量是字符,而不是因子
使用dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(incoherence = c("N", "Y")[(n_distinct(key2) > 1) +1])
# ID key2 incoherence
# <fct> <fct> <chr>
#1 5606 TN35M N
#2 5606 TN35M N
#3 5728 JJ26M N
#4 5824 CD47F Y
#5 5824 CD47F Y
#6 5824 DG44M Y
#7 5824 DG44M Y
和data.table
library(data.table)
setDT(df)[, incoherence := c("N", "Y")[(uniqueN(key2) > 1) +1], by = ID]