R中数据帧变量的传递有效比较和赋值?

Transitive efficient comparison and value assignment to dataframe variables in R?

我有3个数据框:

df1:

    key       value
1   rs1057079     C
2   rs4845882     A
3   rs1891932     T
4    rs530296     A
5  rs10497340     G

df2:

    key       value
1   rs1057079     T
2   rs4845882     G
3   rs1891932     T
4    rs530296     A
5  rs10497340     A

和第三个控件df3:

    key       value
1   rs1057079     C
2   rs4845882     A
3   rs1891932     C
4    rs530296     G
5  rs10497340     G

我想检查 df1df2 中的所有键是否等于控件 df3。例如检查 df1$rs1057079 == df3$rs1057079,与 df2 相同。

我想在没有for循环的情况下做到这一点,最简单有效的方法是什么? 我考虑过 dplyr 过滤器和变异函数,但很高兴听到专家如何将 n df 与控制 df 进行比较?

如果我们可以假设 df1 和 df2 中存在的所有键都将出现在 df3 中,并且它们按相同的顺序排序,那么解决方案相当简单:

df1$value==df3$value
df2$value==df3$value

这会输出一个由 TRUE 和 FALSE 组成的向量,其中每个元素都是 df1 和 df2 中一行的答案。

我将创建一个函数,它接受两个数据集(测试和控制)和 returns 值(彼此相邻)和一个标记以在它们相同时发现。

df1 = read.table(text = "
key       value
1   rs1057079     C
2   rs4845882     A
3   rs1891932     T
4    rs530296     A
5  rs10497340     G
", header=T, stringsAsFactors=F)

df2 = read.table(text = "
key       value
1   rs1057079     T
2   rs4845882     G
3   rs1891932     T
4    rs530296     A
5  rs10497340     A
", header=T, stringsAsFactors=F)

df3 = read.table(text = "
key       value
1   rs1057079     C
2   rs4845882     A
3   rs1891932     C
4    rs530296     G
5  rs10497340     G
", header=T, stringsAsFactors=F)


library(dplyr)

# function that compares the values based on the key column
CompareDatasets = function(d1, d2) {
d1 %>%
  left_join(d2, by="key") %>%
  mutate(IsSame = value.x == value.y)
}

# apply function
CompareDatasets(df1, df3)

#          key value.x value.y IsSame
# 1  rs1057079       C       C   TRUE
# 2  rs4845882       A       A   TRUE
# 3  rs1891932       T       C  FALSE
# 4   rs530296       A       G  FALSE
# 5 rs10497340       G       G   TRUE

key代表两个数据集,value.x是测试数据集的值,value.y是控制数据集的值(取决于你先传给哪个数据集函数)和 flag 发现值相等时。

另一种方法是创建单个数据帧输出(即同时将所有测试数据集与对照进行比较),但您需要创建一个包含数据集名称的列:

library(dplyr)
library(purrr)

# fucntion that gets the name of a dataset and returns the dataset with the name as a column
GetNameData = function(x) {
  d = get(x)
  d$name = x
  d
}

# vector of test datsets' names (multiple names)
# df3 will be the control
test = c("df1", "df2")

test %>%                                    # get the dataset names
  map_df(GetNameData) %>%                   # apply the function and get data (single dataframe)
  left_join(df3, by="key") %>%              # join the control group
  mutate(IsSame = value.x == value.y) %>%   # flag equal values
  select(name, everything())                # re-arrange columns

#    name        key value.x value.y IsSame
# 1   df1  rs1057079       C       C   TRUE
# 2   df1  rs4845882       A       A   TRUE
# 3   df1  rs1891932       T       C  FALSE
# 4   df1   rs530296       A       G  FALSE
# 5   df1 rs10497340       G       G   TRUE
# 6   df2  rs1057079       T       C  FALSE
# 7   df2  rs4845882       G       A  FALSE
# 8   df2  rs1891932       T       C  FALSE
# 9   df2   rs530296       A       G  FALSE
# 10  df2 rs10497340       A       G  FALSE