找到两个数据帧的交集并计算数据帧中整数行的平均值

Question

我有两个包含 id、score 和 studentName 的数据框。

我想创建一个仅包含出现在 test1 和 test2 中的 id 的数据框。然后，我想对学生的分数进行平均。

这是一些示例数据：

test1 <- data.frame(id = numeric(0), score = integer(0), studentName = character(0), stringsAsFactors = FALSE)
test1[1, ] <- c(1, 100, "Alice")
test1[2, ] <- c(2, 98, "Bob")
test1[3, ] <- c(3, 64, "Josh")
test1[4, ] <- c(4, 84, "Jake")

test2 <- data.frame(id = numeric(0), score = integer(0), studentName = character(0), stringsAsFactors = FALSE)
test2[1, ] <- c(1, 90, "Alice")
test2[2, ] <- c(2, 95, "Bob")
test2[3, ] <- c(3, 80, "Josh")
test2[4, ] <- c(10, 50, "Emma")

输出应该是一个包含以下行的数据框：

(1, 95, "Alice")
(2, 96.5, "Bob")
(3, 72, "Jake")

注意4和10的学生id被省略了，因为他们没有出现在test1和test2中。

我正在考虑将 apply 函数与 intersection 和 mean 一起使用，但我不确定如何设置它。

Answer 1

在 base R 中，您可以使用 merge 和 rowMeans（假设 'score' 列是 'numeric').

 res <- merge(test1, test2[-1], by='studentName')
 res
 #   studentName id score.x score.y
 #1       Alice  1     100      90
 #2         Bob  2      98      95
 #3        Josh  3      64      80

我们对 "score.x" 和 "score.y" 列的行进行平均，它们是 "res" 中的第 3 列和第 4 列。 rowMeans 获取这些列 (rowMeans(res[,3:4])) 的行的平均值。

 res$score <- rowMeans(res[,3:4])

如果我们不需要 "score.x" 和 "score.y"，我们可以通过负索引 -c(3:4) 或 -(3:4)

将其删除

 res[-(3:4)]
 #   studentName id score
 #1       Alice  1  95.0
 #2         Bob  2  96.5
 #3        Josh  3  72.0

Answer 2

使用library(dplyr):

df <- inner_join(test1,test2[,-3],by="id")
df <- df %>% mutate(mean_score = (score.x + score.y)/2) %>% select(-c(score.x,score.y))

如果加载 magrittr 包，您可以使用 %<>% 运算符简化第二行：

df %<>% mutate(mean_score = (score.x + score.y)/2) %>% select(-c(score.x,score.y))

找到两个数据帧的交集并计算数据帧中整数行的平均值

Find the intersection of two dataframes and compute the average of an integer row in the dataframe

r

apply

dataframe