按组高效地从一个 data.frame 中查找另一个 data.frame 中的数据

Question

我正在为以下问题寻找更快的解决方案。

假设我有以下两个数据集。

df1 <- data.frame(Var1 = c(5011, 2484, 4031, 1143, 7412),
              Var2 = c(2161, 2161, 2161, 2161, 8595))
df2 <- data.frame(team=c("A","A", "B", "B", "B", "C", "C", "D", "D"),
              class=c("5011", "2161", "2484", "4031", "1143", "2161", "5011", "8595", "1143"),
              attribute=c("X1", "X2", "X1", "Z1", "Z2", "Y1", "X1", "Z1", "X2"),
              stringsAsFactors=FALSE)


> df1
  Var1 Var2
1 5011 2161
2 2484 2161
3 4031 2161
4 1143 2161
5 7412 8595

> df2
  team class attribute
1    A  5011        X1
2    A  2161        X2
3    B  2484        X1
4    B  4031        Z1
5    B  1143        Z2
6    C  2161        Y1
7    C  5011        X1
8    D  8595        Z1
9    D  1143        X2

我想知道 df2 中的哪些团队在 class 中相遇，对应于 df1 中的行。我对行内顺序不感兴趣。

我当前的代码（粘贴在下方）有效，但效率极低。

一些规则：

只有 A 队和 C 队在类中相遇，并在 df1 中以行的形式出现。
B 队和 D 队没有在类中相遇，任何成对组合在 df1 中排成一排。它们被排除在输出之外。

代码：

    teams <- c()
    atts <- c()
    pxs <- unique(df2$team)

    for(j in pxs){
     subs <- subset(df2, team==j)
     for(i in 1:nrow(df1)){
      if(all(df1[i,] %in% subs$class)){
    teams <- rbind(teams, subs$team[i])
    atts <- rbind(atts, subs$attribute)
     } 
     }
    }

    output <- cbind(teams, atts)  

> output
     [,1] [,2] [,3]
[1,] "A"  "X1" "X2"
[2,] "C"  "Y1" "X1"

原始数据由 df1 和 df2 中的数百万行组成。

如何更有效地做到这一点？也许通过 apply 方法结合 data.table?

Answer 1

不太确定您的规则试图达到什么目的。

根据您的示例数据、代码和输出，您可能希望先按 df1 的每一列进行连接，然后再内部连接 2 个结果：

library(data.table)
setDT(df1)
setDT(df2)[, cls := as.integer(cls)]

#left join df1 with df2 using Var1
v1 <- df2[df1, on=.(cls=Var1)]

#left join df1 with df2 using Var2
v2 <- df2[df1, on=.(cls=Var2)]

#inner join the 2 previous results to ensure that the same team is picked 
#where classes already match in v1 and v2
v1[v2, on=.(team, cls=Var1, Var2=cls), nomatch=0L]

输出：

   team  cls attribute Var2 i.attribute
1:    A 5011        X1 2161          X2
2:    C 5011        X1 2161          Y1

按组高效地从一个 data.frame 中查找另一个 data.frame 中的数据

efficiently look-up data from one data.frame in another data.frame by group

lookup

performance

r

dataframe

data.table