R - 将两个 Data.Frames 与行级条件变量合并

R - Merging Two Data.Frames with Row-Level Conditional Variables

简短版本:我有一个比通常的合并操作稍微棘手的操作,我想帮助使用 dplyr 或合并进行优化。我已经有很多解决方案,但是这些 运行 在大型数据集上相当慢,我很好奇 R 中是否存在更快的方法(或者在 SQL 或 python 中)


我有两个data.frames:

  1. 与商店关联的异步事件日志,以及
  2. a table 提供有关该日志中商店的更多详细信息。

问题:商店 ID 是特定位置的唯一标识符,但商店位置的所有权可能会从一个时期更改为下一个时期(为了完整起见,没有两个所有者可以同时拥有同一家商店)。因此,当我合并商店级别信息时,我需要某种条件来合并正确时期的商店级别信息。


可重现的例子:

# asynchronous log. 
#  t for period. 
#  Store for store loc ID
#  var1 just some variable. 
set.seed(1)
df <- data.frame(
  t     = c(1,1,1,2,2,2,3,3,4,4,4),
  Store = c(1,2,3,1,2,3,1,3,1,2,3),
  var1 =  runif(11,0,1)
)

# Store table
# You can see, lots of store location opening and closing, 
#  StateDate is when this business came into existence
#  Store is the store id from df
#  CloseDate is when this store when out of business
#  storeVar1 is just some important var to merge over
Stores <- data.frame(
  StartDate = c(0,0,0,4,4),
  Store     = c(1,2,3,2,3),
  CloseDate = c(9,2,3,9,9),
  storeVar1 = c("a","b","c","d","e")
)

现在,我只想合并Stored.f中的信息。记录,如果 Store 在那个时期(t)营业。 CloseDateStartDate分别表示该企业经营的末期和初期。 (为了完整性但不太重要,StartDate 0 商店在样本之前就存在了。对于 CloseDate 9 商店在那个位置最后没有倒闭样本的数量。)

一个解决方案依赖于周期 t 级别 split()dplyr::rbind_all(),例如

# The following seems to do the trick. 
complxMerge_v1 <- function(df, Stores, by = "Store"){
  library("dplyr")
  temp <- split(df, df$t)
  for (Period in names(temp))(
    temp[[Period]] <- dplyr::left_join(
      temp[[Period]],
      dplyr::filter(Stores, 
                    StartDate <= as.numeric(Period) & 
                    CloseDate >= as.numeric(Period)),
      by = "Store"
    )
  )
  df <- dplyr::rbind_all(temp); rm(temp)
  df
}
complxMerge_v1(df, Stores, "Store")

从功能上讲,这似乎有效(无论如何还没有遇到重大错误)。然而,我们正在处理(越来越常见的)数十亿行日志数据。

如果您想将其用于基准测试,我在 sense.io 上制作了一个更大的可重现示例。看这里:https://sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals


两个问题:

  1. 首先,是否有另一种使用类似方法来解决此问题的方法 运行 更快?
  2. 在 SQL 和 Python 中是否有一个快速简便的解决方案(我不太熟悉,但如果需要可以依赖)。
  3. 此外,你能帮我用更笼统、更抽象的方式表达这个问题吗?现在我只知道如何用特定上下文的术语来讨论问题,但我希望能够用更合适但更通用的编程或数据操作术语来讨论这些类型的问题。

在 R 中,您可以看一下 data.table::foverlaps 函数

library(data.table)

# Set start and end values in `df` and key by them  and by  `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]      
setkey(df, Store, StartDate, CloseDate)

# Run `foverlaps` function
foverlaps(setDT(Stores), df)
#     Store t       var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
#  1:     1 1 0.26550866         1         1           0           9         a
#  2:     1 2 0.90820779         2         2           0           9         a
#  3:     1 3 0.94467527         3         3           0           9         a
#  4:     1 4 0.62911404         4         4           0           9         a
#  5:     2 1 0.37212390         1         1           0           2         b
#  6:     2 2 0.20168193         2         2           0           2         b
#  7:     3 1 0.57285336         1         1           0           3         c
#  8:     3 2 0.89838968         2         2           0           3         c
#  9:     3 3 0.66079779         3         3           0           3         c
# 10:     2 4 0.06178627         4         4           4           9         d
# 11:     3 4 0.20597457         4         4           4           9         e

您可以转换您的 Stores data.frame 添加 t 列,其中包含 t 的所有值用于确定的商店,然后使用 unnest Hadley 的 tydir 包中的函数将其转换为 "long" 形式。

require("tidyr")
require("dplyr")

complxMerge_v2 <- function(df, Stores, by = NULL)    {
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
    unnest(t) %>% left_join(df, ., by = by)
}

complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
#    t Store       var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")

microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)

# Unit: milliseconds
#                       expr      min       lq      mean    median        uq       max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962    10
# complxMerge_v2(df, Stores)  532.744  539.743  567.7207  561.9635  588.0637  636.5775    10

这里有分步结果,使过程清晰。

Stores_with_t <- 
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
#   StartDate Store CloseDate storeVar1                            t
# 1         0     1         9         a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2         0     2         2         b                      0, 1, 2
# 3         0     3         3         c                   0, 1, 2, 3
# 4         4     2         9         d             4, 5, 6, 7, 8, 9
# 5         4     3         9         e             4, 5, 6, 7, 8, 9

# After that `unnest(t)`

Stores_with_t_unnest <- 
  with_t %>% unnest(t)
#    StartDate Store CloseDate storeVar1 t
# 1          0     1         9         a 0
# 2          0     1         9         a 1
# 3          0     1         9         a 2
# 4          0     1         9         a 3
# 5          0     1         9         a 4
# 6          0     1         9         a 5
# 7          0     1         9         a 6
# 8          0     1         9         a 7
# 9          0     1         9         a 8
# 10         0     1         9         a 9
# 11         0     2         2         b 0
# 12         0     2         2         b 1
# 13         0     2         2         b 2
# 14         0     3         3         c 0
# 15         0     3         3         c 1
# 16         0     3         3         c 2
# 17         0     3         3         c 3
# 18         4     2         9         d 4
# 19         4     2         9         d 5
# 20         4     2         9         d 6
# 21         4     2         9         d 7
# 22         4     2         9         d 8
# 23         4     2         9         d 9
# 24         4     3         9         e 4
# 25         4     3         9         e 5
# 26         4     3         9         e 6
# 27         4     3         9         e 7
# 28         4     3         9         e 8
# 29         4     3         9         e 9

# And then simple `left_join`

left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store          var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e