使用 lapply 输出不同因子水平内日期范围之间的值

Question

我有 2 个数据框，一个代表不同商店的每日销售数据 (df1)，一个代表每家商店何时被审计 (df2)。我需要创建一个新的数据框，显示每次审计前 1 周从每个站点获取的销售信息（即 df2 中的信息）。一些示例数据，首先是一段时间内不同商店的每日销售数据：

Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Sales <- as.data.frame(matrix(sample(0:50, 30*10, replace=TRUE), ncol=3)) 
df1 <- cbind(Dates,Sales)
colnames(df1) <- c("Dates","Site.A","Site.B","Site.C")

对于不同商店的每次审核日期：

Store<- c("Store.A","Store.A","Store.B","Store.C","Store.C")
Audit_Dates <- as.data.frame(as.POSIXct(c("2016/1/4","2016/3/1","2016/2/1","2016/2/1","2016/3/1")))
df2 <- as.data.frame(cbind(Store,Audit_Dates ))
colnames(df2) <- c("Store","Audit_Dates")

值得注意的是，每个输出中的日期数量不均匀（即在某些商店审计之前可能没有完整的一周信息）。我之前问过一个解决类似问题的问题。下面显示了一个答案，如果我只考虑来自 1 家商店的信息，它可以作为一个例子：

library(lubridate)
##Data input
Store.A_Dates <- as.data.frame(seq(as.Date("2015/12/30"), as.Date("2016/4/7"),"day"))
Store.A_Sales <- as.data.frame(matrix(sample(0:50, 10*10, replace=TRUE), ncol=1)) 
Store.A_df1 <- cbind(Store.A_Dates,Store.A_Sales)
colnames(Store.A_df1) <- c("Store.A_Dates","Store.A_Sales")
Store.A_df2 <- as.Date(c("2016/1/3","2016/3/1"))

##Output
Store.A_output<- lapply(Store.A_df2, function(x) {Store.A_df1[difftime(Store.A_df1[,1], x - days(7)) >= 0 & difftime(Store.A_df1[,1], x) <= 0, ]})
n1 <- max(sapply(Store.A_output, nrow))
output <- data.frame(lapply(Store.A_output,  function(x) x[seq_len(n1),]))

但我不知道如何为多个站点获取此信息。

Answer 1

试试这个：

# Renamed vars for my convenience...
colnames(df1) <- c("t","Store.A","Store.B","Store.C")
colnames(df2) <- c("Store","t")

library(tidyr)
library(dplyr)

# Gather df1 so that df1 and df2 have the same format:

df1 = gather(df1, Store, Sales, -t)
head(df1)
           t   Store Sales
1 2015-12-30 Store.A    16
2 2015-12-31 Store.A    24
3 2016-01-01 Store.A     8
4 2016-01-02 Store.A    42
5 2016-01-03 Store.A     7
6 2016-01-04 Store.A    46

# This lapply call does not iterate over actual values, just indexes, which allows
# you to subset the data comfortably:

r <- lapply(1:nrow(df2), function(i) {
   audit.t = df2[i, "t"]                                     #time of audit
   audit.s = df1[, "Store"] == df2[i, "Store"]               #store audited
   df = df1[audit.s, ]                             #data from audited store
   df[, "audited"] = audit.t              #add extra column with audit date

   week_before = difftime(df[, "t"], audit.t - (7*24*3600)) >= 0
   week_audit  = difftime(df[, "t"], audit.t) <= 0

   df[week_before & week_audit, ]
})

这是否为您提供了正确的子集？

此外，总结一下您的结果：

r = do.call("rbind", r) %>% 
  group_by(audited, Store) %>% 
  summarise(sales = sum(Sales))

r

     audited   Store sales
      <time>   <chr> <int>
1 2016-01-04 Store.A    97
2 2016-02-01 Store.B   156
3 2016-02-01 Store.C   226
4 2016-03-01 Store.A   115
5 2016-03-01 Store.C   187

使用 lapply 输出不同因子水平内日期范围之间的值

Using lapply to output values between date ranges within different factor levels

r

lapply

lubridate