对于每一列,在之前 window 的时间内按组对分数求和
For each column, sum scores by group over prior window of time
我有一个大型面板数据集 (10,000,000 x 53),其中包含大约 50 列分数。我按组(大约有 15,000 个)和日期汇总了每个分数。
现在我想计算三个值的滚动总和,包括前两个日期和当前日期的分数,创建一个新的对应总和列。
应按日期和组为每个分数列计算总和。
对于组内的第一个和第二个日期,允许少于 3 个值。
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David 在 中的回答涵盖了我关于按组汇总 windows 的大部分问题,但我仍然遗漏了一些内容。
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
我的问题分为两部分:
- 应该用什么代码替换 "Score" 来计算一系列列中每列的时间总和 window?
- 该解决方案是否为大型数据集的快速计算提供了最有效的版本?
可能的解决方案:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 4 9
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 4 5
6: B 2017-04-08 2017-04-06 3 1 7 6
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
如果您还想考虑时间 window,您可以这样做(假设 LAGGED
是 time-window 的开始):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2
1: A 2017-04-01 2017-03-30 1 2 1 2
2: A 2017-04-02 2017-03-31 1 3 2 5
3: A 2017-04-04 2017-04-02 2 4 3 7
4: B 2017-04-02 2017-03-31 2 3 2 3
5: B 2017-04-05 2017-04-03 2 2 2 2
6: B 2017-04-08 2017-04-06 3 1 3 1
7: C 2017-04-02 2017-03-31 3 1 3 1
8: C 2017-04-03 2017-04-01 2 3 5 4
已用数据:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')
我有一个大型面板数据集 (10,000,000 x 53),其中包含大约 50 列分数。我按组(大约有 15,000 个)和日期汇总了每个分数。
现在我想计算三个值的滚动总和,包括前两个日期和当前日期的分数,创建一个新的对应总和列。 应按日期和组为每个分数列计算总和。 对于组内的第一个和第二个日期,允许少于 3 个值。
GROUP DATE LAGGED SCORE1 SUM1 SCORE2 SUM2 ... SCORE50 SUM50
#1 A 2017-04-01 2017-03-30 1 1|1 2 2|2 4 4|4
#2 A 2017-04-02 2017-03-31 1 1+1|2 3 3+2|5 3 3+4|7
#3 A 2017-04-04 2017-04-02 2 2+1+1|4 4 4+3+2|9 2 2+3+4|9
#5 B 2017-04-02 2017-03-31 2 2|2 3 3|3 1 1|1
#6 B 2017-04-05 2017-04-03 2 2+2|4 2 2+3|5 1 1+1|2
#7 B 2017-04-08 2017-04-06 3 3+2+2|7 1 1+2+3|6 3 3+1+1|5
#8 C 2017-04-02 2017-03-31 3 3|3 1 1|1 1 1|1
#9 C 2017-04-03 2017-04-01 2 2+3|5 3 3+1|4 2 2+1|3
: : : : : : : : : :
#10M XX 2018-03-30 2018-03-28 2 2 1 1 ... 1 1
David 在
library(data.table) #v1.10.4
## Convert to a proper date class, and add another column
## in order to define the range
setDT(input)[, c("Date", "Date2") := {
Date = as.IDate(Date)
Date2 = Date - 2L
.(Date, Date2)
}]
## Run a non-equi join against the unique Date/Group combination in input
## Sum the Scores on the fly
## You can ignore the second Date column
input[unique(input, by = c("Date", "Group")), ## This removes the dupes
on = .(Group, Date <= Date, Date >= Date2), ## The join condition
.(Score = sum(Score)), ## sum the scores
keyby = .EACHI] ## Run the sum by each row in
## unique(input, by = c("Date", "Group"))
我的问题分为两部分:
- 应该用什么代码替换 "Score" 来计算一系列列中每列的时间总和 window?
- 该解决方案是否为大型数据集的快速计算提供了最有效的版本?
可能的解决方案:
cols <- grep('^SCORE', names(input), value = TRUE)
input[, gsub('SCORE','SUM',cols) := lapply(.SD, cumsum)
, by = GROUP
, .SDcols = cols][]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 4 9 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 4 5 6: B 2017-04-08 2017-04-06 3 1 7 6 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
如果您还想考虑时间 window,您可以这样做(假设 LAGGED
是 time-window 的开始):
input[input[input[, .(GROUP, DATE, LAGGED)]
, on = .(GROUP, DATE >= LAGGED, DATE <= DATE)
][, setNames(lapply(.SD, sum), gsub('SCORE','SUM',cols))
, by = .(GROUP, DATE = DATE.1)
, .SDcols = cols]
, on = .(GROUP, DATE)]
给出:
GROUP DATE LAGGED SCORE1 SCORE2 SUM1 SUM2 1: A 2017-04-01 2017-03-30 1 2 1 2 2: A 2017-04-02 2017-03-31 1 3 2 5 3: A 2017-04-04 2017-04-02 2 4 3 7 4: B 2017-04-02 2017-03-31 2 3 2 3 5: B 2017-04-05 2017-04-03 2 2 2 2 6: B 2017-04-08 2017-04-06 3 1 3 1 7: C 2017-04-02 2017-03-31 3 1 3 1 8: C 2017-04-03 2017-04-01 2 3 5 4
已用数据:
input <- fread(' GROUP DATE LAGGED SCORE1 SCORE2
A 2017-04-01 2017-03-30 1 2
A 2017-04-02 2017-03-31 1 3
A 2017-04-04 2017-04-02 2 4
B 2017-04-02 2017-03-31 2 3
B 2017-04-05 2017-04-03 2 2
B 2017-04-08 2017-04-06 3 1
C 2017-04-02 2017-03-31 3 1
C 2017-04-03 2017-04-01 2 3')