如果日期在日期范围内,则按组求和
R sum by group if date within date range
假设我有两个数据框。
第一个包括 "Date","Name" 为 "ID" 发出 "Rec" 和 "Rec" "Stop.Date"变得无效。
df(只有一部分)
structure(list(Date = structure(c(13236, 13363, 14074, 13199,
14554), class = "Date"), ID = c("AU0000XINAA9", "AU0000XINAA9",
"AU0000XINAC5", "AU0000XINAI2", "AU0000XINAJ0"), Name = c("N+1 BREWIN",
"N+1 BREWIN", "ARBUTHNOT SECURITIES LTD.", "INVESTEC BANK (UK) PLC",
"AWRAQ INVESTMENTS"), Rec = c(1, 2, 2, 2, 1), Stop.Date = structure(c(13363,
13509, 14937, 13230, 16702), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))
第二个数据框只包含一个时间序列:假设在这种情况下是从 2006 年 3 月 29 日到 2006 年底。
df2
Date1
1: 2006-02-20
2: 2006-02-21
3: 2006-02-22
4: 2006-02-23
5: 2006-02-24
---
311: 2006-12-27
312: 2006-12-28
313: 2006-12-29
314: 2006-12-30
315: 2006-12-31
现在,如果 df2 中的 "Date1" 变量在时间范围内(日期直到 Stop.Date)[=19,我希望我的代码对所有 "Rec" 按 ID 和名称进行汇总=]
我发现了这个 post ,它似乎非常接近我的问题,但解决方案没有考虑任何群体。
我想提出一个 data.frame,其中对于 df2 中的每个日期,每个 "REC" 的总和 "ID"已显示。
预期输出例如
Date1 ID SumRec
1 2006-02-20 AU0000XINAI2 2
2 2006-02-21 AU0000XINAI2 2
...
4 2006-03-29 AU0000XINAA9 1
5 2006-03-30 AU0000XINAA9 1
6 2006-08-03 AU0000XINAA9 2 # since Date1 2006-08-03 is at the end
of range in df (row#1)-> it falls
within range in df (row#2)
...
请记住这只是数据的一小部分。通常每个 "ID" 来自不同的 "Names" 存在更多的 Rec。 (然后 sum 函数才有意义)
非常感谢您的提前帮助。
更新版本
新数据帧:
df
structure(list(Date = structure(c(9905, 10381, 10381, 10954,
10584, 10632, 10778, 10520, 10631, 10905), class = "Date"), ID = c("BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG4593F1389", "BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG526551004", "BMG526551004",
"BMG526551004"), Name = c("ING FM", "Permission Denied 128064",
"Permission Denied 2880", "Permission Denied 2880", "Permission Denied 32",
"Permission Denied 888", "Permission Denied 888", "Permission Denied 2880",
"Permission Denied 2880", "Permission Denied 2880"), Rec = c(2,
3, 2, 2, 3, 3, 3, 1, 3, 3), Stop.Date = structure(c(12095, 11232,
10954, 11180, 11345, 10764, 11667, 10631, 10905, 11087), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
df2
structure(list(Date1 = structure(c(10954, 10955, 10956, 10957,
10958, 10959), class = "Date")), .Names = "Date1", row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
如果我现在执行下面的代码:
> df=df[,interval := interval(df$Date, df$Stop.Date)]
>
> df1 <- do.call(rbind, lapply(df2$Date1, function(x){ index <- x
> %within% df$interval; list(ID = ifelse(any(index), df$ID[index],
> NA), Rec = ifelse(any(index), df$Rec[index], NA),
> Name = ifelse(any(index), df$Name[index], NA),interval = ifelse(any(index),df$interval[index],NA))}))
>
> df3 <- cbind(df2, df1)
我得出以下结果:
Date1 ID Rec Name interval
1: 1999-12-29 BMG4593F1389 2 ING FM 189216000
2: 1999-12-30 BMG4593F1389 2 ING FM 189216000
3: 1999-12-31 BMG4593F1389 2 ING FM 189216000
4: 2000-01-01 BMG4593F1389 2 ING FM 189216000
5: 2000-01-02 BMG4593F1389 2 ING FM 189216000
6: 2000-01-03 BMG4593F1389 2 ING FM 189216000
但是因为例如 df$ID "BMG4593F1389" 的 df2$Date1 ("1999-12-29") 落在 df 中另外 6 个条目的日期范围内(对于不同的 df$Names) FOR这个特定的 df$date1 应该是:
日期 1999-12-29 的预期结果(为简单起见,此处忽略 df3$interval 变量)
Date1 ID Rec Name
1: 1999-12-29 BMG4593F1389 2 ING FM
2: 1999-12-29 BMG4593F1389 3 Permission Denied 128064
3: 1999-12-29 BMG4593F1389 2 Permission Denied 2880
4: 1999-12-29 BMG4593F1389 3 Permission Denied 32
5: 1999-12-29 BMG4593F1389 3 Permission Denied 888
6: 1999-12-29 BMG5265510042 3 Permission Denied 2880
7: 1999-12-30 BMG4593F1389 2 ING FM
... etc
所以最后我需要复制 df$Date1 中的日期,如果有多个名称为特定的 df$ID 发出了 Rec,该 ID 在各自的日期范围内。
有人可以帮我吗?
如果我正确理解问题的更新版本,这可以使用 non-equi join 和后续解决聚合:
library(data.table)
# non-equi join
df[df2, on = .(Date <= Date1, Stop.Date > Date1), allow = TRUE][
# aggregation
, .(sumRec = sum(Rec)), by = .(Date, ID, Name)]
Date ID Name sumRec
1: 1999-12-29 BMG4593F1389 ING FM 2
2: 1999-12-29 BMG4593F1389 Permission Denied 128064 3
3: 1999-12-29 BMG4593F1389 Permission Denied 2880 2
4: 1999-12-29 BMG4593F1389 Permission Denied 32 3
5: 1999-12-29 BMG4593F1389 Permission Denied 888 3
6: 1999-12-29 BMG526551004 Permission Denied 2880 3
7: 1999-12-30 BMG4593F1389 ING FM 2
8: 1999-12-30 BMG4593F1389 Permission Denied 128064 3
9: 1999-12-30 BMG4593F1389 Permission Denied 2880 2
10: 1999-12-30 BMG4593F1389 Permission Denied 32 3
11: 1999-12-30 BMG4593F1389 Permission Denied 888 3
12: 1999-12-30 BMG526551004 Permission Denied 2880 3
13: 1999-12-31 BMG4593F1389 ING FM 2
14: 1999-12-31 BMG4593F1389 Permission Denied 128064 3
15: 1999-12-31 BMG4593F1389 Permission Denied 2880 2
16: 1999-12-31 BMG4593F1389 Permission Denied 32 3
17: 1999-12-31 BMG4593F1389 Permission Denied 888 3
18: 1999-12-31 BMG526551004 Permission Denied 2880 3
19: 2000-01-01 BMG4593F1389 ING FM 2
20: 2000-01-01 BMG4593F1389 Permission Denied 128064 3
21: 2000-01-01 BMG4593F1389 Permission Denied 2880 2
22: 2000-01-01 BMG4593F1389 Permission Denied 32 3
23: 2000-01-01 BMG4593F1389 Permission Denied 888 3
24: 2000-01-01 BMG526551004 Permission Denied 2880 3
25: 2000-01-02 BMG4593F1389 ING FM 2
26: 2000-01-02 BMG4593F1389 Permission Denied 128064 3
27: 2000-01-02 BMG4593F1389 Permission Denied 2880 2
28: 2000-01-02 BMG4593F1389 Permission Denied 32 3
29: 2000-01-02 BMG4593F1389 Permission Denied 888 3
30: 2000-01-02 BMG526551004 Permission Denied 2880 3
31: 2000-01-03 BMG4593F1389 ING FM 2
32: 2000-01-03 BMG4593F1389 Permission Denied 128064 3
33: 2000-01-03 BMG4593F1389 Permission Denied 2880 2
34: 2000-01-03 BMG4593F1389 Permission Denied 32 3
35: 2000-01-03 BMG4593F1389 Permission Denied 888 3
36: 2000-01-03 BMG526551004 Permission Denied 2880 3
Date ID Name sumRec
请注意,我在直接使用 structure(...)
中提供的 df
时遇到了一条奇怪的错误消息。调用后错误消息消失
df <- as.data.table(df)
说明
我是 asked 来解释 non-equi join 是如何工作的。 Non-equi 联接 是data.table
联接的扩展。 data.table
是一个增强基础 R 的包 data.frame
。
在这里,我们将 df2
与 df
右连接,即我们希望在结果中看到 df2
的所有行与 df
中的匹配项,但只有那些其中 Date1
(来自 df2
)位于 Date
和 Stop.Date
(来自 df
)之间,准确地说是 Date <= Date1 < Stop.Date
。由于有很多可能的匹配项,我们需要使用 allow.cartesian = TRUE
.
用户那里有一个video of Arun's talk! 2016 年国际 R 用户会议介绍 Efficient in-memory non-equi joins using data.table.
假设我有两个数据框。
第一个包括 "Date","Name" 为 "ID" 发出 "Rec" 和 "Rec" "Stop.Date"变得无效。
df(只有一部分)
structure(list(Date = structure(c(13236, 13363, 14074, 13199,
14554), class = "Date"), ID = c("AU0000XINAA9", "AU0000XINAA9",
"AU0000XINAC5", "AU0000XINAI2", "AU0000XINAJ0"), Name = c("N+1 BREWIN",
"N+1 BREWIN", "ARBUTHNOT SECURITIES LTD.", "INVESTEC BANK (UK) PLC",
"AWRAQ INVESTMENTS"), Rec = c(1, 2, 2, 2, 1), Stop.Date = structure(c(13363,
13509, 14937, 13230, 16702), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -5L))
第二个数据框只包含一个时间序列:假设在这种情况下是从 2006 年 3 月 29 日到 2006 年底。
df2
Date1
1: 2006-02-20
2: 2006-02-21
3: 2006-02-22
4: 2006-02-23
5: 2006-02-24
---
311: 2006-12-27
312: 2006-12-28
313: 2006-12-29
314: 2006-12-30
315: 2006-12-31
现在,如果 df2 中的 "Date1" 变量在时间范围内(日期直到 Stop.Date)[=19,我希望我的代码对所有 "Rec" 按 ID 和名称进行汇总=]
我发现了这个 post
我想提出一个 data.frame,其中对于 df2 中的每个日期,每个 "REC" 的总和 "ID"已显示。
预期输出例如
Date1 ID SumRec
1 2006-02-20 AU0000XINAI2 2
2 2006-02-21 AU0000XINAI2 2
...
4 2006-03-29 AU0000XINAA9 1
5 2006-03-30 AU0000XINAA9 1
6 2006-08-03 AU0000XINAA9 2 # since Date1 2006-08-03 is at the end
of range in df (row#1)-> it falls
within range in df (row#2)
...
请记住这只是数据的一小部分。通常每个 "ID" 来自不同的 "Names" 存在更多的 Rec。 (然后 sum 函数才有意义)
非常感谢您的提前帮助。
更新版本
新数据帧:
df
structure(list(Date = structure(c(9905, 10381, 10381, 10954,
10584, 10632, 10778, 10520, 10631, 10905), class = "Date"), ID = c("BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG4593F1389", "BMG4593F1389",
"BMG4593F1389", "BMG4593F1389", "BMG526551004", "BMG526551004",
"BMG526551004"), Name = c("ING FM", "Permission Denied 128064",
"Permission Denied 2880", "Permission Denied 2880", "Permission Denied 32",
"Permission Denied 888", "Permission Denied 888", "Permission Denied 2880",
"Permission Denied 2880", "Permission Denied 2880"), Rec = c(2,
3, 2, 2, 3, 3, 3, 1, 3, 3), Stop.Date = structure(c(12095, 11232,
10954, 11180, 11345, 10764, 11667, 10631, 10905, 11087), class = "Date")), .Names = c("Date",
"ID", "Name", "Rec", "Stop.Date"), class = c("data.table", "data.frame"
), row.names = c(NA, -10L))
df2
structure(list(Date1 = structure(c(10954, 10955, 10956, 10957,
10958, 10959), class = "Date")), .Names = "Date1", row.names = c(NA,
-6L), class = c("data.table", "data.frame"))
如果我现在执行下面的代码:
> df=df[,interval := interval(df$Date, df$Stop.Date)]
>
> df1 <- do.call(rbind, lapply(df2$Date1, function(x){ index <- x
> %within% df$interval; list(ID = ifelse(any(index), df$ID[index],
> NA), Rec = ifelse(any(index), df$Rec[index], NA),
> Name = ifelse(any(index), df$Name[index], NA),interval = ifelse(any(index),df$interval[index],NA))}))
>
> df3 <- cbind(df2, df1)
我得出以下结果:
Date1 ID Rec Name interval
1: 1999-12-29 BMG4593F1389 2 ING FM 189216000
2: 1999-12-30 BMG4593F1389 2 ING FM 189216000
3: 1999-12-31 BMG4593F1389 2 ING FM 189216000
4: 2000-01-01 BMG4593F1389 2 ING FM 189216000
5: 2000-01-02 BMG4593F1389 2 ING FM 189216000
6: 2000-01-03 BMG4593F1389 2 ING FM 189216000
但是因为例如 df$ID "BMG4593F1389" 的 df2$Date1 ("1999-12-29") 落在 df 中另外 6 个条目的日期范围内(对于不同的 df$Names) FOR这个特定的 df$date1 应该是:
日期 1999-12-29 的预期结果(为简单起见,此处忽略 df3$interval 变量)
Date1 ID Rec Name
1: 1999-12-29 BMG4593F1389 2 ING FM
2: 1999-12-29 BMG4593F1389 3 Permission Denied 128064
3: 1999-12-29 BMG4593F1389 2 Permission Denied 2880
4: 1999-12-29 BMG4593F1389 3 Permission Denied 32
5: 1999-12-29 BMG4593F1389 3 Permission Denied 888
6: 1999-12-29 BMG5265510042 3 Permission Denied 2880
7: 1999-12-30 BMG4593F1389 2 ING FM
... etc
所以最后我需要复制 df$Date1 中的日期,如果有多个名称为特定的 df$ID 发出了 Rec,该 ID 在各自的日期范围内。
有人可以帮我吗?
如果我正确理解问题的更新版本,这可以使用 non-equi join 和后续解决聚合:
library(data.table)
# non-equi join
df[df2, on = .(Date <= Date1, Stop.Date > Date1), allow = TRUE][
# aggregation
, .(sumRec = sum(Rec)), by = .(Date, ID, Name)]
Date ID Name sumRec 1: 1999-12-29 BMG4593F1389 ING FM 2 2: 1999-12-29 BMG4593F1389 Permission Denied 128064 3 3: 1999-12-29 BMG4593F1389 Permission Denied 2880 2 4: 1999-12-29 BMG4593F1389 Permission Denied 32 3 5: 1999-12-29 BMG4593F1389 Permission Denied 888 3 6: 1999-12-29 BMG526551004 Permission Denied 2880 3 7: 1999-12-30 BMG4593F1389 ING FM 2 8: 1999-12-30 BMG4593F1389 Permission Denied 128064 3 9: 1999-12-30 BMG4593F1389 Permission Denied 2880 2 10: 1999-12-30 BMG4593F1389 Permission Denied 32 3 11: 1999-12-30 BMG4593F1389 Permission Denied 888 3 12: 1999-12-30 BMG526551004 Permission Denied 2880 3 13: 1999-12-31 BMG4593F1389 ING FM 2 14: 1999-12-31 BMG4593F1389 Permission Denied 128064 3 15: 1999-12-31 BMG4593F1389 Permission Denied 2880 2 16: 1999-12-31 BMG4593F1389 Permission Denied 32 3 17: 1999-12-31 BMG4593F1389 Permission Denied 888 3 18: 1999-12-31 BMG526551004 Permission Denied 2880 3 19: 2000-01-01 BMG4593F1389 ING FM 2 20: 2000-01-01 BMG4593F1389 Permission Denied 128064 3 21: 2000-01-01 BMG4593F1389 Permission Denied 2880 2 22: 2000-01-01 BMG4593F1389 Permission Denied 32 3 23: 2000-01-01 BMG4593F1389 Permission Denied 888 3 24: 2000-01-01 BMG526551004 Permission Denied 2880 3 25: 2000-01-02 BMG4593F1389 ING FM 2 26: 2000-01-02 BMG4593F1389 Permission Denied 128064 3 27: 2000-01-02 BMG4593F1389 Permission Denied 2880 2 28: 2000-01-02 BMG4593F1389 Permission Denied 32 3 29: 2000-01-02 BMG4593F1389 Permission Denied 888 3 30: 2000-01-02 BMG526551004 Permission Denied 2880 3 31: 2000-01-03 BMG4593F1389 ING FM 2 32: 2000-01-03 BMG4593F1389 Permission Denied 128064 3 33: 2000-01-03 BMG4593F1389 Permission Denied 2880 2 34: 2000-01-03 BMG4593F1389 Permission Denied 32 3 35: 2000-01-03 BMG4593F1389 Permission Denied 888 3 36: 2000-01-03 BMG526551004 Permission Denied 2880 3 Date ID Name sumRec
请注意,我在直接使用 structure(...)
中提供的 df
时遇到了一条奇怪的错误消息。调用后错误消息消失
df <- as.data.table(df)
说明
我是 asked 来解释 non-equi join 是如何工作的。 Non-equi 联接 是data.table
联接的扩展。 data.table
是一个增强基础 R 的包 data.frame
。
在这里,我们将 df2
与 df
右连接,即我们希望在结果中看到 df2
的所有行与 df
中的匹配项,但只有那些其中 Date1
(来自 df2
)位于 Date
和 Stop.Date
(来自 df
)之间,准确地说是 Date <= Date1 < Stop.Date
。由于有很多可能的匹配项,我们需要使用 allow.cartesian = TRUE
.
用户那里有一个video of Arun's talk! 2016 年国际 R 用户会议介绍 Efficient in-memory non-equi joins using data.table.