数据的非相等连接 table 操作
Non-equi join of data table operation
我想向数据 table 1 添加列,这些列是对数据 table 2 的操作,通过变量连接并且数据 table 2 的日期 <=来自数据的日期 table 1. 我正在寻找一种计算成本不太高的解决方案(我有大约 20k 行)。
数据 table 1 - 我有一个提案数据集,它们的所有者和它们的编辑日期:
proposal_df <- structure(list(proposal = c(41, 62, 169, 72), owner = c("Adam",
"Adam", "Alan", "Alan"), totalAtEdit = c(-27, 1000, 151, 1137
), editDate = structure(c(1556014200, 1560762240, 1563966600,
1540832280), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = c(NA,
-4L))
proposal owner totalAtEdit editDate
1 41 Adam -27 2019-04-23 10:10:00
2 62 Adam 1000 2019-06-17 09:04:00
3 169 Alan 151 2019-07-24 11:10:00
4 72 Alan 1137 2018-10-29 16:58:00
数据 table 2 - 我有一个提案日志以及它们获胜或失败的日期(outcome == 1
或 0
):
proposal_log <- structure(list(proposal = c(9, 48, 43, 39, 45, 73, 111, 179,
115, 146), outcome = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0), owner = c("Adam",
"Adam", "Adam", "Adam", "Adam", "Alan", "Alan", "Alan", "Alan",
"Alan"), totalAtEdit = c(2, 2, 4, 566, 100, 1264, 5000, 75, 493,
18), editDate = structure(c(1557487860, 1561368780, 1561393140,
1546446240, 1549463520, 1546614180, 1547196960, 1579603560, 1566925200,
1536751800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names =
c(NA,
-10L))
proposal outcome owner totalAtEdit editDate
1 9 0 Adam 2 2019-05-10 11:31:00
2 48 1 Adam 2 2019-06-24 09:33:00
3 43 1 Adam 4 2019-06-24 16:19:00
4 39 1 Adam 566 2019-01-02 16:24:00
5 45 0 Adam 100 2019-02-06 14:32:00
6 73 0 Alan 1264 2019-01-04 15:03:00
7 111 0 Alan 5000 2019-01-11 08:56:00
8 179 0 Alan 75 2020-01-21 10:46:00
9 115 0 Alan 493 2019-08-27 17:00:00
10 146 0 Alan 18 2018-09-12 11:30:00
我想向 proposal_df
添加几列,这些列是对 proposal_log
的操作,通过 owner
加入,其中 proposal_log$editDate <= proposal_df$editDate
:
countWon
- outcome == 1
的提案数量
countLost
- outcome == 0
的提案数量
wonValueMean
- totalAtEdit
提案的平均值,其中 outcome == 1
pctWon
- outcome == 1
的提案百分比
输出将如下所示:
proposal owner totalAtEdit editDate countWon countLost wonValueMean pctWon
1 41 Adam -27 2019-04-23 10:10:00 1 1 566 0.5000000
2 62 Adam 1000 2019-06-17 09:04:00 1 2 566 0.3333333
3 169 Alan 151 2019-07-24 11:10:00 0 3 NaN 0.0000000
4 72 Alan 1137 2018-10-29 16:58:00 0 1 NaN 0.0000000
谢谢!
可能有更优雅的解决方案,但这分 4 步给出了所需的输出。
首先,将 tables 设置为数据 tables 以执行非相等连接。
library(data.table)
setDT(proposal_df)
setDT(proposal_log)
第 1 步:同一所有者的非等价加入并且 proposal_log$editDate <= proposal_df$editDate.
Proposals <- proposal_log[proposal_df, on = .(owner, editDate <= editDate)]
这 returns proposal_log 中符合条件的提案。来自较小 table 的 proposal
和 totalAtEdit
变量被添加到结果中,前缀为 i.
.
proposal outcome owner totalAtEdit editDate i.proposal i.totalAtEdit
1: 39 1 Adam 566 2019-04-23 10:10:00 41 -27
2: 45 0 Adam 100 2019-04-23 10:10:00 41 -27
3: 9 0 Adam 2 2019-06-17 09:04:00 62 1000
4: 39 1 Adam 566 2019-06-17 09:04:00 62 1000
5: 45 0 Adam 100 2019-06-17 09:04:00 62 1000
6: 73 0 Alan 1264 2019-07-24 11:10:00 169 151
7: 111 0 Alan 5000 2019-07-24 11:10:00 169 151
8: 146 0 Alan 18 2019-07-24 11:10:00 169 151
9: 146 0 Alan 18 2018-10-29 16:58:00 72 1137
第2步:将其重塑为宽格式以计算(fun=length
)每个i.proposal
的结果数量,然后计算结果的比例赢了(结果=1)。
Outcomes <- dcast(Proposals, i.proposal ~ outcome, fun=length)[
, pctWon := `1`/(`0`+`1`)]
第 3 步:计算每个提案的获胜结果 (outcome==1
) 的 totalAtEdit
的平均值,并使用结果进行内部连接提案 ID。
Means <- Proposals[outcome==1, .(m_total = mean(totalAtEdit)), by=i.proposal]
Outcomes[Means, on=.(i.proposal), wonValueMean := m_total]
第 4 步:加入 proposal_df table。
proposal_df[Outcomes, on=c(proposal = "i.proposal")]
proposal owner totalAtEdit editDate 0 1 pctWon wonValueMean
1: 41 Adam -27 2019-04-23 10:10:00 1 1 0.5000000 566
2: 62 Adam 1000 2019-06-17 09:04:00 2 1 0.3333333 566
3: 72 Alan 1137 2018-10-29 16:58:00 1 0 0.0000000 NA
4: 169 Alan 151 2019-07-24 11:10:00 3 0 0.0000000 NA
另一种选择是使用 by=.EACHI
:
library(data.table)
setDT(proposal_df)
setDT(proposal_log)
proposal_df[, c("countWon","countLost","wonValueMean","pctWon") :=
proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
cw <- sum(outcome==1L)
.(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
}][, (1L:2L) := NULL]
]
我想向数据 table 1 添加列,这些列是对数据 table 2 的操作,通过变量连接并且数据 table 2 的日期 <=来自数据的日期 table 1. 我正在寻找一种计算成本不太高的解决方案(我有大约 20k 行)。
数据 table 1 - 我有一个提案数据集,它们的所有者和它们的编辑日期:
proposal_df <- structure(list(proposal = c(41, 62, 169, 72), owner = c("Adam",
"Adam", "Alan", "Alan"), totalAtEdit = c(-27, 1000, 151, 1137
), editDate = structure(c(1556014200, 1560762240, 1563966600,
1540832280), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names = c(NA,
-4L))
proposal owner totalAtEdit editDate
1 41 Adam -27 2019-04-23 10:10:00
2 62 Adam 1000 2019-06-17 09:04:00
3 169 Alan 151 2019-07-24 11:10:00
4 72 Alan 1137 2018-10-29 16:58:00
数据 table 2 - 我有一个提案日志以及它们获胜或失败的日期(outcome == 1
或 0
):
proposal_log <- structure(list(proposal = c(9, 48, 43, 39, 45, 73, 111, 179,
115, 146), outcome = c(0, 1, 1, 1, 0, 0, 0, 0, 0, 0), owner = c("Adam",
"Adam", "Adam", "Adam", "Adam", "Alan", "Alan", "Alan", "Alan",
"Alan"), totalAtEdit = c(2, 2, 4, 566, 100, 1264, 5000, 75, 493,
18), editDate = structure(c(1557487860, 1561368780, 1561393140,
1546446240, 1549463520, 1546614180, 1547196960, 1579603560, 1566925200,
1536751800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), class = "data.table", row.names =
c(NA,
-10L))
proposal outcome owner totalAtEdit editDate
1 9 0 Adam 2 2019-05-10 11:31:00
2 48 1 Adam 2 2019-06-24 09:33:00
3 43 1 Adam 4 2019-06-24 16:19:00
4 39 1 Adam 566 2019-01-02 16:24:00
5 45 0 Adam 100 2019-02-06 14:32:00
6 73 0 Alan 1264 2019-01-04 15:03:00
7 111 0 Alan 5000 2019-01-11 08:56:00
8 179 0 Alan 75 2020-01-21 10:46:00
9 115 0 Alan 493 2019-08-27 17:00:00
10 146 0 Alan 18 2018-09-12 11:30:00
我想向 proposal_df
添加几列,这些列是对 proposal_log
的操作,通过 owner
加入,其中 proposal_log$editDate <= proposal_df$editDate
:
countWon
-outcome == 1
的提案数量
countLost
-outcome == 0
的提案数量
wonValueMean
-totalAtEdit
提案的平均值,其中outcome == 1
pctWon
-outcome == 1
的提案百分比
输出将如下所示:
proposal owner totalAtEdit editDate countWon countLost wonValueMean pctWon
1 41 Adam -27 2019-04-23 10:10:00 1 1 566 0.5000000
2 62 Adam 1000 2019-06-17 09:04:00 1 2 566 0.3333333
3 169 Alan 151 2019-07-24 11:10:00 0 3 NaN 0.0000000
4 72 Alan 1137 2018-10-29 16:58:00 0 1 NaN 0.0000000
谢谢!
可能有更优雅的解决方案,但这分 4 步给出了所需的输出。
首先,将 tables 设置为数据 tables 以执行非相等连接。
library(data.table)
setDT(proposal_df)
setDT(proposal_log)
第 1 步:同一所有者的非等价加入并且 proposal_log$editDate <= proposal_df$editDate.
Proposals <- proposal_log[proposal_df, on = .(owner, editDate <= editDate)]
这 returns proposal_log 中符合条件的提案。来自较小 table 的 proposal
和 totalAtEdit
变量被添加到结果中,前缀为 i.
.
proposal outcome owner totalAtEdit editDate i.proposal i.totalAtEdit
1: 39 1 Adam 566 2019-04-23 10:10:00 41 -27
2: 45 0 Adam 100 2019-04-23 10:10:00 41 -27
3: 9 0 Adam 2 2019-06-17 09:04:00 62 1000
4: 39 1 Adam 566 2019-06-17 09:04:00 62 1000
5: 45 0 Adam 100 2019-06-17 09:04:00 62 1000
6: 73 0 Alan 1264 2019-07-24 11:10:00 169 151
7: 111 0 Alan 5000 2019-07-24 11:10:00 169 151
8: 146 0 Alan 18 2019-07-24 11:10:00 169 151
9: 146 0 Alan 18 2018-10-29 16:58:00 72 1137
第2步:将其重塑为宽格式以计算(fun=length
)每个i.proposal
的结果数量,然后计算结果的比例赢了(结果=1)。
Outcomes <- dcast(Proposals, i.proposal ~ outcome, fun=length)[
, pctWon := `1`/(`0`+`1`)]
第 3 步:计算每个提案的获胜结果 (outcome==1
) 的 totalAtEdit
的平均值,并使用结果进行内部连接提案 ID。
Means <- Proposals[outcome==1, .(m_total = mean(totalAtEdit)), by=i.proposal]
Outcomes[Means, on=.(i.proposal), wonValueMean := m_total]
第 4 步:加入 proposal_df table。
proposal_df[Outcomes, on=c(proposal = "i.proposal")]
proposal owner totalAtEdit editDate 0 1 pctWon wonValueMean
1: 41 Adam -27 2019-04-23 10:10:00 1 1 0.5000000 566
2: 62 Adam 1000 2019-06-17 09:04:00 2 1 0.3333333 566
3: 72 Alan 1137 2018-10-29 16:58:00 1 0 0.0000000 NA
4: 169 Alan 151 2019-07-24 11:10:00 3 0 0.0000000 NA
另一种选择是使用 by=.EACHI
:
library(data.table)
setDT(proposal_df)
setDT(proposal_log)
proposal_df[, c("countWon","countLost","wonValueMean","pctWon") :=
proposal_log[.SD, on=.(owner, editDate<=editDate), by=.EACHI, {
cw <- sum(outcome==1L)
.(cw, sum(outcome==0L), mean(x.totalAtEdit[outcome==1L]), cw/.N)
}][, (1L:2L) := NULL]
]