data.table 不规则观察随时间的累积统计数据 window
data.table cumulative stats of irregular observations with time window
我有一些交易记录,如下所示:
library(data.table)
customers <- 1:75
purchase_dates <- seq( as.Date('2016-01-01'),
as.Date('2018-12-31'),
by=1 )
n <- 500L
set.seed(1)
# Assume the data are already ordered and 1 row per cust_id/purch_dt
df <- data.table( cust_id = sample(customers, n, replace=TRUE),
purch_dt = sample(purchase_dates, n, replace=TRUE),
purch_amt = sample(500:50000, n, replace=TRUE)/100
)[, .(purch_amt = sum(purch_amt)),
keyby=.(cust_id, purch_dt) ]
df
# cust_id purch_dt purch_amt
# 1 2016-03-20 69.65
# 1 2016-05-17 413.60
# 1 2016-12-25 357.18
# 1 2017-03-20 256.21
# 2 2016-05-26 49.14
# 2 2018-05-31 261.87
# 2 2018-12-27 293.28
# 3 2016-12-10 204.12
# 3 2018-09-21 8.70
我想知道 window 前 365 天内(即 d-365
到 d-1
日期 d-1
之前的交易数量和总金额 d
).
我考虑过使用滚动连接,但那最多匹配一次之前的购买,并且可能有多次购买。
我能够使用带有日期过滤器的笛卡尔自连接获得所需的结果(请参阅下面的答案),但这不是一种非常节省内存的方法。
期望输出:
cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
1 2016-03-20 0 0.00 69.65
1 2016-05-17 1 69.65 413.60
1 2016-12-25 2 483.25 357.18
1 2017-03-20 3 840.43 256.21
2 2016-05-26 0 0.00 49.14
2 2018-05-31 0 0.00 261.87
2 2018-12-27 1 261.87 293.28
3 2016-12-10 0 0.00 204.12
3 2018-09-21 0 0.00 8.70
这是带有日期范围过滤器的笛卡尔自连接:
df_prior <- df[df, on=.(cust_id), allow.cartesian=TRUE
][i.purch_dt < purch_dt &
i.purch_dt >= purch_dt - 365
][, .(prior_purch_cnt = .N,
prior_purch_amt = sum(i.purch_amt)),
keyby=.(cust_id, purch_dt)]
df2 <- df_prior[df, on=.(cust_id, purch_dt)]
df2[is.na(prior_purch_cnt), `:=`(prior_purch_cnt=0,
prior_purch_amt=0
)]
df2
# cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
# 1 2016-03-20 0 0.00 69.65
# 1 2016-05-17 1 69.65 413.60
# 1 2016-12-25 2 483.25 357.18
# 1 2017-03-20 3 840.43 256.21
# 2 2016-05-26 0 0.00 49.14
我担心在过滤客户有很多先前交易的数据集之前,这会如何爆炸。
I would like to know the prior transaction count and total amount, within a 365-day prior window (i.e., at d-365
through d-1
for a transaction on date d
).
我认为惯用的方式是:
df[, c("ppn", "ppa") :=
df[.(cust_id = cust_id, d_dn = purch_dt-365, d_up = purch_dt),
on=.(cust_id, purch_dt >= d_dn, purch_dt < d_up),
.(.N, sum(purch_amt, na.rm=TRUE))
, by=.EACHI][, .(N, V2)]
]
cust_id purch_dt purch_amt ppn ppa
1: 1 2016-03-20 69.65 0 0.00
2: 1 2016-05-17 413.60 1 69.65
3: 1 2016-12-25 357.18 2 483.25
4: 1 2017-03-20 256.21 3 840.43
5: 2 2016-05-26 49.14 0 0.00
---
494: 75 2018-01-12 381.24 2 201.04
495: 75 2018-04-01 65.83 3 582.28
496: 75 2018-06-17 170.30 4 648.11
497: 75 2018-07-22 60.49 5 818.41
498: 75 2018-10-10 66.12 4 677.86
这是一个"non-equi join"。
我有一些交易记录,如下所示:
library(data.table)
customers <- 1:75
purchase_dates <- seq( as.Date('2016-01-01'),
as.Date('2018-12-31'),
by=1 )
n <- 500L
set.seed(1)
# Assume the data are already ordered and 1 row per cust_id/purch_dt
df <- data.table( cust_id = sample(customers, n, replace=TRUE),
purch_dt = sample(purchase_dates, n, replace=TRUE),
purch_amt = sample(500:50000, n, replace=TRUE)/100
)[, .(purch_amt = sum(purch_amt)),
keyby=.(cust_id, purch_dt) ]
df
# cust_id purch_dt purch_amt
# 1 2016-03-20 69.65
# 1 2016-05-17 413.60
# 1 2016-12-25 357.18
# 1 2017-03-20 256.21
# 2 2016-05-26 49.14
# 2 2018-05-31 261.87
# 2 2018-12-27 293.28
# 3 2016-12-10 204.12
# 3 2018-09-21 8.70
我想知道 window 前 365 天内(即 d-365
到 d-1
日期 d-1
之前的交易数量和总金额 d
).
我考虑过使用滚动连接,但那最多匹配一次之前的购买,并且可能有多次购买。
我能够使用带有日期过滤器的笛卡尔自连接获得所需的结果(请参阅下面的答案),但这不是一种非常节省内存的方法。
期望输出:
cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
1 2016-03-20 0 0.00 69.65
1 2016-05-17 1 69.65 413.60
1 2016-12-25 2 483.25 357.18
1 2017-03-20 3 840.43 256.21
2 2016-05-26 0 0.00 49.14
2 2018-05-31 0 0.00 261.87
2 2018-12-27 1 261.87 293.28
3 2016-12-10 0 0.00 204.12
3 2018-09-21 0 0.00 8.70
这是带有日期范围过滤器的笛卡尔自连接:
df_prior <- df[df, on=.(cust_id), allow.cartesian=TRUE
][i.purch_dt < purch_dt &
i.purch_dt >= purch_dt - 365
][, .(prior_purch_cnt = .N,
prior_purch_amt = sum(i.purch_amt)),
keyby=.(cust_id, purch_dt)]
df2 <- df_prior[df, on=.(cust_id, purch_dt)]
df2[is.na(prior_purch_cnt), `:=`(prior_purch_cnt=0,
prior_purch_amt=0
)]
df2
# cust_id purch_dt prior_purch_cnt prior_purch_amt purch_amt
# 1 2016-03-20 0 0.00 69.65
# 1 2016-05-17 1 69.65 413.60
# 1 2016-12-25 2 483.25 357.18
# 1 2017-03-20 3 840.43 256.21
# 2 2016-05-26 0 0.00 49.14
我担心在过滤客户有很多先前交易的数据集之前,这会如何爆炸。
I would like to know the prior transaction count and total amount, within a 365-day prior window (i.e., at
d-365
throughd-1
for a transaction on dated
).
我认为惯用的方式是:
df[, c("ppn", "ppa") :=
df[.(cust_id = cust_id, d_dn = purch_dt-365, d_up = purch_dt),
on=.(cust_id, purch_dt >= d_dn, purch_dt < d_up),
.(.N, sum(purch_amt, na.rm=TRUE))
, by=.EACHI][, .(N, V2)]
]
cust_id purch_dt purch_amt ppn ppa
1: 1 2016-03-20 69.65 0 0.00
2: 1 2016-05-17 413.60 1 69.65
3: 1 2016-12-25 357.18 2 483.25
4: 1 2017-03-20 256.21 3 840.43
5: 2 2016-05-26 49.14 0 0.00
---
494: 75 2018-01-12 381.24 2 201.04
495: 75 2018-04-01 65.83 3 582.28
496: 75 2018-06-17 170.30 4 648.11
497: 75 2018-07-22 60.49 5 818.41
498: 75 2018-10-10 66.12 4 677.86
这是一个"non-equi join"。