在 `df1` 中添加一个新变量(标准偏差),用于依赖于 `df2` 中的多行并以 `Datetime` 和其他两个变量为条件的行
Add a new variable (Standard deviation) in `df1` for row dependinig on multiple rows from `df2` and conditioned to `Datetime` and other two variables
我有数据框 df1
,随着时间的推移,每隔一小时 df1$DateTime
总结不同的人 df$Person
。
此外,我还有另一个数据框 df2
,其中包含有关这些人随时间在 "time spent on the phone" 或 "money spent in a purchases" 列中所做的事情的信息 Data_Type
。 Value
列中显示 phone 花费的分钟数或这些特定时间花费的金钱。
举个例子:
df1<- data.frame(DateTime=c("2016-09-27 11:00:00","2016-09-27 11:00:00","2016-09-27 12:00:00","2016-09-27 12:00:00","2016-09-27 13:00:00","2016-09-27 13:00:00"),
Person= c(11,12,11,12,11,12))
df2<- data.frame(DateTime= c("2016-09-27 11:03:40","2016-09-27 11:07:40","2016-09-27 11:34:53","2016-09-27 11:48:32","2016-09-27 12:01:40","2016-09-27 12:09:40","2016-09-27 12:21:40","2016-09-27 12:29:40","2016-09-27 12:35:40","2016-09-27 12:41:40","2016-09-27 12:53:26","2016-09-27 13:05:40","2016-09-27 13:24:14","2016-09-27 13:32:50","2016-09-27 13:47:19"),
Person= c(11,11,12,11,12,11,11,11,11,12,12,12,11,12,11),
Data_Type=c("Call","Call","Call","Call","Purchase","Call","Call","Call","Call","Purchase","Call","Call","Call","Call","Purchase"),
Value=c(2.7,5.4,3.2,1.7,300,4.6,2.3,5.1,2.9,100,0.6,6.2,1.8,7.6,380))
> df1
DateTime Person
1 2016-09-27 11:00:00 11
2 2016-09-27 11:00:00 12
3 2016-09-27 12:00:00 11
4 2016-09-27 12:00:00 12
5 2016-09-27 13:00:00 11
6 2016-09-27 13:00:00 12
> df2
DateTime Person Data_Type Value
1 2016-09-27 11:03:40 11 Call 2.7
2 2016-09-27 11:07:40 11 Call 5.4
3 2016-09-27 11:34:53 12 Call 3.2
4 2016-09-27 11:48:32 11 Call 1.7
5 2016-09-27 12:01:40 12 Purchase 300.0
6 2016-09-27 12:09:40 11 Call 4.6
7 2016-09-27 12:21:40 11 Call 2.3
8 2016-09-27 12:29:40 11 Call 5.1
9 2016-09-27 12:35:40 11 Call 2.9
10 2016-09-27 12:41:40 12 Purchase 100.0
11 2016-09-27 12:53:26 12 Call 0.6
12 2016-09-27 13:05:40 12 Call 6.2
13 2016-09-27 13:24:14 11 Call 1.8
14 2016-09-27 13:32:50 12 Call 7.6
15 2016-09-27 13:47:19 11 Purchase 380.0
我想在 df1
中添加两个新变量,它们总结了 Calls
和 Purchases
的标准偏差,具体取决于人和指定的一小时间隔。
我想得到这个(也许我在计算 sd 时犯了一些错误):
> df1
DateTime Person sdCalls sdPurchases
1 2016-09-27 11:00:00 11 1.9139836 NA
2 2016-09-27 11:00:00 12 0.0000000 NA
3 2016-09-27 12:00:00 11 1.3375973 NA
4 2016-09-27 12:00:00 12 0.0000000 141.4214
5 2016-09-27 13:00:00 11 0.0000000 0.0000
6 2016-09-27 13:00:00 12 0.9899495 NA
有人知道怎么做吗?
一个选项是 floor
第二个数据集中的 'DateTime' 列,并将 on
与 'Person'、'DateTime' 子集连接 [=19] =] 对应 'Call', 'Purchase' in 'Data_Type' 得到 sd
library(lubridate)
library(data.table)
setDT(df1)[, DateTime := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := floor_date(ymd_hms(DateTime), unit = "hour")]
df2[df1, .(sdsCalls = sd(Value[Data_Type == "Call"]),
sdPurchases = sd(Value[Data_Type == 'Purchase'])),
on = .(Person, dt_floor = DateTime), by = .EACHI]
我有数据框 df1
,随着时间的推移,每隔一小时 df1$DateTime
总结不同的人 df$Person
。
此外,我还有另一个数据框 df2
,其中包含有关这些人随时间在 "time spent on the phone" 或 "money spent in a purchases" 列中所做的事情的信息 Data_Type
。 Value
列中显示 phone 花费的分钟数或这些特定时间花费的金钱。
举个例子:
df1<- data.frame(DateTime=c("2016-09-27 11:00:00","2016-09-27 11:00:00","2016-09-27 12:00:00","2016-09-27 12:00:00","2016-09-27 13:00:00","2016-09-27 13:00:00"),
Person= c(11,12,11,12,11,12))
df2<- data.frame(DateTime= c("2016-09-27 11:03:40","2016-09-27 11:07:40","2016-09-27 11:34:53","2016-09-27 11:48:32","2016-09-27 12:01:40","2016-09-27 12:09:40","2016-09-27 12:21:40","2016-09-27 12:29:40","2016-09-27 12:35:40","2016-09-27 12:41:40","2016-09-27 12:53:26","2016-09-27 13:05:40","2016-09-27 13:24:14","2016-09-27 13:32:50","2016-09-27 13:47:19"),
Person= c(11,11,12,11,12,11,11,11,11,12,12,12,11,12,11),
Data_Type=c("Call","Call","Call","Call","Purchase","Call","Call","Call","Call","Purchase","Call","Call","Call","Call","Purchase"),
Value=c(2.7,5.4,3.2,1.7,300,4.6,2.3,5.1,2.9,100,0.6,6.2,1.8,7.6,380))
> df1
DateTime Person
1 2016-09-27 11:00:00 11
2 2016-09-27 11:00:00 12
3 2016-09-27 12:00:00 11
4 2016-09-27 12:00:00 12
5 2016-09-27 13:00:00 11
6 2016-09-27 13:00:00 12
> df2
DateTime Person Data_Type Value
1 2016-09-27 11:03:40 11 Call 2.7
2 2016-09-27 11:07:40 11 Call 5.4
3 2016-09-27 11:34:53 12 Call 3.2
4 2016-09-27 11:48:32 11 Call 1.7
5 2016-09-27 12:01:40 12 Purchase 300.0
6 2016-09-27 12:09:40 11 Call 4.6
7 2016-09-27 12:21:40 11 Call 2.3
8 2016-09-27 12:29:40 11 Call 5.1
9 2016-09-27 12:35:40 11 Call 2.9
10 2016-09-27 12:41:40 12 Purchase 100.0
11 2016-09-27 12:53:26 12 Call 0.6
12 2016-09-27 13:05:40 12 Call 6.2
13 2016-09-27 13:24:14 11 Call 1.8
14 2016-09-27 13:32:50 12 Call 7.6
15 2016-09-27 13:47:19 11 Purchase 380.0
我想在 df1
中添加两个新变量,它们总结了 Calls
和 Purchases
的标准偏差,具体取决于人和指定的一小时间隔。
我想得到这个(也许我在计算 sd 时犯了一些错误):
> df1
DateTime Person sdCalls sdPurchases
1 2016-09-27 11:00:00 11 1.9139836 NA
2 2016-09-27 11:00:00 12 0.0000000 NA
3 2016-09-27 12:00:00 11 1.3375973 NA
4 2016-09-27 12:00:00 12 0.0000000 141.4214
5 2016-09-27 13:00:00 11 0.0000000 0.0000
6 2016-09-27 13:00:00 12 0.9899495 NA
有人知道怎么做吗?
一个选项是 floor
第二个数据集中的 'DateTime' 列,并将 on
与 'Person'、'DateTime' 子集连接 [=19] =] 对应 'Call', 'Purchase' in 'Data_Type' 得到 sd
library(lubridate)
library(data.table)
setDT(df1)[, DateTime := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := floor_date(ymd_hms(DateTime), unit = "hour")]
df2[df1, .(sdsCalls = sd(Value[Data_Type == "Call"]),
sdPurchases = sd(Value[Data_Type == 'Purchase'])),
on = .(Person, dt_floor = DateTime), by = .EACHI]