如何使用 `df2` 中的数据以 45 分钟的时间间隔使用 `data.table` 正确计算 `df1` 中变量的平均值?
How to calculate properly average values of a variable in `df1` using data from `df2` at 45-minutes time intervals with `data.table`?
我有一个数据框 df1
,它总结了个人 ID
加班的不同观察结果,但从 00:00:00
开始以固定的 45 分钟间隔四舍五入(00:00:00
,00:45:00
,等等)。例如:
df1<- data.frame(DateTime45=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1
DateTime45 ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 00:45:00 A 2
3 2017-07-09 02:15:00 A 0
4 2017-07-09 03:45:00 A 4
我有另一个数据框 df2
,其中我有关于此人的其他信息 (vedba
) 也超时,但在这种情况下没有 45 分钟的时间间隔。例如:
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba
1 2017-07-08 23:40:57.244 A 1.87
2 2017-07-08 23:58:12.944 A 2.30
3 2017-07-09 00:01:00.345 A 0.30
4 2017-07-09 00:07:12.845 A 0.67
. . . . .
. . . . .
我想计算 df1
中每一行的平均值 vedba
使用 df2
中的值。关键是我要考虑的是df1
中的每一次,window包含前后22分30秒之间(即df1$DateTime45
是中心值范围)。例如,df1[1,1]
(2017-07-09 00:00:00
) 的时间范围在 2017-07-08 23:37:30
和 2017-07-09 00:22:30
之间。
在这个例子中,我希望得到这个:
df3
DateTime45 ID VariableX meanVedba n_vedba
1 2017-07-09 00:00:00 A 0 1.2850000 4
2 2017-07-09 00:45:00 A 2 1.7750000 4
3 2017-07-09 02:15:00 A 0 1.5833333 3
4 2017-07-09 03:45:00 A 4 0.8266667 3
*注意:我包含一个 n_vedba
变量来检查代码是否从 df2
.
中获取正确的行数
我的尝试是这段代码:
setDT(df1)[, DateTime45 := ymd_hms(DateTime45)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "45 mins")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime45), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 0.4850000 2
2: A 2017-07-09 00:45:00 2.3333333 3
3: A 2017-07-09 02:15:00 NA 0
4: A 2017-07-09 03:45:00 0.8266667 3
但是,如您所见,我没有得到我期望的结果。
有谁知道为什么以及如何更改代码以完成我想要的?
附加评论
当我有小时间隔而不是 45 分钟间隔时,我显示的代码有效。
- 我创建数据框
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 01:00:00","2017-07-09 02:00:00","2017-07-09 03:00:00","2017-07-09 04:00:00"),
ID=c("A","A","A","A","A"),
VariableX=c(0,2,0,4,7))
df1$DateTime<- as.POSIXct(df1$DateTime45, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df1
DateTime ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 01:00:00 A 2
3 2017-07-09 02:00:00 A 0
4 2017-07-09 03:00:00 A 4
5 2017-07-09 04:00:00 A 7
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba dt_floor
1: 2017-07-08 23:40:57 A 1.87 2017-07-09 00:00:00
2: 2017-07-08 23:58:12 A 2.30 2017-07-09 00:00:00
3: 2017-07-09 00:01:00 A 0.30 2017-07-09 00:00:00
4: 2017-07-09 00:07:12 A 0.67 2017-07-09 00:00:00
. . . . .
. . . . .
- 我计算
vedba
每小时 bin 间隔
setDT(df1)[, DateTime45 := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "hour")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 1.288000 5
2: A 2017-07-09 01:00:00 1.580000 5
3: A 2017-07-09 02:00:00 1.710000 3
4: A 2017-07-09 03:00:00 1.352857 7
5: A 2017-07-09 04:00:00 0.940000 1
好吧,我想到了不同的解决方法,首先我将你的 POSIXct
换成 POSIXlt
然后我将它应用到 df1
和 df2
(而不是只是 df1
)
所以我运行这个:
df1$DateTime45<- as.POSIXlt(df1$DateTime45, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2$DateTime<- as.POSIXlt(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
然后我决定去条件,既然你有时间,你可以检查每个df2
和你的df1
之间的差异是否大于(绝对值)22.5分钟。
我用 2 个嵌套的 for 循环做到了:
for (i in 1:length(df1$DateTime45)){
for (n in 1:length(df2$DateTime)){
df2$DateTime[abs((df1$DateTime45[i] - df2$DateTime[n])) < seconds_to_period(seconds(22.5*60))][n] <- df1$DateTime45[i]
}
}
基本上到目前为止,我将所有 df2
日期覆盖(转换)为相关的 df1
's.So 请注意,如果您想保留原始 df2
日期有时你最初应该 运行 这个在 df2
.
的副本上
现在我们终于可以计算平均 vedba 并将其加入 df1
,再次使用简单的 for
循环:
means <- list()
for (i in 1:length(df1$DateTime45)){
means[[i]] <- mean(df2[df1$DateTime45[i]==df2$DateTime,]$vedba)
}
df1<- cbind(df1,means = unlist(means))
rm(means)
现在 运行ning df1
给我们:
DateTime45 ID VariableX means
1 2017-07-09 00:00:00 A 0 1.2850000
2 2017-07-09 00:45:00 A 2 1.7750000
3 2017-07-09 02:15:00 A 0 1.5833333
4 2017-07-09 03:45:00 A 4 0.8266667
您需要非等值连接
library(data.table)
library(lubridate)
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1$DateTime<- as.POSIXct(df1$DateTime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
setDT(df1)
setDT(df2)
df1[, date_lo := DateTime - minutes(22) - seconds(30)]
df1[, date_hi := DateTime + minutes(22) + seconds(30)]
df2[df1, .(mean = mean(vedba),
N = .N), on = .(ID, DateTime <= date_hi, DateTime >= date_lo), .EACHI]
ID DateTime DateTime mean N
1: A 2017-07-09 00:22:30 2017-07-08 23:37:30 1.2850000 4
2: A 2017-07-09 01:07:30 2017-07-09 00:22:30 1.7750000 4
3: A 2017-07-09 02:37:30 2017-07-09 01:52:30 1.5833333 3
4: A 2017-07-09 04:07:30 2017-07-09 03:22:30 0.8266667 3
我有一个数据框 df1
,它总结了个人 ID
加班的不同观察结果,但从 00:00:00
开始以固定的 45 分钟间隔四舍五入(00:00:00
,00:45:00
,等等)。例如:
df1<- data.frame(DateTime45=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1
DateTime45 ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 00:45:00 A 2
3 2017-07-09 02:15:00 A 0
4 2017-07-09 03:45:00 A 4
我有另一个数据框 df2
,其中我有关于此人的其他信息 (vedba
) 也超时,但在这种情况下没有 45 分钟的时间间隔。例如:
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba
1 2017-07-08 23:40:57.244 A 1.87
2 2017-07-08 23:58:12.944 A 2.30
3 2017-07-09 00:01:00.345 A 0.30
4 2017-07-09 00:07:12.845 A 0.67
. . . . .
. . . . .
我想计算 df1
中每一行的平均值 vedba
使用 df2
中的值。关键是我要考虑的是df1
中的每一次,window包含前后22分30秒之间(即df1$DateTime45
是中心值范围)。例如,df1[1,1]
(2017-07-09 00:00:00
) 的时间范围在 2017-07-08 23:37:30
和 2017-07-09 00:22:30
之间。
在这个例子中,我希望得到这个:
df3
DateTime45 ID VariableX meanVedba n_vedba
1 2017-07-09 00:00:00 A 0 1.2850000 4
2 2017-07-09 00:45:00 A 2 1.7750000 4
3 2017-07-09 02:15:00 A 0 1.5833333 3
4 2017-07-09 03:45:00 A 4 0.8266667 3
*注意:我包含一个 n_vedba
变量来检查代码是否从 df2
.
我的尝试是这段代码:
setDT(df1)[, DateTime45 := ymd_hms(DateTime45)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "45 mins")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime45), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 0.4850000 2
2: A 2017-07-09 00:45:00 2.3333333 3
3: A 2017-07-09 02:15:00 NA 0
4: A 2017-07-09 03:45:00 0.8266667 3
但是,如您所见,我没有得到我期望的结果。
有谁知道为什么以及如何更改代码以完成我想要的?
附加评论
当我有小时间隔而不是 45 分钟间隔时,我显示的代码有效。
- 我创建数据框
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 01:00:00","2017-07-09 02:00:00","2017-07-09 03:00:00","2017-07-09 04:00:00"),
ID=c("A","A","A","A","A"),
VariableX=c(0,2,0,4,7))
df1$DateTime<- as.POSIXct(df1$DateTime45, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df1
DateTime ID VariableX
1 2017-07-09 00:00:00 A 0
2 2017-07-09 01:00:00 A 2
3 2017-07-09 02:00:00 A 0
4 2017-07-09 03:00:00 A 4
5 2017-07-09 04:00:00 A 7
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2
DateTime ID vedba dt_floor
1: 2017-07-08 23:40:57 A 1.87 2017-07-09 00:00:00
2: 2017-07-08 23:58:12 A 2.30 2017-07-09 00:00:00
3: 2017-07-09 00:01:00 A 0.30 2017-07-09 00:00:00
4: 2017-07-09 00:07:12 A 0.67 2017-07-09 00:00:00
. . . . .
. . . . .
- 我计算
vedba
每小时 bin 间隔
setDT(df1)[, DateTime45 := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "hour")]
df3<- df2[df1, .(meanVedba = mean(vedba),
n_vedba=.N),
on = .(ID, dt_floor = DateTime), by = .EACHI]
df3
ID dt_floor meanVedba n_vedba
1: A 2017-07-09 00:00:00 1.288000 5
2: A 2017-07-09 01:00:00 1.580000 5
3: A 2017-07-09 02:00:00 1.710000 3
4: A 2017-07-09 03:00:00 1.352857 7
5: A 2017-07-09 04:00:00 0.940000 1
好吧,我想到了不同的解决方法,首先我将你的 POSIXct
换成 POSIXlt
然后我将它应用到 df1
和 df2
(而不是只是 df1
)
所以我运行这个:
df1$DateTime45<- as.POSIXlt(df1$DateTime45, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2$DateTime<- as.POSIXlt(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
然后我决定去条件,既然你有时间,你可以检查每个df2
和你的df1
之间的差异是否大于(绝对值)22.5分钟。
我用 2 个嵌套的 for 循环做到了:
for (i in 1:length(df1$DateTime45)){
for (n in 1:length(df2$DateTime)){
df2$DateTime[abs((df1$DateTime45[i] - df2$DateTime[n])) < seconds_to_period(seconds(22.5*60))][n] <- df1$DateTime45[i]
}
}
基本上到目前为止,我将所有 df2
日期覆盖(转换)为相关的 df1
's.So 请注意,如果您想保留原始 df2
日期有时你最初应该 运行 这个在 df2
.
现在我们终于可以计算平均 vedba 并将其加入 df1
,再次使用简单的 for
循环:
means <- list()
for (i in 1:length(df1$DateTime45)){
means[[i]] <- mean(df2[df1$DateTime45[i]==df2$DateTime,]$vedba)
}
df1<- cbind(df1,means = unlist(means))
rm(means)
现在 运行ning df1
给我们:
DateTime45 ID VariableX means
1 2017-07-09 00:00:00 A 0 1.2850000
2 2017-07-09 00:45:00 A 2 1.7750000
3 2017-07-09 02:15:00 A 0 1.5833333
4 2017-07-09 03:45:00 A 4 0.8266667
您需要非等值连接
library(data.table)
library(lubridate)
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
ID=c("A","A","A","A"),
VariableX=c(0,2,0,4))
df1$DateTime<- as.POSIXct(df1$DateTime, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
setDT(df1)
setDT(df2)
df1[, date_lo := DateTime - minutes(22) - seconds(30)]
df1[, date_hi := DateTime + minutes(22) + seconds(30)]
df2[df1, .(mean = mean(vedba),
N = .N), on = .(ID, DateTime <= date_hi, DateTime >= date_lo), .EACHI]
ID DateTime DateTime mean N
1: A 2017-07-09 00:22:30 2017-07-08 23:37:30 1.2850000 4
2: A 2017-07-09 01:07:30 2017-07-09 00:22:30 1.7750000 4
3: A 2017-07-09 02:37:30 2017-07-09 01:52:30 1.5833333 3
4: A 2017-07-09 04:07:30 2017-07-09 03:22:30 0.8266667 3